Distributed Teams: A Test Failing Because It’s Run West of Newfoundland and Labrador

(( Not quite 500 mile email-level of nonsense, but might be the closest I get. ))

A test was failing.

Not really unusual, that. Tests fail all the time. It’s how we know they’re good tests: protecting us developers from ourselves.

But this one was failing unusually. Y’see, it was failing on my machine.

(Yes, har har, it is a common-enough occurrence given my obvious lack of quality as a developer how did you guess.)

The unusual part was that it was failing only for me… and I hadn’t even touched anything yet. It wasn’t failing on our test infrastructure “try”, and it wasn’t failing on the machine of :raphael, the fellow responsible for the integration test harness itself. We were building Firefox the same way, running telemetry-tests-client the same way… but I was getting different results.

I fairly quickly locked down the problem to be an extra “main” ping with reason “environment-change” being sent during the initial phases of the test. By dumping some logging into Firefox, rebuilding it, and then routing its logs to console with --gecko-log "-" I learned that we were sending a ping because a monitored user preference had been changed: browser.search.region.

When Firefox starts up the first time, it doesn’t know where it is. And it needs to know where it is to properly make a first guess at what language you want and what search engines would work best. Google’s results are pretty bad in Canada unless you use “google.ca”, after all.

But while Firefox doesn’t know where it is, it does know is what timezone it’s in from the settings in your OS’s clock. On top of that it knows what language your OS is set to. So we make a first guess at which search region we’re in based on whether or not the timezone overlaps a US timezone and if your OS’ locale is `en-US` (United States English).

What this fails to take into account is that United States English is the “default” locale reported by many OSes even if you aren’t in the US. And how even if you are in a timezone that overlaps with the US, you might not be there.

So to account for that, Mozilla operates a location service to double-check that the search region is appropriately set. This takes a little time to get back with the correct information, if it gets back to us at all. So if you happen to be in a US-overlapping timezone with an English-language OS Firefox assumes you’re in the US. Then if the location service request gets back with something that isn’t “US”, browser.search.region has to be updated.

And when it updates, it changes the Telemetry Environment.

And when the Environment changes, we send a “main” ping.

And when we send a “main” ping, the test breaks.

…all because my timezone overlaps the OS and my language is “Default” English.

I feel a bit persecuted, but this shows another strength of Distributed Teams. No one else on my team would be able to find this failure. They’re in Germany, Italy, and the US. None of them have that combination of “Not in the US, but in a US timezone” needed to manifest the bug.

So on one hand this sucks. I’m going to have to find a way around this.

But on the other hand… I feel like my Canadianness is a bit of a bug-finding superpower. I’m no Brok Windsor or Captain Canuck, but I can get the job done in a way no one else on my team can.

Not too bad, eh?

:chutten

Advertisements

Privileged to be a Mozillian

Mike Conley noticed a bug. There was a regression on a particular Firefox Nightly build he was tracking down. It looked like this:

A time series plot with a noticeable regression at November 6

A pretty big difference… only there was a slight problem: there were no relevant changes between the two builds. Being the kind of developer he is, :mconley looked elsewhere and found a probe that only started being included in builds starting November 16.

The plot showed him data starting from November 15.

He brought it up on irc.mozilla.org#telemetry. Roberto Vitillo was around and tried to reproduce, without success. For :mconley the regression was on November 5 and the data on the other probe started November 15. For :rvitillo the regression was on November 6 and the data started November 16. After ruling out addons, they assumed it was the dashboard’s fault and roped me into the discussion. This is what I had to say:

Hey, guess what's different between rvitillo and mconley? About 5 hours.

You see, :mconley is in the Toronto (as in Canada) Mozilla office, and Roberto is in the London (as in England) Mozilla office. There was a bug in how dates were being calculated that made it so the data displayed differently depending on your timezone. If you were on or East of the Prime Meridian you got the right answer. West? Everything looks like it happens one day early.

I hammered out a quick fix, which means the dashboard is now correct… but in thinking back over this bug in a post-mortem-kind-of-way, I realized how beneficial working in a distributed team is.

Having team members in multiple timezones not only provided us with a quick test location for diagnosing and repairing the issue, it equipped us with the mindset to think of timezones as a problematic element in the first place. Working in a distributed fashion has conferred upon us a unique and advantageous set of tools, experiences, thought processes, and mechanisms that allow us to ship amazing software to hundreds of millions of users. You don’t get that from just any cube farm.

#justmozillathings

:chutten