Data Science is Hard: Anomalies and What to Do About Them

:mconley‘s been looking at tab spinners to try and mitigate their impact on user experience. That’s when he noticed something weird that happened last October on Firefox Developer Edition:

spinnersubmissions_buildid

It’s a spike a full five orders of magnitude larger than submission volumes for a single build have ever been.

At first I thought it was users getting stuck on an old version. But then :frank noticed that the “by submission date” of that same graph didn’t tally with that hypothesis:

spinnersubmissions_subdate

Submissions from Aurora (what the Firefox Developer Edition branch is called internally) 51 tailed of when Aurora 52 was released in exactly the way we’ve come to expect. Aurora 52 had a jump mid-December when we switched to using “main” pings instead of “saved-session” pings to run our aggregates, but otherwise everything’s fine heading into the end of the year.

But then there’s Aurora 51 rising from the dead in late December. Some sort of weird re-adoption update problem? But where are all those users coming from? Or are they actually users? These graphs only plot ping volumes.

( Quick refresher: we get anonymous usage data from Firefox users via “pings”: packets of data that are sent at irregular intervals. A single user can send many pings per day, though more than 25 in a day is a pretty darn chatty. )

At this point I filed a bug. It appeared as though, somehow, we were getting new users running Aurora 51 build 20161014.

:mhowell popped the build onto a Windows machine and confirmed that it was updating fine for him. Anyone running that build ought not to be running it for long as they’d update within a couple of hours.

At this point we’d squeezed as much information as the aggregated data could give us, so I wandered off closer to the source to get deeper answers.

First I double-checked that what we were seeing in aggregate was what the data actually showed. Our main_summary dataset confirmed what we were seeing was not some weird artefact… but it also showed that there was no increase in client numbers:

aurora51-pingcountvsclientcount

A quick flip of the query and I learned that a single “client” was sending tens of thousands of pings each and every day from a long-dead non-release build of Firefox Developer Edition.

A “client” in this case is identified by “client_id”, a unique identifier that lives in a Firefox profile. Generally we take a single “client” to roughly equal a single “user”, but this isn’t always the case. Sometimes a single user may have multiple profiles (one at work, one at home, for instance). Sometimes multiple users may have the same profile (an enterprise may provision a specific Firefox profile to every terminal).

It seemed likely we were in the second case: one profile, many Firefox installs.

But could we be sure? What could we learn about the “client” sending us this unexpectedly-large amount of data?

So I took a look.

First, a sense of scaleoutput_11_0

This single client began sending a few pings around November 15, 2016. This makes sense, as Aurora 51 was still the latest version at that time. Things didn’t ramp up until December when we started seeing over ten thousand pings per day. After a lull during Christmas it settled into what appeared to be some light growth with a large peak on Feb 17 reaching one hundred thousand pings on just that day.

This is kinda weird. If we assume some rule-of-thumb of say, two pings per user per day, then we’re talking fifty thousand users running this ancient version of Aurora. What are they doing with it?

Well, we deliberately don’t record too much information about what our users do with their browsers. We don’t know what URLs are being visited, what credentials they’re using, or whether they prefer one hundred duck-sized horses or one horse-sized duck.

But we do know for how long the browser session lasted (from Firefox start to Firefox shutdown), so let’s take a look at that:output_23_0

Woah. Over half of the sessions reported by the pings were exactly 215 seconds long. Two minutes and 35 seconds.

It gets weirder. It turns out that these Aurora 51 builds are all running on the same Operating System (Windows XP, about which I’ve blogged before), all have the same addon installed (Random Agent Spoofer, though about 10% also have Alexa Traffic Rank), none have Aurora 51 set to be their default browser, none have application updates enabled, and they come from 418 different geographical locations according to the IP address of the submitters (top 10 locations include 5 in the US, 2 in France, 2 in Britain, and one in Germany).

This is where I would like to report the flash of insight that had me jumping out of the bath shouting Eureka.

But I don’t have one.

Everyone mentioned here and some others besides have thrown their heads at this mystery and can’t come up with anything suitably satisfying. Is it a Windows XP VM that is distributed to help developers test their websites? Is it an embedded browser in some popular piece of software with broad geographic appeal? Is someone just spoofing us by setting their client ids the same? If so, how did they spoof their session lengths?

To me the two-minute-and-thirty-five-second length of sessions just screams that this is some sort of automated process. I’m worried that Firefox might have been packaged into some sort of botnet-type-thingy that has gone out and infected thousands of hosts and is using our robust network stack to… to do what?

And then there’s the problem of what to do about it.

On one hand, this is data from Firefox. It arrived properly-formed, and no one’s trying to attack us with it, so we have no need to stop it entering our data pipeline for processing.

On the other hand, this data is making the Aurora tab spinner graph look wonky for :mconley, and might cause other mischief down the line.

It leads us to question whether we care about data that’s been sent to use by automated processes… and whether we could identify such data if we didn’t.

For now we’re going to block this particular client_id’s data from entering the aggregate dataset. The aggregate dataset is used by telemetry.mozilla.org to display interesting stuff about Firefox users. Human users. So we’re okay with blocking it.

But all Firefox users submit data that might be useful to us, so what we’re not going to do is block this problematic client’s data from entering the pipeline. We’ll continue to collect and collate it in the hopes that it can reveal to us some way to improve Firefox or data collection in the future.

And that’s sadly where we’re at with this: an unsolved mystery, some unanswered questions about the value of automated data, and an unsatisfied sense of curiosity.

:chutten

The Most Satisfying Graph

There were a lot of Firefox users on Beta 44.

Usually this is a good thing. We like having a lot of users[citation needed].

It wasn’t a good thing this time, as Beta had already moved on to 45. Then 46. Eventually we were at Beta 52, and the number of users on Beta 44 was increasing.

We thought maybe it was because Beta 44 had the same watershed as Release 43. Watershed? Every user running a build before a watershed must update to the watershed first before updating to the latest build. If you have Beta 41 and the latest is Beta 52, you must first update to Beta 44 (watershed) so we can better ascertain your cryptography support before continuing on to 48, which is another watershed, this time to do with fourteen-year-old processor extensions. Then, and only then, can you proceed to the currently-most-recent version, Beta 52.

(If you install afresh, the installer has the smarts to figure out your computer’s cryptographic and CPU characteristics and suitability so that new users jump straight to the front of the line)

Beta 44 being a watershed should, indeed, require a longer-than-usual lifetime of the version, with respect to population. If this were the only effect at play we’d expect the population to quickly decrease as users updated.

But they didn’t update.

It turns out that whenever the Beta 44 users attempted to download an update to that next watershed release, Beta 48, they were getting a 404 Not Found. At some point, the watershed Beta 48 build on download.mozilla.org was removed, possibly due to age (we can’t keep everything forever). So whenever the users on Beta 44 wanted to update, they couldn’t. To compound things, any time a user before Beta 44 wanted to update, they had to go through Beta 44. Where they were caught.

This was fixed on… well, I’ll let you figure out which day it was fixed on:

beta44_dau

This is now the most satisfying graph I’ve ever plotted at Mozilla.

:chutten

Data Science is Hard: Client Delays for Crash Pings

Second verse, much like the first: how quickly do we get data from clients?

This time: crash pings.

Recording Delay

The recording delay of crash pings is different from main pings in that the only time information we have about when the information happens is crashDate, which only tells you the day the crash happened, not the time. This results in a weird stair-step pattern on the plot as I make a big assumption:

Assumption: If the crash ping was created on the same day that the crash happened, it took essentially 0 time to do so. (If I didn’t make this assumption, the plot would have every line at 0 for the first 24 hours and we’d not have as much information displayed before the 96-hour max)

output_19_1

The recording delay for crash pings is the time between the crash happening and the user restarting their browser. As expected, most users appear to restart their browser immediately. Even the slowest channel (release) has over 80% of its crash pings recorded within two days.

Submission Delay

The submission delay for crash pings, as with all pings, is the time between the creation of the ping and the sending of the ping. What makes the crash ping special is that it isn’t even created until the browser has restarted, so I expected these to be quite short:

output_22_1

They do not disappoint. Every branch but Nightly has 9 out of every 10 crash pings sent within minutes of it being created.

Nightly is a weird one. It starts off having the worst proportion of created pings unsent, but then becomes the best.

Really, all four of these lines should be within an error margin of just being a flat line at the top of the graph, since the code that creates the ping is pretty much the same code that sends it. How in the world are these many crash pings remaining unsent at first, but being sent eventually?

Terribly mysterious.

Combined Delay

output_26_1

The combined client delay for crash pings shows that we ought to have over 80% of all crash pings from all channels within a day or two of the crash happening. The coarseness of the crashDate measure makes it hard to say exactly how many and when, but the curve is clearly a much faster one than for the main ping delays previously examined.

Crash Rates

For crash rates that use crashes counted from crash pings and some normalization factor (like usage hours) counted from main pings, it doesn’t actually matter how fast or slow these pings come in. If only 50% of crashes and 50% of usage hours came in within a day, the crash rate would still be correct.

What does matter is when the pings arrive at different speeds:

combined_crashmain_delay

(Please forgive my awful image editing work)

Anywhere that the two same-coloured lines fail to overlap is a time when the server-recorded count of crashes from crash pings will not be from the same proportion of the population as the sever-recorded count of usage hours from main pings.

For example: On release (dark blue), if we look at the crash rate at 22 or 30-36 hours out from a given period, the crash rate is likely to approximate what a final tally will give us. But if we check early (before 22h, or between 22 and 30h), when the main pings are lagging, the crash rate will seem higher than reality. If we check later (after 36h), the crash rate will seem lower.

This is where the tyranny of having a day-resolution crashDate really comes into its own. If we could model exactly when a channels’ crash and main submission proportions are equal, we could use that to generate accurate approximations of the final crash rate. Right now, the rather-exact figures I’m giving in the previous paragraph may have no bearing on reality.

Conclusion

If we are to use crash pings and main pings together to measure “something”, we need to fully understand and measure the differences in their client-side delays. If the curves above are stable, we might be able to model their differences with some degree of accuracy. This would require a higher-resolution crash timestamp.

If we wish to use this measured “something” earlier than 24h from the event (like, say, to measure how crashy a new release is), we need to either chose a method that doesn’t rely on main pings, or speed up main ping reporting so that it has a curve closer to that of crash pings.

To do my part I will see if having a better crash timestamp (hours would do, minutes would be the most I would need) is something we might be willing to pursue, and I will lobby for the rapid completion and adoption of pingSender as a method for turning main pings’ submission delay CDF into a carbon copy of crash pings’.

Please peruse the full analysis on reports.telemetry.mozilla.org if you are interested in the details of how these graphs were generated.

:chutten

Data Science is Hard: Units

I like units. Units are fun. When playing with Firefox Telemetry you can get easy units like “number of bookmarks per user” and long units like “main and content but not content-shutdown crashes per thousand usage hours“.

Some units are just transformations of other units. For instance, if you invert the crash rate units (crashes per usage hours) you get something like Mean Time To Failure where you can see how many usage hours there are between crashes. In the real world of Canada I find myself making distance transformations between miles and kilometres and temperature transformations between Fahrenheit and Celsius.

My younger brother is a medical resident in Canada and is slowly working his way through the details of what it would mean to practice medicine in Canada or the US. One thing that came up in conversation was the unit differences.

I thought he meant things like millilitres being replaced with fluid ounces or some other vaguely insensible nonsense (I am in favour of the metric system, generally). But no. It’s much worse.

It turns out that various lab results have to be communicated in terms of proportion. How much cholesterol per unit of blood? How much calcium? How much sugar, insulin, salt?

I was surprised when my brother told me that in the United States this is communicated in grams. If you took all of the {cholesterol, calcium, sugar, insulin, salt} out of the blood and weighed it on a (metric!) scale, how much is there?

In Canada, this is communicated in moles. Not the furry animal, but the actual count of molecules. If you took all of the substance out of the blood and counted the molecules, how many are there?

So when you are trained in one system to recognize “good” (typical) values and “bad” (atypical) values, when you shift to the other system you need to learn new values.

No problem, right? Like how you need to multiply by 1.6 to get kilometres out of miles?

No. Since grams vs moles is a difference between “much” and “many” you need to apply a different conversion depending on the molecular weight of the substance you are measuring.

So, yes, there is a multiple you can use for cholesterol. And another for calcium. And another for sugar, yet another for insulin, and still another for salt. It isn’t just one conversion, it’s one conversion per substance.

Suddenly “crashes per thousand usage hours” seems reasonable and sane.

:chutten