Data Science is Hard: Anomalies Part 3

So what do you do when you have a duplicate data problem and it just keeps getting worse?

You detect and discard.

Specifically, since we already have a few billion copies of pings with identical document ids (which are extremely-unlikely to collide), there is no benefit to continue storing them. So what we do is write a short report about what the incoming duplicate looked like (so that we can continue to analyze trends in duplicate submissions), then toss out the data without even parsing it.

As before, I’ll leave finding out the time the change went live as an exercise for the reader:newplot(1)

:chutten

Data Science is Hard: Anomalies Part 2

Apparently this is one of those problems that jumps two orders of magnitude if you ignore it:

aurora51-submissions

Since last time we’ve noticed that the vast majority of these incoming pings are duplicate. I don’t mean that they look similar, I mean that they are absolutely identical down to their supposedly-unique document ids.

How could this happen?

Well, with a minimum of speculation we can assume that however these Firefox instances are being distributed, they are being distributed with full copies of the original profile data directory. This would contain not only the user’s configuration information, but also copies of all as-yet-unsent pings. Once the distributed Firefox instance was started in its new home, it would submit these pending pings, which would explain why they are all duplicated: the distributor copy-pasta’d them.

So if we want to learn anything about the population of machines that are actually running these instances, we need to ignore all of these duplicate pings. So I took my analysis from last time and tweaked it.

First off, to demonstrate just how much of the traffic spike we see is the same fifteen duplicate pings, here is a graph of ping volume vs unique ping volume:

output_12_0

The count of non-duplicated pings is minuscule. We can conclude from this that most of these distributed Firefox instances rarely get the opportunity to send more than one ping. (Because if they did, we’d see many more unique pings created on their new hosts)

What can we say about these unique pings?

Besides how infrequent they are? They come from instances that all have the same Random Agent Spoofer addon that we saw in the original analysis. None of them are set as the user’s default browser. The hosts are most likely to have a 2.4GHz or 3.5GHz cpu. The hosts come from a geographically-diverse spread of area, with a peculiarly-popular cluster in Montreal (maybe they like the bagels. I know I do).

All of the pings come from computers running Windows XP. I wish I were more surprised by this, but it really does turn out that running software over a decade past its best before is a bad idea.

Also of note: the length of time the browser is open for is far too short (60-75s mostly) for a human to get anything done with it:

output_26_0

(Telemetry needs 60s after Firefox starts up in order to send a ping so it’s possible that there are browsing sessions that are shorter than a minute that we aren’t seeing.)

What can/should be done about these pings?

These pings are coming in at a rate far exceeding what the entire Aurora 51 population had when it was an active release. Yet, Aurora 51 hasn’t been an active release for six months and Aurora itself is going away.

As such, though its volume seems to continue to increase, this anomaly is less and less likely to cause us real problems day-to-day. These pings are unlikely to accidentally corrupt a meaningful analysis or mis-scale a plot.

And with our duplicate detector identifying these pings as they come in, it isn’t clear that this actually poses an analysis risk at all.

So, should we do anything about this?

Well, it is approaching release-channel-levels of volume per-day, submitted evenly at all hours instead of in the hump-backed periodic wave that our population usually generates:

aurora51-duplicateMainPings

Hundreds of duplicates detected every minute means nearly a million pings a day. We can handle it (in the above plot I turned off release, whose low points coincide with aurora’s high points), but should we?

Maybe for Mozilla’s server budget’s sake we should shut down this data after all. There’s no point in receiving yet another billion copies of the exact same document. The only things that differ are the submission timestamp and submitting IP address.

Another point: it is unlikely that these hosts are participating in this distribution of their free will. The rate of growth, the length of sessions, the geographic spread, and the time of day the duplicates arrive at our servers strongly suggest that it isn’t humans who are operating these Firefox installs. Maybe for the health of these hosts on the Internet we should consider some way to hotpatch these wayward instances into quiescence.

I don’t know what we (mozilla) should do. Heck, I don’t even know what we can do.

I’ll bring this up on fhr-dev and see if we’ll just leave this alone, waiting for people to shut off their Windows XP machines… or if we can come up with something we can (and should) do.

:chutten

Data Science is Hard: Client Delays for Crash Pings

Second verse, much like the first: how quickly do we get data from clients?

This time: crash pings.

Recording Delay

The recording delay of crash pings is different from main pings in that the only time information we have about when the information happens is crashDate, which only tells you the day the crash happened, not the time. This results in a weird stair-step pattern on the plot as I make a big assumption:

Assumption: If the crash ping was created on the same day that the crash happened, it took essentially 0 time to do so. (If I didn’t make this assumption, the plot would have every line at 0 for the first 24 hours and we’d not have as much information displayed before the 96-hour max)

output_19_1

The recording delay for crash pings is the time between the crash happening and the user restarting their browser. As expected, most users appear to restart their browser immediately. Even the slowest channel (release) has over 80% of its crash pings recorded within two days.

Submission Delay

The submission delay for crash pings, as with all pings, is the time between the creation of the ping and the sending of the ping. What makes the crash ping special is that it isn’t even created until the browser has restarted, so I expected these to be quite short:

output_22_1

They do not disappoint. Every branch but Nightly has 9 out of every 10 crash pings sent within minutes of it being created.

Nightly is a weird one. It starts off having the worst proportion of created pings unsent, but then becomes the best.

Really, all four of these lines should be within an error margin of just being a flat line at the top of the graph, since the code that creates the ping is pretty much the same code that sends it. How in the world are these many crash pings remaining unsent at first, but being sent eventually?

Terribly mysterious.

Combined Delay

output_26_1

The combined client delay for crash pings shows that we ought to have over 80% of all crash pings from all channels within a day or two of the crash happening. The coarseness of the crashDate measure makes it hard to say exactly how many and when, but the curve is clearly a much faster one than for the main ping delays previously examined.

Crash Rates

For crash rates that use crashes counted from crash pings and some normalization factor (like usage hours) counted from main pings, it doesn’t actually matter how fast or slow these pings come in. If only 50% of crashes and 50% of usage hours came in within a day, the crash rate would still be correct.

What does matter is when the pings arrive at different speeds:

combined_crashmain_delay

(Please forgive my awful image editing work)

Anywhere that the two same-coloured lines fail to overlap is a time when the server-recorded count of crashes from crash pings will not be from the same proportion of the population as the sever-recorded count of usage hours from main pings.

For example: On release (dark blue), if we look at the crash rate at 22 or 30-36 hours out from a given period, the crash rate is likely to approximate what a final tally will give us. But if we check early (before 22h, or between 22 and 30h), when the main pings are lagging, the crash rate will seem higher than reality. If we check later (after 36h), the crash rate will seem lower.

This is where the tyranny of having a day-resolution crashDate really comes into its own. If we could model exactly when a channels’ crash and main submission proportions are equal, we could use that to generate accurate approximations of the final crash rate. Right now, the rather-exact figures I’m giving in the previous paragraph may have no bearing on reality.

Conclusion

If we are to use crash pings and main pings together to measure “something”, we need to fully understand and measure the differences in their client-side delays. If the curves above are stable, we might be able to model their differences with some degree of accuracy. This would require a higher-resolution crash timestamp.

If we wish to use this measured “something” earlier than 24h from the event (like, say, to measure how crashy a new release is), we need to either chose a method that doesn’t rely on main pings, or speed up main ping reporting so that it has a curve closer to that of crash pings.

To do my part I will see if having a better crash timestamp (hours would do, minutes would be the most I would need) is something we might be willing to pursue, and I will lobby for the rapid completion and adoption of pingSender as a method for turning main pings’ submission delay CDF into a carbon copy of crash pings’.

Please peruse the full analysis on reports.telemetry.mozilla.org if you are interested in the details of how these graphs were generated.

:chutten

Data Science is Hard: Client Delays

Delays suck, but unmeasured delays suck more. So let’s measure them.

I’ve previous talked about delays as they relate to crash pings. This time we’re looking at the core of Firefox Telemetry data collection: the “main” ping. We’ll be looking at a 10% sample of all “main” pings submitted on Tuesday, January 10th[1].

In my previous post on delays, I defined five types of delay: recording, submission, aggregation, migration, and query scheduling. This post is about delays on the client side of the equation, so we’ll be focusing on the first two: recording, and submission.

Recording Delay

How long does it take from something happening, to having a record of it happening? We count HTTP response codes (as one does), so how much time passes from that first HTTP response to the time when that response’s code is packaged into a ping to be sent to our servers?

output_20_1

This is a Cumulative Distribution Functions or CDF. The ones in this post show you what proportion (0% – 100%) of “main” pings we’re looking at arrived with data that falls within a certain timeframe (0 – 96 hours). So in this case, look at the red, “aurora”-branch line. It crosses the 0.9 y-axis line at about the 8 x-axis line. This means 90% of the pings had a recording delay of 8 hours or less.

Which is fantastic news, especially since every other channel (release and beta vying for fastest speeds) gets more of its pings in even faster. 90% of release pings have a recording delay of at most 4 hours.

And notice that shelf at 24 hours, where every curve basically jumps to 100%? If users leave their browsers open for longer than a day, we cut a fresh ping at midnight. Glad to see evidence that it’s working.

All in all it shows that we can expect recording delays of under 30min for most pings across all channels. This is not a huge source of delay.

Submission Delay

With all that data finally part of a “main” ping, how long before the servers are told? For now, Telemetry has to wait for the user to restart their Firefox before it is able to send its pings. How long can that take?

output_23_1

Ouch.

Now we see aurora is no longer the slowest, and has submission delays very similar to release’s submission delays.  The laggard is now beta… and I really can’t figure out why. If Beta users are leaving their browsers open longer, we’d expect to see them be on the slower side of the “Recording Delay CDF” plot. If Beta users are leaving their browser closed longer, we’d expect them to show up lower on Engagement Ratio plots (which they don’t).

A mystery.

Not a mystery is that nightly has the fastest submission times. It receives updates every day so users have an incentive to restart their browsers often.

Comparing Submission Delay to Recording Delay, you can see how this is where we’re focusing most of our “Get More Data, Faster” attentions. If we wait for 90% of “main” pings to arrive, then we have to wait at least 17 hours for nightly data, 28 hours for release and aurora… and over 96 hours for beta.

And that’s just Submission Delay. What if we measured the full client -> server delay for data?

Combined Client Delay

output_27_1

With things the way they were on 2017-01-10, to get 90% of “main” pings we need to wait a minimum of 22 hours (nightly) and a maximum of… you know what, I don’t even know. I can’t tell where beta might cross the 0.9 line, but it certainly isn’t within 96 hours.

If we limit ourselves to 80% we’re back to a much more palatable 11 hours (nightly) to 27 hours (beta). But that’s still pretty horrendous.

I’m afraid things are actually even worse than I’m making it out to be. We rely on getting counts out of “main” pings. To count something, you need to count every single individual something. This means we need 100% of these pings, or as near as we can get. Even nightly pings take longer than 96 hours to get us more than 95% of the way there.

What do we use “main” pings to count? Amongst other things, “usage hours” or “how long has Firefox been open”. This is imperative to normalizing crash information properly so we can determine the health and stability of a release.

As you can imagine, we’re interested in knowing this as fast as possible. And as things stood a couple of Tuesdays ago, we have a lot of room for improvement.

For now, expect more analyses like this one (and more blog posts like this one) examining how slowly or quickly we can possibly get our data from the users who generate it to the Mozillians who use it to improve Firefox.

:chutten

[1]: Why did I look at pings from 2017-01-10? It was a recent Tuesday (less weekend effect) well after Gregorian New Year’s Day, well before Chinese New Year’s Day, and even a decent distance from Epiphany. Also the 01-10 is a mirror which I thought was neat.

What’s the First Firefox Crash a User Sees?

Growth is going to be a big deal across Mozilla in 2017. We spent 2016 solidifying our foundations, and now we’re going to use that to spring to action and grow our influence and user base.

So this got me thinking about new users. We’re constantly getting new users: people who, for one reason or another, choose to install and run Firefox for the first time today. They run it and… well, then what?

Maybe they like it. They open a new tab. Then they open a staggeringly unbelievable number of tabs. They find and install an addon. Or two.

Fresh downloads and installs of Firefox continue at an excellent pace. New people, every day, are choosing Firefox.

So with the number of new users we already see, the key to Growth may not lie in attracting more of them… it might be that we need to keep the ones we already see.

So what might stop a user from using Firefox? Maybe after they open the seventy-first tab, Firefox crashes. It just disappears on them. They open it again, browse for a little while… but can’t forget that the browser, at any time, could just decide to disappear and let them down. So they migrate back to something else, and we lose them.

It is with these ideas in my head that I wondered “Are there particular types of crashes that happen to new users? Do they more likely crash because of a plugin, their GPU misbehaving, running out of RAM… What is their first crash, and how might it compare to the broader ecosystem of crashes we see and fix every day?”

With the new data available to me thanks to Gabriele Svelto’s work on client-side stack traces, I figured I could maybe try to answer it.

My full analysis is here, but let me summarize: sadly there’s too much noise in the process to make a full diagnosis. There are some strange JSON errors I haven’t tracked down… and even if I do, there are too many “mystery” stack frames that we just don’t have a mechanism to figure out yet.

And this isn’t even covering how we need some kind of service or algorithm to detect signatures in these stacks or at least cluster them in some meaningful way.

Work is ongoing, so I hope to have a more definite answer in the future. But for now, all I can do is invite you to tell me what you think causes users to stop using Firefox. You can find me on Twitter, or through my @mozilla.com email address.

:chutten

Privileged to be a Mozillian

Mike Conley noticed a bug. There was a regression on a particular Firefox Nightly build he was tracking down. It looked like this:

A time series plot with a noticeable regression at November 6

A pretty big difference… only there was a slight problem: there were no relevant changes between the two builds. Being the kind of developer he is, :mconley looked elsewhere and found a probe that only started being included in builds starting November 16.

The plot showed him data starting from November 15.

He brought it up on irc.mozilla.org#telemetry. Roberto Vitillo was around and tried to reproduce, without success. For :mconley the regression was on November 5 and the data on the other probe started November 15. For :rvitillo the regression was on November 6 and the data started November 16. After ruling out addons, they assumed it was the dashboard’s fault and roped me into the discussion. This is what I had to say:

Hey, guess what's different between rvitillo and mconley? About 5 hours.

You see, :mconley is in the Toronto (as in Canada) Mozilla office, and Roberto is in the London (as in England) Mozilla office. There was a bug in how dates were being calculated that made it so the data displayed differently depending on your timezone. If you were on or East of the Prime Meridian you got the right answer. West? Everything looks like it happens one day early.

I hammered out a quick fix, which means the dashboard is now correct… but in thinking back over this bug in a post-mortem-kind-of-way, I realized how beneficial working in a distributed team is.

Having team members in multiple timezones not only provided us with a quick test location for diagnosing and repairing the issue, it equipped us with the mindset to think of timezones as a problematic element in the first place. Working in a distributed fashion has conferred upon us a unique and advantageous set of tools, experiences, thought processes, and mechanisms that allow us to ship amazing software to hundreds of millions of users. You don’t get that from just any cube farm.

#justmozillathings

:chutten

All Aboard the Release Train!

I’m working on a dashboard (a word which here means a website that has plots on it) to show Firefox crashes per channel over time. The idea is to create a resource for Release Management to figure out if there’s something wrong with a given version of Firefox with enough lead time to then do something about it.

It’s entering its final form (awaiting further feedback from relman and others to polish it) and while poking around (let’s call it “testing”) I noticed a pattern:auroramcscrashes

Each one of those spikes happens on a “merge day” when the Aurora channel (the channel powering Firefox Developer Edition) updates to the latest code from the Nightly channel (the channel powering Firefox Nightly). From then on, only stabilizing changes are merged so that when the next merge day comes, we have a nice, stable build to promote from Aurora to the Beta channel (the channel powering Firefox Beta). Beta undergoes further stabilization on a slower schedule (a new build every week or two instead of daily) to become a candidate for the Release channel (which powers Firefox).

For what it’s worth, this is called the Train Model. It allows us to ship stable and secure code with the latest features every six-to-eight weeks to hundreds of millions of users. It’s pretty neat.

And what that picture up there shows is that it works. The improvement is especially noticeable on the Aurora branch where we go from crashing 9-10 times for every thousand hours of use to 3-4 times for every thousand hours.

Now, the number and diversity of users on the Aurora branch is limited, so when the code rides the trains to Beta, the crash rate goes up. Suddenly code that seemed stable across the Aurora userbase is exposed to previously-unseen usage patterns and configurations of hardware and software and addons. This is one of the reasons why our pre-release users are so precious to us: they provide us with the early warnings we need to stabilize our code for the wild diversity of our users as well as the wild diversity of the Web.

If you’d like to poke around at the dashboard yourself, it’s over here. Eventually it’ll get merged into telemetry.mozilla.org when it’s been reviewed and formalized.

If you have any brilliant ideas of how to improve it, or find any mistakes or bugs, please comment on the tracking bug where the discussion is currently ongoing.

If you’d like to help in an different way, consider installing Firefox Beta. It should look and act almost exactly like your current Firefox, but with some pre-release goodies that you get to use before anyone else (like how Firefox Beta 50 lets you use Reader Mode when you want to print a web page to skip all the unnecessary sidebars and ads). By doing this you are helping us ship the best Firefox we can.

:chutten