Data Science is Hard: History, or It Seemed Like a Good Idea At the Time

I’m mentoring a Summer of Code project this summer about redesigning the “about:telemetry” interface that ships with each and every version of Firefox.

The minute the first student (:flyingrub) asked me “What is a parent payload and child payload?” I knew I was going to be asked a lot of questions.

To least-effectively answer these questions, I’ll blog the answers as narratives. And to start with this question, here’s how the history of a project makes it difficult to collect data from it.

In the Beginning — or, rather, in the middle of October 2015 when I was hired at Mozilla (so, at my Beginning) — there was single-process Firefox, and all was good. Users had many tabs, but one process. Users had many bookmarks, but one process. Users had many windows, but one process. All this and the web contents themselves were all sharing time within a single construct of electrons and bits and code and pixels: vying with each other for control of the filesystem, the addressable space of RAM, the network resources, and CPU scheduling.

Not satisfied with things being just “good”, we took a page from the book penned by Google Chrome and decided the time was ripe to split the browser into many processes so that a critical failure in one would not trouble the others. To begin with, because our code is venerable, we decided that we would try two processes. One of these twins would be in charge of the browser and the other would be in charge of the web contents.

This project was called Electrolysis after the mechanism by which one might split water into Hydrogen and Oxygen using electricity.

Suddenly the browser became responsive, even in the face of the worst JavaScript written by the least experienced dev at the most privileged startup in Silicon Valley. And out-of-memory errors decreased in frequency because the browser’s memory and the web contents’ memory were able to grow without interfering with each other.

Remember, our code is venerable. Remember, our code hearkens from its single-process past.

Our data-collection code was written in that single-process past. But we had two processes with input events that need to be timed to find problems. We had two processes with memory allocations that need to be examined for regressions.

So the data collection code was made aware that there could be two types of process: parent and child.

Alas, not just one child. There could be many child processes in a row if some webpage were naughty and brought down the child in its anger. So the data collection code was made aware there could be many batches of data from child processes, and one batch of data from parent processes.

The parent data was left looking like single-process data, out in the root of the data collection payload. Child processes’ data were placed in an array of childPayloads where each payload echoed the structure of the parent.

Then, not content with “good”, I had to come along in bug 1218576, a bug whose number I still have locked in my memory, for good or ill.

Firefox needs to have multiple child processes of different types, simultaneously. And many of some of those several types, also simultaneously. What was going to be a quick way to ensure that childPayloads was always of length 1 turned into a months-long exercise to put data exactly where we wanted it to be.

And so now we have childPayloads where the “weird” content child data that resists aggregation remains, and we also have payload.processes.<process type>.* where the cool and hip data lives: histograms, scalars, and keyed variants of both.

Already this approach is showing dividends as some proportions of Nightly users are getting a gpu process, and others are getting some number of content processes. The data files neatly into place with minimal intervention required.

But it means about:telemetry needs to know whether you want the parent’s “weird” data or the child’s. And which child was that, again?

And about:telemetry also needs to know whether you want the parent’s “cool” data, or the content child’s, or the gpu child’s.

So this means that within about:telemetry there are now five places where you can select what process you want. One for “weird” data, and one for each of the four kinds of “cool” data.

Sadly, that brings my storytelling to a close, having reached the present day. Hopefully after this Summer’s Code is done, this will have a happier, more-maintainable, and responsively-designed ending.

But until now, remember that “accident of history” is the answer to most questions. As such it behooves you to learn history well.

:chutten

The Most Satisfying Graph

There were a lot of Firefox users on Beta 44.

Usually this is a good thing. We like having a lot of users[citation needed].

It wasn’t a good thing this time, as Beta had already moved on to 45. Then 46. Eventually we were at Beta 52, and the number of users on Beta 44 was increasing.

We thought maybe it was because Beta 44 had the same watershed as Release 43. Watershed? Every user running a build before a watershed must update to the watershed first before updating to the latest build. If you have Beta 41 and the latest is Beta 52, you must first update to Beta 44 (watershed) so we can better ascertain your cryptography support before continuing on to 48, which is another watershed, this time to do with fourteen-year-old processor extensions. Then, and only then, can you proceed to the currently-most-recent version, Beta 52.

(If you install afresh, the installer has the smarts to figure out your computer’s cryptographic and CPU characteristics and suitability so that new users jump straight to the front of the line)

Beta 44 being a watershed should, indeed, require a longer-than-usual lifetime of the version, with respect to population. If this were the only effect at play we’d expect the population to quickly decrease as users updated.

But they didn’t update.

It turns out that whenever the Beta 44 users attempted to download an update to that next watershed release, Beta 48, they were getting a 404 Not Found. At some point, the watershed Beta 48 build on download.mozilla.org was removed, possibly due to age (we can’t keep everything forever). So whenever the users on Beta 44 wanted to update, they couldn’t. To compound things, any time a user before Beta 44 wanted to update, they had to go through Beta 44. Where they were caught.

This was fixed on… well, I’ll let you figure out which day it was fixed on:

beta44_dau

This is now the most satisfying graph I’ve ever plotted at Mozilla.

:chutten

Data Science is Hard: Client Delays for Crash Pings

Second verse, much like the first: how quickly do we get data from clients?

This time: crash pings.

Recording Delay

The recording delay of crash pings is different from main pings in that the only time information we have about when the information happens is crashDate, which only tells you the day the crash happened, not the time. This results in a weird stair-step pattern on the plot as I make a big assumption:

Assumption: If the crash ping was created on the same day that the crash happened, it took essentially 0 time to do so. (If I didn’t make this assumption, the plot would have every line at 0 for the first 24 hours and we’d not have as much information displayed before the 96-hour max)

output_19_1

The recording delay for crash pings is the time between the crash happening and the user restarting their browser. As expected, most users appear to restart their browser immediately. Even the slowest channel (release) has over 80% of its crash pings recorded within two days.

Submission Delay

The submission delay for crash pings, as with all pings, is the time between the creation of the ping and the sending of the ping. What makes the crash ping special is that it isn’t even created until the browser has restarted, so I expected these to be quite short:

output_22_1

They do not disappoint. Every branch but Nightly has 9 out of every 10 crash pings sent within minutes of it being created.

Nightly is a weird one. It starts off having the worst proportion of created pings unsent, but then becomes the best.

Really, all four of these lines should be within an error margin of just being a flat line at the top of the graph, since the code that creates the ping is pretty much the same code that sends it. How in the world are these many crash pings remaining unsent at first, but being sent eventually?

Terribly mysterious.

Combined Delay

output_26_1

The combined client delay for crash pings shows that we ought to have over 80% of all crash pings from all channels within a day or two of the crash happening. The coarseness of the crashDate measure makes it hard to say exactly how many and when, but the curve is clearly a much faster one than for the main ping delays previously examined.

Crash Rates

For crash rates that use crashes counted from crash pings and some normalization factor (like usage hours) counted from main pings, it doesn’t actually matter how fast or slow these pings come in. If only 50% of crashes and 50% of usage hours came in within a day, the crash rate would still be correct.

What does matter is when the pings arrive at different speeds:

combined_crashmain_delay

(Please forgive my awful image editing work)

Anywhere that the two same-coloured lines fail to overlap is a time when the server-recorded count of crashes from crash pings will not be from the same proportion of the population as the sever-recorded count of usage hours from main pings.

For example: On release (dark blue), if we look at the crash rate at 22 or 30-36 hours out from a given period, the crash rate is likely to approximate what a final tally will give us. But if we check early (before 22h, or between 22 and 30h), when the main pings are lagging, the crash rate will seem higher than reality. If we check later (after 36h), the crash rate will seem lower.

This is where the tyranny of having a day-resolution crashDate really comes into its own. If we could model exactly when a channels’ crash and main submission proportions are equal, we could use that to generate accurate approximations of the final crash rate. Right now, the rather-exact figures I’m giving in the previous paragraph may have no bearing on reality.

Conclusion

If we are to use crash pings and main pings together to measure “something”, we need to fully understand and measure the differences in their client-side delays. If the curves above are stable, we might be able to model their differences with some degree of accuracy. This would require a higher-resolution crash timestamp.

If we wish to use this measured “something” earlier than 24h from the event (like, say, to measure how crashy a new release is), we need to either chose a method that doesn’t rely on main pings, or speed up main ping reporting so that it has a curve closer to that of crash pings.

To do my part I will see if having a better crash timestamp (hours would do, minutes would be the most I would need) is something we might be willing to pursue, and I will lobby for the rapid completion and adoption of pingSender as a method for turning main pings’ submission delay CDF into a carbon copy of crash pings’.

Please peruse the full analysis on reports.telemetry.mozilla.org if you are interested in the details of how these graphs were generated.

:chutten

Data Science is Hard: Units

I like units. Units are fun. When playing with Firefox Telemetry you can get easy units like “number of bookmarks per user” and long units like “main and content but not content-shutdown crashes per thousand usage hours“.

Some units are just transformations of other units. For instance, if you invert the crash rate units (crashes per usage hours) you get something like Mean Time To Failure where you can see how many usage hours there are between crashes. In the real world of Canada I find myself making distance transformations between miles and kilometres and temperature transformations between Fahrenheit and Celsius.

My younger brother is a medical resident in Canada and is slowly working his way through the details of what it would mean to practice medicine in Canada or the US. One thing that came up in conversation was the unit differences.

I thought he meant things like millilitres being replaced with fluid ounces or some other vaguely insensible nonsense (I am in favour of the metric system, generally). But no. It’s much worse.

It turns out that various lab results have to be communicated in terms of proportion. How much cholesterol per unit of blood? How much calcium? How much sugar, insulin, salt?

I was surprised when my brother told me that in the United States this is communicated in grams. If you took all of the {cholesterol, calcium, sugar, insulin, salt} out of the blood and weighed it on a (metric!) scale, how much is there?

In Canada, this is communicated in moles. Not the furry animal, but the actual count of molecules. If you took all of the substance out of the blood and counted the molecules, how many are there?

So when you are trained in one system to recognize “good” (typical) values and “bad” (atypical) values, when you shift to the other system you need to learn new values.

No problem, right? Like how you need to multiply by 1.6 to get kilometres out of miles?

No. Since grams vs moles is a difference between “much” and “many” you need to apply a different conversion depending on the molecular weight of the substance you are measuring.

So, yes, there is a multiple you can use for cholesterol. And another for calcium. And another for sugar, yet another for insulin, and still another for salt. It isn’t just one conversion, it’s one conversion per substance.

Suddenly “crashes per thousand usage hours” seems reasonable and sane.

:chutten

Data Science is Hard: Client Delays

Delays suck, but unmeasured delays suck more. So let’s measure them.

I’ve previous talked about delays as they relate to crash pings. This time we’re looking at the core of Firefox Telemetry data collection: the “main” ping. We’ll be looking at a 10% sample of all “main” pings submitted on Tuesday, January 10th[1].

In my previous post on delays, I defined five types of delay: recording, submission, aggregation, migration, and query scheduling. This post is about delays on the client side of the equation, so we’ll be focusing on the first two: recording, and submission.

Recording Delay

How long does it take from something happening, to having a record of it happening? We count HTTP response codes (as one does), so how much time passes from that first HTTP response to the time when that response’s code is packaged into a ping to be sent to our servers?

output_20_1

This is a Cumulative Distribution Functions or CDF. The ones in this post show you what proportion (0% – 100%) of “main” pings we’re looking at arrived with data that falls within a certain timeframe (0 – 96 hours). So in this case, look at the red, “aurora”-branch line. It crosses the 0.9 y-axis line at about the 8 x-axis line. This means 90% of the pings had a recording delay of 8 hours or less.

Which is fantastic news, especially since every other channel (release and beta vying for fastest speeds) gets more of its pings in even faster. 90% of release pings have a recording delay of at most 4 hours.

And notice that shelf at 24 hours, where every curve basically jumps to 100%? If users leave their browsers open for longer than a day, we cut a fresh ping at midnight. Glad to see evidence that it’s working.

All in all it shows that we can expect recording delays of under 30min for most pings across all channels. This is not a huge source of delay.

Submission Delay

With all that data finally part of a “main” ping, how long before the servers are told? For now, Telemetry has to wait for the user to restart their Firefox before it is able to send its pings. How long can that take?

output_23_1

Ouch.

Now we see aurora is no longer the slowest, and has submission delays very similar to release’s submission delays.  The laggard is now beta… and I really can’t figure out why. If Beta users are leaving their browsers open longer, we’d expect to see them be on the slower side of the “Recording Delay CDF” plot. If Beta users are leaving their browser closed longer, we’d expect them to show up lower on Engagement Ratio plots (which they don’t).

A mystery.

Not a mystery is that nightly has the fastest submission times. It receives updates every day so users have an incentive to restart their browsers often.

Comparing Submission Delay to Recording Delay, you can see how this is where we’re focusing most of our “Get More Data, Faster” attentions. If we wait for 90% of “main” pings to arrive, then we have to wait at least 17 hours for nightly data, 28 hours for release and aurora… and over 96 hours for beta.

And that’s just Submission Delay. What if we measured the full client -> server delay for data?

Combined Client Delay

output_27_1

With things the way they were on 2017-01-10, to get 90% of “main” pings we need to wait a minimum of 22 hours (nightly) and a maximum of… you know what, I don’t even know. I can’t tell where beta might cross the 0.9 line, but it certainly isn’t within 96 hours.

If we limit ourselves to 80% we’re back to a much more palatable 11 hours (nightly) to 27 hours (beta). But that’s still pretty horrendous.

I’m afraid things are actually even worse than I’m making it out to be. We rely on getting counts out of “main” pings. To count something, you need to count every single individual something. This means we need 100% of these pings, or as near as we can get. Even nightly pings take longer than 96 hours to get us more than 95% of the way there.

What do we use “main” pings to count? Amongst other things, “usage hours” or “how long has Firefox been open”. This is imperative to normalizing crash information properly so we can determine the health and stability of a release.

As you can imagine, we’re interested in knowing this as fast as possible. And as things stood a couple of Tuesdays ago, we have a lot of room for improvement.

For now, expect more analyses like this one (and more blog posts like this one) examining how slowly or quickly we can possibly get our data from the users who generate it to the Mozillians who use it to improve Firefox.

:chutten

[1]: Why did I look at pings from 2017-01-10? It was a recent Tuesday (less weekend effect) well after Gregorian New Year’s Day, well before Chinese New Year’s Day, and even a decent distance from Epiphany. Also the 01-10 is a mirror which I thought was neat.

What’s the First Firefox Crash a User Sees?

Growth is going to be a big deal across Mozilla in 2017. We spent 2016 solidifying our foundations, and now we’re going to use that to spring to action and grow our influence and user base.

So this got me thinking about new users. We’re constantly getting new users: people who, for one reason or another, choose to install and run Firefox for the first time today. They run it and… well, then what?

Maybe they like it. They open a new tab. Then they open a staggeringly unbelievable number of tabs. They find and install an addon. Or two.

Fresh downloads and installs of Firefox continue at an excellent pace. New people, every day, are choosing Firefox.

So with the number of new users we already see, the key to Growth may not lie in attracting more of them… it might be that we need to keep the ones we already see.

So what might stop a user from using Firefox? Maybe after they open the seventy-first tab, Firefox crashes. It just disappears on them. They open it again, browse for a little while… but can’t forget that the browser, at any time, could just decide to disappear and let them down. So they migrate back to something else, and we lose them.

It is with these ideas in my head that I wondered “Are there particular types of crashes that happen to new users? Do they more likely crash because of a plugin, their GPU misbehaving, running out of RAM… What is their first crash, and how might it compare to the broader ecosystem of crashes we see and fix every day?”

With the new data available to me thanks to Gabriele Svelto’s work on client-side stack traces, I figured I could maybe try to answer it.

My full analysis is here, but let me summarize: sadly there’s too much noise in the process to make a full diagnosis. There are some strange JSON errors I haven’t tracked down… and even if I do, there are too many “mystery” stack frames that we just don’t have a mechanism to figure out yet.

And this isn’t even covering how we need some kind of service or algorithm to detect signatures in these stacks or at least cluster them in some meaningful way.

Work is ongoing, so I hope to have a more definite answer in the future. But for now, all I can do is invite you to tell me what you think causes users to stop using Firefox. You can find me on Twitter, or through my @mozilla.com email address.

:chutten

Data Science is Hard – Case Study: What is a Firefox Crash?

In the past I’ve gone on at length about the challenge of getting timely data to determine Firefox release quality with respect to how often Firefox crashes. Comparatively I’ve spent essentially no time at all on what a crash actually is.

A crash (broadly) is what happens when a computer process encounters an error it cannot recover from. Since it cannot recover, the system it is running within ends the process abruptly.

Not all crashes are equal. Not all crashes mean the same thing to users and to release managers and to computer programmers.

If you are in the middle of drafting an email and the web page content suddenly goes blank and says “Sorry, this tab has crashed.” then that’s a big deal. It’s even worse if the entire browser disappears without warning.

But what if Firefox crashes, but only after it has mostly shut down? Everything’s been saved properly, but we didn’t clean up after ourselves well. This is a crash (technically), but does it really matter to a user?

What if the process that contains Flash crashes and web advertisements stop working? It can be restarted without too much trouble, and no one likes ads, so is it really that bad of a thing?

And on top of these families of events, there are other horrible things that can happen to users we might want to call “crashes” even though they aren’t. For instance: what if the browser becomes completely unresponsive and the user has no recourse but to close it? The process didn’t encounter a fatal error, but that user’s situation is the same: Something weird happened, and now their data is gone.

Generally speaking, I look at four classes of crash: Main Crashes (M), Content Crashes (C), Content Shutdown Crashes (S), and Plugin Crashes (P).

In my opinion, the most reliable indicator of Firefox’s stability and quality is M + C – S. In plain English, it is the sum of the events where the whole Browser goes poof or the Web Content inside the browser goes poof, ignoring the times when the Web Content goes poof after the user has decided to shut down the browser.

It doesn’t include Plugin crashes, as those are less obtrusive and more predicted by the plugin code, not Firefox code. It does include some events where Firefox became unresponsive (or “hangs” for short) and had to be terminated.

This, to my mind, most accurately encompasses a measure of Firefox quality. If the number of these crashes goes up, that means there are more times where more users are having less fun with Firefox. If the number of these crashes goes down, that means there are fewer times that fewer people are having less fun with Firefox.

It doesn’t tell the whole story. What good is a not-crashing browser if it doesn’t scroll when you ask it to? What good is a stable piece of web content if half of it is missing because we don’t support it? What good is a Firefox that is open all the time if it takes twice as long to load the web pages you care about?

But it gives us one very important part of the Firefox Quality story, and that’s good enough for me.

:chutten