Another Advantage of Decreasing Data Latency: Flatter Graphs

I’ve muttered before about how difficult it can be to measure application crashes. The most important lesson is that you can’t just count the number of crashes, you must normalize it by some “usage” value in order to determine whether a crashy day is because the application got crashier or because the application was just being used more.

Thus you have a numerator (number of crashes) and a denominator (some proxy of application usage) to determine the crash rate: crashes-per-use.

The current dominant denominator for Firefox is “thousand hours that Firefox is open,” or “kilo-usage-hours (kuh).”

The biggest problem we’ve been facing lately is how our numerator (number of crashes) comes in at a different rate and time than our denominator (kilo-usage-hours) due to the former being transmitted nearly-immediately via “crash” ping and the former being transmitted occasionally via “main” ping.

With pingsender now sending most “main” pings as soon as they’re created, our client submission delay for “main” pings is now roughly in line with the client submission delay of “crash” pings.

What does this mean? Well, look at this graph from https://telemetry.mozilla.org/crashes:

Screenshot-2017-7-25 Crash Rates (Telemetry)

This is the Firefox Beta Main Crash Rate (number of main process crashes on Firefox Beta divided by the number of thousands of hours users had Firefox Beta running) over the past three months or so. The spike in the middle is when we switched from Firefox Beta 54 to Firefox Beta 55. (Most of that spike is a measuring artefact due to a delay between a beta being available and people installing it. Feel free to ignore it for our purposes.)

On the left in the Beta 54 data there is a seven-day cycle where Sundays are the lowest point and Saturday is the highest point.

On the right in the Beta 55 data, there is no seven-day cycle. The rate is flat. (It is a little high, but flat. Feel free to ignore its height for our purposes.)

This is because sending “main” pings with pingsender is behaviour that ships in Firefox 55. Starting with 55, instead of having most of our denominator data (usage hours) coming in one day late due to “main” ping delay, we have that data in-sync with the numerator data (main crashes), resulting in a flat rate.

You can see it in the difference between Firefox ESR 52 (yellow) and Beta 55 (green) in the kusage_hours graph also on https://telemetry.mozilla.org/crashes:

Screenshot-2017-7-27 Crash Rates (Telemetry)

On the left, before Firefox Beta 55’s release, they were both in sync with each other, but one day behind the crash counts. On the right, after Beta 55’s release, notice that Beta 55’s cycle is now one day ahead of ESR 52’s.

This results in still more graphs that are quite satisfying. To me at least.

It also, somewhat more importantly, now makes the crash rate graph less time-variable. This reduces cognitive load on people looking at the graphs for explanations of what Firefox users experience in the wild. Decision-makers looking at these graphs no longer need to mentally subtract from the graph for Saturday numbers, adding that back in somehow for Sundays (and conducting more subtle adjustments through the week).

Now the rate is just the rate. And any change is much more likely to mean a change in crashiness, not some odd day-of-week measurement you can ignore.

I’m not making these graphs to have them ignored.

(many thanks to :philipp for noticing this effect and forcing me to explain it)

:chutten

Advertisements

Data Science is Hard: Client Delays for Crash Pings

Second verse, much like the first: how quickly do we get data from clients?

This time: crash pings.

Recording Delay

The recording delay of crash pings is different from main pings in that the only time information we have about when the information happens is crashDate, which only tells you the day the crash happened, not the time. This results in a weird stair-step pattern on the plot as I make a big assumption:

Assumption: If the crash ping was created on the same day that the crash happened, it took essentially 0 time to do so. (If I didn’t make this assumption, the plot would have every line at 0 for the first 24 hours and we’d not have as much information displayed before the 96-hour max)

output_19_1

The recording delay for crash pings is the time between the crash happening and the user restarting their browser. As expected, most users appear to restart their browser immediately. Even the slowest channel (release) has over 80% of its crash pings recorded within two days.

Submission Delay

The submission delay for crash pings, as with all pings, is the time between the creation of the ping and the sending of the ping. What makes the crash ping special is that it isn’t even created until the browser has restarted, so I expected these to be quite short:

output_22_1

They do not disappoint. Every branch but Nightly has 9 out of every 10 crash pings sent within minutes of it being created.

Nightly is a weird one. It starts off having the worst proportion of created pings unsent, but then becomes the best.

Really, all four of these lines should be within an error margin of just being a flat line at the top of the graph, since the code that creates the ping is pretty much the same code that sends it. How in the world are these many crash pings remaining unsent at first, but being sent eventually?

Terribly mysterious.

Combined Delay

output_26_1

The combined client delay for crash pings shows that we ought to have over 80% of all crash pings from all channels within a day or two of the crash happening. The coarseness of the crashDate measure makes it hard to say exactly how many and when, but the curve is clearly a much faster one than for the main ping delays previously examined.

Crash Rates

For crash rates that use crashes counted from crash pings and some normalization factor (like usage hours) counted from main pings, it doesn’t actually matter how fast or slow these pings come in. If only 50% of crashes and 50% of usage hours came in within a day, the crash rate would still be correct.

What does matter is when the pings arrive at different speeds:

combined_crashmain_delay

(Please forgive my awful image editing work)

Anywhere that the two same-coloured lines fail to overlap is a time when the server-recorded count of crashes from crash pings will not be from the same proportion of the population as the sever-recorded count of usage hours from main pings.

For example: On release (dark blue), if we look at the crash rate at 22 or 30-36 hours out from a given period, the crash rate is likely to approximate what a final tally will give us. But if we check early (before 22h, or between 22 and 30h), when the main pings are lagging, the crash rate will seem higher than reality. If we check later (after 36h), the crash rate will seem lower.

This is where the tyranny of having a day-resolution crashDate really comes into its own. If we could model exactly when a channels’ crash and main submission proportions are equal, we could use that to generate accurate approximations of the final crash rate. Right now, the rather-exact figures I’m giving in the previous paragraph may have no bearing on reality.

Conclusion

If we are to use crash pings and main pings together to measure “something”, we need to fully understand and measure the differences in their client-side delays. If the curves above are stable, we might be able to model their differences with some degree of accuracy. This would require a higher-resolution crash timestamp.

If we wish to use this measured “something” earlier than 24h from the event (like, say, to measure how crashy a new release is), we need to either chose a method that doesn’t rely on main pings, or speed up main ping reporting so that it has a curve closer to that of crash pings.

To do my part I will see if having a better crash timestamp (hours would do, minutes would be the most I would need) is something we might be willing to pursue, and I will lobby for the rapid completion and adoption of pingSender as a method for turning main pings’ submission delay CDF into a carbon copy of crash pings’.

Please peruse the full analysis on reports.telemetry.mozilla.org if you are interested in the details of how these graphs were generated.

:chutten

What’s the First Firefox Crash a User Sees?

Growth is going to be a big deal across Mozilla in 2017. We spent 2016 solidifying our foundations, and now we’re going to use that to spring to action and grow our influence and user base.

So this got me thinking about new users. We’re constantly getting new users: people who, for one reason or another, choose to install and run Firefox for the first time today. They run it and… well, then what?

Maybe they like it. They open a new tab. Then they open a staggeringly unbelievable number of tabs. They find and install an addon. Or two.

Fresh downloads and installs of Firefox continue at an excellent pace. New people, every day, are choosing Firefox.

So with the number of new users we already see, the key to Growth may not lie in attracting more of them… it might be that we need to keep the ones we already see.

So what might stop a user from using Firefox? Maybe after they open the seventy-first tab, Firefox crashes. It just disappears on them. They open it again, browse for a little while… but can’t forget that the browser, at any time, could just decide to disappear and let them down. So they migrate back to something else, and we lose them.

It is with these ideas in my head that I wondered “Are there particular types of crashes that happen to new users? Do they more likely crash because of a plugin, their GPU misbehaving, running out of RAM… What is their first crash, and how might it compare to the broader ecosystem of crashes we see and fix every day?”

With the new data available to me thanks to Gabriele Svelto’s work on client-side stack traces, I figured I could maybe try to answer it.

My full analysis is here, but let me summarize: sadly there’s too much noise in the process to make a full diagnosis. There are some strange JSON errors I haven’t tracked down… and even if I do, there are too many “mystery” stack frames that we just don’t have a mechanism to figure out yet.

And this isn’t even covering how we need some kind of service or algorithm to detect signatures in these stacks or at least cluster them in some meaningful way.

Work is ongoing, so I hope to have a more definite answer in the future. But for now, all I can do is invite you to tell me what you think causes users to stop using Firefox. You can find me on Twitter, or through my @mozilla.com email address.

:chutten

Data Science is Hard – Case Study: What is a Firefox Crash?

In the past I’ve gone on at length about the challenge of getting timely data to determine Firefox release quality with respect to how often Firefox crashes. Comparatively I’ve spent essentially no time at all on what a crash actually is.

A crash (broadly) is what happens when a computer process encounters an error it cannot recover from. Since it cannot recover, the system it is running within ends the process abruptly.

Not all crashes are equal. Not all crashes mean the same thing to users and to release managers and to computer programmers.

If you are in the middle of drafting an email and the web page content suddenly goes blank and says “Sorry, this tab has crashed.” then that’s a big deal. It’s even worse if the entire browser disappears without warning.

But what if Firefox crashes, but only after it has mostly shut down? Everything’s been saved properly, but we didn’t clean up after ourselves well. This is a crash (technically), but does it really matter to a user?

What if the process that contains Flash crashes and web advertisements stop working? It can be restarted without too much trouble, and no one likes ads, so is it really that bad of a thing?

And on top of these families of events, there are other horrible things that can happen to users we might want to call “crashes” even though they aren’t. For instance: what if the browser becomes completely unresponsive and the user has no recourse but to close it? The process didn’t encounter a fatal error, but that user’s situation is the same: Something weird happened, and now their data is gone.

Generally speaking, I look at four classes of crash: Main Crashes (M), Content Crashes (C), Content Shutdown Crashes (S), and Plugin Crashes (P).

In my opinion, the most reliable indicator of Firefox’s stability and quality is M + C – S. In plain English, it is the sum of the events where the whole Browser goes poof or the Web Content inside the browser goes poof, ignoring the times when the Web Content goes poof after the user has decided to shut down the browser.

It doesn’t include Plugin crashes, as those are less obtrusive and more predicted by the plugin code, not Firefox code. It does include some events where Firefox became unresponsive (or “hangs” for short) and had to be terminated.

This, to my mind, most accurately encompasses a measure of Firefox quality. If the number of these crashes goes up, that means there are more times where more users are having less fun with Firefox. If the number of these crashes goes down, that means there are fewer times that fewer people are having less fun with Firefox.

It doesn’t tell the whole story. What good is a not-crashing browser if it doesn’t scroll when you ask it to? What good is a stable piece of web content if half of it is missing because we don’t support it? What good is a Firefox that is open all the time if it takes twice as long to load the web pages you care about?

But it gives us one very important part of the Firefox Quality story, and that’s good enough for me.

:chutten

Data Science is Hard – Case Study: Latency of Firefox Crash Rates

Firefox crashes sometimes. This bothers users, so a large amount of time, money, and developer effort is devoted to keep it from happening.

That’s why I like that image of Firefox Aurora’s crash rate from my post about Firefox’s release model. It clearly demonstrates the result of those efforts to reduce crash events:auroramcscrashes

So how do we measure crashes?

That picture I like so much comes from this dashboard I’m developing, and uses a very specific measure of both what a crash is, and what we normalize it by so we can use it as a measure of Firefox’s quality.

Specifically, we count the number of times Firefox or the web page content disappears from the user’s view without warning. Unfortunately, this simple count of crash events doesn’t give us a full picture of Firefox’s quality, unless you think Firefox is miraculously 30% less crashy on weekends:aurora51a2crashes

So we need to normalize it based on some measure of how much Firefox is being used. We choose to normalize it by thousands of “usage hours” where a usage hour is one hour Firefox was open and running for a user without crashing.

Unfortunately, this choice of crashes per thousand usage hours as our metric, and how we collect data to support it, has its problems. Most significant amongst these problems is the delay between when a new build is released and when this measure can tell you if it is a good build or not.

Crashes tend to come in quickly. Generally speaking, when a user’s Firefox disappears out from under them, they are quick to restart it. This means this new Firefox instance is able to send us information about that crash usually within minutes of it happening. So for now, we can ignore the delay between a crash happening and our servers being told about it.

The second part is harder: when should users tell us that everything is fine?everythingisok

We can introduce code into Firefox that would tell us every minute that nothing bad happened… but could you imagine the bandwidth costs? Even every hour might be too often. Presently we record this information when the user closes their browser (or if the user doesn’t close their browser, at the user’s local midnight).

The difference between the user experiencing an hour of un-crashing Firefox and that data being recorded is recording delay. This tends to not exceed 24 hours.

If the user shuts down their browser for the day, there isn’t an active Firefox instance to send us the data for collection. This means we have to wait for the next time the user starts up Firefox to send us their “usage hours” number. If this was a Friday’s record, it could easily take until Monday to be sent.

The difference between the data being recorded and the data being sent is the submission delay. This can take an arbitrary length of time, but we tend to see a decent chunk of the data within two days’ time.

This data is being sent in throughout each and every day. Somewhere at this very moment (or very soon) a user is starting up Firefox and that Firefox will send us some Telemetry. We have the facilities to calculate at any given time the usage hours and the crash numbers for each and every part of each and every day… but this would be a wasteful approach. Instead, a scheduled task performs an aggregation of crash figures and usage hour records per day. This happens once per day and the result is put in the CrashAggregates dataset.

The difference between a crash or usage hour record being submitted and it being present in this daily derived dataset is aggregation delay. This can be anywhere from 0 to 23 hours.

This dataset is stored in one format (parquet), but queried in another (prestodb fronted by re:dash). This migration task is performed once per day some time after the dataset is derived.

The difference between the aggregate dataset being derived and its appearance in the query interface is migration delay. This is roughly an hour or two.

Many queries run against this dataset and are scheduled sequentially or on an ad hoc basis. The query that supplies the data to the telemetry crash dashboard runs once per day at 2pm UTC.

The difference between the dataset being present in the query interface and the query running is query scheduling delay. This is about an hour.

This provides us with a handy equation:

latency = reporting delay + submission delay + aggregation delay + migration delay + query scheduling delay

With typical values, we’re seeing:

latency = 6 hours + 24 hours + 12 hours + 1 hour + 1 hour

latency = 2 days

And since submission delay is unbounded (and tends to be longer than 24 hours on weekends and over holidays), the latency is actually a range of probable values. We’re never really sure when we’ve heard from everyone.

So what’s to blame, and what can we do about it?

The design of Firefox’s Telemetry data reporting system is responsible for reporting delay and submission delay: two of the worst offenders. submission delay could be radically improved if we devoted engineering resources to submitting Telemetry (both crash numbers and “usage hour” reports) without an active Firefox running (using, say, a small executable that runs as soon as Firefox crashes or closes). reporting delay will probably not be adjusted very much as we don’t want to saturate our users’ bandwidth (or our own).

We can improve aggregation delay simply by running the aggregation, migration, and query multiple times a day, as information is coming in. Proper scheduling infrastructure can remove all the non-processing overhead from migration delay and query scheduling delay which can bring them easily down below a single hour, combined.

In conclusion, even given a clear and specific metric and a data collection mechanism with which to collect all the data necessary to measure it, there are still problems when you try to use it to make timely decisions. There are technical solutions to these technical problems, but they require a focused approach to improve the timeliness of reported data.

:chutten

All Aboard the Release Train!

I’m working on a dashboard (a word which here means a website that has plots on it) to show Firefox crashes per channel over time. The idea is to create a resource for Release Management to figure out if there’s something wrong with a given version of Firefox with enough lead time to then do something about it.

It’s entering its final form (awaiting further feedback from relman and others to polish it) and while poking around (let’s call it “testing”) I noticed a pattern:auroramcscrashes

Each one of those spikes happens on a “merge day” when the Aurora channel (the channel powering Firefox Developer Edition) updates to the latest code from the Nightly channel (the channel powering Firefox Nightly). From then on, only stabilizing changes are merged so that when the next merge day comes, we have a nice, stable build to promote from Aurora to the Beta channel (the channel powering Firefox Beta). Beta undergoes further stabilization on a slower schedule (a new build every week or two instead of daily) to become a candidate for the Release channel (which powers Firefox).

For what it’s worth, this is called the Train Model. It allows us to ship stable and secure code with the latest features every six-to-eight weeks to hundreds of millions of users. It’s pretty neat.

And what that picture up there shows is that it works. The improvement is especially noticeable on the Aurora branch where we go from crashing 9-10 times for every thousand hours of use to 3-4 times for every thousand hours.

Now, the number and diversity of users on the Aurora branch is limited, so when the code rides the trains to Beta, the crash rate goes up. Suddenly code that seemed stable across the Aurora userbase is exposed to previously-unseen usage patterns and configurations of hardware and software and addons. This is one of the reasons why our pre-release users are so precious to us: they provide us with the early warnings we need to stabilize our code for the wild diversity of our users as well as the wild diversity of the Web.

If you’d like to poke around at the dashboard yourself, it’s over here. Eventually it’ll get merged into telemetry.mozilla.org when it’s been reviewed and formalized.

If you have any brilliant ideas of how to improve it, or find any mistakes or bugs, please comment on the tracking bug where the discussion is currently ongoing.

If you’d like to help in an different way, consider installing Firefox Beta. It should look and act almost exactly like your current Firefox, but with some pre-release goodies that you get to use before anyone else (like how Firefox Beta 50 lets you use Reader Mode when you want to print a web page to skip all the unnecessary sidebars and ads). By doing this you are helping us ship the best Firefox we can.

:chutten