Another Advantage of Decreasing Data Latency: Flatter Graphs

I’ve muttered before about how difficult it can be to measure application crashes. The most important lesson is that you can’t just count the number of crashes, you must normalize it by some “usage” value in order to determine whether a crashy day is because the application got crashier or because the application was just being used more.

Thus you have a numerator (number of crashes) and a denominator (some proxy of application usage) to determine the crash rate: crashes-per-use.

The current dominant denominator for Firefox is “thousand hours that Firefox is open,” or “kilo-usage-hours (kuh).”

The biggest problem we’ve been facing lately is how our numerator (number of crashes) comes in at a different rate and time than our denominator (kilo-usage-hours) due to the former being transmitted nearly-immediately via “crash” ping and the former being transmitted occasionally via “main” ping.

With pingsender now sending most “main” pings as soon as they’re created, our client submission delay for “main” pings is now roughly in line with the client submission delay of “crash” pings.

What does this mean? Well, look at this graph from https://telemetry.mozilla.org/crashes:

Screenshot-2017-7-25 Crash Rates (Telemetry)

This is the Firefox Beta Main Crash Rate (number of main process crashes on Firefox Beta divided by the number of thousands of hours users had Firefox Beta running) over the past three months or so. The spike in the middle is when we switched from Firefox Beta 54 to Firefox Beta 55. (Most of that spike is a measuring artefact due to a delay between a beta being available and people installing it. Feel free to ignore it for our purposes.)

On the left in the Beta 54 data there is a seven-day cycle where Sundays are the lowest point and Saturday is the highest point.

On the right in the Beta 55 data, there is no seven-day cycle. The rate is flat. (It is a little high, but flat. Feel free to ignore its height for our purposes.)

This is because sending “main” pings with pingsender is behaviour that ships in Firefox 55. Starting with 55, instead of having most of our denominator data (usage hours) coming in one day late due to “main” ping delay, we have that data in-sync with the numerator data (main crashes), resulting in a flat rate.

You can see it in the difference between Firefox ESR 52 (yellow) and Beta 55 (green) in the kusage_hours graph also on https://telemetry.mozilla.org/crashes:

Screenshot-2017-7-27 Crash Rates (Telemetry)

On the left, before Firefox Beta 55’s release, they were both in sync with each other, but one day behind the crash counts. On the right, after Beta 55’s release, notice that Beta 55’s cycle is now one day ahead of ESR 52’s.

This results in still more graphs that are quite satisfying. To me at least.

It also, somewhat more importantly, now makes the crash rate graph less time-variable. This reduces cognitive load on people looking at the graphs for explanations of what Firefox users experience in the wild. Decision-makers looking at these graphs no longer need to mentally subtract from the graph for Saturday numbers, adding that back in somehow for Sundays (and conducting more subtle adjustments through the week).

Now the rate is just the rate. And any change is much more likely to mean a change in crashiness, not some odd day-of-week measurement you can ignore.

I’m not making these graphs to have them ignored.

(many thanks to :philipp for noticing this effect and forcing me to explain it)

:chutten

Advertisements

Latency Improvements, or, Yet Another Satisfying Graph

This is the third in my ongoing series of posts containing satisfying graphs.

Today’s feature: a plot of the mean and 95th percentile submission delays of “main” pings received by Firefox Telemetry from users running Firefox Beta.

Screenshot-2017-7-12 Beta _Main_ Ping Submission Delay in hours (mean, 95th %ile)

We went from receiving 95% of pings after about, say, 130 hours (or 5.5 days) down to getting them within about 55 hours (2 days and change). And the numbers will continue to fall as more beta users get the modern beta builds with lower latency ping sending thanks to pingsender.

What does this mean? This means that you should no longer have to wait a week to get a decently-rigorous count of data that comes in via “main” pings (which is most of our data). Instead, you only have to wait a couple of days.

Some teams were using the rule-of-thumb of ten (10) days before counting anything that came in from “main” pings. We should be able to reduce that significantly.

How significantly? Time, and data, will tell. This quarter I’m looking into what guarantees we might be able to extend about our data quality, which includes timeliness… so stay tuned.

For a more rigorous take on this, partake in any of dexter’s recent reports on RTMO. He’s been tracking the latency improvements and possible increases in duplicate ping rates as these changes have ridden the trains towards release. He’s blogged about it if you want all the rigour but none of Python.

:chutten

FINE PRINT: Yes, due to how these graphs work they will always look better towards the end because the really delayed stuff hasn’t reached us yet. However, even by the standards of the pre-pingsender mean and 95th percentiles we are far enough after the massive improvement for it to be exceedingly unlikely to change much as more data is received. By the post-pingsender standards, it is almost impossible. So there.

FINER PRINT: These figures include adjustments for client clocks having skewed relative to server clocks. Time is a really hard problem when even on a single computer and trying to reconcile it between many computers separated by oceans both literal and metaphorical is the subject of several dissertations and, likely, therapy sessions. As I mentioned above, for rigour and detail about this and other aspects, see RTMO.

Data Science is Hard: Anomalies Part 3

So what do you do when you have a duplicate data problem and it just keeps getting worse?

You detect and discard.

Specifically, since we already have a few billion copies of pings with identical document ids (which are extremely-unlikely to collide), there is no benefit to continue storing them. So what we do is write a short report about what the incoming duplicate looked like (so that we can continue to analyze trends in duplicate submissions), then toss out the data without even parsing it.

As before, I’ll leave finding out the time the change went live as an exercise for the reader:newplot(1)

:chutten

The Most Satisfying Graph

There were a lot of Firefox users on Beta 44.

Usually this is a good thing. We like having a lot of users[citation needed].

It wasn’t a good thing this time, as Beta had already moved on to 45. Then 46. Eventually we were at Beta 52, and the number of users on Beta 44 was increasing.

We thought maybe it was because Beta 44 had the same watershed as Release 43. Watershed? Every user running a build before a watershed must update to the watershed first before updating to the latest build. If you have Beta 41 and the latest is Beta 52, you must first update to Beta 44 (watershed) so we can better ascertain your cryptography support before continuing on to 48, which is another watershed, this time to do with fourteen-year-old processor extensions. Then, and only then, can you proceed to the currently-most-recent version, Beta 52.

(If you install afresh, the installer has the smarts to figure out your computer’s cryptographic and CPU characteristics and suitability so that new users jump straight to the front of the line)

Beta 44 being a watershed should, indeed, require a longer-than-usual lifetime of the version, with respect to population. If this were the only effect at play we’d expect the population to quickly decrease as users updated.

But they didn’t update.

It turns out that whenever the Beta 44 users attempted to download an update to that next watershed release, Beta 48, they were getting a 404 Not Found. At some point, the watershed Beta 48 build on download.mozilla.org was removed, possibly due to age (we can’t keep everything forever). So whenever the users on Beta 44 wanted to update, they couldn’t. To compound things, any time a user before Beta 44 wanted to update, they had to go through Beta 44. Where they were caught.

This was fixed on… well, I’ll let you figure out which day it was fixed on:

beta44_dau

This is now the most satisfying graph I’ve ever plotted at Mozilla.

:chutten