Another Advantage of Decreasing Data Latency: Flatter Graphs

I’ve muttered before about how difficult it can be to measure application crashes. The most important lesson is that you can’t just count the number of crashes, you must normalize it by some “usage” value in order to determine whether a crashy day is because the application got crashier or because the application was just being used more.

Thus you have a numerator (number of crashes) and a denominator (some proxy of application usage) to determine the crash rate: crashes-per-use.

The current dominant denominator for Firefox is “thousand hours that Firefox is open,” or “kilo-usage-hours (kuh).”

The biggest problem we’ve been facing lately is how our numerator (number of crashes) comes in at a different rate and time than our denominator (kilo-usage-hours) due to the former being transmitted nearly-immediately via “crash” ping and the former being transmitted occasionally via “main” ping.

With pingsender now sending most “main” pings as soon as they’re created, our client submission delay for “main” pings is now roughly in line with the client submission delay of “crash” pings.

What does this mean? Well, look at this graph from https://telemetry.mozilla.org/crashes:

Screenshot-2017-7-25 Crash Rates (Telemetry)

This is the Firefox Beta Main Crash Rate (number of main process crashes on Firefox Beta divided by the number of thousands of hours users had Firefox Beta running) over the past three months or so. The spike in the middle is when we switched from Firefox Beta 54 to Firefox Beta 55. (Most of that spike is a measuring artefact due to a delay between a beta being available and people installing it. Feel free to ignore it for our purposes.)

On the left in the Beta 54 data there is a seven-day cycle where Sundays are the lowest point and Saturday is the highest point.

On the right in the Beta 55 data, there is no seven-day cycle. The rate is flat. (It is a little high, but flat. Feel free to ignore its height for our purposes.)

This is because sending “main” pings with pingsender is behaviour that ships in Firefox 55. Starting with 55, instead of having most of our denominator data (usage hours) coming in one day late due to “main” ping delay, we have that data in-sync with the numerator data (main crashes), resulting in a flat rate.

You can see it in the difference between Firefox ESR 52 (yellow) and Beta 55 (green) in the kusage_hours graph also on https://telemetry.mozilla.org/crashes:

Screenshot-2017-7-27 Crash Rates (Telemetry)

On the left, before Firefox Beta 55’s release, they were both in sync with each other, but one day behind the crash counts. On the right, after Beta 55’s release, notice that Beta 55’s cycle is now one day ahead of ESR 52’s.

This results in still more graphs that are quite satisfying. To me at least.

It also, somewhat more importantly, now makes the crash rate graph less time-variable. This reduces cognitive load on people looking at the graphs for explanations of what Firefox users experience in the wild. Decision-makers looking at these graphs no longer need to mentally subtract from the graph for Saturday numbers, adding that back in somehow for Sundays (and conducting more subtle adjustments through the week).

Now the rate is just the rate. And any change is much more likely to mean a change in crashiness, not some odd day-of-week measurement you can ignore.

I’m not making these graphs to have them ignored.

(many thanks to :philipp for noticing this effect and forcing me to explain it)

:chutten