Data Science is Hard – Case Study: Latency of Firefox Crash Rates

Firefox crashes sometimes. This bothers users, so a large amount of time, money, and developer effort is devoted to keep it from happening.

That’s why I like that image of Firefox Aurora’s crash rate from my post about Firefox’s release model. It clearly demonstrates the result of those efforts to reduce crash events:auroramcscrashes

So how do we measure crashes?

That picture I like so much comes from this dashboard I’m developing, and uses a very specific measure of both what a crash is, and what we normalize it by so we can use it as a measure of Firefox’s quality.

Specifically, we count the number of times Firefox or the web page content disappears from the user’s view without warning. Unfortunately, this simple count of crash events doesn’t give us a full picture of Firefox’s quality, unless you think Firefox is miraculously 30% less crashy on weekends:aurora51a2crashes

So we need to normalize it based on some measure of how much Firefox is being used. We choose to normalize it by thousands of “usage hours” where a usage hour is one hour Firefox was open and running for a user without crashing.

Unfortunately, this choice of crashes per thousand usage hours as our metric, and how we collect data to support it, has its problems. Most significant amongst these problems is the delay between when a new build is released and when this measure can tell you if it is a good build or not.

Crashes tend to come in quickly. Generally speaking, when a user’s Firefox disappears out from under them, they are quick to restart it. This means this new Firefox instance is able to send us information about that crash usually within minutes of it happening. So for now, we can ignore the delay between a crash happening and our servers being told about it.

The second part is harder: when should users tell us that everything is fine?everythingisok

We can introduce code into Firefox that would tell us every minute that nothing bad happened… but could you imagine the bandwidth costs? Even every hour might be too often. Presently we record this information when the user closes their browser (or if the user doesn’t close their browser, at the user’s local midnight).

The difference between the user experiencing an hour of un-crashing Firefox and that data being recorded is recording delay. This tends to not exceed 24 hours.

If the user shuts down their browser for the day, there isn’t an active Firefox instance to send us the data for collection. This means we have to wait for the next time the user starts up Firefox to send us their “usage hours” number. If this was a Friday’s record, it could easily take until Monday to be sent.

The difference between the data being recorded and the data being sent is the submission delay. This can take an arbitrary length of time, but we tend to see a decent chunk of the data within two days’ time.

This data is being sent in throughout each and every day. Somewhere at this very moment (or very soon) a user is starting up Firefox and that Firefox will send us some Telemetry. We have the facilities to calculate at any given time the usage hours and the crash numbers for each and every part of each and every day… but this would be a wasteful approach. Instead, a scheduled task performs an aggregation of crash figures and usage hour records per day. This happens once per day and the result is put in the CrashAggregates dataset.

The difference between a crash or usage hour record being submitted and it being present in this daily derived dataset is aggregation delay. This can be anywhere from 0 to 23 hours.

This dataset is stored in one format (parquet), but queried in another (prestodb fronted by re:dash). This migration task is performed once per day some time after the dataset is derived.

The difference between the aggregate dataset being derived and its appearance in the query interface is migration delay. This is roughly an hour or two.

Many queries run against this dataset and are scheduled sequentially or on an ad hoc basis. The query that supplies the data to the telemetry crash dashboard runs once per day at 2pm UTC.

The difference between the dataset being present in the query interface and the query running is query scheduling delay. This is about an hour.

This provides us with a handy equation:

latency = reporting delay + submission delay + aggregation delay + migration delay + query scheduling delay

With typical values, we’re seeing:

latency = 6 hours + 24 hours + 12 hours + 1 hour + 1 hour

latency = 2 days

And since submission delay is unbounded (and tends to be longer than 24 hours on weekends and over holidays), the latency is actually a range of probable values. We’re never really sure when we’ve heard from everyone.

So what’s to blame, and what can we do about it?

The design of Firefox’s Telemetry data reporting system is responsible for reporting delay and submission delay: two of the worst offenders. submission delay could be radically improved if we devoted engineering resources to submitting Telemetry (both crash numbers and “usage hour” reports) without an active Firefox running (using, say, a small executable that runs as soon as Firefox crashes or closes). reporting delay will probably not be adjusted very much as we don’t want to saturate our users’ bandwidth (or our own).

We can improve aggregation delay simply by running the aggregation, migration, and query multiple times a day, as information is coming in. Proper scheduling infrastructure can remove all the non-processing overhead from migration delay and query scheduling delay which can bring them easily down below a single hour, combined.

In conclusion, even given a clear and specific metric and a data collection mechanism with which to collect all the data necessary to measure it, there are still problems when you try to use it to make timely decisions. There are technical solutions to these technical problems, but they require a focused approach to improve the timeliness of reported data.