Another Advantage of Decreasing Data Latency: Flatter Graphs

I’ve muttered before about how difficult it can be to measure application crashes. The most important lesson is that you can’t just count the number of crashes, you must normalize it by some “usage” value in order to determine whether a crashy day is because the application got crashier or because the application was just being used more.

Thus you have a numerator (number of crashes) and a denominator (some proxy of application usage) to determine the crash rate: crashes-per-use.

The current dominant denominator for Firefox is “thousand hours that Firefox is open,” or “kilo-usage-hours (kuh).”

The biggest problem we’ve been facing lately is how our numerator (number of crashes) comes in at a different rate and time than our denominator (kilo-usage-hours) due to the former being transmitted nearly-immediately via “crash” ping and the former being transmitted occasionally via “main” ping.

With pingsender now sending most “main” pings as soon as they’re created, our client submission delay for “main” pings is now roughly in line with the client submission delay of “crash” pings.

What does this mean? Well, look at this graph from https://telemetry.mozilla.org/crashes:

Screenshot-2017-7-25 Crash Rates (Telemetry)

This is the Firefox Beta Main Crash Rate (number of main process crashes on Firefox Beta divided by the number of thousands of hours users had Firefox Beta running) over the past three months or so. The spike in the middle is when we switched from Firefox Beta 54 to Firefox Beta 55. (Most of that spike is a measuring artefact due to a delay between a beta being available and people installing it. Feel free to ignore it for our purposes.)

On the left in the Beta 54 data there is a seven-day cycle where Sundays are the lowest point and Saturday is the highest point.

On the right in the Beta 55 data, there is no seven-day cycle. The rate is flat. (It is a little high, but flat. Feel free to ignore its height for our purposes.)

This is because sending “main” pings with pingsender is behaviour that ships in Firefox 55. Starting with 55, instead of having most of our denominator data (usage hours) coming in one day late due to “main” ping delay, we have that data in-sync with the numerator data (main crashes), resulting in a flat rate.

You can see it in the difference between Firefox ESR 52 (yellow) and Beta 55 (green) in the kusage_hours graph also on https://telemetry.mozilla.org/crashes:

Screenshot-2017-7-27 Crash Rates (Telemetry)

On the left, before Firefox Beta 55’s release, they were both in sync with each other, but one day behind the crash counts. On the right, after Beta 55’s release, notice that Beta 55’s cycle is now one day ahead of ESR 52’s.

This results in still more graphs that are quite satisfying. To me at least.

It also, somewhat more importantly, now makes the crash rate graph less time-variable. This reduces cognitive load on people looking at the graphs for explanations of what Firefox users experience in the wild. Decision-makers looking at these graphs no longer need to mentally subtract from the graph for Saturday numbers, adding that back in somehow for Sundays (and conducting more subtle adjustments through the week).

Now the rate is just the rate. And any change is much more likely to mean a change in crashiness, not some odd day-of-week measurement you can ignore.

I’m not making these graphs to have them ignored.

(many thanks to :philipp for noticing this effect and forcing me to explain it)

:chutten

Advertisements

Latency Improvements, or, Yet Another Satisfying Graph

This is the third in my ongoing series of posts containing satisfying graphs.

Today’s feature: a plot of the mean and 95th percentile submission delays of “main” pings received by Firefox Telemetry from users running Firefox Beta.

Screenshot-2017-7-12 Beta _Main_ Ping Submission Delay in hours (mean, 95th %ile)

We went from receiving 95% of pings after about, say, 130 hours (or 5.5 days) down to getting them within about 55 hours (2 days and change). And the numbers will continue to fall as more beta users get the modern beta builds with lower latency ping sending thanks to pingsender.

What does this mean? This means that you should no longer have to wait a week to get a decently-rigorous count of data that comes in via “main” pings (which is most of our data). Instead, you only have to wait a couple of days.

Some teams were using the rule-of-thumb of ten (10) days before counting anything that came in from “main” pings. We should be able to reduce that significantly.

How significantly? Time, and data, will tell. This quarter I’m looking into what guarantees we might be able to extend about our data quality, which includes timeliness… so stay tuned.

For a more rigorous take on this, partake in any of dexter’s recent reports on RTMO. He’s been tracking the latency improvements and possible increases in duplicate ping rates as these changes have ridden the trains towards release. He’s blogged about it if you want all the rigour but none of Python.

:chutten

FINE PRINT: Yes, due to how these graphs work they will always look better towards the end because the really delayed stuff hasn’t reached us yet. However, even by the standards of the pre-pingsender mean and 95th percentiles we are far enough after the massive improvement for it to be exceedingly unlikely to change much as more data is received. By the post-pingsender standards, it is almost impossible. So there.

FINER PRINT: These figures include adjustments for client clocks having skewed relative to server clocks. Time is a really hard problem when even on a single computer and trying to reconcile it between many computers separated by oceans both literal and metaphorical is the subject of several dissertations and, likely, therapy sessions. As I mentioned above, for rigour and detail about this and other aspects, see RTMO.

Data Science is Hard: Anomalies Part 3

So what do you do when you have a duplicate data problem and it just keeps getting worse?

You detect and discard.

Specifically, since we already have a few billion copies of pings with identical document ids (which are extremely-unlikely to collide), there is no benefit to continue storing them. So what we do is write a short report about what the incoming duplicate looked like (so that we can continue to analyze trends in duplicate submissions), then toss out the data without even parsing it.

As before, I’ll leave finding out the time the change went live as an exercise for the reader:newplot(1)

:chutten

Data Science is Hard: Anomalies Part 2

Apparently this is one of those problems that jumps two orders of magnitude if you ignore it:

aurora51-submissions

Since last time we’ve noticed that the vast majority of these incoming pings are duplicate. I don’t mean that they look similar, I mean that they are absolutely identical down to their supposedly-unique document ids.

How could this happen?

Well, with a minimum of speculation we can assume that however these Firefox instances are being distributed, they are being distributed with full copies of the original profile data directory. This would contain not only the user’s configuration information, but also copies of all as-yet-unsent pings. Once the distributed Firefox instance was started in its new home, it would submit these pending pings, which would explain why they are all duplicated: the distributor copy-pasta’d them.

So if we want to learn anything about the population of machines that are actually running these instances, we need to ignore all of these duplicate pings. So I took my analysis from last time and tweaked it.

First off, to demonstrate just how much of the traffic spike we see is the same fifteen duplicate pings, here is a graph of ping volume vs unique ping volume:

output_12_0

The count of non-duplicated pings is minuscule. We can conclude from this that most of these distributed Firefox instances rarely get the opportunity to send more than one ping. (Because if they did, we’d see many more unique pings created on their new hosts)

What can we say about these unique pings?

Besides how infrequent they are? They come from instances that all have the same Random Agent Spoofer addon that we saw in the original analysis. None of them are set as the user’s default browser. The hosts are most likely to have a 2.4GHz or 3.5GHz cpu. The hosts come from a geographically-diverse spread of area, with a peculiarly-popular cluster in Montreal (maybe they like the bagels. I know I do).

All of the pings come from computers running Windows XP. I wish I were more surprised by this, but it really does turn out that running software over a decade past its best before is a bad idea.

Also of note: the length of time the browser is open for is far too short (60-75s mostly) for a human to get anything done with it:

output_26_0

(Telemetry needs 60s after Firefox starts up in order to send a ping so it’s possible that there are browsing sessions that are shorter than a minute that we aren’t seeing.)

What can/should be done about these pings?

These pings are coming in at a rate far exceeding what the entire Aurora 51 population had when it was an active release. Yet, Aurora 51 hasn’t been an active release for six months and Aurora itself is going away.

As such, though its volume seems to continue to increase, this anomaly is less and less likely to cause us real problems day-to-day. These pings are unlikely to accidentally corrupt a meaningful analysis or mis-scale a plot.

And with our duplicate detector identifying these pings as they come in, it isn’t clear that this actually poses an analysis risk at all.

So, should we do anything about this?

Well, it is approaching release-channel-levels of volume per-day, submitted evenly at all hours instead of in the hump-backed periodic wave that our population usually generates:

aurora51-duplicateMainPings

Hundreds of duplicates detected every minute means nearly a million pings a day. We can handle it (in the above plot I turned off release, whose low points coincide with aurora’s high points), but should we?

Maybe for Mozilla’s server budget’s sake we should shut down this data after all. There’s no point in receiving yet another billion copies of the exact same document. The only things that differ are the submission timestamp and submitting IP address.

Another point: it is unlikely that these hosts are participating in this distribution of their free will. The rate of growth, the length of sessions, the geographic spread, and the time of day the duplicates arrive at our servers strongly suggest that it isn’t humans who are operating these Firefox installs. Maybe for the health of these hosts on the Internet we should consider some way to hotpatch these wayward instances into quiescence.

I don’t know what we (mozilla) should do. Heck, I don’t even know what we can do.

I’ll bring this up on fhr-dev and see if we’ll just leave this alone, waiting for people to shut off their Windows XP machines… or if we can come up with something we can (and should) do.

:chutten

Data Science is Hard: History, or It Seemed Like a Good Idea At the Time

I’m mentoring a Summer of Code project this summer about redesigning the “about:telemetry” interface that ships with each and every version of Firefox.

The minute the first student (:flyingrub) asked me “What is a parent payload and child payload?” I knew I was going to be asked a lot of questions.

To least-effectively answer these questions, I’ll blog the answers as narratives. And to start with this question, here’s how the history of a project makes it difficult to collect data from it.

In the Beginning — or, rather, in the middle of October 2015 when I was hired at Mozilla (so, at my Beginning) — there was single-process Firefox, and all was good. Users had many tabs, but one process. Users had many bookmarks, but one process. Users had many windows, but one process. All this and the web contents themselves were all sharing time within a single construct of electrons and bits and code and pixels: vying with each other for control of the filesystem, the addressable space of RAM, the network resources, and CPU scheduling.

Not satisfied with things being just “good”, we took a page from the book penned by Google Chrome and decided the time was ripe to split the browser into many processes so that a critical failure in one would not trouble the others. To begin with, because our code is venerable, we decided that we would try two processes. One of these twins would be in charge of the browser and the other would be in charge of the web contents.

This project was called Electrolysis after the mechanism by which one might split water into Hydrogen and Oxygen using electricity.

Suddenly the browser became responsive, even in the face of the worst JavaScript written by the least experienced dev at the most privileged startup in Silicon Valley. And out-of-memory errors decreased in frequency because the browser’s memory and the web contents’ memory were able to grow without interfering with each other.

Remember, our code is venerable. Remember, our code hearkens from its single-process past.

Our data-collection code was written in that single-process past. But we had two processes with input events that need to be timed to find problems. We had two processes with memory allocations that need to be examined for regressions.

So the data collection code was made aware that there could be two types of process: parent and child.

Alas, not just one child. There could be many child processes in a row if some webpage were naughty and brought down the child in its anger. So the data collection code was made aware there could be many batches of data from child processes, and one batch of data from parent processes.

The parent data was left looking like single-process data, out in the root of the data collection payload. Child processes’ data were placed in an array of childPayloads where each payload echoed the structure of the parent.

Then, not content with “good”, I had to come along in bug 1218576, a bug whose number I still have locked in my memory, for good or ill.

Firefox needs to have multiple child processes of different types, simultaneously. And many of some of those several types, also simultaneously. What was going to be a quick way to ensure that childPayloads was always of length 1 turned into a months-long exercise to put data exactly where we wanted it to be.

And so now we have childPayloads where the “weird” content child data that resists aggregation remains, and we also have payload.processes.<process type>.* where the cool and hip data lives: histograms, scalars, and keyed variants of both.

Already this approach is showing dividends as some proportions of Nightly users are getting a gpu process, and others are getting some number of content processes. The data files neatly into place with minimal intervention required.

But it means about:telemetry needs to know whether you want the parent’s “weird” data or the child’s. And which child was that, again?

And about:telemetry also needs to know whether you want the parent’s “cool” data, or the content child’s, or the gpu child’s.

So this means that within about:telemetry there are now five places where you can select what process you want. One for “weird” data, and one for each of the four kinds of “cool” data.

Sadly, that brings my storytelling to a close, having reached the present day. Hopefully after this Summer’s Code is done, this will have a happier, more-maintainable, and responsively-designed ending.

But until now, remember that “accident of history” is the answer to most questions. As such it behooves you to learn history well.

:chutten

The Most Satisfying Graph

There were a lot of Firefox users on Beta 44.

Usually this is a good thing. We like having a lot of users[citation needed].

It wasn’t a good thing this time, as Beta had already moved on to 45. Then 46. Eventually we were at Beta 52, and the number of users on Beta 44 was increasing.

We thought maybe it was because Beta 44 had the same watershed as Release 43. Watershed? Every user running a build before a watershed must update to the watershed first before updating to the latest build. If you have Beta 41 and the latest is Beta 52, you must first update to Beta 44 (watershed) so we can better ascertain your cryptography support before continuing on to 48, which is another watershed, this time to do with fourteen-year-old processor extensions. Then, and only then, can you proceed to the currently-most-recent version, Beta 52.

(If you install afresh, the installer has the smarts to figure out your computer’s cryptographic and CPU characteristics and suitability so that new users jump straight to the front of the line)

Beta 44 being a watershed should, indeed, require a longer-than-usual lifetime of the version, with respect to population. If this were the only effect at play we’d expect the population to quickly decrease as users updated.

But they didn’t update.

It turns out that whenever the Beta 44 users attempted to download an update to that next watershed release, Beta 48, they were getting a 404 Not Found. At some point, the watershed Beta 48 build on download.mozilla.org was removed, possibly due to age (we can’t keep everything forever). So whenever the users on Beta 44 wanted to update, they couldn’t. To compound things, any time a user before Beta 44 wanted to update, they had to go through Beta 44. Where they were caught.

This was fixed on… well, I’ll let you figure out which day it was fixed on:

beta44_dau

This is now the most satisfying graph I’ve ever plotted at Mozilla.

:chutten

Data Science is Hard: Client Delays for Crash Pings

Second verse, much like the first: how quickly do we get data from clients?

This time: crash pings.

Recording Delay

The recording delay of crash pings is different from main pings in that the only time information we have about when the information happens is crashDate, which only tells you the day the crash happened, not the time. This results in a weird stair-step pattern on the plot as I make a big assumption:

Assumption: If the crash ping was created on the same day that the crash happened, it took essentially 0 time to do so. (If I didn’t make this assumption, the plot would have every line at 0 for the first 24 hours and we’d not have as much information displayed before the 96-hour max)

output_19_1

The recording delay for crash pings is the time between the crash happening and the user restarting their browser. As expected, most users appear to restart their browser immediately. Even the slowest channel (release) has over 80% of its crash pings recorded within two days.

Submission Delay

The submission delay for crash pings, as with all pings, is the time between the creation of the ping and the sending of the ping. What makes the crash ping special is that it isn’t even created until the browser has restarted, so I expected these to be quite short:

output_22_1

They do not disappoint. Every branch but Nightly has 9 out of every 10 crash pings sent within minutes of it being created.

Nightly is a weird one. It starts off having the worst proportion of created pings unsent, but then becomes the best.

Really, all four of these lines should be within an error margin of just being a flat line at the top of the graph, since the code that creates the ping is pretty much the same code that sends it. How in the world are these many crash pings remaining unsent at first, but being sent eventually?

Terribly mysterious.

Combined Delay

output_26_1

The combined client delay for crash pings shows that we ought to have over 80% of all crash pings from all channels within a day or two of the crash happening. The coarseness of the crashDate measure makes it hard to say exactly how many and when, but the curve is clearly a much faster one than for the main ping delays previously examined.

Crash Rates

For crash rates that use crashes counted from crash pings and some normalization factor (like usage hours) counted from main pings, it doesn’t actually matter how fast or slow these pings come in. If only 50% of crashes and 50% of usage hours came in within a day, the crash rate would still be correct.

What does matter is when the pings arrive at different speeds:

combined_crashmain_delay

(Please forgive my awful image editing work)

Anywhere that the two same-coloured lines fail to overlap is a time when the server-recorded count of crashes from crash pings will not be from the same proportion of the population as the sever-recorded count of usage hours from main pings.

For example: On release (dark blue), if we look at the crash rate at 22 or 30-36 hours out from a given period, the crash rate is likely to approximate what a final tally will give us. But if we check early (before 22h, or between 22 and 30h), when the main pings are lagging, the crash rate will seem higher than reality. If we check later (after 36h), the crash rate will seem lower.

This is where the tyranny of having a day-resolution crashDate really comes into its own. If we could model exactly when a channels’ crash and main submission proportions are equal, we could use that to generate accurate approximations of the final crash rate. Right now, the rather-exact figures I’m giving in the previous paragraph may have no bearing on reality.

Conclusion

If we are to use crash pings and main pings together to measure “something”, we need to fully understand and measure the differences in their client-side delays. If the curves above are stable, we might be able to model their differences with some degree of accuracy. This would require a higher-resolution crash timestamp.

If we wish to use this measured “something” earlier than 24h from the event (like, say, to measure how crashy a new release is), we need to either chose a method that doesn’t rely on main pings, or speed up main ping reporting so that it has a curve closer to that of crash pings.

To do my part I will see if having a better crash timestamp (hours would do, minutes would be the most I would need) is something we might be willing to pursue, and I will lobby for the rapid completion and adoption of pingSender as a method for turning main pings’ submission delay CDF into a carbon copy of crash pings’.

Please peruse the full analysis on reports.telemetry.mozilla.org if you are interested in the details of how these graphs were generated.

:chutten