Latency Improvements, or, Yet Another Satisfying Graph

This is the third in my ongoing series of posts containing satisfying graphs.

Today’s feature: a plot of the mean and 95th percentile submission delays of “main” pings received by Firefox Telemetry from users running Firefox Beta.

Screenshot-2017-7-12 Beta _Main_ Ping Submission Delay in hours (mean, 95th %ile)

We went from receiving 95% of pings after about, say, 130 hours (or 5.5 days) down to getting them within about 55 hours (2 days and change). And the numbers will continue to fall as more beta users get the modern beta builds with lower latency ping sending thanks to pingsender.

What does this mean? This means that you should no longer have to wait a week to get a decently-rigorous count of data that comes in via “main” pings (which is most of our data). Instead, you only have to wait a couple of days.

Some teams were using the rule-of-thumb of ten (10) days before counting anything that came in from “main” pings. We should be able to reduce that significantly.

How significantly? Time, and data, will tell. This quarter I’m looking into what guarantees we might be able to extend about our data quality, which includes timeliness… so stay tuned.

For a more rigorous take on this, partake in any of dexter’s recent reports on RTMO. He’s been tracking the latency improvements and possible increases in duplicate ping rates as these changes have ridden the trains towards release. He’s blogged about it if you want all the rigour but none of Python.


FINE PRINT: Yes, due to how these graphs work they will always look better towards the end because the really delayed stuff hasn’t reached us yet. However, even by the standards of the pre-pingsender mean and 95th percentiles we are far enough after the massive improvement for it to be exceedingly unlikely to change much as more data is received. By the post-pingsender standards, it is almost impossible. So there.

FINER PRINT: These figures include adjustments for client clocks having skewed relative to server clocks. Time is a really hard problem when even on a single computer and trying to reconcile it between many computers separated by oceans both literal and metaphorical is the subject of several dissertations and, likely, therapy sessions. As I mentioned above, for rigour and detail about this and other aspects, see RTMO.

Data Science is Hard: History, or It Seemed Like a Good Idea At the Time

I’m mentoring a Summer of Code project this summer about redesigning the “about:telemetry” interface that ships with each and every version of Firefox.

The minute the first student (:flyingrub) asked me “What is a parent payload and child payload?” I knew I was going to be asked a lot of questions.

To least-effectively answer these questions, I’ll blog the answers as narratives. And to start with this question, here’s how the history of a project makes it difficult to collect data from it.

In the Beginning — or, rather, in the middle of October 2015 when I was hired at Mozilla (so, at my Beginning) — there was single-process Firefox, and all was good. Users had many tabs, but one process. Users had many bookmarks, but one process. Users had many windows, but one process. All this and the web contents themselves were all sharing time within a single construct of electrons and bits and code and pixels: vying with each other for control of the filesystem, the addressable space of RAM, the network resources, and CPU scheduling.

Not satisfied with things being just “good”, we took a page from the book penned by Google Chrome and decided the time was ripe to split the browser into many processes so that a critical failure in one would not trouble the others. To begin with, because our code is venerable, we decided that we would try two processes. One of these twins would be in charge of the browser and the other would be in charge of the web contents.

This project was called Electrolysis after the mechanism by which one might split water into Hydrogen and Oxygen using electricity.

Suddenly the browser became responsive, even in the face of the worst JavaScript written by the least experienced dev at the most privileged startup in Silicon Valley. And out-of-memory errors decreased in frequency because the browser’s memory and the web contents’ memory were able to grow without interfering with each other.

Remember, our code is venerable. Remember, our code hearkens from its single-process past.

Our data-collection code was written in that single-process past. But we had two processes with input events that need to be timed to find problems. We had two processes with memory allocations that need to be examined for regressions.

So the data collection code was made aware that there could be two types of process: parent and child.

Alas, not just one child. There could be many child processes in a row if some webpage were naughty and brought down the child in its anger. So the data collection code was made aware there could be many batches of data from child processes, and one batch of data from parent processes.

The parent data was left looking like single-process data, out in the root of the data collection payload. Child processes’ data were placed in an array of childPayloads where each payload echoed the structure of the parent.

Then, not content with “good”, I had to come along in bug 1218576, a bug whose number I still have locked in my memory, for good or ill.

Firefox needs to have multiple child processes of different types, simultaneously. And many of some of those several types, also simultaneously. What was going to be a quick way to ensure that childPayloads was always of length 1 turned into a months-long exercise to put data exactly where we wanted it to be.

And so now we have childPayloads where the “weird” content child data that resists aggregation remains, and we also have payload.processes.<process type>.* where the cool and hip data lives: histograms, scalars, and keyed variants of both.

Already this approach is showing dividends as some proportions of Nightly users are getting a gpu process, and others are getting some number of content processes. The data files neatly into place with minimal intervention required.

But it means about:telemetry needs to know whether you want the parent’s “weird” data or the child’s. And which child was that, again?

And about:telemetry also needs to know whether you want the parent’s “cool” data, or the content child’s, or the gpu child’s.

So this means that within about:telemetry there are now five places where you can select what process you want. One for “weird” data, and one for each of the four kinds of “cool” data.

Sadly, that brings my storytelling to a close, having reached the present day. Hopefully after this Summer’s Code is done, this will have a happier, more-maintainable, and responsively-designed ending.

But until now, remember that “accident of history” is the answer to most questions. As such it behooves you to learn history well.


Firefox on Windows XP: End of the Line

With the release of Firefox 52 to all users worldwide, we now have the final Windows XP-supported Firefox release out the door.

This isn’t to say that support is done. As I’ve mentioned before, Windows XP users will be transitioned to the ESR update channel where they’ll continue to receive security updates for the next year or so.

And I don’t expect this to be the end of me having to blog about weird clients that are inexplicably on Windows XP.

However, this does take care of one of the longest-standing data questions I’ve looked at on this blog and in my career at Mozilla. So I feel that it’s worth taking a moment to mark the occasion.

Windows XP is dead. Long live Windows XP.


Data Science is Hard: Anomalies and What to Do About Them

:mconley‘s been looking at tab spinners to try and mitigate their impact on user experience. That’s when he noticed something weird that happened last October on Firefox Developer Edition:


It’s a spike a full five orders of magnitude larger than submission volumes for a single build have ever been.

At first I thought it was users getting stuck on an old version. But then :frank noticed that the “by submission date” of that same graph didn’t tally with that hypothesis:


Submissions from Aurora (what the Firefox Developer Edition branch is called internally) 51 tailed of when Aurora 52 was released in exactly the way we’ve come to expect. Aurora 52 had a jump mid-December when we switched to using “main” pings instead of “saved-session” pings to run our aggregates, but otherwise everything’s fine heading into the end of the year.

But then there’s Aurora 51 rising from the dead in late December. Some sort of weird re-adoption update problem? But where are all those users coming from? Or are they actually users? These graphs only plot ping volumes.

( Quick refresher: we get anonymous usage data from Firefox users via “pings”: packets of data that are sent at irregular intervals. A single user can send many pings per day, though more than 25 in a day is a pretty darn chatty. )

At this point I filed a bug. It appeared as though, somehow, we were getting new users running Aurora 51 build 20161014.

:mhowell popped the build onto a Windows machine and confirmed that it was updating fine for him. Anyone running that build ought not to be running it for long as they’d update within a couple of hours.

At this point we’d squeezed as much information as the aggregated data could give us, so I wandered off closer to the source to get deeper answers.

First I double-checked that what we were seeing in aggregate was what the data actually showed. Our main_summary dataset confirmed what we were seeing was not some weird artefact… but it also showed that there was no increase in client numbers:


A quick flip of the query and I learned that a single “client” was sending tens of thousands of pings each and every day from a long-dead non-release build of Firefox Developer Edition.

A “client” in this case is identified by “client_id”, a unique identifier that lives in a Firefox profile. Generally we take a single “client” to roughly equal a single “user”, but this isn’t always the case. Sometimes a single user may have multiple profiles (one at work, one at home, for instance). Sometimes multiple users may have the same profile (an enterprise may provision a specific Firefox profile to every terminal).

It seemed likely we were in the second case: one profile, many Firefox installs.

But could we be sure? What could we learn about the “client” sending us this unexpectedly-large amount of data?

So I took a look.

First, a sense of scaleoutput_11_0

This single client began sending a few pings around November 15, 2016. This makes sense, as Aurora 51 was still the latest version at that time. Things didn’t ramp up until December when we started seeing over ten thousand pings per day. After a lull during Christmas it settled into what appeared to be some light growth with a large peak on Feb 17 reaching one hundred thousand pings on just that day.

This is kinda weird. If we assume some rule-of-thumb of say, two pings per user per day, then we’re talking fifty thousand users running this ancient version of Aurora. What are they doing with it?

Well, we deliberately don’t record too much information about what our users do with their browsers. We don’t know what URLs are being visited, what credentials they’re using, or whether they prefer one hundred duck-sized horses or one horse-sized duck.

But we do know for how long the browser session lasted (from Firefox start to Firefox shutdown), so let’s take a look at that:output_23_0

Woah. Over half of the sessions reported by the pings were exactly 215 seconds long. Two minutes and 35 seconds.

It gets weirder. It turns out that these Aurora 51 builds are all running on the same Operating System (Windows XP, about which I’ve blogged before), all have the same addon installed (Random Agent Spoofer, though about 10% also have Alexa Traffic Rank), none have Aurora 51 set to be their default browser, none have application updates enabled, and they come from 418 different geographical locations according to the IP address of the submitters (top 10 locations include 5 in the US, 2 in France, 2 in Britain, and one in Germany).

This is where I would like to report the flash of insight that had me jumping out of the bath shouting Eureka.

But I don’t have one.

Everyone mentioned here and some others besides have thrown their heads at this mystery and can’t come up with anything suitably satisfying. Is it a Windows XP VM that is distributed to help developers test their websites? Is it an embedded browser in some popular piece of software with broad geographic appeal? Is someone just spoofing us by setting their client ids the same? If so, how did they spoof their session lengths?

To me the two-minute-and-thirty-five-second length of sessions just screams that this is some sort of automated process. I’m worried that Firefox might have been packaged into some sort of botnet-type-thingy that has gone out and infected thousands of hosts and is using our robust network stack to… to do what?

And then there’s the problem of what to do about it.

On one hand, this is data from Firefox. It arrived properly-formed, and no one’s trying to attack us with it, so we have no need to stop it entering our data pipeline for processing.

On the other hand, this data is making the Aurora tab spinner graph look wonky for :mconley, and might cause other mischief down the line.

It leads us to question whether we care about data that’s been sent to use by automated processes… and whether we could identify such data if we didn’t.

For now we’re going to block this particular client_id’s data from entering the aggregate dataset. The aggregate dataset is used by to display interesting stuff about Firefox users. Human users. So we’re okay with blocking it.

But all Firefox users submit data that might be useful to us, so what we’re not going to do is block this problematic client’s data from entering the pipeline. We’ll continue to collect and collate it in the hopes that it can reveal to us some way to improve Firefox or data collection in the future.

And that’s sadly where we’re at with this: an unsolved mystery, some unanswered questions about the value of automated data, and an unsatisfied sense of curiosity.


The Most Satisfying Graph

There were a lot of Firefox users on Beta 44.

Usually this is a good thing. We like having a lot of users[citation needed].

It wasn’t a good thing this time, as Beta had already moved on to 45. Then 46. Eventually we were at Beta 52, and the number of users on Beta 44 was increasing.

We thought maybe it was because Beta 44 had the same watershed as Release 43. Watershed? Every user running a build before a watershed must update to the watershed first before updating to the latest build. If you have Beta 41 and the latest is Beta 52, you must first update to Beta 44 (watershed) so we can better ascertain your cryptography support before continuing on to 48, which is another watershed, this time to do with fourteen-year-old processor extensions. Then, and only then, can you proceed to the currently-most-recent version, Beta 52.

(If you install afresh, the installer has the smarts to figure out your computer’s cryptographic and CPU characteristics and suitability so that new users jump straight to the front of the line)

Beta 44 being a watershed should, indeed, require a longer-than-usual lifetime of the version, with respect to population. If this were the only effect at play we’d expect the population to quickly decrease as users updated.

But they didn’t update.

It turns out that whenever the Beta 44 users attempted to download an update to that next watershed release, Beta 48, they were getting a 404 Not Found. At some point, the watershed Beta 48 build on was removed, possibly due to age (we can’t keep everything forever). So whenever the users on Beta 44 wanted to update, they couldn’t. To compound things, any time a user before Beta 44 wanted to update, they had to go through Beta 44. Where they were caught.

This was fixed on… well, I’ll let you figure out which day it was fixed on:


This is now the most satisfying graph I’ve ever plotted at Mozilla.


Data Science is Hard: Client Delays

Delays suck, but unmeasured delays suck more. So let’s measure them.

I’ve previous talked about delays as they relate to crash pings. This time we’re looking at the core of Firefox Telemetry data collection: the “main” ping. We’ll be looking at a 10% sample of all “main” pings submitted on Tuesday, January 10th[1].

In my previous post on delays, I defined five types of delay: recording, submission, aggregation, migration, and query scheduling. This post is about delays on the client side of the equation, so we’ll be focusing on the first two: recording, and submission.

Recording Delay

How long does it take from something happening, to having a record of it happening? We count HTTP response codes (as one does), so how much time passes from that first HTTP response to the time when that response’s code is packaged into a ping to be sent to our servers?


This is a Cumulative Distribution Functions or CDF. The ones in this post show you what proportion (0% – 100%) of “main” pings we’re looking at arrived with data that falls within a certain timeframe (0 – 96 hours). So in this case, look at the red, “aurora”-branch line. It crosses the 0.9 y-axis line at about the 8 x-axis line. This means 90% of the pings had a recording delay of 8 hours or less.

Which is fantastic news, especially since every other channel (release and beta vying for fastest speeds) gets more of its pings in even faster. 90% of release pings have a recording delay of at most 4 hours.

And notice that shelf at 24 hours, where every curve basically jumps to 100%? If users leave their browsers open for longer than a day, we cut a fresh ping at midnight. Glad to see evidence that it’s working.

All in all it shows that we can expect recording delays of under 30min for most pings across all channels. This is not a huge source of delay.

Submission Delay

With all that data finally part of a “main” ping, how long before the servers are told? For now, Telemetry has to wait for the user to restart their Firefox before it is able to send its pings. How long can that take?



Now we see aurora is no longer the slowest, and has submission delays very similar to release’s submission delays.  The laggard is now beta… and I really can’t figure out why. If Beta users are leaving their browsers open longer, we’d expect to see them be on the slower side of the “Recording Delay CDF” plot. If Beta users are leaving their browser closed longer, we’d expect them to show up lower on Engagement Ratio plots (which they don’t).

A mystery.

Not a mystery is that nightly has the fastest submission times. It receives updates every day so users have an incentive to restart their browsers often.

Comparing Submission Delay to Recording Delay, you can see how this is where we’re focusing most of our “Get More Data, Faster” attentions. If we wait for 90% of “main” pings to arrive, then we have to wait at least 17 hours for nightly data, 28 hours for release and aurora… and over 96 hours for beta.

And that’s just Submission Delay. What if we measured the full client -> server delay for data?

Combined Client Delay


With things the way they were on 2017-01-10, to get 90% of “main” pings we need to wait a minimum of 22 hours (nightly) and a maximum of… you know what, I don’t even know. I can’t tell where beta might cross the 0.9 line, but it certainly isn’t within 96 hours.

If we limit ourselves to 80% we’re back to a much more palatable 11 hours (nightly) to 27 hours (beta). But that’s still pretty horrendous.

I’m afraid things are actually even worse than I’m making it out to be. We rely on getting counts out of “main” pings. To count something, you need to count every single individual something. This means we need 100% of these pings, or as near as we can get. Even nightly pings take longer than 96 hours to get us more than 95% of the way there.

What do we use “main” pings to count? Amongst other things, “usage hours” or “how long has Firefox been open”. This is imperative to normalizing crash information properly so we can determine the health and stability of a release.

As you can imagine, we’re interested in knowing this as fast as possible. And as things stood a couple of Tuesdays ago, we have a lot of room for improvement.

For now, expect more analyses like this one (and more blog posts like this one) examining how slowly or quickly we can possibly get our data from the users who generate it to the Mozillians who use it to improve Firefox.


[1]: Why did I look at pings from 2017-01-10? It was a recent Tuesday (less weekend effect) well after Gregorian New Year’s Day, well before Chinese New Year’s Day, and even a decent distance from Epiphany. Also the 01-10 is a mirror which I thought was neat.