Latency Improvements, or, Yet Another Satisfying Graph

This is the third in my ongoing series of posts containing satisfying graphs.

Today’s feature: a plot of the mean and 95th percentile submission delays of “main” pings received by Firefox Telemetry from users running Firefox Beta.

Screenshot-2017-7-12 Beta _Main_ Ping Submission Delay in hours (mean, 95th %ile)

We went from receiving 95% of pings after about, say, 130 hours (or 5.5 days) down to getting them within about 55 hours (2 days and change). And the numbers will continue to fall as more beta users get the modern beta builds with lower latency ping sending thanks to pingsender.

What does this mean? This means that you should no longer have to wait a week to get a decently-rigorous count of data that comes in via “main” pings (which is most of our data). Instead, you only have to wait a couple of days.

Some teams were using the rule-of-thumb of ten (10) days before counting anything that came in from “main” pings. We should be able to reduce that significantly.

How significantly? Time, and data, will tell. This quarter I’m looking into what guarantees we might be able to extend about our data quality, which includes timeliness… so stay tuned.

For a more rigorous take on this, partake in any of dexter’s recent reports on RTMO. He’s been tracking the latency improvements and possible increases in duplicate ping rates as these changes have ridden the trains towards release. He’s blogged about it if you want all the rigour but none of Python.

:chutten

FINE PRINT: Yes, due to how these graphs work they will always look better towards the end because the really delayed stuff hasn’t reached us yet. However, even by the standards of the pre-pingsender mean and 95th percentiles we are far enough after the massive improvement for it to be exceedingly unlikely to change much as more data is received. By the post-pingsender standards, it is almost impossible. So there.

FINER PRINT: These figures include adjustments for client clocks having skewed relative to server clocks. Time is a really hard problem when even on a single computer and trying to reconcile it between many computers separated by oceans both literal and metaphorical is the subject of several dissertations and, likely, therapy sessions. As I mentioned above, for rigour and detail about this and other aspects, see RTMO.

Apple Didn’t Kill BlackBerry

bbred_wide

It was Oracle.

And I don’t mean “an Oracle” in the allegorical way Shakespeare had it where it was MacBeth’s prophecy-fueled hubris what incited the incidents (though it is pretty easy to cast anything in the mobile space as a reimaging of the Scottish Play). I mean the company Oracle was the primary agent of the the downfall of the company then-known as Research in Motion.

And they probably didn’t even mean to do it.

To be clear: this is my theory, these are my opinions, and all of it’s based on what I can remember from nearly a decade ago.

At the end of June 2007, Apple released the iPhone in the US. It was an odd little device. It didn’t have apps or GPS or 3G (wouldn’t have any of those until July 2008), it was only available on the AT&T network (a one-year exclusivity agreement), and it didn’t have copy-paste (that took until June 2009).

Worst of all, it didn’t even run Java.

Java was incredibly important in the 2000s. It was the only language both powerful enough on the day’s mobile hardware to be useful and sandboxed enough from that hardware to be safe to run.

And the iPhone didn’t have it! In fact, in the release of the SDK agreement in 2008, Apple excluded Java (and browser engines like Firefox’s Gecko) by disallowing the running of interpreted code.

It is understandable, then, that the executives in Research in Motion weren’t too worried. The press immediately called the iPhone a BlackBerry Killer… but they’d done that for the Motorola Q9H, the Nokia E61i, and the Samsung BlackJack. (You don’t have to feel bad if you’ve never heard of them. I only know they exist because I worked for BlackBerry starting in June 2008.)

I remember a poorly-chroma-keyed presentation featuring then-CTO David Yach commanding a starship that destroyed each of these devices in turn with our phasers of device portfolio depth, photon torpedoes of enterprise connectivity, and warp factor BlackBerry OS 4.6. Clearly we could deal with this Apple upstart the way we dealt with the others: by continuing to be excellent at what we did.

Still, a new competitor is still a new competitor. Measures had to be taken.

Especially when, in November of 2007, it was pretty clear that Google had stepped into the ring with the announcement of Android.

Android was the scarier competitor. Google was a well-known software giant and they had an audacious plan to marry their software expertise (and incredible buying, hiring, and lawyering power) with chipsets, handsets, and carrier reach from dozens of companies including Qualcomm, Motorola, and T-mobile.

The Android announcements exploded across the boardrooms of RIM’s Waterloo campus.

But with competition comes opportunity.

You see, Android ran Java. Well, code written in Java could run on Android. And this meant they had the hearts and minds of every mobile developer in the then-nascent app ecosystem. All they had to do was not call it Java and they were able to enable a far more recent version of Java’s own APIs than BlackBerry was allowed and run a high-performance non-Java virtual machine called Dalvik.

BlackBerry couldn’t match this due to the terms of its license agreement, while Google didn’t even need to pay Sun Microsystems (Java’s owner) a per-device license fee.

Quickly, a plan was hatched: Project Highlander (no, I’m not joking). It was going to be the one platform for all BlackBerry devices that was going to allow us to wield the sword of the Katana filesystem (still not joking) and defeat our enemies. Yes, even the execs were dorks at RIM in early 2009.

Specifically, RIM was going to adopt a new Operating System for our mobile devices that would run Dalvik, allowing them to not only finally evolve past the evolutionary barriers Sun had refused to lift from in front of BlackBerry Java…. but to also eat Google’s lunch at the same time. No matter how much money Google poured into app development for Android, we would reap the benefit through Highlander’s Android compatibility.

By essentially joining Google in the platform war against the increasingly-worrisome growth of Apple, we would be able to punch above our weight in the US. And by not running Android, we could keep our security clearance and be sold places Google couldn’t reach.

It was to be perfect: the radio core running RIM’s low-power, high-performance Nessus OS talking over secure hardware to the real-time QNX OS atop which would be running an Android-compatible Dalvik VM managing the applications RIM’s developers had written in the language they had spent years mastering: Java. With the separation of the radio and application cores we were even planning how to cut a deal with mobile carriers to only certify the radio core so we’d be free to update the user-facing parts of the OS without having to go through their lengthy, costly, irritating process.

A pity it never happened.

RIM’s end properly began on April 20, 2009, when Oracle announced it was in agreement to purchase Sun Microsystems, maker of Java.

Oracle, it was joked, was a tech company where the size of its Legal department outstripped that of the rest of its business units… combined.

Even I, a lowly grunt working on the BlackBerry Browser, knew what was going to happen next.

After Oracle completed its acquisition of Sun it took only seven months for them to file suit against Google over Android’s use of Java.

These two events held monumental importance for Research in Motion:

Oracle had bought Sun, which meant there was now effectively zero chance of a new version of mobile Java which would allow BlackBerry Java to innovate within the terms of RIM’s license to Sun.

Oracle had sued Google, which meant RIM would be squashed like a bug under the litigant might of Sun’s new master if it tried to pave its own not-Android way to its own, modern Java-alike.

All of RIM’s application engineers had lived and breathed Java for years. And now that expertise was to be sequestered, squandered, and then removed.

While Java-based BlackBerry 6 and 7 devices continued to keep the lights on under steadily decreasing order volumes, the BlackBerry PlayBook was announced, delayed, released, and scrapped. The PlayBook was such a good example of a cautionary tale that BlackBerry 10 required an extra year of development to rewrite most of the things it got wrong.

Under that extra year of pressure-cooker development, BlackBerry 10 bristled with ideas. This was a problem. Instead of evolving with patient direction, adding innovation step-by-step, guiding users over the years from 2009 to BlackBerry 10’s release in 2013, all of the pent up ideas of user interaction, user experience paradigms, and content-first design landed in users’ laps all at once.

This led to confusion, which led to frustration, which led to devices being returned.

BlackBerry 10 couldn’t sell, and with users’ last good graces spent, the company suddenly-renamed BlackBerry just couldn’t find something it could release that consumers would want to buy.

Massed layoffs, begun during the extra year of BlackBerry 10 development with the removal of entire teams of Java developers, continued as the company tried to resize itself to the size of its market. Handset prices increased to sweeten fallen margins. Developers shuffled over to the Enterprise business unit where BlackBerry was still paying bonuses and achieving sales targets.

The millions of handsets sold and billions of dollars revenue were gone. And yet, despite finding itself beneath the footfalls of fighting giants, BlackBerry was not dead — is still not dead.

Its future may not lie with smartphones, but when I left BlackBerry in late 2015, having myself survived many layoffs and reorganizations, I left with the opinion that it does indeed have a future.

Maybe it’ll focus on its enterprise deployments and niche device releases.

Maybe it’ll find a product millions of consumers will need.

Maybe it’ll be bought by Oracle.

:chutten

Data Science is Hard: Anomalies Part 3

So what do you do when you have a duplicate data problem and it just keeps getting worse?

You detect and discard.

Specifically, since we already have a few billion copies of pings with identical document ids (which are extremely-unlikely to collide), there is no benefit to continue storing them. So what we do is write a short report about what the incoming duplicate looked like (so that we can continue to analyze trends in duplicate submissions), then toss out the data without even parsing it.

As before, I’ll leave finding out the time the change went live as an exercise for the reader:newplot(1)

:chutten

Data Science is Hard: Anomalies Part 2

Apparently this is one of those problems that jumps two orders of magnitude if you ignore it:

aurora51-submissions

Since last time we’ve noticed that the vast majority of these incoming pings are duplicate. I don’t mean that they look similar, I mean that they are absolutely identical down to their supposedly-unique document ids.

How could this happen?

Well, with a minimum of speculation we can assume that however these Firefox instances are being distributed, they are being distributed with full copies of the original profile data directory. This would contain not only the user’s configuration information, but also copies of all as-yet-unsent pings. Once the distributed Firefox instance was started in its new home, it would submit these pending pings, which would explain why they are all duplicated: the distributor copy-pasta’d them.

So if we want to learn anything about the population of machines that are actually running these instances, we need to ignore all of these duplicate pings. So I took my analysis from last time and tweaked it.

First off, to demonstrate just how much of the traffic spike we see is the same fifteen duplicate pings, here is a graph of ping volume vs unique ping volume:

output_12_0

The count of non-duplicated pings is minuscule. We can conclude from this that most of these distributed Firefox instances rarely get the opportunity to send more than one ping. (Because if they did, we’d see many more unique pings created on their new hosts)

What can we say about these unique pings?

Besides how infrequent they are? They come from instances that all have the same Random Agent Spoofer addon that we saw in the original analysis. None of them are set as the user’s default browser. The hosts are most likely to have a 2.4GHz or 3.5GHz cpu. The hosts come from a geographically-diverse spread of area, with a peculiarly-popular cluster in Montreal (maybe they like the bagels. I know I do).

All of the pings come from computers running Windows XP. I wish I were more surprised by this, but it really does turn out that running software over a decade past its best before is a bad idea.

Also of note: the length of time the browser is open for is far too short (60-75s mostly) for a human to get anything done with it:

output_26_0

(Telemetry needs 60s after Firefox starts up in order to send a ping so it’s possible that there are browsing sessions that are shorter than a minute that we aren’t seeing.)

What can/should be done about these pings?

These pings are coming in at a rate far exceeding what the entire Aurora 51 population had when it was an active release. Yet, Aurora 51 hasn’t been an active release for six months and Aurora itself is going away.

As such, though its volume seems to continue to increase, this anomaly is less and less likely to cause us real problems day-to-day. These pings are unlikely to accidentally corrupt a meaningful analysis or mis-scale a plot.

And with our duplicate detector identifying these pings as they come in, it isn’t clear that this actually poses an analysis risk at all.

So, should we do anything about this?

Well, it is approaching release-channel-levels of volume per-day, submitted evenly at all hours instead of in the hump-backed periodic wave that our population usually generates:

aurora51-duplicateMainPings

Hundreds of duplicates detected every minute means nearly a million pings a day. We can handle it (in the above plot I turned off release, whose low points coincide with aurora’s high points), but should we?

Maybe for Mozilla’s server budget’s sake we should shut down this data after all. There’s no point in receiving yet another billion copies of the exact same document. The only things that differ are the submission timestamp and submitting IP address.

Another point: it is unlikely that these hosts are participating in this distribution of their free will. The rate of growth, the length of sessions, the geographic spread, and the time of day the duplicates arrive at our servers strongly suggest that it isn’t humans who are operating these Firefox installs. Maybe for the health of these hosts on the Internet we should consider some way to hotpatch these wayward instances into quiescence.

I don’t know what we (mozilla) should do. Heck, I don’t even know what we can do.

I’ll bring this up on fhr-dev and see if we’ll just leave this alone, waiting for people to shut off their Windows XP machines… or if we can come up with something we can (and should) do.

:chutten

Data Science is Hard: History, or It Seemed Like a Good Idea At the Time

I’m mentoring a Summer of Code project this summer about redesigning the “about:telemetry” interface that ships with each and every version of Firefox.

The minute the first student (:flyingrub) asked me “What is a parent payload and child payload?” I knew I was going to be asked a lot of questions.

To least-effectively answer these questions, I’ll blog the answers as narratives. And to start with this question, here’s how the history of a project makes it difficult to collect data from it.

In the Beginning — or, rather, in the middle of October 2015 when I was hired at Mozilla (so, at my Beginning) — there was single-process Firefox, and all was good. Users had many tabs, but one process. Users had many bookmarks, but one process. Users had many windows, but one process. All this and the web contents themselves were all sharing time within a single construct of electrons and bits and code and pixels: vying with each other for control of the filesystem, the addressable space of RAM, the network resources, and CPU scheduling.

Not satisfied with things being just “good”, we took a page from the book penned by Google Chrome and decided the time was ripe to split the browser into many processes so that a critical failure in one would not trouble the others. To begin with, because our code is venerable, we decided that we would try two processes. One of these twins would be in charge of the browser and the other would be in charge of the web contents.

This project was called Electrolysis after the mechanism by which one might split water into Hydrogen and Oxygen using electricity.

Suddenly the browser became responsive, even in the face of the worst JavaScript written by the least experienced dev at the most privileged startup in Silicon Valley. And out-of-memory errors decreased in frequency because the browser’s memory and the web contents’ memory were able to grow without interfering with each other.

Remember, our code is venerable. Remember, our code hearkens from its single-process past.

Our data-collection code was written in that single-process past. But we had two processes with input events that need to be timed to find problems. We had two processes with memory allocations that need to be examined for regressions.

So the data collection code was made aware that there could be two types of process: parent and child.

Alas, not just one child. There could be many child processes in a row if some webpage were naughty and brought down the child in its anger. So the data collection code was made aware there could be many batches of data from child processes, and one batch of data from parent processes.

The parent data was left looking like single-process data, out in the root of the data collection payload. Child processes’ data were placed in an array of childPayloads where each payload echoed the structure of the parent.

Then, not content with “good”, I had to come along in bug 1218576, a bug whose number I still have locked in my memory, for good or ill.

Firefox needs to have multiple child processes of different types, simultaneously. And many of some of those several types, also simultaneously. What was going to be a quick way to ensure that childPayloads was always of length 1 turned into a months-long exercise to put data exactly where we wanted it to be.

And so now we have childPayloads where the “weird” content child data that resists aggregation remains, and we also have payload.processes.<process type>.* where the cool and hip data lives: histograms, scalars, and keyed variants of both.

Already this approach is showing dividends as some proportions of Nightly users are getting a gpu process, and others are getting some number of content processes. The data files neatly into place with minimal intervention required.

But it means about:telemetry needs to know whether you want the parent’s “weird” data or the child’s. And which child was that, again?

And about:telemetry also needs to know whether you want the parent’s “cool” data, or the content child’s, or the gpu child’s.

So this means that within about:telemetry there are now five places where you can select what process you want. One for “weird” data, and one for each of the four kinds of “cool” data.

Sadly, that brings my storytelling to a close, having reached the present day. Hopefully after this Summer’s Code is done, this will have a happier, more-maintainable, and responsively-designed ending.

But until now, remember that “accident of history” is the answer to most questions. As such it behooves you to learn history well.

:chutten

Firefox on Windows XP: End of the Line

With the release of Firefox 52 to all users worldwide, we now have the final Windows XP-supported Firefox release out the door.

This isn’t to say that support is done. As I’ve mentioned before, Windows XP users will be transitioned to the ESR update channel where they’ll continue to receive security updates for the next year or so.

And I don’t expect this to be the end of me having to blog about weird clients that are inexplicably on Windows XP.

However, this does take care of one of the longest-standing data questions I’ve looked at on this blog and in my career at Mozilla. So I feel that it’s worth taking a moment to mark the occasion.

Windows XP is dead. Long live Windows XP.

:chutten

Data Science is Hard: Anomalies and What to Do About Them

:mconley‘s been looking at tab spinners to try and mitigate their impact on user experience. That’s when he noticed something weird that happened last October on Firefox Developer Edition:

spinnersubmissions_buildid

It’s a spike a full five orders of magnitude larger than submission volumes for a single build have ever been.

At first I thought it was users getting stuck on an old version. But then :frank noticed that the “by submission date” of that same graph didn’t tally with that hypothesis:

spinnersubmissions_subdate

Submissions from Aurora (what the Firefox Developer Edition branch is called internally) 51 tailed of when Aurora 52 was released in exactly the way we’ve come to expect. Aurora 52 had a jump mid-December when we switched to using “main” pings instead of “saved-session” pings to run our aggregates, but otherwise everything’s fine heading into the end of the year.

But then there’s Aurora 51 rising from the dead in late December. Some sort of weird re-adoption update problem? But where are all those users coming from? Or are they actually users? These graphs only plot ping volumes.

( Quick refresher: we get anonymous usage data from Firefox users via “pings”: packets of data that are sent at irregular intervals. A single user can send many pings per day, though more than 25 in a day is a pretty darn chatty. )

At this point I filed a bug. It appeared as though, somehow, we were getting new users running Aurora 51 build 20161014.

:mhowell popped the build onto a Windows machine and confirmed that it was updating fine for him. Anyone running that build ought not to be running it for long as they’d update within a couple of hours.

At this point we’d squeezed as much information as the aggregated data could give us, so I wandered off closer to the source to get deeper answers.

First I double-checked that what we were seeing in aggregate was what the data actually showed. Our main_summary dataset confirmed what we were seeing was not some weird artefact… but it also showed that there was no increase in client numbers:

aurora51-pingcountvsclientcount

A quick flip of the query and I learned that a single “client” was sending tens of thousands of pings each and every day from a long-dead non-release build of Firefox Developer Edition.

A “client” in this case is identified by “client_id”, a unique identifier that lives in a Firefox profile. Generally we take a single “client” to roughly equal a single “user”, but this isn’t always the case. Sometimes a single user may have multiple profiles (one at work, one at home, for instance). Sometimes multiple users may have the same profile (an enterprise may provision a specific Firefox profile to every terminal).

It seemed likely we were in the second case: one profile, many Firefox installs.

But could we be sure? What could we learn about the “client” sending us this unexpectedly-large amount of data?

So I took a look.

First, a sense of scaleoutput_11_0

This single client began sending a few pings around November 15, 2016. This makes sense, as Aurora 51 was still the latest version at that time. Things didn’t ramp up until December when we started seeing over ten thousand pings per day. After a lull during Christmas it settled into what appeared to be some light growth with a large peak on Feb 17 reaching one hundred thousand pings on just that day.

This is kinda weird. If we assume some rule-of-thumb of say, two pings per user per day, then we’re talking fifty thousand users running this ancient version of Aurora. What are they doing with it?

Well, we deliberately don’t record too much information about what our users do with their browsers. We don’t know what URLs are being visited, what credentials they’re using, or whether they prefer one hundred duck-sized horses or one horse-sized duck.

But we do know for how long the browser session lasted (from Firefox start to Firefox shutdown), so let’s take a look at that:output_23_0

Woah. Over half of the sessions reported by the pings were exactly 215 seconds long. Two minutes and 35 seconds.

It gets weirder. It turns out that these Aurora 51 builds are all running on the same Operating System (Windows XP, about which I’ve blogged before), all have the same addon installed (Random Agent Spoofer, though about 10% also have Alexa Traffic Rank), none have Aurora 51 set to be their default browser, none have application updates enabled, and they come from 418 different geographical locations according to the IP address of the submitters (top 10 locations include 5 in the US, 2 in France, 2 in Britain, and one in Germany).

This is where I would like to report the flash of insight that had me jumping out of the bath shouting Eureka.

But I don’t have one.

Everyone mentioned here and some others besides have thrown their heads at this mystery and can’t come up with anything suitably satisfying. Is it a Windows XP VM that is distributed to help developers test their websites? Is it an embedded browser in some popular piece of software with broad geographic appeal? Is someone just spoofing us by setting their client ids the same? If so, how did they spoof their session lengths?

To me the two-minute-and-thirty-five-second length of sessions just screams that this is some sort of automated process. I’m worried that Firefox might have been packaged into some sort of botnet-type-thingy that has gone out and infected thousands of hosts and is using our robust network stack to… to do what?

And then there’s the problem of what to do about it.

On one hand, this is data from Firefox. It arrived properly-formed, and no one’s trying to attack us with it, so we have no need to stop it entering our data pipeline for processing.

On the other hand, this data is making the Aurora tab spinner graph look wonky for :mconley, and might cause other mischief down the line.

It leads us to question whether we care about data that’s been sent to use by automated processes… and whether we could identify such data if we didn’t.

For now we’re going to block this particular client_id’s data from entering the aggregate dataset. The aggregate dataset is used by telemetry.mozilla.org to display interesting stuff about Firefox users. Human users. So we’re okay with blocking it.

But all Firefox users submit data that might be useful to us, so what we’re not going to do is block this problematic client’s data from entering the pipeline. We’ll continue to collect and collate it in the hopes that it can reveal to us some way to improve Firefox or data collection in the future.

And that’s sadly where we’re at with this: an unsolved mystery, some unanswered questions about the value of automated data, and an unsatisfied sense of curiosity.

:chutten