Perplexing Graphs: The Case of the 0KB Virtual Memory Allocations

Every Monday and Thursday around 3pm I check dev-telemetry-alerts to see if there have been any changes detected in the distribution of any of the 1500-or-so pieces of anonymous usage statistics we record in Firefox using Firefox Telemetry.

This past Monday there was one. It was a little odd.489b9ce7-84e6-4de0-b52d-e0179a9fdb1a

Generally, when you’re measuring continuous variables (timings, memory allocations…) you don’t see too many of the same value. Sure, there are common values (2GB of physical memory, for instance), but generally you don’t suddenly see a quarter of all reports become 0.

That was weird.

So I did what I always do when I find an alert that no one’s responded to, and triaged it. Mostly this involves looking at it on telemetry.mozilla.org to see if it was still happening, whether it was caused by a change in submission volumes (could be that we’re suddenly hearing from a lot more users, and they all report just “0”, for example), or whether it was limited to a single operating system or architecture:

windowsVSIZE

Hello, Windows.

windowsx64VSIZE

Specifically: hello Windows 64-bit.

With these clues, :erahm was able to highlight for me a bug that might have contributed to this sudden change: enabling Control Flow Guard on Windows builds.

Control Flow Guard (CFG) is a feature of Windows 8.1 (Update 3) and 10 that inserts some runtime checks into your binary to ensure you only make sensible jumps. This protects against certain exploits where attackers force a binary to jump into strange places in the running program, causing Bad Things to happen.

I had no idea how a control flow integrity feature would result in 0-size virtual memory allowances, but when :erahm gives you a hint, you take it. I commented on the bug.

Luckily, I was taken seriously, so a new bug was filed and :tjr looked into it almost immediately. The most important clue came from :dmajor who had the smartest money in the room, and crucial help from :ted who was able to reproduce the bug.

It turns out that turning CFG on made our Virtual Memory allowances jump above two terabytes.

Now, to head off “Firefox iz eatang ur RAM!!!!111eleven” commentary: this is CFG’s fault, not ours. (Also: Virtual Memory isn’t RAM.)

In order to determine what parts of a binary are valid “indirect jump targets”, Windows needs to keep track of them all, and do so performantly enough that the jumps can still happen at speed. Windows does this by maintaining a map with a bit per possible jump location. The bit is 1 if it is a valid location to jump to, and 0 if it is not. On each indirect jump, Windows checks the bit for the jump location and interrupts the process if it was about to jump to a forbidden place.

When running this on a 64-bit machine, this bitmap gets… big. Really big. Two Terabytes big. And that’s using an optimized way of storing data about the jump availability of up to 2^64 (18 quintillion) addresses. Windows puts this in the process’ storage allocations for its own recordkeeping reasons, which means that every 64-bit process with CFG enabled (on CFG-aware Windows versions (8.1 Update 3 and 10)) has a 2TB virtual memory allocation.

So. We have an abnormally-large value for Virtual Memory. How does that become 0?

Well, those of you with CS backgrounds (or who clicked on the “smartest money” link a few paragraphs back), will be thinking about the word “overflow”.

And you’d be wrong. Ish.

The raw number :ted was seeing was the number 2201166503936. This number is the number of bytes in his virtual memory allocation and is a few powers of two above what we can fit in 32 bits. However, we report the number of kilobytes. The number of kilobytes is 2149576664, well underneath the maximum number you can store in an unsigned 32-bit integer, which we all know (*eyeroll*) is 4294967296. So instead of a number about 512x too big to fit, we get one that can fit almost twice over.

Welll….

So we’re left with a number that should fit, being recorded as 0. So I tried some things and, sure enough, recording the number 2149576664 into any histogram did indeed record as 0. I filed a new bug.

Then I tried numbers plus or minus 1 around :ted’s magic number. They became zeros. I tried recording 2^31 + 1. Zero. I tried recording 2^32 – 1. Zero.

With a sinking feeling in my gut, I then tried recording 2^32 + 1. I got my overflow. It recorded as 1. 2^32 + 2 recorded as 2. And so on.

All numbers between 2^31 and 2^32 were being recorded as 0.

sensibleError

In a sensible language like Rust, assigning an unsigned value to a signed variable isn’t something you can do accidentally. You almost never want to do it, so why make it easy? And let’s make sure to warn the code author that they’re probably making a mistake while we’re at it.

In C++, however, you can silently convert from unsigned to signed. For values between 0 and 2^31 this doesn’t matter. For values between 2^31 and 2^32, this means you can turn a large positive number into a negative number somewhere between -2^31 and -1. Silently.

Telemetry Histograms don’t record negatives. We clamp them to 0. But something in our code was coercing our fancy unsigned 32-bit integer to a signed one before it was clamped to 0. And it was doing it silently. Because C++.

Now that we’ve found the problem, fixed the problem, and documented the problem we are collecting data about the data[citation] we may have lost because of the problem.

But to get there I had to receive an automated alert (which I had to manually check), split the data against available populations, become incredibly lucky and run it by :erahm who had an idea of what it might be, find a team willing to take me seriously, and then do battle with silent type coercion in a language that really should know better.

All in a day’s work, I guess?

:chutten

Advertisements

Firefox Telemetry Use Counters: Over-estimating usage, now fixed

Firefox Telemetry records the usage of certain web features via a mechanism called Use Counters. Essentially, for every document that Firefox loads, we record a “false” if the document didn’t use a counted feature, and a “true” if the document did use that counted feature.

(( We technically count it when the documents are destroyed, not loaded, since a document could use a feature at any time during its lifetime. We also count top-level documents (pages) separately from the count of all documents (including iframes), so we can see if it is the pages that users load that are using a feature or if it’s the subdocuments that the page author loads on the user’s behalf that are contributing the counts. ))

To save space, we decided to count the number of documents once, and the number of “true” values in each use counter. This saved users from having to tell us they didn’t use any of Feature 1, Feature 2, Feature 5, Feature 7, … the “no-use” use counters. They could just tell us which features they did see used, and we could work out the rest.

Only, we got it wrong.

The server-side adjustment of the counts took every use counter we were told about, and filled in the “false” values. A simple fix.

But it didn’t add in the “no-use” use counters. Users who didn’t see a feature used at all weren’t having their “false” values counted.

This led us to under-count the number of “false” values (since we only counted “falses” from users who had at least one “true”), which led us to overestimate the usage of features.

Of all the errors to have, this one was probably the more benign. In failing in the “overestimate” direction we didn’t incorrectly remove features that were being used more than measured… but we may have kept some features that we could have removed, costing mozilla time and energy for their maintenance.

Once we detected the fault, we started addressing it. First, we started educating people whenever the topic came up in email and bugzilla. Second, :gfritzsche added a fancy Use Counter Dashboard that did a client-side adjustment using the correct “true” and “false” values for a given population.

Third, and finally, we fixed the server-side aggregator service to serve the correct values for all data, current and historical.

And that brings us to today: Use Counters are fixed! Please use them, they’re kind of cool.

:chutten

bfcbd97c-80cf-483b-8707-def6057474e6
Before
beb4afee-f937-4729-b210-f4e212da7504
After (4B more samples)

Data Science is Hard: What’s in a Dashboard

1920x1200-4-COUPLE-WEEKS-AFTER
The data is fake, don’t get excited.

Firefox Quantum is here! Please do give it a go. We have been working really hard on it for quite some time, now. We’re very proud of what we’ve achieved.

To show Mozillians how the release is progressing, and show off a little about what cool things we can learn from the data Telemetry collects, we’ve built a few internal dashboards. The Data Team dashboard shows new user count, uptake, usage, install success, pages visited, and session hours (as seen above, with faked data). If you visit one of our Mozilla Offices, you may see it on the big monitors in the common areas.

The dashboard doesn’t look like much: six plots and a little writing. What’s the big deal?

Well, doing things right involved quite a lot more than just one person whipping something together overnight:

1. Meetings for this dashboard started on Hallowe’en, two weeks before launch. Each meeting had between eight and fourteen attendees and ran for its full half-hour allotment each time.

2. In addition there were several one-off meetings: with Comms (internal and external) to make sure we weren’t putting our foot in our mouth, with Data ops to make sure we weren’t depending on datasets that would go down at the wrong moment, with other teams with other dashboards to make sure we weren’t stealing anyone’s thunder, and with SVPs and C-levels to make sure we had a final sign-off.

3. Outside of meetings we spent hours and hours on dashboard design and development, query construction and review, discussion after discussion after discussion…

4. To say nothing of all the bikeshedding.

It’s hard to do things right. It’s hard to do even the simplest things, sometimes. But that’s the job. And Mozilla seems to be pretty good at it.

One last plug: if you want to nudge these graphs a little higher, download and install and use and enjoy the new Firefox Quantum. And maybe encourage others to do the same?

:chutten

Anatomy of a Firefox Update

Alessio (:Dexter) recently landed a new ping for Firefox 56: the “update” ping with reason “ready”. It lets us know when a client’s Firefox has downloaded and installed an update and is only waiting for the user to restart the browser for the update to take effect.

In Firefox 57 he added a second reason for the “update” ping: reason “success”. This lets us know when the user’s started their newly-updated Firefox.

I thought I might as well see what sort of information we could glean from this new data, using the recent shipping of the new Firefox Quantum Beta as a case study.

This is exploratory work and you know what that means[citation needed]: Lots of pretty graphs!

First: the data we knew before the “update” ping: Nothing.

Well, nothing specific. We would know when a given client would use a newly-released build because their Telemetry pings would suddenly have the new version number in them. Whenever the user got around to sending them to us.

We do have data about installs, though. Our stub installer lets us know how and when installs are downloaded and applied. We compile those notifications into a dataset called download_stats. (for anyone who’s interested: this particular data collection isn’t Telemetry. These data points are packaged and sent in different ways.) Its data looks like this:Screenshot-2017-9-29 Recent Beta Downloads.png

Whoops. Well that ain’t good.

On the left we have the tailing edge of users continuing to download installs for Firefox Beta 56 at the rate of 50-150 per hour… and then only a trace level of Firefox Beta 57 after the build was pushed.

It turns out that the stub installer notifications were being rejected as malformed. Luckily we kept the malformed reports around so that after we fixed the problem we could backfill the dataset:Screenshot-2017-10-4 Recent Beta Downloads

Now that’s better. We can see up to 4000 installs per hour of users migrating to Beta 57, with distinct time-of-day effects. Perfectly cromulent, though the volume seems a little low.

But that’s installs, not updates.

What do we get with “update” pings? Well, for one, we can run queries rather quickly. Querying “main” pings to find the one where a user switched versions requires sifting through terabytes of data. The query below took two minutes to run:

Screenshot-2017-10-3 Users Updating to Firefox Quantum Beta 57(1)

The red line is update/ready: the number of pings we received in that hour telling us that the user had downloaded an update to Beta 57 and it was ready to go. The blue line is update/success: the number of pings we received that hour telling us the user had started their new Firefox Quantum Beta instance.

And here it is per-minute, just because we can:Screenshot-2017-10-3 Users Updating to Firefox Quantum Beta 57(2).png

September 30 and October 1 were the weekend. As such, we’d expect their volumes to be lower than the weekdays surrounding them. However, looking at the per-minute graph for update/ready (red), why is Friday the 29th the same height as Saturday the 30th? Fridays are usually noticeably busier than Saturdays.

Friday was Navarati in India (one of our largest market for Beta) but that’s a multi-day festival that started on the Wednesday (and other sources for client data show only a 15% or so dip in user activity on that date in India), so it’s unlikely to have caused a single day’s dip. Friday wasn’t a holiday at all in any of our other larger markets. There weren’t any problems with the updater or “update” ping ingestion. There haven’t been any dataset failures that would explain it. So what gives?

It turns out that Friday’s numbers weren’t low: Saturday’s were high. In order to improve the stability of what was going to become the Firefox 56 release we began on the 26th to offer updates to the new Firefox Quantum Beta to only¬†half of updating Firefox Beta users. To the other half we offered an update to the Firefox 56 Release Candidate.

What is a Release Candidate? Well, for Firefox it is the stabilized, optimized, rebuilt, rebranded version of Firefox that is just about ready to ship to our release population. It is the last chance we have to catch things before it reaches hundreds of millions of users.

It wasn’t until late on the 29th that we opened the floodgates and let the rest of the Beta users update to Beta 57. This contributed to a higher than expected update volume on the 30th, allowing the Saturday numbers to be nearly as voluminous as the Friday ones. You can actually see exactly when we made the change: there’s a sharp jump in the red line late on September 29 that you can see clearly on both “update”-ping-derived plots.

That’s something we wouldn’t see in “main” pings: they only report what version the user is running, not what version they downloaded and when. And that’s not all we get.

The “update”-ping-fueled graphs have two lines. This rather abruptly piques my curiosity about how they might relate to each other. Visually, the update/ready line (red) is almost always higher than the update/success line (blue). This means that we have more clients downloading and installing updates than we have clients restarting into the updated browser in those intervals. We can count these clients by subtracting the blue line from the red and summing over time:Screenshot-2017-10-3 Outstanding Updates for Users Updating to Firefox Quantum Beta 57

There are, as of the time I was drafting this post, about one half of one million Beta clients who have the new Firefox Quantum Beta… but haven’t run it yet.

Given the delicious quantity of improvements in the new Firefox Quantum Beta, they’re in for a pleasant surprise when they do.

And you can join in, if you’d like.

:chutten

(NOTE: earlier revisions of this post erroneously said download_stats counted updater notifications. It counts stub installer notifications. I have reworded the post to correct for this error. Many thanks to :ddurst for catching that)

Data Science is Hard: Dangerous Data

I sit next to a developer at my coworking location (I’m one of the many Mozilla staff who work remotely) who recently installed the new Firefox Quantum Beta on his home and work machines. I showed him what I was working on at the time (that graph below showing how nicely our Nightly population has increased in the past six months), and we talked about how we count users.

Screenshot-2017-9-28 Desktop Nightly DAU MAU for the Last Six Months by Version

=> “But of course we’ll be counting you twice, since you started a fresh profile on each Beta you installed. Actually four times, since you used Nightly to download and install those builds.” This, among other reasons, is why counting users is hard.

<= “Well, you just have to link it to my Firefox Account and then I’ll only count as one.” He figured it’d be a quick join and then we’d have better numbers for some users.

=> “Are you nuts?! We don’t link your Firefox Account to Telemetry! Imagine what an attacker could do with that!”

In a world with adversarial trackers, advertising trackers, and ever more additional trackers, it was novel to this pseudo-coworker of mine that Mozilla would specifically not integrate its systems.

Wouldn’t it be helpful to ourselves and our partners to know more about our users? About their Firefox Accounts? About their browsing history…

Mozilla doesn’t play that game. And our mission, our policies, and our practices help keep us from accidentally providing “value” of this kind for anyone else.

We know the size of users’ history databases, but not what’s in them.

We know you’re the same user when you close and reopen Firefox, but not who you are.

We know whether users have a Firefox Account, but not which ones they are.

We know how many bookmarks users have, but not what they’re for.

We know how many tabs users have open, but not why. (And for those users reporting over 1000 tabs: WHY?!)

And even this much we only know when you let us:

firefoxDataCollection

Why? Why do we hamstring our revenue stream like this? Why do we compromise on the certainty that having complete information would provide? Why do we allow ourselves to wonder and move cautiously into the unknown when we could measure and react with surety?

Why do we make Data Science even harder by doing this?

Because we care about our users. We think about what a Bad Actor could do if they had access to the data we collect. Before we okay a new data collection we think of all the ways it could be abused: Can it identify the user? Does it link to another dataset? Might it reveal something sensitive?

Yes, we have confidence in our security, our defenses in depth, our privacy policies, and our motivations to work for users and their interests.

But we are also confident that others have motivations and processes and policies that don’t align with ours… and might be given either the authority or the opportunity to gain access in the future.

This is why Firefox Send doesn’t know your encryption key for the files you share with your friends. This is why Firefox Accounts only knows six things (two of them optional) about you, and why Firefox Sync cannot read the data it’s storing for you.

And this is why Telemetry doesn’t know your Firefox Account id.

:chutten

Another Advantage of Decreasing Data Latency: Flatter Graphs

I’ve muttered before about how difficult it can be to measure application crashes. The most important lesson is that you can’t just count the number of crashes, you must normalize it by some “usage” value in order to determine whether a crashy day is because the application got crashier or because the application was just being used more.

Thus you have a numerator (number of crashes) and a denominator (some proxy of application usage) to determine the crash rate: crashes-per-use.

The current dominant denominator for Firefox is “thousand hours that Firefox is open,” or “kilo-usage-hours (kuh).”

The biggest problem we’ve been facing lately is how our numerator (number of crashes) comes in at a different rate and time than our denominator (kilo-usage-hours) due to the former being transmitted nearly-immediately via “crash” ping and the former being transmitted occasionally via “main” ping.

With pingsender now sending most “main” pings as soon as they’re created, our client submission delay for “main” pings is now roughly in line with the client submission delay of “crash” pings.

What does this mean? Well, look at this graph from https://telemetry.mozilla.org/crashes:

Screenshot-2017-7-25 Crash Rates (Telemetry)

This is the Firefox Beta Main Crash Rate (number of main process crashes on Firefox Beta divided by the number of thousands of hours users had Firefox Beta running) over the past three months or so. The spike in the middle is when we switched from Firefox Beta 54 to Firefox Beta 55. (Most of that spike is a measuring artefact due to a delay between a beta being available and people installing it. Feel free to ignore it for our purposes.)

On the left in the Beta 54 data there is a seven-day cycle where Sundays are the lowest point and Saturday is the highest point.

On the right in the Beta 55 data, there is no seven-day cycle. The rate is flat. (It is a little high, but flat. Feel free to ignore its height for our purposes.)

This is because sending “main” pings with pingsender is behaviour that ships in Firefox 55. Starting with 55, instead of having most of our denominator data (usage hours) coming in one day late due to “main” ping delay, we have that data in-sync with the numerator data (main crashes), resulting in a flat rate.

You can see it in the difference between Firefox ESR 52 (yellow) and Beta 55 (green) in the kusage_hours graph also on https://telemetry.mozilla.org/crashes:

Screenshot-2017-7-27 Crash Rates (Telemetry)

On the left, before Firefox Beta 55’s release, they were both in sync with each other, but one day behind the crash counts. On the right, after Beta 55’s release, notice that Beta 55’s cycle is now one day ahead of ESR 52’s.

This results in still more graphs that are quite satisfying. To me at least.

It also, somewhat more importantly, now makes the crash rate graph less time-variable. This reduces cognitive load on people looking at the graphs for explanations of what Firefox users experience in the wild. Decision-makers looking at these graphs no longer need to mentally subtract from the graph for Saturday numbers, adding that back in somehow for Sundays (and conducting more subtle adjustments through the week).

Now the rate is just the rate. And any change is much more likely to mean a change in crashiness, not some odd day-of-week measurement you can ignore.

I’m not making these graphs to have them ignored.

(many thanks to :philipp for noticing this effect and forcing me to explain it)

:chutten

Latency Improvements, or, Yet Another Satisfying Graph

This is the third in my ongoing series of posts containing satisfying graphs.

Today’s feature: a plot of the mean and 95th percentile submission delays of “main” pings received by Firefox Telemetry from users running Firefox Beta.

Screenshot-2017-7-12 Beta _Main_ Ping Submission Delay in hours (mean, 95th %ile)

We went from receiving 95% of pings after about, say, 130 hours (or 5.5 days) down to getting them within about 55 hours (2 days and change). And the numbers will continue to fall as more beta users get the modern beta builds with lower latency ping sending thanks to pingsender.

What does this mean? This means that you should no longer have to wait a week to get a decently-rigorous count of data that comes in via “main” pings (which is most of our data). Instead, you only have to wait a couple of days.

Some teams were using the rule-of-thumb of ten (10) days before counting anything that came in from “main” pings. We should be able to reduce that significantly.

How significantly? Time, and data, will tell. This quarter I’m looking into what guarantees we might be able to extend about our data quality, which includes timeliness… so stay tuned.

For a more rigorous take on this, partake in any of dexter’s recent reports on RTMO. He’s been tracking the latency improvements and possible increases in duplicate ping rates as these changes have ridden the trains towards release. He’s blogged about it if you want all the rigour but none of Python.

:chutten

FINE PRINT: Yes, due to how these graphs work they will always look better towards the end because the really delayed stuff hasn’t reached us yet. However, even by the standards of the pre-pingsender mean and 95th percentiles we are far enough after the massive improvement for it to be exceedingly unlikely to change much as more data is received. By the post-pingsender standards, it is almost impossible. So there.

FINER PRINT: These figures include adjustments for client clocks having skewed relative to server clocks. Time is a really hard problem when even on a single computer and trying to reconcile it between many computers separated by oceans both literal and metaphorical is the subject of several dissertations and, likely, therapy sessions. As I mentioned above, for rigour and detail about this and other aspects, see RTMO.