Every Monday and Thursday around 3pm I check dev-telemetry-alerts to see if there have been any changes detected in the distribution of any of the 1500-or-so pieces of anonymous usage statistics we record in Firefox using Firefox Telemetry.
This past Monday there was one. It was a little odd.
Generally, when you’re measuring continuous variables (timings, memory allocations…) you don’t see too many of the same value. Sure, there are common values (2GB of physical memory, for instance), but generally you don’t suddenly see a quarter of all reports become 0.
That was weird.
So I did what I always do when I find an alert that no one’s responded to, and triaged it. Mostly this involves looking at it on telemetry.mozilla.org to see if it was still happening, whether it was caused by a change in submission volumes (could be that we’re suddenly hearing from a lot more users, and they all report just “0”, for example), or whether it was limited to a single operating system or architecture:
Specifically: hello Windows 64-bit.
With these clues, :erahm was able to highlight for me a bug that might have contributed to this sudden change: enabling Control Flow Guard on Windows builds.
Control Flow Guard (CFG) is a feature of Windows 8.1 (Update 3) and 10 that inserts some runtime checks into your binary to ensure you only make sensible jumps. This protects against certain exploits where attackers force a binary to jump into strange places in the running program, causing Bad Things to happen.
I had no idea how a control flow integrity feature would result in 0-size virtual memory allowances, but when :erahm gives you a hint, you take it. I commented on the bug.
Luckily, I was taken seriously, so a new bug was filed and :tjr looked into it almost immediately. The most important clue came from :dmajor who had the smartest money in the room, and crucial help from :ted who was able to reproduce the bug.
It turns out that turning CFG on made our Virtual Memory allowances jump above two terabytes.
Now, to head off “Firefox iz eatang ur RAM!!!!111eleven” commentary: this is CFG’s fault, not ours. (Also: Virtual Memory isn’t RAM.)
In order to determine what parts of a binary are valid “indirect jump targets”, Windows needs to keep track of them all, and do so performantly enough that the jumps can still happen at speed. Windows does this by maintaining a map with a bit per possible jump location. The bit is 1 if it is a valid location to jump to, and 0 if it is not. On each indirect jump, Windows checks the bit for the jump location and interrupts the process if it was about to jump to a forbidden place.
When running this on a 64-bit machine, this bitmap gets… big. Really big. Two Terabytes big. And that’s using an optimized way of storing data about the jump availability of up to 2^64 (18 quintillion) addresses. Windows puts this in the process’ storage allocations for its own recordkeeping reasons, which means that every 64-bit process with CFG enabled (on CFG-aware Windows versions (8.1 Update 3 and 10)) has a 2TB virtual memory allocation.
So. We have an abnormally-large value for Virtual Memory. How does that become 0?
Well, those of you with CS backgrounds (or who clicked on the “smartest money” link a few paragraphs back), will be thinking about the word “overflow”.
And you’d be wrong. Ish.
The raw number :ted was seeing was the number
So we’re left with a number that should fit, being recorded as 0. So I tried some things and, sure enough, recording the number a new bug.
into any histogram did indeed record as 0. I filed
Then I tried numbers plus or minus 1 around :ted’s magic number. They became zeros. I tried recording 2^31 + 1. Zero. I tried recording 2^32 – 1. Zero.
With a sinking feeling in my gut, I then tried recording 2^32 + 1. I got my overflow. It recorded as 1. 2^32 + 2 recorded as 2. And so on.
All numbers between 2^31 and 2^32 were being recorded as 0.
In a sensible language like Rust, assigning an unsigned value to a signed variable isn’t something you can do accidentally. You almost never want to do it, so why make it easy? And let’s make sure to warn the code author that they’re probably making a mistake while we’re at it.
In C++, however, you can silently convert from unsigned to signed. For values between 0 and 2^31 this doesn’t matter. For values between 2^31 and 2^32, this means you can turn a large positive number into a negative number somewhere between -2^31 and -1. Silently.
Telemetry Histograms don’t record negatives. We clamp them to 0. But something in our code was coercing our fancy unsigned 32-bit integer to a signed one before it was clamped to 0. And it was doing it silently. Because C++.
Now that we’ve found the problem, fixed the problem, and documented the problem we are collecting data about the data[citation] we may have lost because of the problem.
But to get there I had to receive an automated alert (which I had to manually check), split the data against available populations, become incredibly lucky and run it by :erahm who had an idea of what it might be, find a team willing to take me seriously, and then do battle with silent type coercion in a language that really should know better.
All in a day’s work, I guess?