Perplexing Graphs: The Case of the 0KB Virtual Memory Allocations

Every Monday and Thursday around 3pm I check dev-telemetry-alerts to see if there have been any changes detected in the distribution of any of the 1500-or-so pieces of anonymous usage statistics we record in Firefox using Firefox Telemetry.

This past Monday there was one. It was a little odd.489b9ce7-84e6-4de0-b52d-e0179a9fdb1a

Generally, when you’re measuring continuous variables (timings, memory allocations…) you don’t see too many of the same value. Sure, there are common values (2GB of physical memory, for instance), but generally you don’t suddenly see a quarter of all reports become 0.

That was weird.

So I did what I always do when I find an alert that no one’s responded to, and triaged it. Mostly this involves looking at it on telemetry.mozilla.org to see if it was still happening, whether it was caused by a change in submission volumes (could be that we’re suddenly hearing from a lot more users, and they all report just “0”, for example), or whether it was limited to a single operating system or architecture:

windowsVSIZE

Hello, Windows.

windowsx64VSIZE

Specifically: hello Windows 64-bit.

With these clues, :erahm was able to highlight for me a bug that might have contributed to this sudden change: enabling Control Flow Guard on Windows builds.

Control Flow Guard (CFG) is a feature of Windows 8.1 (Update 3) and 10 that inserts some runtime checks into your binary to ensure you only make sensible jumps. This protects against certain exploits where attackers force a binary to jump into strange places in the running program, causing Bad Things to happen.

I had no idea how a control flow integrity feature would result in 0-size virtual memory allowances, but when :erahm gives you a hint, you take it. I commented on the bug.

Luckily, I was taken seriously, so a new bug was filed and :tjr looked into it almost immediately. The most important clue came from :dmajor who had the smartest money in the room, and crucial help from :ted who was able to reproduce the bug.

It turns out that turning CFG on made our Virtual Memory allowances jump above two terabytes.

Now, to head off “Firefox iz eatang ur RAM!!!!111eleven” commentary: this is CFG’s fault, not ours. (Also: Virtual Memory isn’t RAM.)

In order to determine what parts of a binary are valid “indirect jump targets”, Windows needs to keep track of them all, and do so performantly enough that the jumps can still happen at speed. Windows does this by maintaining a map with a bit per possible jump location. The bit is 1 if it is a valid location to jump to, and 0 if it is not. On each indirect jump, Windows checks the bit for the jump location and interrupts the process if it was about to jump to a forbidden place.

When running this on a 64-bit machine, this bitmap gets… big. Really big. Two Terabytes big. And that’s using an optimized way of storing data about the jump availability of up to 2^64 (18 quintillion) addresses. Windows puts this in the process’ storage allocations for its own recordkeeping reasons, which means that every 64-bit process with CFG enabled (on CFG-aware Windows versions (8.1 Update 3 and 10)) has a 2TB virtual memory allocation.

So. We have an abnormally-large value for Virtual Memory. How does that become 0?

Well, those of you with CS backgrounds (or who clicked on the “smartest money” link a few paragraphs back), will be thinking about the word “overflow”.

And you’d be wrong. Ish.

The raw number :ted was seeing was the number 2201166503936. This number is the number of bytes in his virtual memory allocation and is a few powers of two above what we can fit in 32 bits. However, we report the number of kilobytes. The number of kilobytes is 2149576664, well underneath the maximum number you can store in an unsigned 32-bit integer, which we all know (*eyeroll*) is 4294967296. So instead of a number about 512x too big to fit, we get one that can fit almost twice over.

Welll….

So we’re left with a number that should fit, being recorded as 0. So I tried some things and, sure enough, recording the number 2149576664 into any histogram did indeed record as 0. I filed a new bug.

Then I tried numbers plus or minus 1 around :ted’s magic number. They became zeros. I tried recording 2^31 + 1. Zero. I tried recording 2^32 – 1. Zero.

With a sinking feeling in my gut, I then tried recording 2^32 + 1. I got my overflow. It recorded as 1. 2^32 + 2 recorded as 2. And so on.

All numbers between 2^31 and 2^32 were being recorded as 0.

sensibleError

In a sensible language like Rust, assigning an unsigned value to a signed variable isn’t something you can do accidentally. You almost never want to do it, so why make it easy? And let’s make sure to warn the code author that they’re probably making a mistake while we’re at it.

In C++, however, you can silently convert from unsigned to signed. For values between 0 and 2^31 this doesn’t matter. For values between 2^31 and 2^32, this means you can turn a large positive number into a negative number somewhere between -2^31 and -1. Silently.

Telemetry Histograms don’t record negatives. We clamp them to 0. But something in our code was coercing our fancy unsigned 32-bit integer to a signed one before it was clamped to 0. And it was doing it silently. Because C++.

Now that we’ve found the problem, fixed the problem, and documented the problem we are collecting data about the data[citation] we may have lost because of the problem.

But to get there I had to receive an automated alert (which I had to manually check), split the data against available populations, become incredibly lucky and run it by :erahm who had an idea of what it might be, find a team willing to take me seriously, and then do battle with silent type coercion in a language that really should know better.

All in a day’s work, I guess?

:chutten

Advertisements

Software Ideas People Should Steal, Edition One

Here are five little ideas that I think every relevant software project should implement immediately.

1) WordPress has an excellent feature for linkifying text where pasting links over selected text will linkify the selected text to point to the link. All rich-text editing software needs to implement this on the double: if the clipboard you’re overpasting with starts with ‘http’, then linkify the text, don’t replace it.

2) My new Samsung Galaxy A5 has a little touch where it checks the ambient light level before turning on the screen. If it is dim where the user is, it gradually increases the brightness as you turn on the screen instead of immediately jumping to the current, adaptive screen brightness level. This saves my eyeballs from wincing. All phone manufacturers need to implement this.

3) Speaking of phones, when you’re about to go to sleep at night, you need to tell your phone to be quiet (except for the alarm, which should be loud). On BlackBerry 10 you could do this from the lock screen by drawing a shade down over the phone, putting it into Bedside Mode. Nearest I can figure, no other device allows you to do this without unlocking the phone. Lock screen Bedside Mode should’ve been copied by the other phone OSes years ago.

4) Speaking of BlackBerry 10, it still has the best text selection I’ve encountered in a phone. You want to select a paragraph of text. On Android or iPhone you press-hold until it selects a word, then you grab handles and labouriously drag them to where you want. On BB10 you press-hold until it selects a word, and then you keep holding. It selects a sentence. Keep holding. It selects a paragraph. Keep holding. It will visually start selecting further down the page until you finally release. “Expandable Text Selection” is discoverable, delightful, and useful. Phone OS developers, please implement this yesterday.

5) May as well round this off with yet another BlackBerry idea. This time, the BB10 Keyboard. You start typing a message but then realize halfway through that your wording reads as insensitive. The first half’s fine, but your phrasing went downhill six words ago. In the BB10 keyboard just swipe to the left (or right in RTL) six times. Each swipe deletes a word. Then you can start typing again. Near as I can figure, every other keyboard relies on mobile OS text selection to quickly replace more than a few letters at a time. Take this idea, keyboard developers. It’s wonderful.

That’s all for now, folks. If anyone’s surprised at how many of these are ideas from BlackBerry 10, I’d introduce you to the list I’m not writing about all of the ideas that current smartphones _already_ copied from that now-failed platform. It’s much longer.

:chutten