Going from New Laptop to Productive Mozillian

laptopStickers

My old laptop had so many great stickers on it I didn’t want to say goodbye. So I put off my hardware refresh cycle from the recommended 2 years to almost 3.

To speak the truth it wasn’t only the stickers that made me wary of switching. I had a workflow that worked. The system wasn’t slow. It was only three years old.

But then Windows started crashing on me during video calls. And my Firefox build times became long enough that I ported changes to my Linux desktop before building them. It was time to move on.

Of course this opened up a can of worms. Questions, in order that they presented themselves, included:

Should I move to Mac, or stick with Windows? My lingering dislike for Apple products and complete unfamiliarity with OSX made that choice easy.

Of the Windows laptops, which should I go for? Microsoft’s Surface lineup keeps improving. I had no complaints from my previous Lenovo X1 Carbon. And the Dell XPS 15 and 13 were enjoyed by several of my coworkers.

The Dells I nixed because I didn’t want anything bigger than the X1 I was retiring, and because the webcam is positioned at knuckle-height. I felt wary of the Surfacebooks due to the number that mhoye had put in the ground due to manufacturing defects. Yes, I know he has an outsized effect on hardware and software. It really only served to highlight how much importance I put on familiarity and habit.

X1 Carbon 6th Generation it is, then.

So I initiated the purchase order. It would be sent to Mozilla Toronto, the location charged with providing my IT support, where it would be configured and given an asset number. Then it would be sent to me. And only then would the work begin in setting it up so that I could actually get work done on it.

First, not being a fan of sending keypresses over the network, I disabled Bing search from the Start Menu by setting the following registry keys:

HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Search
BingSearchEnabled dword:00000000
AllowSearchToUseLocation dword:00000000
CortanaConsent dword:00000000

Then I fixed some odd defaults in Lenovo’s hardware. Middle-click should middle-click, not enter into a scroll. Fn should need to be pressed to perform special functions on the F keys (it’s like FnLock was default-enabled).

I installed all editions of Firefox. Firefox Beta installed over the release-channel that came pre-installed. Firefox Developer Edition and Nightly came next and added their own icons. I had to edit the shortcuts for each of these individually on the Desktop and in the Quick Launch bar to have -P --no-remote arguments so I wouldn’t accidentally start the wrong edition with the wrong profile and lose all of my data. (This should soon be addressed)

In Firefox Beta I logged in to sync to my work Firefox Account. This brought me 60% of the way to being useful right there. So much of my work is done in the browser, and so much of my browsing experience can be brought to life by logging in to Firefox Sync.

The other 40% took the most effort and the most time. This is because I want to be able to compile Firefox on Windows, for my sins, and this isn’t the most pleasant of experiences. Luckily we have “Building Firefox for Windows” instructions on MDN. Unluckily, I want to use git instead of mercurial for version control.

  1. Install mozilla-build
  2. Install Microsoft Visual Studio Community Edition (needed for Win10 SDKs)
  3. Copy over my .vimrc, .bashrc, .gitconfig, and my ssh keys into the mozilla-build shell environment
  4. Add exclusions to Windows Defender for my entire development directory in an effort to speed up Windows’ notoriously-slow filesystem speeds
  5. Install Git for Windows
  6. Clone and configure git-cinnabar for working with Mozilla’s mercurial repositories
  7. Clone mozilla-unified
    • This takes hours to complete. The download is pretty quick, but turning all of the mercurial changesets into git commits requires a lot of filesystem operations.
  8. Download git-prompt.sh so I can see the current branch in my mozilla-build prompt
  9.  ./mach bootstrap
    • This takes dozens of minutes and can’t be left alone as it has questions that need answers at various points in the process.
  10. ​./mach build
    • This originally failed because when I checked out mozilla-unified in Step 7 my git used the wrong line-endings. (core.eol should be set to lf and core.autocrlf to false)
    • Then it failed because ./mach bootstrap downloaded the wrong rust std library. I managed to find rustup in ~/.cargo/bin which allowed me to follow the build system’s error message and fix things
  11. Just under 50min later I have a Firefox build

And that’s not all. I haven’t installed the necessary tools for uploading patches to Mozilla’s Phabricator instance so they can undergo code review. I haven’t installed Chrome so I can check if things are broken for everyone or just for Firefox. I haven’t cloned and configured the frankly-daunting number of github repositories in use by my team and the wider org.

Only with all this done can I be a productive mozillian. It takes hours, and knowledge gained over my nearly-3 years of employment here.

Could it be automated? Technologically, almost certainly yes. The latest mozilla-build can be fetched from a central location. mozilla-unified can be cloned using the version control setup of choice. The correct version of Visual Studio Community can be installed (but maybe not usably given its reliance on Microsoft Accounts). We might be able to get all the way to a working Firefox build from a recent checkout of the source tree before the laptop leaves IT’s hands.

It might not be worth it. How many mozillians even need a working Firefox build, anyway? And how often are they requesting new hardware?

Ignoring the requirement to build Firefox, then, why was the laptop furnished with a release-channel version of Firefox? Shouldn’t it at least have been Beta?

And could this process of setup be better documented? The parts common to multiple teams appear well documented to begin with. The “Building Firefox on Windows” documentation on MDN is exceedingly clear to work with despite the frightening complexity of its underpinnings. And my team has onboarding docs focused on getting new employees connected and confident.

Ultimately I believe this is probably as simple and as efficient as this process will get. Maybe it’s a good thing that I only undertook this after three years. That seems like a nice length of time to amortize the hours of cost it took to get back to productive.

Oh, and as for the stickers… well, Mozilla has a program for buying your own old laptop. I splurged and am using it to replace my 2009 Aspire Revo to connect to my TV and provide living room computing. It is working out just swell.

:chutten

Advertisements

Perplexing Graphs: The Case of the 0KB Virtual Memory Allocations

Every Monday and Thursday around 3pm I check dev-telemetry-alerts to see if there have been any changes detected in the distribution of any of the 1500-or-so pieces of anonymous usage statistics we record in Firefox using Firefox Telemetry.

This past Monday there was one. It was a little odd.489b9ce7-84e6-4de0-b52d-e0179a9fdb1a

Generally, when you’re measuring continuous variables (timings, memory allocations…) you don’t see too many of the same value. Sure, there are common values (2GB of physical memory, for instance), but generally you don’t suddenly see a quarter of all reports become 0.

That was weird.

So I did what I always do when I find an alert that no one’s responded to, and triaged it. Mostly this involves looking at it on telemetry.mozilla.org to see if it was still happening, whether it was caused by a change in submission volumes (could be that we’re suddenly hearing from a lot more users, and they all report just “0”, for example), or whether it was limited to a single operating system or architecture:

windowsVSIZE

Hello, Windows.

windowsx64VSIZE

Specifically: hello Windows 64-bit.

With these clues, :erahm was able to highlight for me a bug that might have contributed to this sudden change: enabling Control Flow Guard on Windows builds.

Control Flow Guard (CFG) is a feature of Windows 8.1 (Update 3) and 10 that inserts some runtime checks into your binary to ensure you only make sensible jumps. This protects against certain exploits where attackers force a binary to jump into strange places in the running program, causing Bad Things to happen.

I had no idea how a control flow integrity feature would result in 0-size virtual memory allowances, but when :erahm gives you a hint, you take it. I commented on the bug.

Luckily, I was taken seriously, so a new bug was filed and :tjr looked into it almost immediately. The most important clue came from :dmajor who had the smartest money in the room, and crucial help from :ted who was able to reproduce the bug.

It turns out that turning CFG on made our Virtual Memory allowances jump above two terabytes.

Now, to head off “Firefox iz eatang ur RAM!!!!111eleven” commentary: this is CFG’s fault, not ours. (Also: Virtual Memory isn’t RAM.)

In order to determine what parts of a binary are valid “indirect jump targets”, Windows needs to keep track of them all, and do so performantly enough that the jumps can still happen at speed. Windows does this by maintaining a map with a bit per possible jump location. The bit is 1 if it is a valid location to jump to, and 0 if it is not. On each indirect jump, Windows checks the bit for the jump location and interrupts the process if it was about to jump to a forbidden place.

When running this on a 64-bit machine, this bitmap gets… big. Really big. Two Terabytes big. And that’s using an optimized way of storing data about the jump availability of up to 2^64 (18 quintillion) addresses. Windows puts this in the process’ storage allocations for its own recordkeeping reasons, which means that every 64-bit process with CFG enabled (on CFG-aware Windows versions (8.1 Update 3 and 10)) has a 2TB virtual memory allocation.

So. We have an abnormally-large value for Virtual Memory. How does that become 0?

Well, those of you with CS backgrounds (or who clicked on the “smartest money” link a few paragraphs back), will be thinking about the word “overflow”.

And you’d be wrong. Ish.

The raw number :ted was seeing was the number 2201166503936. This number is the number of bytes in his virtual memory allocation and is a few powers of two above what we can fit in 32 bits. However, we report the number of kilobytes. The number of kilobytes is 2149576664, well underneath the maximum number you can store in an unsigned 32-bit integer, which we all know (*eyeroll*) is 4294967296. So instead of a number about 512x too big to fit, we get one that can fit almost twice over.

Welll….

So we’re left with a number that should fit, being recorded as 0. So I tried some things and, sure enough, recording the number 2149576664 into any histogram did indeed record as 0. I filed a new bug.

Then I tried numbers plus or minus 1 around :ted’s magic number. They became zeros. I tried recording 2^31 + 1. Zero. I tried recording 2^32 – 1. Zero.

With a sinking feeling in my gut, I then tried recording 2^32 + 1. I got my overflow. It recorded as 1. 2^32 + 2 recorded as 2. And so on.

All numbers between 2^31 and 2^32 were being recorded as 0.

sensibleError

In a sensible language like Rust, assigning an unsigned value to a signed variable isn’t something you can do accidentally. You almost never want to do it, so why make it easy? And let’s make sure to warn the code author that they’re probably making a mistake while we’re at it.

In C++, however, you can silently convert from unsigned to signed. For values between 0 and 2^31 this doesn’t matter. For values between 2^31 and 2^32, this means you can turn a large positive number into a negative number somewhere between -2^31 and -1. Silently.

Telemetry Histograms don’t record negatives. We clamp them to 0. But something in our code was coercing our fancy unsigned 32-bit integer to a signed one before it was clamped to 0. And it was doing it silently. Because C++.

Now that we’ve found the problem, fixed the problem, and documented the problem we are collecting data about the data[citation] we may have lost because of the problem.

But to get there I had to receive an automated alert (which I had to manually check), split the data against available populations, become incredibly lucky and run it by :erahm who had an idea of what it might be, find a team willing to take me seriously, and then do battle with silent type coercion in a language that really should know better.

All in a day’s work, I guess?

:chutten