Distributed Teams: On the non-Universality of “Not it!”

I’ve surprisingly not written a lot over here about working on a distributed team in a distributed organization. Mozilla is about 60% people who work in MoLos (office workers) and 40% people who don’t (remotees). My team is 50/50: I’m remote near Toronto, one works from his home in Italy, and the other two sit in the Berlin office most days.

If I expand to encompass one extra level of organization, I work with people in Toronto, San Francisco, Poland, Iowa, Illinois, more Berlin, Nova Scotia… well, you get the idea. For the past two and a half years I’ve been working with people from all over the world and I have been learning how that’s different from the rather-more-monocultural experience I had working in offices for the previous 8-10.

So today when I shouted “Not it!” into the IRC channel in response to the dawning realization that someone would have to investigate and take ownership of some test bustage, I followed it up within the minute with a cultural note:

09:35 <chutten> (Actually, that's a cultural thing that may need
 explanation. As kids, usually at summer sleep-away camp, if there
 is an undesirable thing that needs to be done by one person in
 the cabin the last person to say "Not it" is "It" and thus, has
 to do the undesirable thing.)

“Not it” is cultural. I think. I’ve been able to find surprisingly little about its origins in the usual places. It seems to share some heritage with the game of Tag. It seems to be an evolution of the game “Nose Goes,” but it’s hard to tell exactly where any of it started. Wikipedia can’t find an origin earlier than the 1979 Canadian film “Meatballs” where the nose game was already assumed to be a part of camp life.

Regardless of origin, I can’t assume it’s shared amongst my team. Thus my little note. Lucky for me, they seem to enjoy learning things like this. Even luckier, they enjoy sharing back. For instance, :gfritzsche once said his thumbs were pressed that we’d get something done by week’s end… what?

There were at least two things I didn’t understand about that: the first was what he meant, the second was how one pressed one’s thumbs. I mean, do you put them in your fist and squeeze, or do you press them on the outside of your fist and pretend you’re having a Thumb War (yet another cultural artefact)?

First, it means hoping for good luck. Second, it’s with thumbs inside your fist, not outside. I’m very lucky there’s a similar behaviour and expression that I’m already familiar with (“fingers crossed”). This will not always be the case, and it won’t ever be an even exchange…

All four of my team members speak the language I spoke at home while I was growing up. A lot of my culture is exported by the US via Hollywood, embedding it into the brains of the people with whom I work. I have a huge head-start on understanding and being understood, and I need to remain mindful of it.

Hopefully I’ll be able to catch some of my more egregious references before I need to explain camp songs, cringe-worthy 90s slang, or just how many hours I spent in a minivan with my siblings looking for the letter X on a license plate.

Or, then again, maybe those explanations are just part of being a distributed team?

:chutten

Advertisements

TIL: C++ character constants aren’t created equal

arrPtrTerm

When reviewing a patchset involving files, strings, timers, and multiple threads (essentially all the “tough to review” checkboxes right there), a comment from :froydnj caught my eye:

> +const char* kPersistenceFileName = "gv_measurements.json";
> +const char* kPersistenceFileNameNoExt = "gv_measurements";

Please make these `const char kPersistence...[] = "..."`, which is slightly smaller on all of our usual systems.

Why? Aren’t C arrays just pointers to the first element? Don’t we treat char[] and char* identically when logging or appending them to higher-order string objects?

Influenced by :Dexter and :jan-erik, I tried this out myself. I made two small C programs arr.c and ptr.c where the only difference was that one had a const char* and the other a const char[].

I compiled them both with gcc and then checked their sizes in bytes.arrPtrSizes

Sure enough, though ptr.c was smaller by one byte (* is one character and [] is two), after being compiled ptr was larger by a whole 8 bytes!

This is because C arrays aren’t identical to pointers to the first element. Instead they are identifiers that are (most of the time) implicitly converted to be pointers to the first element. For instance, if you’re calling a function void foo(char c[]), inside the function `c` is implicitly a char* and its size is that of a pointer, but outside of the function c could be an array with its size being the length of the array. As an example:

void foo(char c[]) {
  printf("sizeof(c): %d\n", sizeof c);
}
int main(void) {
  char arr[] = "a";
  printf("sizeof(arr): %d\n", sizeof arr);
  return 0;
}

Prints:

sizeof(arr): 2
sizeof(c): 8

Another way to think about this is that the char* form allocates two things: the actual string of four (“abc” plus “\0”) characters, and a pointer called ptr that contains the address of the place the ‘a’ character is stored.

This is in contrast to the char[] form which allocates just the four-character string and allows you to use the name arr to refer to the place the ‘a’ character is stored.

So, in conclusion, if you want to save yourself a pointer’s width on your constant strings, you may wish to declare them as char[] instead of char*.

:chutten

(( :glandium wrote a much deeper look at this problem if you’d like to see how much worse this can actually get thanks to relocations and symbol exporting ))

(( :bsmedberg wrote in to mention that you almost never want const char* for a constant anyway as the pointer isn’t const, only the data is. ))

Mozilla All-Hands Tips

25486648678_90fa78a27e_k
All Hands Austin, December 2017, Mitchell Baker presenting. (Photo used under CC BY-NC-SA 2.0)

Twice a year, Mozilla gathers employees, volunteers, and assorted hangers-on in a single place to have a week of planning, working, and socializing. Being as distributed an organization as we are, it’s a bit rare to get enough of us in a single place to generate the kind of cross-talk and beneficial synergistic happenstances that help us work smarter and move in more-or-less the same direction. These are our All Hands events.

They’re a Pretty Big Deal(tm).

So here you are, individual contributor or manager, staff or volunteer, veteran or first-timer. With all these Big Plans, what are we littler folk to do to not become overwhelmed?

I have some tips.

Before You Go

Set up a mail folder/label for relevant email: You’ll be getting some email with details about where you should be, what you should be doing, and when. Organizing these into one place is helpful for reference, so come up with a label (maybe “201807-sanfran” or “mozsf2” or “fogzilla” or something) and organize those emails as they come in.

Act on those emails immediately: If they contain instructions or an announcement that bookings or registration is now open… then do that thing right then. Do not file the email and forget. Do the thing while you are looking at that email. Only then should you file that email and get back to where you were in your brain. If you absolutely can’t just then (have to synchronize with family or what-have-you), put a calendar reminder in that repeats every weekday until you handle it.

Do not upgrade Nightly: You’re running Nightly, right? You’ll be travelling through a land of uncertain connectivity, and the last thing you want is to use it downloading a multi-MB Nightly update that might have accidentally disabled Captive Portal Detection. If it works, keep your Nightly build until you’re certain you have the bandwidth to download a new one. All else fails, keep it until you get back.

Make sure your laptop is in shape: My laptop is often neglected in favour of its Desktop comrade: updates may be pending, credentials may have expired, the source code checkouts might be weeks old, and there may have even been a new version of Mozilla Build released since the last time I tried to compile Firefox. With luck, while at an All Hands you won’t have to compile Gecko on a laptop in your hotel… but we make our own luck, we who are prepared. Prepare your laptop.

Prepare your family: If you don’t live alone, you’ll have non-mozilla prepwork to do. Spouse and kids or roomates and pets, there are lifeforms who normally expect to see you that won’t. Clear the family schedule for the week you’re gone, and do as much preparation ahead of time as you can. Laundry, meal planning, groceries, sitters, dog walkers, even lawn services are things you can arrange to lighten the load that your absence will place on those around you. Even if you’re bringing them with you.

While You’re There

Do not fear missing out: You will not be able to attend both Boardgame Night and your team dinner. There will be karaoke parties you won’t get to, or be invited to. This is fine. This is expected. This is unavoidable when you have so many people disorganizing so many things simultaneously. So don’t fret about it. Prioritize.

Say no: Speaking of prioritizing: prioritize for yourself. You may very well be operating as a Level 100 You for hours at a time. So many people to talk to, so many talks and social events to organize, deliver, and attend… No. You don’t have to stay the entire length of the party. You don’t even have to go. If you feel yourself fading, get out while you have the strength. Regroup. Find a quiet corner or go to sleep early… At my first All Hands, I napped on both Wednesday and Thursday. And I wasn’t even in a different timezone. It really helped.

Wash your hands: Lots. Before meals. After meals. You’ll be talking, working, eating, and otherwise hanging out with a thousand of your closest coworkers. It’s probably your best bet for not catching mozflu, and it’s definitely your best bet to not transmit it.

After You’re Back

Consider taking a day: Generally speaking you’ll be flying back on Saturday and returning to work on Monday. Depending on distance to travel, available flight times, and cancellations, this may result in only a few hours between stumbling through your door and stumbling back to work. Consider booking that Monday off (or, honestly, if your trip back was heinous, don’t even book it off. Just take it. Get some sleep. Work can wait until Tuesday.)

Check in: If you live with family, you haven’t seen them for a week. Even if you brought them with you, you’ve been in meetings and talks and stuff most hours. Check in with them. Get up to speed on what’s been happening in their lives while you’ve been away.

Get excited for the next one: Even immediately back from an All Hands, it’s still only six months to the next one. Take stock of what you liked and what you didn’t like about this one. Rest up, and try not to get impatient :)

:chutten

(( Great minds think alike, because Seburo recently wrote a Wiki article covering even more excellent tips for All Hands events. Check that out, too! ))

Annoying Graphs: Did the Facebook Container Add-on Result in More New Firefox Profiles?

Yesterday, Mozilla was in the news again for releasing a Firefox add-on called Facebook Container. The work of (amongst others) :groovecoder, :pdol, :pdehaan, :rfeeley, :tanvi, and :jkt, Facebook Container puts Facebook in a little box and doesn’t let it see what else you do on the web.

You can try it out right now if you’d like. It’s really just as simple as going to the Facebook Container page on addons.mozilla.org and clicking on the “+ Add to Firefox” button. From then on Facebook will only be able to track you with their cookies while you are actually visiting Facebook.

It’s easy-to-use, open source, and incredibly timely. So it quickly hit the usual nerdy corners of the web… but then it spread. Even Forbes picked it up. We started seeing incredible numbers of hits on the blogpost (I don’t have plots for that, sorry).

With all this positive press did we see any additional new Firefox users because of it?

Normally this is where I trot out the usual gimmick “Well, it depends on how you word the question.” “Additional” compared to what, exactly? Compared to the day before? The same day a week ago? A month ago?

In this case it really doesn’t depend. I can’t tell, no matter how I word the question. And this annoys me.

I mean, look at these graphs:

Here’s one showing the new-profile pings we receive each minute of some interesting days:c52dd445-e624-47aa-a44d-d5e758b56b04

Summer Time lining up with Daylight Saving Time means that different parts of the world were installing Firefox at different times of the day. The shapes of the curves don’t line up, making it impossible to compare between days.

So here’s one showing the number of new-profile pings we received each day this month:ebce02bb-1c78-4c52-9878-9a9e8d78e459

Yesterday’s numbers are low comparing to other Tuesdays these past four weeks, but look at how low Monday’s numbers are! Clearly this is some weird kinda week, making it impossible to compare between weeks.

So here’s one showing approximate Firefox client counts of last April:1d44c744-0267-4216-9371-5bf042ba47e7

This highlights a seasonal depression starting the week of April 10 similar to the one shown in the previous plot. This is expected since we’re in the weeks surrounding Easter… but why did I look at last April instead of last March? Easter changes its position relative to the civil calendar, making it impossible to compare between years.

So, did we see any additional new Firefox users thanks to all of the hard work put into Facebook Container?

¯\_(-.-)_/¯

:chutten

TIL: Feature Detection in Windows using GetProcAddress

In JavaScript, if you want to use a function that was introduced only in certain versions of browsers, you use Feature Detection. For example, you can ask “Hey, browser, do you have a function called `includes` on Array?” If the browser has it, you use it; and if it doesn’t, you either get along without it or load your own implementation.

It turns out that this same concept can be (and, in Firefox, is) done with Windows APIs.

Firefox for Windows is built against the Windows 10 SDK. This means the compiler knows the API calls and type definitions for all sorts of wondrous modern features like toast notifications and enumerating graphics adapters in a specific order.

However, as of writing, Firefox for Windows supports Windows 7 and up. What would happen if Firefox tried to use those fancy new Windows 10 features when running on Windows 7?

Well, at compile time (when Mozilla builds Firefox), it knows everything it needs to about the sizes and names of things used in the new features thanks to the SDK. At runtime (when a user runs Firefox), it needs to ask Windows at what address exactly all of those fancy new features live so that it can use them.

If Firefox can’t find a feature it expects to be there, it won’t start. We want Firefox to start, though, and we want to use the new features when available. So how do we both use the new feature (if it’s there) and not (if it’s not)?

Windows provides an API called GetProcAddress that allows the running program to perform some Feature Detection. It is asking Windows “Hey, so I’m looking for the address of this fancy new feature named FancyNewAPI. Do you know where that is?”. Windows will either reply “No, sorry” at which point you work around it, or “Yes, it’s over at address X” at which point to convert address X into a function pointer that takes the same number and types of arguments that the documentation said it takes and then instruct your program to jump into it and start executing.

We use this in Firefox to detect gamepad input modules, cancelable synchronous IO, display density measurements, and a whole bunch of graphics and media acceleration stuff.

And today (well, yesterday at this point) I learned about it. And now so have you.

:chutten

–edited to remove incorrect note that GetProcAddress started in WinXP– :aklotz noted that GetProcAddress has been around since ancient times, MSDN just periodically updates its “Minimum Supported Release” fields to drop older versions.

Perplexing Graphs: The Case of the 0KB Virtual Memory Allocations

Every Monday and Thursday around 3pm I check dev-telemetry-alerts to see if there have been any changes detected in the distribution of any of the 1500-or-so pieces of anonymous usage statistics we record in Firefox using Firefox Telemetry.

This past Monday there was one. It was a little odd.489b9ce7-84e6-4de0-b52d-e0179a9fdb1a

Generally, when you’re measuring continuous variables (timings, memory allocations…) you don’t see too many of the same value. Sure, there are common values (2GB of physical memory, for instance), but generally you don’t suddenly see a quarter of all reports become 0.

That was weird.

So I did what I always do when I find an alert that no one’s responded to, and triaged it. Mostly this involves looking at it on telemetry.mozilla.org to see if it was still happening, whether it was caused by a change in submission volumes (could be that we’re suddenly hearing from a lot more users, and they all report just “0”, for example), or whether it was limited to a single operating system or architecture:

windowsVSIZE

Hello, Windows.

windowsx64VSIZE

Specifically: hello Windows 64-bit.

With these clues, :erahm was able to highlight for me a bug that might have contributed to this sudden change: enabling Control Flow Guard on Windows builds.

Control Flow Guard (CFG) is a feature of Windows 8.1 (Update 3) and 10 that inserts some runtime checks into your binary to ensure you only make sensible jumps. This protects against certain exploits where attackers force a binary to jump into strange places in the running program, causing Bad Things to happen.

I had no idea how a control flow integrity feature would result in 0-size virtual memory allowances, but when :erahm gives you a hint, you take it. I commented on the bug.

Luckily, I was taken seriously, so a new bug was filed and :tjr looked into it almost immediately. The most important clue came from :dmajor who had the smartest money in the room, and crucial help from :ted who was able to reproduce the bug.

It turns out that turning CFG on made our Virtual Memory allowances jump above two terabytes.

Now, to head off “Firefox iz eatang ur RAM!!!!111eleven” commentary: this is CFG’s fault, not ours. (Also: Virtual Memory isn’t RAM.)

In order to determine what parts of a binary are valid “indirect jump targets”, Windows needs to keep track of them all, and do so performantly enough that the jumps can still happen at speed. Windows does this by maintaining a map with a bit per possible jump location. The bit is 1 if it is a valid location to jump to, and 0 if it is not. On each indirect jump, Windows checks the bit for the jump location and interrupts the process if it was about to jump to a forbidden place.

When running this on a 64-bit machine, this bitmap gets… big. Really big. Two Terabytes big. And that’s using an optimized way of storing data about the jump availability of up to 2^64 (18 quintillion) addresses. Windows puts this in the process’ storage allocations for its own recordkeeping reasons, which means that every 64-bit process with CFG enabled (on CFG-aware Windows versions (8.1 Update 3 and 10)) has a 2TB virtual memory allocation.

So. We have an abnormally-large value for Virtual Memory. How does that become 0?

Well, those of you with CS backgrounds (or who clicked on the “smartest money” link a few paragraphs back), will be thinking about the word “overflow”.

And you’d be wrong. Ish.

The raw number :ted was seeing was the number 2201166503936. This number is the number of bytes in his virtual memory allocation and is a few powers of two above what we can fit in 32 bits. However, we report the number of kilobytes. The number of kilobytes is 2149576664, well underneath the maximum number you can store in an unsigned 32-bit integer, which we all know (*eyeroll*) is 4294967296. So instead of a number about 512x too big to fit, we get one that can fit almost twice over.

Welll….

So we’re left with a number that should fit, being recorded as 0. So I tried some things and, sure enough, recording the number 2149576664 into any histogram did indeed record as 0. I filed a new bug.

Then I tried numbers plus or minus 1 around :ted’s magic number. They became zeros. I tried recording 2^31 + 1. Zero. I tried recording 2^32 – 1. Zero.

With a sinking feeling in my gut, I then tried recording 2^32 + 1. I got my overflow. It recorded as 1. 2^32 + 2 recorded as 2. And so on.

All numbers between 2^31 and 2^32 were being recorded as 0.

sensibleError

In a sensible language like Rust, assigning an unsigned value to a signed variable isn’t something you can do accidentally. You almost never want to do it, so why make it easy? And let’s make sure to warn the code author that they’re probably making a mistake while we’re at it.

In C++, however, you can silently convert from unsigned to signed. For values between 0 and 2^31 this doesn’t matter. For values between 2^31 and 2^32, this means you can turn a large positive number into a negative number somewhere between -2^31 and -1. Silently.

Telemetry Histograms don’t record negatives. We clamp them to 0. But something in our code was coercing our fancy unsigned 32-bit integer to a signed one before it was clamped to 0. And it was doing it silently. Because C++.

Now that we’ve found the problem, fixed the problem, and documented the problem we are collecting data about the data[citation] we may have lost because of the problem.

But to get there I had to receive an automated alert (which I had to manually check), split the data against available populations, become incredibly lucky and run it by :erahm who had an idea of what it might be, find a team willing to take me seriously, and then do battle with silent type coercion in a language that really should know better.

All in a day’s work, I guess?

:chutten

Firefox Telemetry Use Counters: Over-estimating usage, now fixed

Firefox Telemetry records the usage of certain web features via a mechanism called Use Counters. Essentially, for every document that Firefox loads, we record a “false” if the document didn’t use a counted feature, and a “true” if the document did use that counted feature.

(( We technically count it when the documents are destroyed, not loaded, since a document could use a feature at any time during its lifetime. We also count top-level documents (pages) separately from the count of all documents (including iframes), so we can see if it is the pages that users load that are using a feature or if it’s the subdocuments that the page author loads on the user’s behalf that are contributing the counts. ))

To save space, we decided to count the number of documents once, and the number of “true” values in each use counter. This saved users from having to tell us they didn’t use any of Feature 1, Feature 2, Feature 5, Feature 7, … the “no-use” use counters. They could just tell us which features they did see used, and we could work out the rest.

Only, we got it wrong.

The server-side adjustment of the counts took every use counter we were told about, and filled in the “false” values. A simple fix.

But it didn’t add in the “no-use” use counters. Users who didn’t see a feature used at all weren’t having their “false” values counted.

This led us to under-count the number of “false” values (since we only counted “falses” from users who had at least one “true”), which led us to overestimate the usage of features.

Of all the errors to have, this one was probably the more benign. In failing in the “overestimate” direction we didn’t incorrectly remove features that were being used more than measured… but we may have kept some features that we could have removed, costing mozilla time and energy for their maintenance.

Once we detected the fault, we started addressing it. First, we started educating people whenever the topic came up in email and bugzilla. Second, :gfritzsche added a fancy Use Counter Dashboard that did a client-side adjustment using the correct “true” and “false” values for a given population.

Third, and finally, we fixed the server-side aggregator service to serve the correct values for all data, current and historical.

And that brings us to today: Use Counters are fixed! Please use them, they’re kind of cool.

:chutten

bfcbd97c-80cf-483b-8707-def6057474e6
Before
beb4afee-f937-4729-b210-f4e212da7504
After (4B more samples)