When do All Firefox Users Update?

Last time we talked about updates I wrote about all of what goes into an individual Firefox user’s ability to update to a new release. We looked into how often Firefox checks for updates, and how we sometimes lie and say that there isn’t an update even after release day.

But how does that translate to a population?

Well, let’s look at some pictures. First, the number of “update” pings we received from users during the recent Firefox 61 release:update_ping_volume_release61

This is a real-time look at how many updates were received or installed by Firefox users each minute. There is a plateau’s edge on June 26th shortly after the update went live (around 10am PDT), and then a drop almost exactly 24 hours later when we turned updates off. This plot isn’t the best for looking into the mechanics of how this works since it shows volume of all types of “update” pings from all versions, so it includes users finally installing Firefox Quantum 57 from last November as well as users being granted the fresh update for Firefox 61.

Now that it’s been a week we can look at our derived dataset of “update” pings and get a more nuanced view (actually, latency on this dataset is much lower than a week, but it’s been at least a week). First, here’s the same graph, but filtered to look at only clients who are letting us know they have the Firefox 61 update (“update” ping, reason: “ready”) or they have already updated and are running Firefox 61 for the first time after update (“update” ping, reason: “success”):update_ping_volume_61only_wide

First thing to notice is how closely the two graphs line up. This shows how, during an update, the volume of “update” pings is dominated by those users who are updating to the most recent version.

And it’s also nice validation that we’re looking at the same data historically that we were in real time.

To step into the mechanics, let’s break the graph into its two constituent parts: the users reporting that they’ve received the update (reason: “ready”) and the users reporting that they’re now running the updated Firefox (reason: “success”).

update_ping_volume_stackedupdate_ping_volume_unstacked

The first graph shows the two lines stacked for maximum similarity to the graphs above. The second unstacks the two so we can examine them individually.

It is now much clearer to see how and when we turned updates on and off during this release. We turned them on June 26, off June 27, then on again June 28. The blue line also shows us some other features: the Canada Day weekend of lower activity June 30 and July 1, and even time-of-day effects where our sharpest peaks are (EDT) 6-8am, 12-2pm, and a noticeable hook at 1am.

(( That first peak is mostly made up of countries in the Central European Timezone UTC+1 (e.g. Germany, France, Poland). Central Europe’s peak is so sharp because Firefox users, like the populations of European countries, are concentrated mostly in that one timezone. The second peak is North and South America (e.g. United States, Brazil). It is broader because of how many timezones the Americas span (6) and how populations are dense in the Eastern Timezone and Pacific Timezone which are 3 hours apart (UTC-5 to UTC-8). The noticeable hook is China and Indonesia. China has one timezone for its entire population (UTC+8), and Indonesia has three. ))

This blue line shows us how far we got in the last post: delivering the updates to the user.

The red line shows why that’s only part of the story, and why we need to look at populations in addition to just an individual user.

Ultimately we want to know how quickly we can reach our users with updated code. We want, on release day, to get our fancy new bells and whistles out to a population of users small enough that if something goes wrong we’re impacting the fewest number of them, but big enough that we can consider their experience representative of what the whole Firefox user population would experience were we to release to all of them.

To do this we have a two levers: we can change how frequently Firefox asks for updates, and we can change how many Firefox installs that ask for updates actually get it right away. That’s about it.

So what happens if we change Firefox to check for updates every 6 hours instead of every 12? Well, that would ensure more users will check for updates during periods they’re offered. It would also increase the likelihood of a given user being offered the update when their Firefox asks for one. It would raise the blue line a bit in those first hours.

What if we change the %ge of update requests that are offers? We could tune up or down the number of users who are offered the updates. That would also raise the blue line in those first hours. We could offer the update to more users faster.

But neither of these things would necessarily increase the speed at which we hit that Goldilocks number of users that is both big enough to be representative, and small enough to be prudent. Why? The red line is why. There is a delay between a Firefox having an update and a user restarting it to take advantage of it.

Users who have an update won’t install it immediately (see how the red line is lower than the blue except when we turn updates off completely), and even if we turn updates off it doesn’t stop users who have the update from installing it (see how the red line continues even when the blue line is floored).

Even with our current practices of serving updates, more users receive the update than install it for at least the first week after release.

If we want to accelerate users getting update code, we need to control the red line. Which we don’t. And likely can’t.

I mean, we can try. When an update has been pending for long enough, you get a little arrow on your Firefox menu: update_arrow.png

If you leave it even longer, we provide a larger piece of UI: a doorhanger.

restartDoorhanger

We can tune how long it takes to show these. We can show the doorhanger immediately when the update is ready, asking the user to stop what they’re doing and–

windows_update_bluewindows_updatewindows_update_blue_lg

…maybe we should just wait until users update their own selves, and just offer some mild encouragement if they’ve put it off for, say, four days? We’ll give the user eight days before we show them anything as invasive as the doorhanger. And if they dismiss that doorhanger, we’ll just not show it again and trust they’ll (eventually) restart their browser.

…if only because Windows restarted their whole machines when they weren’t looking.

(( If you want your updates faster and more frequent, may I suggest installing Firefox Beta (updates every week) or Firefox Nightly (updates twice a day)? ))

This means that if the question is “When do all Firefox users update?” the answer is, essentially, “When they want to.” We can try and serve the updates faster, or try to encourage them to restart their browsers sooner… but when all is said and done our users will restart their browsers when they want to restart their browsers, and not a moment before.

Maybe in the future we’ll be able to smoothly update Firefox when we detect the user isn’t using it and when all the information they have entered can be restored. Maybe in the future we’ll be able to update seamlessly by porting the running state from instance to instance. But for now, given our current capabilities, we don’t want to dictate when a user has to restart their browser.

…but what if we could ship new features more quickly to an even more controlled segment of the release population… and do it without restarting the user’s browser? Wouldn’t that be even better than updates?

Well, we might answers for that… but that’s a subject for another time.

:chutten

Advertisements

Faster Event Telemetry with “event” Pings

Screenshot_2018-07-04 New Query(1).pngEvent Telemetry is the means by which we can send ordered interaction data from Firefox users back to Mozilla where we can use it to make product decisions.

For example, we know from a histogram that the most popular way of opening the Developer Tools in Firefox Beta 62 is by the shortcut key (Ctrl+Shift+I). And it’s nice to see that the number of times the Javascript Debugger was opened was roughly 1/10th of the number of times the shortcut key was used.

…but are these connected? If so, how?

And the Javascript Profiler is opened only half as often as the Debugger. Why? Isn’t it easy to find that panel from the Debugger? Are users going there directly from the DOM view or is it easier to find from the Debugger?

To determine what parts of Firefox our users are having trouble finding or using, we often need to know the order things happen. That’s where Event Telemetry comes into play: we timestamp things that happen all over the browser so we can see what happens and in what order (and a little bit of how long it took to happen).

Event Telemetry isn’t new: it’s been around for about 2 years now. And for those two years it has been piggy-backing on the workhorse of the Firefox Telemetry system: the “main” ping.

The “main” ping carries a lot of information and is usually sent once per time you close your Firefox (or once per day, whichever is shorter). As such, Event Telemetry was constrained in how it was able to report this ordered data. It takes two whole days to get 95% of it (because that’s how long it takes us to get “main” pings), and it isn’t allowed to send more than one thousand events per process (lest it balloon the size of the “main” ping, causing problems).

This makes the data slow, and possibly incomplete.

With the landing of bug 1460595 in Firefox Nightly 63 last week, Event Telemetry now has its own ping: the “event” ping.

The “event” ping maintains the same 1000-events-per-process-per-ping limit as the “main” ping, but can send pings as frequently as one ping every ten minutes. Typically, though, it waits the full hour before sending as there isn’t any rush. A maximum delay of an hour still makes for low-latency data, and a minimum delay of ten minutes is unlikely to be overrun by event recordings which means we should get all of the events.

This means it takes less time to receive data that is more likely to be complete. This in turn means we can use less of it to get our answers. And it means more efficiency in our decision-making process, which is important when you’re competing against giants.

If you use Event Telemetry to answer your questions with data, now you can look forward to being able to do so faster and with less worry about losing data along the way.

And if you don’t use Event Telemetry to answer your questions, maybe now would be a good time to start.

The “event” ping landed in Firefox Nightly 63 (build id 20180627100027) and I hope to have it uplifted to Firefox Beta 62 in the coming days.

Thanks to :sunahsuh for her excellent work reviewing the proposal and in getting the data into the derived datasets so they can be easily queried, and further thanks to the Data Team for their support.

:chutten

Some More Very Satisfying Graphs

I guess I just really like graphs that step downwards:

Screenshot_2018-06-27 Telemetry Budget Forecasting

Earlier this week :mreid noticed that our Nightly population suddenly started sending us, on average, 150 fewer kilobytes (uncompressed) of data per ping. And they started doing this in the middle of the previous week.

Step 1 was to panic that we were missing information. However, no one had complained yet and we can usually count on things that break to break loudly, so we cautiously-optimistically put our panic away.

Step 2 was to see if the number of pings changed. It could be we were being flooded with twice as many pings at half the size, for the same volume. This was not the case:

Screenshot_2018-06-27 Telemetry Budget Forecasting(2)

Step 3 was to do some code archaeology to try and determine the “culprit” change that was checked into Firefox and resulted in us sending so much less data. We quickly hit upon the removal of BrowserUITelemetry and that was that.

…except… when I went to thank :Standard8 for removing BrowserUITelemetry and saving us and our users so much bandwidth, he was confused. To the best of his knowledge, BrowserUITelemetry was already not being sent. And then I remembered that, indeed, back in March :janerik had been responsible for stopping many things like BrowserUITelemetry from being sent (since they were unmaintained and unused).

So I fired up an analysis notebook and started poking to see if I could find out what parts of the payload had suddenly decreased in size. Eventually, I generated a plot that showed quite clearly that it was the keyedHistograms section that had decreased so radically.

Screenshot_2018-06-27 main_ping_size - Databricks

Around the same time :janerik found the culprit in the list of changes that went into the build: we are no longer sending a couple of incredibly-verbose keyed histograms because their information is now much more readily available in profiles.

The power of cleaning up old code: removing 150kb from the average “main” ping sent multiple times per day by each and every Firefox Nightly user.

Very satisfying.

:chutten

When, Exactly, does a Firefox User Update, Anyway?

There’s a brand new Firefox version coming. There always is, since we adopted the Rapid Release scheme way back in 2011 for Firefox 5. Every 6-8 weeks we instantly update every Firefox user in the wild with the latest security and performance…

Well, no. Not every Firefox user. We have a dashboard telling us only about 90% of Firefox users are presently up-to-date. And “up-to-date” here means any user within two versions of the latest:

Pie chart of "Up to date and out of date client distribution" showing 83.9% up to date, 8% out of date and potentially of concern, and 2.1% out of date and of concern.

Why two versions? Shouldn’t “Up to date” mean up-to-date?

Say it’s June 26, 2018, and Firefox 61 is ready to go. The first code new to this version landed March 13, and it has spent the past three months under intense development and scrutiny. During its time as Firefox Nightly 61.0a1 it had features added, performance and security improvements layered in, and some rearranging of home page settings. While it was known as Firefox Beta 61.0b{1-14} it has stabilized, with progressively fewer code changes accepted as we prepare it for our broadest population.

So we put Firefox 61.0 up on the server, ring the bell and yell “Come and get it~!”?

No.

Despite our best efforts, our Firefox Beta-running population is not as diverse as our release population. The release population has thousands of gpu+driver combinations 61 has never run on before. Some users will use it for kiosks and internet cafes in areas of the world we have no beta users. Other users will have combinations of addons and settings that are unique to them alone… and we’ll be shipping a fresh browsing experience to each and every one of them.

So maybe we shouldn’t send it to everyone at once in case it breaks something for those users whose configurations we haven’t had an opportunity to test.

As a result, our update servers will, upon being asked by your Firefox instance if there is an update available, determine whether or not to lie. Our release managers will chose to turn the release dial to, say, 10% to begin. So when your Firefox asks if there is an update, nine out of ten times we will lie and say “No, try again later.” And that random response is cached so that everyone else trying for the next one or two minutes will get the same response you did.

At 10% roughly one out of every ten 1-2min periods will tell the truth: “Yes, there is an update and you can find it here: <url>”. This adds a bit of a time-delay between “releasing” a new version and users having it.

Eventually, after a couple of days or maybe up to a week, we will turn the dial up to 100% and everyone will be able to receive the update and in a matter of hours the entire population will be up-to-date and…

No.

When does a Firefox instance ask for an update? We “only” release a new update every six-to-eight weeks, it would be wasteful to be asking all the time. When -should- we ask?

If you’ve ever listened to a programmer complain about time, you might have an inkling of the complexity involved in simply trying to figure out when to ask if there’s an update available.

The simplest two cases of Firefox instances asking for updates are: “When the user tells it to”, and “If the Firefox instance was released more than 63 days ago.”

For the first of these two cases, you can at any time open Help > About Firefox and it will check to see if your Firefox is up-to-date. There is also a button labeled “Check for Updates” in Preferences/Options.

For the second, we have a check during application startup that compares the date we built your Firefox to the date your computer thinks it is. If they differ by more than 63 days, we will ask for an update.

We can’t rely on users to know how to check for updates, and we don’t want our users to wait over two -more- months to benefit from all of our hard work.

Maybe we should check twice a day or so. That seems like a decent compromise. So that’s what we do for Firefox release users: we check every 12 hours or so. If the user isn’t running Firefox for more than 12 hours, then when they start it up again we check against the client’s clock to see if it’s been 12 hours since our last check.

Putting this all together:

Firefox must be running. It must have been at least 12 hours since the last time we checked for updates. If we are still throttling updates to, say, 10% we (or the client who asked previously within the past 1-2min) must be lucky to be told the truth that there is an update available. Firefox must be able to download the entire update (which can be interrupted if the user closes Firefox before the download is complete). Firefox must be able to apply the update. The user must restart Firefox to start running the new version.

And then, and only then, is that one Firefox user running the updated Firefox.

How does this look like for an entire population of Firefox users whose configurations and usage behaviours I already mentioned are the most diverse of all of our user populations?

That’ll have to wait for another time, as it sure isn’t a simpler story than this one. For now, you can enjoy looking at some graphs I used to explore a similar topic, earlier.

:chutten

Distributed Teams: On the non-Universality of “Not it!”

I’ve surprisingly not written a lot over here about working on a distributed team in a distributed organization. Mozilla is about 60% people who work in MoLos (office workers) and 40% people who don’t (remotees). My team is 50/50: I’m remote near Toronto, one works from his home in Italy, and the other two sit in the Berlin office most days.

If I expand to encompass one extra level of organization, I work with people in Toronto, San Francisco, Poland, Iowa, Illinois, more Berlin, Nova Scotia… well, you get the idea. For the past two and a half years I’ve been working with people from all over the world and I have been learning how that’s different from the rather-more-monocultural experience I had working in offices for the previous 8-10.

So today when I shouted “Not it!” into the IRC channel in response to the dawning realization that someone would have to investigate and take ownership of some test bustage, I followed it up within the minute with a cultural note:

09:35 <chutten> (Actually, that's a cultural thing that may need
 explanation. As kids, usually at summer sleep-away camp, if there
 is an undesirable thing that needs to be done by one person in
 the cabin the last person to say "Not it" is "It" and thus, has
 to do the undesirable thing.)

“Not it” is cultural. I think. I’ve been able to find surprisingly little about its origins in the usual places. It seems to share some heritage with the game of Tag. It seems to be an evolution of the game “Nose Goes,” but it’s hard to tell exactly where any of it started. Wikipedia can’t find an origin earlier than the 1979 Canadian film “Meatballs” where the nose game was already assumed to be a part of camp life.

Regardless of origin, I can’t assume it’s shared amongst my team. Thus my little note. Lucky for me, they seem to enjoy learning things like this. Even luckier, they enjoy sharing back. For instance, :gfritzsche once said his thumbs were pressed that we’d get something done by week’s end… what?

There were at least two things I didn’t understand about that: the first was what he meant, the second was how one pressed one’s thumbs. I mean, do you put them in your fist and squeeze, or do you press them on the outside of your fist and pretend you’re having a Thumb War (yet another cultural artefact)?

First, it means hoping for good luck. Second, it’s with thumbs inside your fist, not outside. I’m very lucky there’s a similar behaviour and expression that I’m already familiar with (“fingers crossed”). This will not always be the case, and it won’t ever be an even exchange…

All four of my team members speak the language I spoke at home while I was growing up. A lot of my culture is exported by the US via Hollywood, embedding it into the brains of the people with whom I work. I have a huge head-start on understanding and being understood, and I need to remain mindful of it.

Hopefully I’ll be able to catch some of my more egregious references before I need to explain camp songs, cringe-worthy 90s slang, or just how many hours I spent in a minivan with my siblings looking for the letter X on a license plate.

Or, then again, maybe those explanations are just part of being a distributed team?

:chutten

TIL: C++ character constants aren’t created equal

arrPtrTerm

When reviewing a patchset involving files, strings, timers, and multiple threads (essentially all the “tough to review” checkboxes right there), a comment from :froydnj caught my eye:

> +const char* kPersistenceFileName = "gv_measurements.json";
> +const char* kPersistenceFileNameNoExt = "gv_measurements";

Please make these `const char kPersistence...[] = "..."`, which is slightly smaller on all of our usual systems.

Why? Aren’t C arrays just pointers to the first element? Don’t we treat char[] and char* identically when logging or appending them to higher-order string objects?

Influenced by :Dexter and :jan-erik, I tried this out myself. I made two small C programs arr.c and ptr.c where the only difference was that one had a const char* and the other a const char[].

I compiled them both with gcc and then checked their sizes in bytes.arrPtrSizes

Sure enough, though ptr.c was smaller by one byte (* is one character and [] is two), after being compiled ptr was larger by a whole 8 bytes!

This is because C arrays aren’t identical to pointers to the first element. Instead they are identifiers that are (most of the time) implicitly converted to be pointers to the first element. For instance, if you’re calling a function void foo(char c[]), inside the function `c` is implicitly a char* and its size is that of a pointer, but outside of the function c could be an array with its size being the length of the array. As an example:

void foo(char c[]) {
  printf("sizeof(c): %d\n", sizeof c);
}
int main(void) {
  char arr[] = "a";
  printf("sizeof(arr): %d\n", sizeof arr);
  return 0;
}

Prints:

sizeof(arr): 2
sizeof(c): 8

Another way to think about this is that the char* form allocates two things: the actual string of four (“abc” plus “\0”) characters, and a pointer called ptr that contains the address of the place the ‘a’ character is stored.

This is in contrast to the char[] form which allocates just the four-character string and allows you to use the name arr to refer to the place the ‘a’ character is stored.

So, in conclusion, if you want to save yourself a pointer’s width on your constant strings, you may wish to declare them as char[] instead of char*.

:chutten

(( :glandium wrote a much deeper look at this problem if you’d like to see how much worse this can actually get thanks to relocations and symbol exporting ))

(( :bsmedberg wrote in to mention that you almost never want const char* for a constant anyway as the pointer isn’t const, only the data is. ))

Mozilla All-Hands Tips

25486648678_90fa78a27e_k
All Hands Austin, December 2017, Mitchell Baker presenting. (Photo used under CC BY-NC-SA 2.0)

Twice a year, Mozilla gathers employees, volunteers, and assorted hangers-on in a single place to have a week of planning, working, and socializing. Being as distributed an organization as we are, it’s a bit rare to get enough of us in a single place to generate the kind of cross-talk and beneficial synergistic happenstances that help us work smarter and move in more-or-less the same direction. These are our All Hands events.

They’re a Pretty Big Deal(tm).

So here you are, individual contributor or manager, staff or volunteer, veteran or first-timer. With all these Big Plans, what are we littler folk to do to not become overwhelmed?

I have some tips.

Before You Go

Set up a mail folder/label for relevant email: You’ll be getting some email with details about where you should be, what you should be doing, and when. Organizing these into one place is helpful for reference, so come up with a label (maybe “201807-sanfran” or “mozsf2” or “fogzilla” or something) and organize those emails as they come in.

Act on those emails immediately: If they contain instructions or an announcement that bookings or registration is now open… then do that thing right then. Do not file the email and forget. Do the thing while you are looking at that email. Only then should you file that email and get back to where you were in your brain. If you absolutely can’t just then (have to synchronize with family or what-have-you), put a calendar reminder in that repeats every weekday until you handle it.

Do not upgrade Nightly: You’re running Nightly, right? You’ll be travelling through a land of uncertain connectivity, and the last thing you want is to use it downloading a multi-MB Nightly update that might have accidentally disabled Captive Portal Detection. If it works, keep your Nightly build until you’re certain you have the bandwidth to download a new one. All else fails, keep it until you get back.

Make sure your laptop is in shape: My laptop is often neglected in favour of its Desktop comrade: updates may be pending, credentials may have expired, the source code checkouts might be weeks old, and there may have even been a new version of Mozilla Build released since the last time I tried to compile Firefox. With luck, while at an All Hands you won’t have to compile Gecko on a laptop in your hotel… but we make our own luck, we who are prepared. Prepare your laptop.

Prepare your family: If you don’t live alone, you’ll have non-mozilla prepwork to do. Spouse and kids or roomates and pets, there are lifeforms who normally expect to see you that won’t. Clear the family schedule for the week you’re gone, and do as much preparation ahead of time as you can. Laundry, meal planning, groceries, sitters, dog walkers, even lawn services are things you can arrange to lighten the load that your absence will place on those around you. Even if you’re bringing them with you.

While You’re There

Do not fear missing out: You will not be able to attend both Boardgame Night and your team dinner. There will be karaoke parties you won’t get to, or be invited to. This is fine. This is expected. This is unavoidable when you have so many people disorganizing so many things simultaneously. So don’t fret about it. Prioritize.

Say no: Speaking of prioritizing: prioritize for yourself. You may very well be operating as a Level 100 You for hours at a time. So many people to talk to, so many talks and social events to organize, deliver, and attend… No. You don’t have to stay the entire length of the party. You don’t even have to go. If you feel yourself fading, get out while you have the strength. Regroup. Find a quiet corner or go to sleep early… At my first All Hands, I napped on both Wednesday and Thursday. And I wasn’t even in a different timezone. It really helped.

Wash your hands: Lots. Before meals. After meals. You’ll be talking, working, eating, and otherwise hanging out with a thousand of your closest coworkers. It’s probably your best bet for not catching mozflu, and it’s definitely your best bet to not transmit it.

After You’re Back

Consider taking a day: Generally speaking you’ll be flying back on Saturday and returning to work on Monday. Depending on distance to travel, available flight times, and cancellations, this may result in only a few hours between stumbling through your door and stumbling back to work. Consider booking that Monday off (or, honestly, if your trip back was heinous, don’t even book it off. Just take it. Get some sleep. Work can wait until Tuesday.)

Check in: If you live with family, you haven’t seen them for a week. Even if you brought them with you, you’ve been in meetings and talks and stuff most hours. Check in with them. Get up to speed on what’s been happening in their lives while you’ve been away.

Get excited for the next one: Even immediately back from an All Hands, it’s still only six months to the next one. Take stock of what you liked and what you didn’t like about this one. Rest up, and try not to get impatient :)

:chutten

(( Great minds think alike, because Seburo recently wrote a Wiki article covering even more excellent tips for All Hands events. Check that out, too! ))