My StarCon 2019 Talk: Collecting Data Responsibly and at Scale

 

Back in January I was privileged to speak at StarCon 2019 at the University of Waterloo about responsible data collection. It was a bitterly-cold weekend with beautiful sun dogs ringing the morning sun. I spent it inside talking about good ways to collect data and how Mozilla serves as a concrete example. It’s 15 minutes short and aimed at a general audience. I hope you like it.

I encourage you to also sample some of the other talks. Two I remember fondly are Aaron Levin’s “Conjure ye File System, transmorgifier” about video games that look like file systems and Cory Dominguez’s lovely analysis of Moby Dick editions in “or, the whale“. Since I missed a whole day, I now get to look forward to fondly discovering new ones from the full list.

:chutten

Firefox Origin Telemetry: Putting Prio in Practice

Prio is neat. It allows us to learn counts of things that happen across the Firefox population without ever being able to learn which Firefox sent us which pieces of information.

For example, Content Blocking will soon be using this to count how often different trackers are blocked and exempted from blocking so we can more quickly roll our Enhanced Tracking Protection to our users to protect them from companies who want to track their activities across the Web.

To get from “Prio is neat” to “Content Blocking is using it” required a lot of effort and the design and implementation of a system I called Firefox Origin Telemetry.

Prio on its own has some very rough edges. It can only operate on a list of at most 2046 yes or no questions (a bit vector). It needs to know cryptographic keys from the servers that will be doing the sums and decryption. It needs to know what a “Batch ID” is. And it needs something to reliably and reasonably-frequently send the data once it has been encoded.

So how can we turn “tracker fb.com was blocked” into a bit in a bit vector into an encoded prio buffer into a network payload…

Firefox Origin Telemetry has two lists: a list of “origins” and a list of “metrics”. The list of origins is a list of where things happen. Did you block fb.com or google.com? Each of those trackers are “origins”. The list of metrics is a list of what happened. Did you block fb.com or did you have to exempt it from blocking because otherwise the site broke? Both “blocked” and “exempt” are “metrics”.

In this way Content Blocking can, whenever fb.com is blocked, call

Telemetry::RecordOrigin(OriginMetricID::ContentBlocking_Blocked, "fb.com");

And Firefox Origin Telemetry will take it from there.

Step 0 is in-memory storage. Firefox Origin Telemetry stores tables mapping from encoding id (ContentBlocking_Blocked) to tables of origins mapped to counts (“fb.com”: 1). If there’s any data in Firefox Origin Telemetry, you can view it in about:telemetry and it might look something like this:

originTelemetryAbout

Step 1 is App Encoding: turning “ContentBlocking_Blocked: {“fb.com”: 1}” into “bit twelve on shard 2 should be set to 1 for encoding ‘content-blocking-blocked’ ”

The full list of origins is too long to talk to Prio. So Firefox Origin Telemetry splits the list into 2046-element “shards”. The order of the origins list and the split locations for the shards must be stable and known ahead of time. When we change it in the future (either because Prio can start accepting larger or smaller buffers, or when the list of origins changes) we will have to change the name of the encoding from ‘content-blocking-blocked’ to maybe ‘content-blocking-blocked-v2’.

Step 2 is Prio Encoding: Firefox Origin Telemetry generates batch IDs of the encoding name suffixed with the shard number: for our example the batch ID is “content-blocking-blocked-1”. The server keys are communicated by Firefox Preferences (you can see them in about:config). With those pieces and the bit vector shards themselves, Prio has everything it needs to generate opaque binary blobs about 50 kilobytes in size.

Yeah, 2kb of data in a 50kb package. Not a small increase.

Step 3 is Base64 Encoding where we turn those 50kb binary blobs into 67kb strings of the letters a-z and A-Z, the numbers 0-9, and the symbols “+” or “/”. This is so we can send it in a normal Telemetry ping.

Step 4 is the “prio” ping. Once a day or when Firefox shuts down we need to send a ping containing these pairs of batch ids and base64-encoded strings plus a minimum amount of environmental data (Firefox version, current date, etc.), if there’s data to be sent. In the event that sending fails, we need to retry (TelemetrySend). After sending the ping should be available to be inspected for a period of time (TelemetryArchive).

…basically, this is where Telemetry does what Telemetry does best.

And then the ping becomes the problem of the servers who need to count and verify and sum and decode and… stuff. I dunno, I’m a Firefox Telemetry Engineer, not a Data Engineer. :amiyaguchi’s doing that part, not me : )

I’ve smoothed over some details here, but I hope I’ve given you an idea of what value Firefox Origin Telemetry brings to Firefox’s data collection systems. It makes Prio usable for callers like Content Blocking and establishes systems for managing the keys and batch IDs necessary for decoding on the server side (Prio will generate int vector shards for us, but how will we know which position of which shard maps back to which origin and which metric?).

Firefox Origin Telemetry is shipping in Firefox 68 and is currently only enabled for Firefox Nightly and Beta. Content Blocking is targeting Firefox 69 to start using Origin Telemetry to measure tracker blocking and exempting for 0.014% of pageloads of 1% of clients.

:chutten

Distributed Teams: A Test Failing Because It’s Run West of Newfoundland and Labrador

(( Not quite 500 mile email-level of nonsense, but might be the closest I get. ))

A test was failing.

Not really unusual, that. Tests fail all the time. It’s how we know they’re good tests: protecting us developers from ourselves.

But this one was failing unusually. Y’see, it was failing on my machine.

(Yes, har har, it is a common-enough occurrence given my obvious lack of quality as a developer how did you guess.)

The unusual part was that it was failing only for me… and I hadn’t even touched anything yet. It wasn’t failing on our test infrastructure “try”, and it wasn’t failing on the machine of :raphael, the fellow responsible for the integration test harness itself. We were building Firefox the same way, running telemetry-tests-client the same way… but I was getting different results.

I fairly quickly locked down the problem to be an extra “main” ping with reason “environment-change” being sent during the initial phases of the test. By dumping some logging into Firefox, rebuilding it, and then routing its logs to console with --gecko-log "-" I learned that we were sending a ping because a monitored user preference had been changed: browser.search.region.

When Firefox starts up the first time, it doesn’t know where it is. And it needs to know where it is to properly make a first guess at what language you want and what search engines would work best. Google’s results are pretty bad in Canada unless you use “google.ca”, after all.

But while Firefox doesn’t know where it is, it does know is what timezone it’s in from the settings in your OS’s clock. On top of that it knows what language your OS is set to. So we make a first guess at which search region we’re in based on whether or not the timezone overlaps a US timezone and if your OS’ locale is `en-US` (United States English).

What this fails to take into account is that United States English is the “default” locale reported by many OSes even if you aren’t in the US. And how even if you are in a timezone that overlaps with the US, you might not be there.

So to account for that, Mozilla operates a location service to double-check that the search region is appropriately set. This takes a little time to get back with the correct information, if it gets back to us at all. So if you happen to be in a US-overlapping timezone with an English-language OS Firefox assumes you’re in the US. Then if the location service request gets back with something that isn’t “US”, browser.search.region has to be updated.

And when it updates, it changes the Telemetry Environment.

And when the Environment changes, we send a “main” ping.

And when we send a “main” ping, the test breaks.

…all because my timezone overlaps the OS and my language is “Default” English.

I feel a bit persecuted, but this shows another strength of Distributed Teams. No one else on my team would be able to find this failure. They’re in Germany, Italy, and the US. None of them have that combination of “Not in the US, but in a US timezone” needed to manifest the bug.

So on one hand this sucks. I’m going to have to find a way around this.

But on the other hand… I feel like my Canadianness is a bit of a bug-finding superpower. I’m no Brok Windsor or Captain Canuck, but I can get the job done in a way no one else on my team can.

Not too bad, eh?

:chutten

Blast from the Past: I filed a bug against Firefox 3.6.6

A screenshot of the old bugzilla duplicate finder UI with the text inside table cells not rendering at allOn June 30, 2010 I was:

  • Sleepy. My daughter had just been born a few months prior and was spending her time pooping, crying, and not sleeping (as babies do).
  • Busy. I was working at Research in Motion (it would be three years before it would be renamed BlackBerry) on the BlackBerry Browser for BlackBerry 6.0. It was a big shift for us since that release was the first one using WebKit instead of the in-house “mango” rendering engine written in BlackBerry-mobile-dialect Java.
  • Keen. Apparently I was filing a bug against Firefox 3.6.6?!

Yeah. I had completely forgotten about this. Apparently while reading my RSS feeds in Google Reader (that doesn’t make me old, does it?) taking in news from Dragonmount about the Wheel of Time (so I guess I’ve always been a nerd, then) the text would sometimes just fail to render. I even caught it happening on the old Bugzilla “possible duplicate finder” UI (see above).

The only reason I was reminded this exists was because I received bugmail on my personal email address when someone accidentally added and removed themselves from the Cc list.

Pretty sure this bug, being no longer reproducible, still in UNCONFIRMED state, and filed against a pre-rapid-release version Firefox is something I should close. Yeah, I’ll just go and do that.

:chutten

 

Going from New Laptop to Productive Mozillian

laptopStickers

My old laptop had so many great stickers on it I didn’t want to say goodbye. So I put off my hardware refresh cycle from the recommended 2 years to almost 3.

To speak the truth it wasn’t only the stickers that made me wary of switching. I had a workflow that worked. The system wasn’t slow. It was only three years old.

But then Windows started crashing on me during video calls. And my Firefox build times became long enough that I ported changes to my Linux desktop before building them. It was time to move on.

Of course this opened up a can of worms. Questions, in order that they presented themselves, included:

Should I move to Mac, or stick with Windows? My lingering dislike for Apple products and complete unfamiliarity with OSX made that choice easy.

Of the Windows laptops, which should I go for? Microsoft’s Surface lineup keeps improving. I had no complaints from my previous Lenovo X1 Carbon. And the Dell XPS 15 and 13 were enjoyed by several of my coworkers.

The Dells I nixed because I didn’t want anything bigger than the X1 I was retiring, and because the webcam is positioned at knuckle-height. I felt wary of the Surfacebooks due to the number that mhoye had put in the ground due to manufacturing defects. Yes, I know he has an outsized effect on hardware and software. It really only served to highlight how much importance I put on familiarity and habit.

X1 Carbon 6th Generation it is, then.

So I initiated the purchase order. It would be sent to Mozilla Toronto, the location charged with providing my IT support, where it would be configured and given an asset number. Then it would be sent to me. And only then would the work begin in setting it up so that I could actually get work done on it.

First, not being a fan of sending keypresses over the network, I disabled Bing search from the Start Menu by setting the following registry keys:

HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Search
BingSearchEnabled dword:00000000
AllowSearchToUseLocation dword:00000000
CortanaConsent dword:00000000

Then I fixed some odd defaults in Lenovo’s hardware. Middle-click should middle-click, not enter into a scroll. Fn should need to be pressed to perform special functions on the F keys (it’s like FnLock was default-enabled).

I installed all editions of Firefox. Firefox Beta installed over the release-channel that came pre-installed. Firefox Developer Edition and Nightly came next and added their own icons. I had to edit the shortcuts for each of these individually on the Desktop and in the Quick Launch bar to have -P --no-remote arguments so I wouldn’t accidentally start the wrong edition with the wrong profile and lose all of my data. (This should soon be addressed)

In Firefox Beta I logged in to sync to my work Firefox Account. This brought me 60% of the way to being useful right there. So much of my work is done in the browser, and so much of my browsing experience can be brought to life by logging in to Firefox Sync.

The other 40% took the most effort and the most time. This is because I want to be able to compile Firefox on Windows, for my sins, and this isn’t the most pleasant of experiences. Luckily we have “Building Firefox for Windows” instructions on MDN. Unluckily, I want to use git instead of mercurial for version control.

  1. Install mozilla-build
  2. Install Microsoft Visual Studio Community Edition (needed for Win10 SDKs)
  3. Copy over my .vimrc, .bashrc, .gitconfig, and my ssh keys into the mozilla-build shell environment
  4. Add exclusions to Windows Defender for my entire development directory in an effort to speed up Windows’ notoriously-slow filesystem speeds
  5. Install Git for Windows
  6. Clone and configure git-cinnabar for working with Mozilla’s mercurial repositories
  7. Clone mozilla-unified
    • This takes hours to complete. The download is pretty quick, but turning all of the mercurial changesets into git commits requires a lot of filesystem operations.
  8. Download git-prompt.sh so I can see the current branch in my mozilla-build prompt
  9.  ./mach bootstrap
    • This takes dozens of minutes and can’t be left alone as it has questions that need answers at various points in the process.
  10. ​./mach build
    • This originally failed because when I checked out mozilla-unified in Step 7 my git used the wrong line-endings. (core.eol should be set to lf and core.autocrlf to false)
    • Then it failed because ./mach bootstrap downloaded the wrong rust std library. I managed to find rustup in ~/.cargo/bin which allowed me to follow the build system’s error message and fix things
  11. Just under 50min later I have a Firefox build

And that’s not all. I haven’t installed the necessary tools for uploading patches to Mozilla’s Phabricator instance so they can undergo code review. I haven’t installed Chrome so I can check if things are broken for everyone or just for Firefox. I haven’t cloned and configured the frankly-daunting number of github repositories in use by my team and the wider org.

Only with all this done can I be a productive mozillian. It takes hours, and knowledge gained over my nearly-3 years of employment here.

Could it be automated? Technologically, almost certainly yes. The latest mozilla-build can be fetched from a central location. mozilla-unified can be cloned using the version control setup of choice. The correct version of Visual Studio Community can be installed (but maybe not usably given its reliance on Microsoft Accounts). We might be able to get all the way to a working Firefox build from a recent checkout of the source tree before the laptop leaves IT’s hands.

It might not be worth it. How many mozillians even need a working Firefox build, anyway? And how often are they requesting new hardware?

Ignoring the requirement to build Firefox, then, why was the laptop furnished with a release-channel version of Firefox? Shouldn’t it at least have been Beta?

And could this process of setup be better documented? The parts common to multiple teams appear well documented to begin with. The “Building Firefox on Windows” documentation on MDN is exceedingly clear to work with despite the frightening complexity of its underpinnings. And my team has onboarding docs focused on getting new employees connected and confident.

Ultimately I believe this is probably as simple and as efficient as this process will get. Maybe it’s a good thing that I only undertook this after three years. That seems like a nice length of time to amortize the hours of cost it took to get back to productive.

Oh, and as for the stickers… well, Mozilla has a program for buying your own old laptop. I splurged and am using it to replace my 2009 Aspire Revo to connect to my TV and provide living room computing. It is working out just swell.

:chutten

The End of Firefox Windows XP Support

Firefox 62 has been released. Go give it a try!

At the same time, on the Extended Support Release channel, we released Firefox ESR 60.2 and stopped supporting Firefox ESR 52: the final version of Firefox with Windows XP support.

Now, we don’t publish all-channel user proportions grouped by operating system, but as part of the Firefox Public Data Report we do have data from the release channel back before we switched our XP users to the ESR channel. At the end of February 2016, XP users made up 12% of release Firefox. By the end of February 2017, XP users made up 8% of release Firefox.

If this trend continued without much change after we switched XP users to ESR, XP Firefox users would presently amount to about 2% of release users.

That’s millions of users we kept safe on the Internet despite running a nearly-17-year-old operating system whose last patch was over 4 years ago. That’s a year and a half of extra support for users who probably don’t feel they have much ability to protect themselves online.

It required effort, and it required devoting resources to supporting XP well after Microsoft stopped doing so. It meant we couldn’t do other things, since we were busy with XP.

I think we did a good thing for these users. I think we did the right thing for these users. And now we’re wishing these users the very best of luck.

…and that they please oh please upgrade so we can go on protecting them into the future.

:chutten

 

Data Science is Hard: Counting Users

Screenshot_2018-08-29 User Activity Firefox Public Data Report

Counting is harder than you think. No, really!

Intuitively, as you look around you, you think this can’t be true. If you see a parking lot you can count the cars, right?

But do cars that have left the parking lot count? What about cars driving through it without stopping? What about cars driving through looking for a space? (And can you tell the difference between those two kinds from a distance?)

These cars all count if you’re interested in usage. It’s all well and good to know the number of cars using your parking lot right now… but is it lower on weekends? Holidays? Are you measuring on a rainy day when fewer people take bicycles, or in the Summer when more people are on vacation? Do you need better signs or more amenities to get more drivers to stop? Are you going to have expand capacity this year, or next?

Yesterday we released the Firefox Public Data Report. Go take a look! It is the culmination of months of work of many mozillians (not me, I only contributed some early bug reports). In it you can find out how many users Firefox has, the most popular addons, and how quickly Firefox users update to the latest version. And you can choose whether to look at how these plots look for the worldwide user base or for one of the top ten (by number of Firefox users) countries individually.

It’s really cool.

The first two plots are a little strange, though. They count the number of Firefox users over time… and they don’t agree. They don’t even come close!

For the week including August 17, 2018 the Yearly Active User (YAU) count is 861884770 (or about 862M)… but the Monthly Active User (MAU) count is 256092920 (or about 256M)!

That’s over 600M difference! Which one is right?

Well, they both are.

Returning to our parking lot analogy, MAU is about counting how many cars use the parking lot over a 28-day period. So, starting Feb 1, count cars. If someone you saw earlier returns the next day or after a week, don’t count them again: we only want unique cars. Then, at the end of the 28-day period, that was the MAU for Feb 28. The MAU for Mar 1 (on non-leap-years) is the same thing, but you start counting on Feb 2.

Similarly for YAU, but you count over the past 365 days.

It stands to reason that you’ll see more unique cars over the year than you will over the month: you’ll see visitors, tourists, people using the lot just once, and people who have changed jobs and haven’t been back in four months.

So how many of these 600M who are in the YAU but not in the MAU are gone forever? How many are coming back? We don’t know.

Well, we don’t know _precisely_.

We’ve been at the browser game for long enough to see patterns in the data. We’re in the Summer slump for MAU numbers, and we have a model for how much higher the numbers are likely to be come October. We have surveyed people of varied backgrounds and have some ideas of why people change browsers to or away from Firefox.

We have the no-longer users, the lapsed users, the lost-and-regained users, the tried-us-once users, the non-human users, … we have categories and rough proportions on what we think we know about our population, and how that influences how we can better make the internet better for them.

Ultimately, to me, it doesn’t matter too much. I work on Firefox, a product that hundreds of millions of people use. How many hundreds of millions doesn’t matter: we’re above the threshold that makes me feel like I’m making the world better.

(( Well… I say that, but it is actually my job to understand the mechanisms behind these  numbers and why they can’t be exact, so I do have a bit of a vested interest. And there are a myriad of technological and behavioural considerations to account for in code and in documentation and in analysis which makes it an interesting job. But, you know. Hundreds of millions is precise enough for my job satisfaction index. ))

But once again we reach the inescapable return to the central thesis. Counting is harder than you think: one of the leading candidates for the Data Team’s motto. (Others include “Well, it depends.” and “¯\_(ツ)_/¯”). And now we’re counting in the open, so you get to experience its difficulty firsthand. Go have another look.

:chutten