This Week in Glean: Data Reviews are Important, Glean Parser makes them Easy

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean.) All “This Week in Glean” blog posts are listed in the TWiG index).

At Mozilla we put a lot of stock in Openness. Source? Open. Bug tracker? Open. Discussion Forums (Fora?)? Open (synchronous and asynchronous).

We also have an open process for determining if a new or expanded data collection in a Mozilla project is in line with our Privacy Principles and Policies: Data Review.

Basically, when a new piece of instrumentation is put up for code review (or before, or after), the instrumentor fills out a form and asks a volunteer Data Steward to review it. If the instrumentation (as explained in the filled-in form) is obviously in line with our privacy commitments to our users, the Data Steward gives it the go-ahead to ship.

(If it isn’t _obviously_ okay then we kick it up to our Trust Team to make the decision. They sit next to Legal, in case you need to find them.)

The Data Review Process and its forms are very generic. They’re designed to work for any instrumentation (tab count, bytes transferred, theme colour) being added to any project (Firefox Desktop, mozilla.org, Focus) and being collected by any data collection system (Firefox Telemetry, Crash Reporter, Glean). This is great for the process as it means we can use it and rely on it anywhere.

It isn’t so great for users _of_ the process. If you only ever write Data Reviews for one system, you’ll find yourself answering the same questions with the same answers every time.

And Glean makes this worse (better?) by including in its metrics definitions almost every piece of information you need in order to answer the review. So now you get to write the answers first in YAML and then in English during Data Review.

But no more! Introducing glean_parser data-review and mach data-review: command-line tools that will generate for you a Data Review Request skeleton with all the easy parts filled in. It works like this:

  1. Write your instrumentation, providing full information in the metrics definition.
  2. Call python -m glean_parser data-review <bug_number> <list of metrics.yaml files> (or mach data-review <bug_number> if you’re adding the instrumentation to Firefox Desktop).
  3. glean_parser will parse the metrics definitions files, pull out only the definitions that were added or changed in <bug_number>, and then output a partially-filled-out form for you.

Here’s an example. Say I’m working on bug 1664461 and add a new piece of instrumentation to Firefox Desktop:

fog.ipc:
  replay_failures:
    type: counter
    description: |
      The number of times the ipc buffer failed to be replayed in the
      parent process.
    bugs:
      - https://bugzilla.mozilla.org/show_bug.cgi?id=1664461
    data_reviews:
      - https://bugzilla.mozilla.org/show_bug.cgi?id=1664461
    data_sensitivity:
      - technical
    notification_emails:
      - chutten@mozilla.com
      - glean-team@mozilla.com
    expires: never

I’m sure to fill in the `bugs` field correctly (because that’s important on its own _and_ it’s what glean_parser data-review uses to find which data I added), and have categorized the data_sensitivity. I also included a helpful description. (The data_reviews field currently points at the bug I’ll attach the Data Review Request for. I’d better remember to come back before I land this code and update it to point at the specific comment…)

Then I can simply use mach data-review 1664461 and it spits out:

!! Reminder: it is your responsibility to complete and check the correctness of
!! this automatically-generated request skeleton before requesting Data
!! Collection Review. See https://wiki.mozilla.org/Data_Collection for details.

DATA REVIEW REQUEST
1. What questions will you answer with this data?

TODO: Fill this in.

2. Why does Mozilla need to answer these questions? Are there benefits for users?
   Do we need this information to address product or business requirements?

TODO: Fill this in.

3. What alternative methods did you consider to answer these questions?
   Why were they not sufficient?

TODO: Fill this in.

4. Can current instrumentation answer these questions?

TODO: Fill this in.

5. List all proposed measurements and indicate the category of data collection for each
   measurement, using the Firefox data collection categories found on the Mozilla wiki.

Measurement Name | Measurement Description | Data Collection Category | Tracking Bug
---------------- | ----------------------- | ------------------------ | ------------
fog_ipc.replay_failures | The number of times the ipc buffer failed to be replayed in the parent process.  | technical | https://bugzilla.mozilla.org/show_bug.cgi?id=1664461


6. Please provide a link to the documentation for this data collection which
   describes the ultimate data set in a public, complete, and accurate way.

This collection is Glean so is documented
[in the Glean Dictionary](https://dictionary.telemetry.mozilla.org).

7. How long will this data be collected?

This collection will be collected permanently.
**TODO: identify at least one individual here** will be responsible for the permanent collections.

8. What populations will you measure?

All channels, countries, and locales. No filters.

9. If this data collection is default on, what is the opt-out mechanism for users?

These collections are Glean. The opt-out can be found in the product's preferences.

10. Please provide a general description of how you will analyze this data.

TODO: Fill this in.

11. Where do you intend to share the results of your analysis?

TODO: Fill this in.

12. Is there a third-party tool (i.e. not Telemetry) that you
    are proposing to use for this data collection?

No.

As you can see, this Data Review Request skeleton comes partially filled out. Everything you previously had to mechanically fill out has been done for you, leaving you more time to focus on only the interesting questions like “Why do we need this?” and “How are you going to use it?”.

Also, this saves you from having to remember the URL to the Data Review Request Form Template each time you need it. We’ve got you covered.

And since this is part of Glean, this means this is already available to every project you can see here. This isn’t just a Firefox Desktop thing. 

Hope this saves you some time! If you can think of other time-saving improvements we could add once to Glean so every Mozilla project can take advantage of, please tell us on Matrix.

If you’re interested in how this is implemented, glean_parser’s part of this is over here, while the mach command part is here.

:chutten

I Assembled A Home Audio Thingy

Or, how to use Volumio, an old Raspberry Pi B+ (from 2014!), and an even older Denon stereo receiver+amplifier to pipe my wife’s MP3 collection to wired speakers in my house.

We like ourselves some music in our house. We’re not Hi Fi snobs. We don’t follow bands, really. We just like to have tunes around to help make chores a little less dreary <small>and to fill the gaping void we all hide inside ourselves</small>. Up until getting this house of ours a half decade ago we accomplished this by turning our computer speakers or CD player up to Rather Loud and trying not to spend too much time too close to it.

This “new” house came with a set of speakers in the kitchen and a nest of speaker wires connecting various corners of the main floor to a central location via the drop ceiling in the basement. With a couple of shelf speakers I ripped the proprietary connectors off of, plus two more speakers and a receiver donated by a far-more Hi Fi snobbish (though not really. But he does rather care about the surround, and waxes poetic about Master and Commander and House of the Flying Daggers for their sound fields) friend of ours, I had six speakers in four rooms.

But I had nothing to play on it. No audio source.

For fun I hooked up the PS4 via toslink/spdif/that optical thingy so I could play Uncharted in surround… but it seems Sony’s dream of the PlayStation being the command center of your home entertainment centre never really got off the ground as it can’t even play one of our (many) audio CDs.

(For the youngins: An audio CD is like a Spotify Playlist that is at most an hour long, but doesn’t require an Internet connection to play).

The PS3 was closer to that vision and had the hardware to play CDs, so it got unmothballed and used as a CD Player? Disc Deck? An audio source that did nothing but play audio CDs. The receiver had a 5CH Stereo setting so we had left+right channels in the rooms that had multiple speakers (and the two that only had single speakers I threw on L because Mono)…

Suffice to say we had a “okay” setup, given I spent a grand total of zero dollabux on it.

But my wife and I? We have MP3 collections that far outstrip our CD collections.

(For the youngins: An MP3 is like a stream of audio that you don’t need the Internet to play.)

(I’m ignoring the cassette tape collection, which play only in the basement on the Hi Fi Enthusiast Hardware of the Late Eighties that the previous owners of the house didn’t deign to take with them. It’s delightful.) How was I going to hook those MP3s up so they could play through the house as easily as the Audio CDs?

For a while I tried to get it to work via the Home Theatre PC.

(For the youngins: A Home Theatre PC is a computer which you connect to a TV so you can do computer things on your TV. Like a Smart TV in two pieces, both of which I control. Or like a laptop, but with a much larger screen that has a remote control.)

Unfortunately the HTPC’s dock is acting up when it comes to audio, and even the headphone jack was giving me grief. Plus, the HTPC’s media software stack was based on Kodi which, though lovely and has remote control capabilities over local network via both their web interface Chorus2 and official app Kore, is far more interested in video than audio. (for example: playlists don’t exist in Kore, and can’t really be managed in Chorus2).

But I learned a lot about what I wanted from such a system in trying to squish it into the HTPC which already had a job to do, so I decided to try making the audio player its own thing. Do one thing and do it well, jacks of all trades are masters of none. That sort of thing.

That’s when I remembered I had an old Raspberry Pi B+ in my closet. 700MHz CPU. 512MB RAM. Not the fastest machine in the park… but all it had to do was supply an interface in front of a largish (8k tracks) MP3 collection.

I found this project called Volumio which aimed to catalogue and provide a good, network-aware frontend on an audio collection (and do other stuff). It even had a plugin for playing Audio CDs so I could finally return the PS3 to game playing duty in the basement with the other previous generations of video gaming hardware.

It was a bit fiddly, though. Here’s the process:

  1. Install stock Volumio onto a microSD card which you then insert into the Raspberry Pi
    • This was very straightforward except for when I learned that the microsd card I wanted to use actually had bad-sector-ed itself to unusability. Luckily I had a spare.
  2. Adjust Volumio’s settings
    • Be sure to change playback to “Single” from “Continuous” or when you press play on a single track in a long list it’ll add every track in that list to the queue… which, on the B+’s anemic processor, takes a goodly while.
  3. Install the NanoSound CD Plugin
    • This is where it gets tricky. You could “just” pay for a subscription to Volumio and get first-party audio CD support including upsampling and other Hi Fi things. I’m using the B+’s headphone jack for output so Hi Fi is clearly none of my concern. And I’m too frugal for my own good, so I’m gonna do this myself.
    • Don’t install the plugin from the repository because it won’t work. Install the dependencies as described, then use the install script method. This will take a while as it compiles from source, and my B+ is not fast.
    • I’d like the CD to autoplay when inserted. There are instructions on the support page for how to script this: don’t use them. They have fancy quotation marks and emdashes which confuse both bash and curl when you try. Use instead the instructions on the source comment but don’t reset the volume.
  4. Install the Volumio App on your phone for remote control.
    • The “App” appears to be a webview that just loads http://volumio.local/ — for whatever reason my phone won’t resolve that host properly so I can’t just use the browsers I have already installed to access the UI.
  5. Move all the MP3s to a computer that is always on
    • You could use a USB drive attached to the Pi if you wanna, but I had space leftover on the Home Theatre PC, so I simply directed Volumio at the network share. Note that it demands credentials even for CIFS/Samba/Windows shares that don’t require credentials, so be prepared to add a straw account.

This was when we learned that our MP3 collection isn’t exactly nicely organized. Like Napster or eDonkey or Limewire or Kazaa, there were multiple slightly-different copies of some tracks or entire albums. Tracks weren’t really clear about what album, artist, and title they had… and the organization was a nightmare.

I’ve turned to Picard to help with the metadata challenges. So far it’s… fine? I dunno, AcoustID isn’t as foolproof as I was expecting it to be, and sometimes it decides to not group tracks into albums… it’s fine. So far.

Also, the gain levels of each track were different. Some were whisper-quiet and some were Cable TV Advertisement Loud. I’d hoped Volumio’s own volume normalization would help, but it seemed to silence already-quiet tracks and amplify high-gain recordings in the exact opposite of what I wanted. So I ran MP3Gain (yes, on sourceforge. Yes it hasn’t had a non-UI-language update since like 2008) for a few hours to get everyone singing at the same level, and turned off Volumio’s volume normalization.

And that’s where we are now. I’m not fully done with Picard (so many tracks to go). I haven’t added my own MP3 collection to the mix, with its additional duplicates and bad gain and whatnot…

…but it’s working. And it’s encouraging my wife and I to discover music we haven’t played in years. Which is wonderful.

If only because it annoys our preteen for her to learn that she kinda likes her parents’ tunes.

This Week in Glean: Firefox Telemetry is to Glean as C++ is to Rust

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

I had this goofy idea that, like Rust, the Glean SDKs (and Ecosystem) aim to bring safety and higher-level thought to their domain. This is in comparison to how, like C++, Firefox Telemetry is built out of flexible primitives that assume you very much know what you’re doing and cannot (will not?) provide any clues in its design as to how to do things properly.

I have these goofy thoughts a lot. I’m a goofy guy. But the more I thought about it, the more the comparison seemed apt.

In Glean wherever we can we intentionally forbid behaviour we cannot guarantee is safe (e.g. we forbid non-commutative operations in FOG IPC, we forbid decrementing counters). And in situations where we need to permit perhaps-unsafe data practices, we do it in tightly-scoped areas that are identified as unsafe (e.g. if a timing_distribution uses accumulate_raw_samples_nanos you know to look at its data with more skepticism).

In Glean we encourage instrumentors to think at a higher level (e.g. memory_distribution instead of a Histogram of unknown buckets and samples) thereby permitting Glean to identify errors early (e.g. you can’t start a timespan twice) and allowing Glean to do clever things about it (e.g. in our tooling we know counter metrics are interesting when summed, but quantity metrics are not). Speaking of those errors, we are able to forbid error-prone behaviour through design and use of language features (e.g. In languages with type systems we can prevent you from collecting the wrong type of data) and when the error is only detectable at runtime we can report it with a high degree of specificity to make it easier to diagnose.

There are more analogues, but the metaphor gets strained. (( I mean, I guess a timing_distribution’s `TimerId` is kinda the closest thing to a borrow checker we have? Maybe? )) So I should probably stop here.

Now, those of you paying attention might have already seen this relationship. After all, as we all know, glean-core (which underpins most of the Glean SDKs regardless of language) is actually written in Rust whereas Firefox Telemetry’s core of Histograms, Scalars, and Events is written in C++. Maybe we shouldn’t be too surprised when the language the system is written in happens to be reflected in the top-level design.

But! glean-core was (for a long time) written in Kotlin from stem to stern. So maybe it’s not due to language determinism and is more to do with thoughtful design, careful change processes, and a list of principles we hold to firmly as the number of supported languages and metric types continues to grow.

I certainly don’t know. I’m just goofing around.

:chutten

Responsible Data Collection is Good, Actually (Ubisoft Data Summit 2021)

In June I was invited to talk at Ubisoft’s Data Summit about how Mozilla does data. I’ve given a short talk on this subject before, but this was an opportunity to update the material, cover more ground, and include more stories. The talk, including questions, comes in at just under an hour and is probably best summarized by the synopsis:

Learn how responsible data collection as practiced at Mozilla makes cataloguing easy, stops instrumentation mistakes before they ship, and allows you to build self-serve analysis tooling that gets everyone invested in data quality. Oh, and it’s cheaper, too.

If you want to skip to the best bits, I included shameless advertising for Mozilla VPN at 3:20 and becoming a Mozilla contributor at 14:04, and I lose my place in my notes at about 29:30.

Many thanks to Mathieu Nayrolles, Sebastien Hinse and the Data Summit committee at Ubisoft for guiding me through the process and organizing a wonderful event.

:chutten

Data Science is Interesting: Why are there so many Canadians in India?

Any time India comes up in the context of Firefox and Data I know it’s going to be an interesting day.

They’re our largest Beta population:

pie chart showing India by far the largest at 33.2%

They’re our second-largest English user base (after the US):

pie chart showing US as largest with 37.8% then India with 10.8%

 

But this is the interesting stuff about India that you just take for granted in Firefox Data. You come across these factoids for the first time and your mind is all blown and you hear the perhaps-apocryphal stories about Indian ISPs distributing Firefox Beta on CDs to their customers back in the Firefox 4 days… and then you move on. But every so often something new comes up and you’re reminded that no matter how much you think you’re prepared, there’s always something new you learn and go “Huh? What? Wait, what?!”

Especially when it’s India.

One of the facts I like to trot out to catch folks’ interest is how, when we first released the Canadian English localization of Firefox, India had more Canadians than Canada. Even today India is, after Canada and the US, the third largest user base of Canadian English Firefox:

pie chart of en-CA using Firefox clients by country. Canada at 75.5%, US at 8.35%, then India at 5.41%

 

Back in September 2018 Mozilla released the official Canadian English-localized Firefox. You can try it yourself by selecting it from the drop down menu in Firefox’s Preferences/Options in the “Language” section. You may have to click ‘Search for More Languages’ to be able to add it to the list first, but a few clicks later and you’ll be good to go, eh?

(( Or, if you don’t already have Firefox installed, you can select which language and dialect of Firefox you want from this download page. ))

Anyhoo, the Canadian English locale quickly gained a chunk of our install base:

uptake chart for en-CA users in Firefox in September 2018. Shows a sharp uptake followed by a weekly seasonal pattern with weekends lower than week days

…actually, it very quickly gained an overlarge chunk of our install base. Within a week we’d reached over three quarters of the entire Canadian user base?! Say we have one million Canadian users, that first peak in the chart was over 750k!

Now, we Canadian Mozillians suspected that there was some latent demand for the localized edition (they were just too polite to bring it up, y’know)… but not to this order of magnitude.

So back around that time a group of us including :flod, :mconnor, :catlee, :Aryx, :callek (and possibly others) fell down the rabbit hole trying to figure out where these Canadians were coming from. We ran down the obvious possibilities first: errors in data, errors in queries, errors in visualization… who knows, maybe I was counting some clients more than once a day? Maybe I was counting other Englishes (like South African and Great Britain) as well? Nothing panned out.

Then we guessed that maybe Canadians in Canada weren’t the only ones interested in the Canadian English localization. Originally I think we made a joke about how much Canadians love to travel, but then the query stopped running and showed us just how many Canadians there must be in India.

We were expecting a fair number of Canadians in the US. It is, after all, home to Firefox’s largest user base. But India? Why would India have so many Canadians? Or, if it’s not Canadians, why would Indians have such a preference for the English spoken in ten provinces and three territories? What is it about one of two official languages spoken from sea to sea to sea that could draw their attention?

Another thing that was puzzling was the raw speed of the uptake. If users were choosing the new localization themselves, we’d have seen a shallow curve with spikes as various news media made announcements or as we started promoting it ourselves. But this was far sharper an incline. This spoke to some automated process.

And the final curiosity (or clue, depending on your point of view) was discovered when we overlaid British English (en-GB) on top of the Canadian English (en-CA) uptake and noticed that (after accounting for some seasonality at the time due to the start of the school year) this suddenly-large number of Canadian English Firefoxes was drawn almost entirely from the number previously using British English:

chart showing use of British and Canadian English in Firefox in September 2018. The rise in use of Canadian English is matched by a fall in the use of British English.

It was with all this put together that day that lead us to our Best Guess. I’ll give you a little space to make your own guess. If you think yours is a better fit for the evidence, or simply want to help out with Firefox in Canadian English, drop by the Canadian English (en-CA) Localization matrix room and let us know! We’re a fairly quiet bunch who are always happy to have folks help us keep on top of the new strings added or changed in Mozilla projects or just chat about language stuff.

Okay, got your guess made? Here’s ours:

en-CA is alphabetically before en-GB.

Which is to say that the Canadian English Firefox, when put in a list with all the other Firefox builds (like this one which lists all the locales Firefox 88 comes in for Windows 64-bit), comes before the British English Firefox. We assume there is a population of Firefoxes, heavily represented in India (and somewhat in the US and elsewhere), that are installed automatically from a list like this one. This automatic installation is looking for the first English build in this list, and it doesn’t care which dialect. Starting September of 2018, instead of grabbing British English like it’s been doing for who knows how long, it had a new English higher in the list: Canadian English.

But who can say! All I know is that any time India comes up in the data, it’s going to be an interesting day.

:chutten

Doubling the Speed of Windows Firefox Builds using sccache-dist

I’m one of the many users but few developers of Firefox on Windows. One of the biggest obstacles stopping me from doing more development on Windows instead of this beefy Linux desktop I have sitting under my table is how slow builds are.

Luckily, distributed compilation (and caching) using sccache is here to help. This post is a step-by-step version of the rather-more-scattered docs I found on the github repo and in Firefox’s documentation. Those guides are excellent and have all of the same information (though they forgot to remind me to put the ports on the url config variables), but they have to satisfy many audiences with many platforms and many use cases so I found myself having to switch between all three to get myself set up.

To synthesize what I learned all in one place, I’m writing my Home Office Version to be specific to “using a Linux machine to help your Windows machine compile Firefox on a local network”. Here’s how it goes:

  1. Ensure the Build Scheduler (Linux-only), Build Servers (Linux-only), and Build Clients (any of Linux, MacOS, Windows) all have sccache-dist.
    • If you have a Firefox Build present, ./mach bootstrap already gave you a copy at .mozbuild/sccache/bin
    • My Build Scheduler and solitary Build Server are both the same Linux machine.
  2. Configure how the pieces all talk together by configuring the Scheduler.
    • Make a file someplace (I put mine in ~/sccache-dist/scheduler.conf) and put in the public-facing IP address of the scheduler (better be static), the method and secret that Clients use to authenticate themselves, and the method and secret that Servers use to authenticate themselves.
    • Keep the tokens and secret keys, y’know, secret.
# Don't forget the port, and don't use an internal iface address like 127.0.0.1.
# This is where the Clients and Servers should find the Scheduler
public_addr = "192.168.1.1:10600"

[client_auth]
type = "token"
# You can use whatever source of random, long, hard-to-guess token you'd like.
# But chances are you have openssl anyway, and it's good enough unless you're in
# a VM or other restrained-entropy situation.
token = "<whatever the output of `openssl rand -hex 64` gives you>"

[server_auth]
type = "jwt_hs256"
secret_key = "<whatever the output of `sccache-dist auth generate-jwt-hs256-key` is>"
  1. Start the Scheduler to see if it complains about your configuration.
    • ~/.mozconfig/sccache/sccache-dist scheduler –config ~/sccache-dist/scheduler.conf
    • If it fails fatally, it’ll let you know. But you might also want to have `–syslog trace` while we’re setting things up so you can follow the verbose logging with `tail -f /var/log/syslog`
  2. Configure the Build Server.
    • Ensure you have bubblewrap >= 0.3.0 to sandbox your build jobs away from the rest of your computer
    • Make a file someplace (I put mine in ~/sccache-dist/server.conf) and put in the public-facing IP address of the server (better be static) and things like where and how big the toolchain cache should be, where the Scheduler is, and how you authenticate the Server with the Scheduler.
# Toolchains are how a Linux Server can build for a Windows Client.
# The Server needs a place to cache these so Clients don’t have to send them along each time.
cache_dir = "/tmp/toolchains"
# You can also config the cache size with toolchain_cache_size, but the default of 10GB is fine.

# This is where the Scheduler can find the Server. Don’t forget the port.
public_addr = "192.168.1.1:10501"

# This is where the Server can find the Scheduler. Don’t forget http. Don’t forget the port.
# Ideally you’d have an https server in front that’d add a layer of TLS and
# redirect to the port for you, but this is Home Office Edition.
scheduler_url = "http://192.168.1.1:10600"

[builder]
type = "overlay" # I don’t know what this means
build_dir = "/tmp/build" # Where on the fs you want that sandbox of build jobs to live
bwrap_path = "/usr/bin/bwrap" # Where the bubblewrap 0.3.0+ binary lives

[scheduler_auth]
type = "jwt_token"
token = "<what sccache-dist auth generate-jwt-hs256-server-token --secret-key <that key from scheduler.conf> --server <the value in public_addr including port>"
  1. Start the Build Server
    • `sudo` is necessary for this part to satisfy bubblewrap
    • sudo ~/.mozbuild/sccache/sccache-dist server –config ~/sccache-dist/server.conf
    • I’m not sure if it’s just me, but the build server runs in foreground without logs. Personally, I’d prefer a daemon.
    • If your scheduler’s tracelogging to syslog, you should see something in /var/log about the server authenticating successfully. If you aren’t, we can query the whole build network’s status in Step 7.
  2. Configure the Build Client.
    • This config file needs to have a specific name and location to be picked up by sccache. On Windows it’s `%APPDATA%\Mozilla\sccache\config\config`.
    • In it you need to write down how the Client can find and authenticate itself with the Scheduler. On not-Linux you also need to specify the toolchains you’ll be asking your Build Servers to use to compile your code.
[dist]
scheduler_url = "http://192.168.1.1:10600" # Don’t forget the protocol or port
toolchain_cache_size = 5368709120 # The default of 10GB is at least twice as big as you need.

# Gonna need two toolchains, one for C++ and one for Rust
# Remember to replace all <user> with your user name on disk
[[dist.toolchains]]
type = "path_override"
compiler_executable = "C:/Users/<user>/.mozbuild/clang/bin/clang-cl.exe"
archive = "C:/Users/<user>/.mozbuild/clang-dist-toolchain.tar.xz"
archive_compiler_executable = "/builds/worker/toolchains/clang/bin/clang"

[[dist.toolchains]]
type = "path_override"
compiler_executable = "C:/Users/<user>/.rustup/toolchains/stable-x86_64-pc-windows-msvc/bin/rustc.exe"
archive = "C:/Users/<user>/.mozbuild/rustc-dist-toolchain.tar.xz"
archive_compiler_executable = "/builds/worker/toolchains/rustc/bin/rustc"

# Near as I can tell, these dist.toolchains blocks tell sccache
# that if a job requires a tool at `compiler_executable` then it should instead
# distribute the job to be compiled using the tool present in `archive` at
# the path within the archive of `archive_compiler_executable`.
# You’ll notice that the `archive_compiler_executable` binaries do not end in `.exe`.

[dist.auth]
type = "token"
token = "<the value of scheduler.conf’s client_auth.token>"
  1. Perform a status check from the Client.
    • With the Scheduler and Server both running, go to the Client and run `.mozbuild/sccache/sccache.exe –dist-status`
    • It will start a sccache “client server” (ugh) in the background and try to connect. Ideally you’re looking for a non-0 “num_servers” and non-0 “num_cpus”
  2. Configure mach to use sccache
    • You need to tell it that it has a ccache and to configure clang to use `cl` driver mode (because when executing compiles on the Build Server it will see it’s called `clang` not `clang-cl` and thus forget to use `cl` mode unless you remind it to)
# Remember to replace all <user> with your user name on disk
ac_add_options CCACHE="C:/Users/<user>/.mozbuild/sccache/sccache.exe"

export CC="C:/Users/<user>/.mozbuild/clang/bin/clang-cl.exe --driver-mode=cl"
export CXX="C:/Users/<user>/.mozbuild/clang/bin/clang-cl.exe --driver-mode=cl"
export HOST_CC="C:/Users/<user>/.mozbuild/clang/bin/clang-cl.exe --driver-mode=cl"
export HOST_CXX="C:/Users/<user>/.mozbuild/clang/bin/clang-cl.exe --driver-mode=cl"
  1. Run a test build
    • Using the value of “num_cpus” from Step 7’s `–dist-status`, run `./mach build -j<num_cpus>`
    • To monitor if everything’s working, you have some choices
      • You can look at network traffic (expect your network to be swamped with jobs going out and artefacts coming back)
      • You can look at resource-using processes on the Build Server (you can use `top` to watch the number of `clang` processes)
      • If your Scheduler or Server is logging, you can `tail -f /var/log/syslog` to watch the requests and responses in real time

Oh, dang, I should manufacture a final step so it’s How To Speed Up Windows Firefox Builds In Ten Easy Steps (if you have a fast Linux machine and network). Oh well.

Anyhoo, I’m not sure if this is useful to anyone else, but I hope it is. No doubt your setup is less weird than mine somehow so you’ll be better off reading the general docs instead. Happy Firefox developing!

:chutten

List of 14 Project FOG Proposals including C++ API, Documentation Design, and Glean API Frontend for Firefox Telemetry among others.

This Week in Glean: Proposals for Asynchronous Design

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

At last count there are 14 proposals for Firefox on Glean, the effort that, last year, brought the Glean SDK to Firefox Desktop. What in the world is a small, scrappy team in a small, scrappy company like Mozilla doing wasting so much time with old-school Waterfall Model overhead?!

Because it’s cheaper than the alternative.

Design is crucial before tackling difficult technological problems that affect multiple teams. At the very least you’re writing an API and you need to know what people want to do with it. So how do you get agreement? How do you reach the least bad design in the shortest time?

We in the Data Org use a Proposal Process. It’s a very lightweight thing. You write down in a (sigh) Google Doc what it is you’re proposing (we have a snazzy template), attach it to a bug, then needinfo folks who should look at it. They use Google Docs’ commenting and suggested changes features to improve the proposal in small ways and discuss it, and use Bugzilla’s comments and flags to provide overall feedback on the proposal itself (like, should it even exist) and to ensure they keep getting reminded to look at the proposal until the reviewer’s done reviewing. All in all, it’ll take a week or two of part-time effort to write the proposal, find the right people to review it, and then incorporate the feedback and consider it approved.

(( Full disclosure, the parts involving Bugzilla are my spin on the Proposal Process. It just says you should get feedback, not how. ))

Proposals vs Meetings

Why not use a meeting? Wouldn’t that be faster?

Think about who gets to review things in a meeting as a series of filters. First and foremost, only those who attend can review. I’ve talked before about how distributed across the globe my org is, and a lot of the proposals in Project FOG also needed feedback from subject matter experts across Mozilla as a whole (we are not jumping into the XPIDL swamp without a guide). No way could I find a space in all those calendars, assuming that any of them even overlap due to time zones.

Secondly, with a defensive Proposer, feedback will be limited to those reviewers they can’t overpower in a meeting. So if someone wants to voice a subtle flaw in the C++ Metrics API Design (like how I forgot to include any details about how to handle Labeled Metrics), they have to first get me to stop talking. And even though I’m getting better at that (still a ways to go), if you are someone who doesn’t feel comfortable providing feedback in a meeting (perhaps you’re new and hesitant, or you only kinda know about the topic and are worried about looking foolish, or you are generally averse to speaking in front of others) it won’t matter how quiet I am. The proposal won’t be able to benefit from your input.

Thirdly, some feedback can’t be thought of in a meeting. There’s a rough-and-readiness, an immediacy, to feedback in a meeting setting. You’re thinking on your feet, even if the Proposal and meeting agenda are set well in advance. Some critiques need time to percolate, or additional critical voices to bounce off of. Meetings aren’t great for that unless you can get everyone in a room for a day. Pandemic aside, when was the last time you all had that much time?

Proposal documents are just so much more inclusive than design meetings. You probably still want to have a meeting for early prototyping with a small group of insiders, and another at the end to coax out any lingering doubts… but having the main review stages be done asynchronously to your reviewers’ schedules allows you to include a wider variety of voices. You wouldn’t feel comfortable asking a VP to an hour-long design meeting, but you might feel comfortable sending the doc in an email for visibility.

Asynchronicity For You and Me

On top of being more inclusive, proposals are also more respectful. I don’t know what your schedule is today. I don’t know what life you’re living. But I can safely assume that, unless you’re on vacation, you’ll have enough time between now and, say, next Friday to skim a doc and see if there’s anything foolish in it you need to stop me from doing. Or think of someone else who I didn’t think of who should really take a look.

And by setting a feedback deadline, you the Proposer are setting yourself free. You’ll be getting emails as feedback comes in. You’ll be responding to questions, accepting and rejecting changes, and having short little chats. But you can handle that in bite sized chunks on your own schedule, asynchronously, and give yourself the freedom to schedule synchronous work and meetings in the meantime.

Proposal Evolution

Name a Design that was implemented exactly as written. Go on, I’ll wait.

No? Can’t think of one? Neither can I.

Designs (and thus Proposals) are always incomplete. They can’t take into consideration everything. They’re necessarily at a higher level than the implementation. So in some way, the implementation is the evolution of the Design. But implementations lose the valuable information about Why and How that was so important to set down in the Design. When someone new comes to the project and asks you why we implemented it this way, will you have to rely on the foggy remembrance of oral organizational history? Or will you find some way of keeping an objective record?

Only now have we started to develop the habit of indexing and archiving Proposals internally. That’s how I know there’s been fourteen Project FOG proposals (so far). But I don’t think a dusty wiki is the correct place for them.

I think, once accepted, Proposals should evolve into Documentation. Documentation is a Design adjusted by the realities encountered during implementation and maintained by users asking questions. Documentation is a living document explaining Why and How, kept in sync with the implementation’s explanation of What.

But Documentation is a discussion for another time. Reference Documentation vs User Guides vs Design Documentation vs Marketing Copy vs… so much variety, so little time. And I’ve already written too much.

:chutten

This Week in Glean: Glean is Frictionless Data Collection

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

So you want to collect data in your project? Okay, it’s pretty straightforward.

  1. API: You need a way to combine the name of your data with the value that data has. Ideally you want it to be ergonomic to your developers to encourage them to instrument things without asking you for help, so it should include as many compile-time checks as you can and should be friendly to the IDEs and languages in use. Note the plurals.
  2. Persistent Storage: Keyed by the name of your data, you need some place to put the value. Ideally this will be common regardless of the instrumentation’s language or thread of execution. And since you really don’t want crashes or sudden application shutdowns or power outages to cause you to lose everything, you need to persist this storage. You can write it to a file on disk (if your platforms have such access), but be sure to write the serialization and deserialization functions with backwards-compatibility in mind because you’ll eventually need to change the format.
  3. Networking: Data stored with the product has its uses, but chances are you want this data to be combined with more data from other installations. You don’t need to write the network code yourself, there are libraries for HTTPS after all, but you’ll need to write a protocol on top of it to serialize your data for transmission.
  4. Scheduling: Sending data each time a new piece of instrumentation comes in might be acceptable for some products whose nature is only-online. Messaging apps and MMOs send so much low-latency data all the time that you might as well send your data as it comes in. But chances are you aren’t writing something like that, or you respect the bandwidth of your users too much to waste it, so you’ll only want to be sending data occasionally. Maybe daily. Maybe when the user isn’t in the middle of something. Maybe regularly. Maybe when the stored data reaches a certain size. This could get complicated, so spend some time here and don’t be afraid to change it as you find new corners.
  5. Errors: Things will go wrong. Instrumentation will, despite your ergonomic API, do something wrong and write the wrong value or call stop() before start(). Your networking code will encounter the weirdness of the full Internet. Your storage will get full. You need some way to communicate the health of your data collection system to yourself (the owner who needs to adjust scheduling and persistence and other stuff to decrease errors) and to others (devs who need to fix their instrumentation, analysts who should be told if there’s a problem with the data, QA so they can write tests for these corner cases).
  6. Ingestion: You’ll need something on the Internet listening for your data coming in. It’ll need to scale to the size of your product’s base and be resilient to Internet Attacks. It should speak the protocol you defined in #4, so you should probably have some sort of machine-readable definition of that protocol that product and ingestion can share. And you should spend some time thinking about what to do when an old product with an old version of the protocol wants to send data to your latest ingestion endpoint.
  7. Pipeline: Not all data will go to the same place. Some is from a different product. Some adheres to a different schema. Some is wrong but ingestion (because it needs to scale) couldn’t do the verification of it, so now you need to discard it more expensively. Thus you’ll be wanting some sort of routing infrastructure to take ingested data and do some processing on it.
  8. Warehousing: Once you receive all these raw payloads you’ll need a place to put them. You’ll want this place to be scalable, high-performance, and highly-available.
  9. Datasets: Performing analysis to gain insight from raw payloads is possible (even I have done it), but it is far more pleasant to consolidate like payloads with like, perhaps ordered or partitioned by time and by some dimensions within the payload that’ll make analyses quicker. Maybe you’ll want to split payloads into multiple rows of a tabular dataset, or combine multiple payloads into single rows. Talk to the people doing the analyses and ask them what would make their lives easier.
  10. Tooling: Democratizing data analysis is a good way to scale up the number of insights your organization can find at once, and it’s a good way to build data intuition. You might want to consider low-barrier data analysis tooling to encourage exploration. You might also want to consider some high-barrier data tooling for operational analyses and monitoring (good to know that the update is rolling out properly and isn’t bricking users’ devices). And some things for the middle ground of folks that know data and have questions, but don’t know SQL or Python or R.
  11. Tests: Don’t forget that every piece of this should be testable and tested in isolation and in integration. If you can manage it, a suite of end-to-end tests does wonders for making you feel good that the whole system will continue to work as you develop it.
  12. Documentation: You’ll need two types of documentation: User and Developer. The former is for the “user” of the piece (developers who wish to instrument back in #1, analysts who have questions that need answering in #10). The latter is for anyone going in trying to understand the “Why” and “How” of the pieces’ architecture and design choices.

You get all that? Thread safety. File formats. Networking protocols. Scheduling using real wall-clock time. Schema validation. Open ports on the Internet. At scale. User-facing tools and documentation. All tested and verified.

Look, I said it’d be straightforward, not that it’d be easy. I’m sure it’ll only take you a few years and a couple tries to get it right.

Or, y’know, if you’re a Mozilla project you could just use Glean which already has all of these things…

  1. API: The Glean SDK API aims to be ergonomic and idiomatic in each supported language.
  2. Persistent Storage: The Glean SDK uses rkv as a persistent store for unsubmitted data, and a documented flat file format for submitted but not yet sent data.
  3. Networking: The Glean SDK provides an API for embedding applications to provide their own networking stack (useful when we’re embedded in a browser), and some default implementations if you don’t care to provide one. The payload protocol is built on Structured Ingestion and has a schema that generates and deploys new versions daily.
  4. Scheduling: Each Glean SDK payload has its own schedule to respect the character of the data it contains, from as frequently as the user foregrounds the app to, at most, once a day.
  5. Errors: The Glean SDK builds user metric and internal health metrics into the SDK itself.
  6. Ingestion: The edge servers and schema validation are all documented and tested. We autoscale quite well and have a process for handling incidents.
  7. Pipeline: We have a pubsub system on GCP that handles a variety of different types of data.
  8. Warehousing: I can’t remember if we still call this the Data Lake or not.
  9. Datasets: We have a few. They are monitored. Our workflow software for deriving the datasets is monitored as well.
  10. Tooling: Quite a few of them are linked from the Telemetry Index.
  11. Tests: Each piece is tested individually. Adjacent pieces sometimes have integration suites. And Raphael recently spun up end-to-end tests that we’re very appreciative of. And if you’re just a dev wondering if your new instrumentation is working? We have the debug ping viewer.
  12. Documentation: Each piece has developer documentation. Some pieces, like the SDK, also have user documentation. And the system at large? Even more documentation.

Glean takes this incredibly complex problem, breaks it into pieces, solves each piece individually, then puts the solution together in a way that makes it greater than the sum of its parts.

All you need is to follow the six steps to integrate the Glean SDK and notify the Ecosystem that your project exists, and then your responsibilities shrink to just instrumentation and analysis.

If that isn’t frictionless data collection, I don’t know what is.

:chutten

(( If you’re not a Mozilla project, and thus don’t by default get to use the Data Platform (numbers 6-10) for your project, come find us on the #glean channel on Matrix and we’ll see what help we can get you. ))

Data Science is Hard: ALSA in Firefox

(( We’re overdue for another episode in this series on how Data Science is Hard. Today is a story from 2016 which I think illustrates many important things to do with data. ))

It’s story time. Gather ’round.

In July of 2016, Anthony Jones made the case that the Mozilla-built Firefox for Linux should stop supporting the ALSA backend (and also the WinXP WinMM backend) so that we could innovate on features for more modern audio backends.

(( You don’t need to know what an audio backend is to understand this story. ))

The code supporting ALSA would remain in tree for any Linux distribution who wished to maintain the backend and build it for themselves, but Mozilla would stop shipping Firefox with that code in it.

But how could we ensure the number of Firefoxen relying on this backend was small enough that we wouldn’t be removing something our users desperately needed? Luckily :padenot had just added an audio backend measurement to Telemetry. “We’ll have data soon,” he wrote.

By the end of August we’d heard from Firefox Nightly and Firefox Developer Edition that only 3.5% and 2% (respectively) of Linux subsessions with audio used ALSA. This was small enough to for the removal to move ahead.

Fast-forward to March of 2017. Seven months have passed. The removal has wound its way through Nightly, Developer Edition, Beta, and now into the stable Release channel. Linux users following this update channel update their Firefox and… suddenly the web grows silent for a large number of users.

Bugs are filed (thirteen of them). The mailing list thread with Anthony’s original proposal is revived with some very angry language. It seems as though far more than just a fraction of a fraction of users were using ALSA. There were entire Linux distributions that didn’t ship anything besides ALSA. How did Telemetry miss them?

It turns out that many of those same ALSA-only Linux distributions also turned off Telemetry when they repackaged Firefox for their users. And for any that shipped with Telemetry at all, many users disabled it themselves. Those users’ Firefoxen had no way to phone home to tell Mozilla how important ALSA was to them… and now it was too late.

Those Linux distributions started building ALSA support into their distributed Firefox builds… and hopefully began reporting Telemetry by default to prevent this from happening again. I don’t know if they did for sure (we don’t collect fine-grained information like that because we don’t need it).

But it serves as a cautionary tale: Mozilla can only support a finite number of things. Far fewer now than we did back in 2016. We prioritize what we support based on its simplicity and its reach. That first one we can see for ourselves, and for the second we rely on data collection like Telemetry to tell us.

Counting things is harder than it looks. Counting things that are invisible is damn near impossible. So if you want to be counted: turn Telemetry on (it’s in the Preferences) and leave it on.

:chutten

Five-Year Moziversary

Wowee what a year that was. And I’m pretty sure the year to come will be even more so.

Me, in last year’s moziversary post

Oof. I hate being right for the wrong reasons. And that’s all I’ll say about COVID-19 and the rest of the 2020 dumpster fire.

In team news, Georg’s short break turned into the neverending kind as he left Mozilla late last year. We gained Michael Droettboom as our new fearless leader, and from my perspective he seems to be doing quite well at the managery things. Bea and Travis, our two newer team members, have really stepped into their roles well, providing much needed bench depth on Rust and Mobile. And Jan-Erik has taken over leadership of the SDK, freeing up Alessio to think about data collection for Web Extensions.

2020 is indeed being the Year of Glean on the Desktop with several projects already embedding the now-successful Glean SDK, including our very own mach (Firefox Build Tooling Commandline) and mozregression (Firefox Bug Regression Window Finding Tool). Oh, and Jan-Erik and I’ve spent ten months planning and executing on Project FOG (Firefox on Glean) (maybe you’ve heard of it), on track (more or less) to be able to recommend it for all new data collections by the end of the year.

My blogging frequency has cratered. Though I have a mitt full of ideas, I’ve spent no time developing them into proper posts beyond taking my turn at This Week in Glean. In the hopper I have “Naming Your Kid Based on how you Yell At Them”, “Tools Externalize Costs to their Users”, “Writing Code for two Wolves: Computers and Developers”, “Glean is Frictionless”, “Distributed Teams: Proposals are Inclusive”, and whatever of the twelve (Twelve?!) drafts I have saved up in wordpress that have any life in them.

Progress on my resolutions to blog more, continue improving, and put Glean on Firefox? Well, I think I’ve done the latter two. And I think those resolutions are equally valid for the next year, though I may tweak “put Glean on Firefox” to “support migrating Firefox Telemetry to Glean” which is more or less the same thing.

:chutten