Data Science is Interesting: Why are there so many Canadians in India?

Any time India comes up in the context of Firefox and Data I know it’s going to be an interesting day.

They’re our largest Beta population:

pie chart showing India by far the largest at 33.2%

They’re our second-largest English user base (after the US):

pie chart showing US as largest with 37.8% then India with 10.8%

 

But this is the interesting stuff about India that you just take for granted in Firefox Data. You come across these factoids for the first time and your mind is all blown and you hear the perhaps-apocryphal stories about Indian ISPs distributing Firefox Beta on CDs to their customers back in the Firefox 4 days… and then you move on. But every so often something new comes up and you’re reminded that no matter how much you think you’re prepared, there’s always something new you learn and go “Huh? What? Wait, what?!”

Especially when it’s India.

One of the facts I like to trot out to catch folks’ interest is how, when we first released the Canadian English localization of Firefox, India had more Canadians than Canada. Even today India is, after Canada and the US, the third largest user base of Canadian English Firefox:

pie chart of en-CA using Firefox clients by country. Canada at 75.5%, US at 8.35%, then India at 5.41%

 

Back in September 2018 Mozilla released the official Canadian English-localized Firefox. You can try it yourself by selecting it from the drop down menu in Firefox’s Preferences/Options in the “Language” section. You may have to click ‘Search for More Languages’ to be able to add it to the list first, but a few clicks later and you’ll be good to go, eh?

(( Or, if you don’t already have Firefox installed, you can select which language and dialect of Firefox you want from this download page. ))

Anyhoo, the Canadian English locale quickly gained a chunk of our install base:

uptake chart for en-CA users in Firefox in September 2018. Shows a sharp uptake followed by a weekly seasonal pattern with weekends lower than week days

…actually, it very quickly gained an overlarge chunk of our install base. Within a week we’d reached over three quarters of the entire Canadian user base?! Say we have one million Canadian users, that first peak in the chart was over 750k!

Now, we Canadian Mozillians suspected that there was some latent demand for the localized edition (they were just too polite to bring it up, y’know)… but not to this order of magnitude.

So back around that time a group of us including :flod, :mconnor, :catlee, :Aryx, :callek (and possibly others) fell down the rabbit hole trying to figure out where these Canadians were coming from. We ran down the obvious possibilities first: errors in data, errors in queries, errors in visualization… who knows, maybe I was counting some clients more than once a day? Maybe I was counting other Englishes (like South African and Great Britain) as well? Nothing panned out.

Then we guessed that maybe Canadians in Canada weren’t the only ones interested in the Canadian English localization. Originally I think we made a joke about how much Canadians love to travel, but then the query stopped running and showed us just how many Canadians there must be in India.

We were expecting a fair number of Canadians in the US. It is, after all, home to Firefox’s largest user base. But India? Why would India have so many Canadians? Or, if it’s not Canadians, why would Indians have such a preference for the English spoken in ten provinces and three territories? What is it about one of two official languages spoken from sea to sea to sea that could draw their attention?

Another thing that was puzzling was the raw speed of the uptake. If users were choosing the new localization themselves, we’d have seen a shallow curve with spikes as various news media made announcements or as we started promoting it ourselves. But this was far sharper an incline. This spoke to some automated process.

And the final curiosity (or clue, depending on your point of view) was discovered when we overlaid British English (en-GB) on top of the Canadian English (en-CA) uptake and noticed that (after accounting for some seasonality at the time due to the start of the school year) this suddenly-large number of Canadian English Firefoxes was drawn almost entirely from the number previously using British English:

chart showing use of British and Canadian English in Firefox in September 2018. The rise in use of Canadian English is matched by a fall in the use of British English.

It was with all this put together that day that lead us to our Best Guess. I’ll give you a little space to make your own guess. If you think yours is a better fit for the evidence, or simply want to help out with Firefox in Canadian English, drop by the Canadian English (en-CA) Localization matrix room and let us know! We’re a fairly quiet bunch who are always happy to have folks help us keep on top of the new strings added or changed in Mozilla projects or just chat about language stuff.

Okay, got your guess made? Here’s ours:

en-CA is alphabetically before en-GB.

Which is to say that the Canadian English Firefox, when put in a list with all the other Firefox builds (like this one which lists all the locales Firefox 88 comes in for Windows 64-bit), comes before the British English Firefox. We assume there is a population of Firefoxes, heavily represented in India (and somewhat in the US and elsewhere), that are installed automatically from a list like this one. This automatic installation is looking for the first English build in this list, and it doesn’t care which dialect. Starting September of 2018, instead of grabbing British English like it’s been doing for who knows how long, it had a new English higher in the list: Canadian English.

But who can say! All I know is that any time India comes up in the data, it’s going to be an interesting day.

:chutten

Doubling the Speed of Windows Firefox Builds using sccache-dist

I’m one of the many users but few developers of Firefox on Windows. One of the biggest obstacles stopping me from doing more development on Windows instead of this beefy Linux desktop I have sitting under my table is how slow builds are.

Luckily, distributed compilation (and caching) using sccache is here to help. This post is a step-by-step version of the rather-more-scattered docs I found on the github repo and in Firefox’s documentation. Those guides are excellent and have all of the same information (though they forgot to remind me to put the ports on the url config variables), but they have to satisfy many audiences with many platforms and many use cases so I found myself having to switch between all three to get myself set up.

To synthesize what I learned all in one place, I’m writing my Home Office Version to be specific to “using a Linux machine to help your Windows machine compile Firefox on a local network”. Here’s how it goes:

  1. Ensure the Build Scheduler (Linux-only), Build Servers (Linux-only), and Build Clients (any of Linux, MacOS, Windows) all have sccache-dist.
    • If you have a Firefox Build present, ./mach bootstrap already gave you a copy at .mozbuild/sccache/bin
    • My Build Scheduler and solitary Build Server are both the same Linux machine.
  2. Configure how the pieces all talk together by configuring the Scheduler.
    • Make a file someplace (I put mine in ~/sccache-dist/scheduler.conf) and put in the public-facing IP address of the scheduler (better be static), the method and secret that Clients use to authenticate themselves, and the method and secret that Servers use to authenticate themselves.
    • Keep the tokens and secret keys, y’know, secret.
# Don't forget the port, and don't use an internal iface address like 127.0.0.1.
# This is where the Clients and Servers should find the Scheduler
public_addr = "192.168.1.1:10600"

[client_auth]
type = "token"
# You can use whatever source of random, long, hard-to-guess token you'd like.
# But chances are you have openssl anyway, and it's good enough unless you're in
# a VM or other restrained-entropy situation.
token = "<whatever the output of `openssl rand -hex 64` gives you>"

[server_auth]
type = "jwt_hs256"
secret_key = "<whatever the output of `sccache-dist auth generate-jwt-hs256-key` is>"
  1. Start the Scheduler to see if it complains about your configuration.
    • ~/.mozconfig/sccache/sccache-dist scheduler –config ~/sccache-dist/scheduler.conf
    • If it fails fatally, it’ll let you know. But you might also want to have `–syslog trace` while we’re setting things up so you can follow the verbose logging with `tail -f /var/log/syslog`
  2. Configure the Build Server.
    • Ensure you have bubblewrap >= 0.3.0 to sandbox your build jobs away from the rest of your computer
    • Make a file someplace (I put mine in ~/sccache-dist/server.conf) and put in the public-facing IP address of the server (better be static) and things like where and how big the toolchain cache should be, where the Scheduler is, and how you authenticate the Server with the Scheduler.
# Toolchains are how a Linux Server can build for a Windows Client.
# The Server needs a place to cache these so Clients don’t have to send them along each time.
cache_dir = "/tmp/toolchains"
# You can also config the cache size with toolchain_cache_size, but the default of 10GB is fine.

# This is where the Scheduler can find the Server. Don’t forget the port.
public_addr = "192.168.1.1:10501"

# This is where the Server can find the Scheduler. Don’t forget http. Don’t forget the port.
# Ideally you’d have an https server in front that’d add a layer of TLS and
# redirect to the port for you, but this is Home Office Edition.
scheduler_url = "http://192.168.1.1:10600"

[builder]
type = "overlay" # I don’t know what this means
build_dir = "/tmp/build" # Where on the fs you want that sandbox of build jobs to live
bwrap_path = "/usr/bin/bwrap" # Where the bubblewrap 0.3.0+ binary lives

[scheduler_auth]
type = "jwt_token"
token = "<what sccache-dist auth generate-jwt-hs256-server-token --secret-key <that key from scheduler.conf> --server <the value in public_addr including port>"
  1. Start the Build Server
    • `sudo` is necessary for this part to satisfy bubblewrap
    • sudo ~/.mozbuild/sccache/sccache-dist server –config ~/sccache-dist/server.conf
    • I’m not sure if it’s just me, but the build server runs in foreground without logs. Personally, I’d prefer a daemon.
    • If your scheduler’s tracelogging to syslog, you should see something in /var/log about the server authenticating successfully. If you aren’t, we can query the whole build network’s status in Step 7.
  2. Configure the Build Client.
    • This config file needs to have a specific name and location to be picked up by sccache. On Windows it’s `%APPDATA%\Mozilla\sccache\config\config`.
    • In it you need to write down how the Client can find and authenticate itself with the Scheduler. On not-Linux you also need to specify the toolchains you’ll be asking your Build Servers to use to compile your code.
[dist]
scheduler_url = "http://192.168.1.1:10600" # Don’t forget the protocol or port
toolchain_cache_size = 5368709120 # The default of 10GB is at least twice as big as you need.

# Gonna need two toolchains, one for C++ and one for Rust
# Remember to replace all <user> with your user name on disk
[[dist.toolchains]]
type = "path_override"
compiler_executable = "C:/Users/<user>/.mozbuild/clang/bin/clang-cl.exe"
archive = "C:/Users/<user>/.mozbuild/clang-dist-toolchain.tar.xz"
archive_compiler_executable = "/builds/worker/toolchains/clang/bin/clang"

[[dist.toolchains]]
type = "path_override"
compiler_executable = "C:/Users/<user>/.rustup/toolchains/stable-x86_64-pc-windows-msvc/bin/rustc.exe"
archive = "C:/Users/<user>/.mozbuild/rustc-dist-toolchain.tar.xz"
archive_compiler_executable = "/builds/worker/toolchains/rustc/bin/rustc"

# Near as I can tell, these dist.toolchains blocks tell sccache
# that if a job requires a tool at `compiler_executable` then it should instead
# distribute the job to be compiled using the tool present in `archive` at
# the path within the archive of `archive_compiler_executable`.
# You’ll notice that the `archive_compiler_executable` binaries do not end in `.exe`.

[dist.auth]
type = "token"
token = "<the value of scheduler.conf’s client_auth.token>"
  1. Perform a status check from the Client.
    • With the Scheduler and Server both running, go to the Client and run `.mozbuild/sccache/sccache.exe –dist-status`
    • It will start a sccache “client server” (ugh) in the background and try to connect. Ideally you’re looking for a non-0 “num_servers” and non-0 “num_cpus”
  2. Configure mach to use sccache
    • You need to tell it that it has a ccache and to configure clang to use `cl` driver mode (because when executing compiles on the Build Server it will see it’s called `clang` not `clang-cl` and thus forget to use `cl` mode unless you remind it to)
# Remember to replace all <user> with your user name on disk
ac_add_options CCACHE="C:/Users/<user>/.mozbuild/sccache/sccache.exe"

export CC="C:/Users/<user>/.mozbuild/clang/bin/clang-cl.exe --driver-mode=cl"
export CXX="C:/Users/<user>/.mozbuild/clang/bin/clang-cl.exe --driver-mode=cl"
export HOST_CC="C:/Users/<user>/.mozbuild/clang/bin/clang-cl.exe --driver-mode=cl"
export HOST_CXX="C:/Users/<user>/.mozbuild/clang/bin/clang-cl.exe --driver-mode=cl"
  1. Run a test build
    • Using the value of “num_cpus” from Step 7’s `–dist-status`, run `./mach build -j<num_cpus>`
    • To monitor if everything’s working, you have some choices
      • You can look at network traffic (expect your network to be swamped with jobs going out and artefacts coming back)
      • You can look at resource-using processes on the Build Server (you can use `top` to watch the number of `clang` processes)
      • If your Scheduler or Server is logging, you can `tail -f /var/log/syslog` to watch the requests and responses in real time

Oh, dang, I should manufacture a final step so it’s How To Speed Up Windows Firefox Builds In Ten Easy Steps (if you have a fast Linux machine and network). Oh well.

Anyhoo, I’m not sure if this is useful to anyone else, but I hope it is. No doubt your setup is less weird than mine somehow so you’ll be better off reading the general docs instead. Happy Firefox developing!

:chutten

This Week in Glean: Glean is Frictionless Data Collection

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

So you want to collect data in your project? Okay, it’s pretty straightforward.

  1. API: You need a way to combine the name of your data with the value that data has. Ideally you want it to be ergonomic to your developers to encourage them to instrument things without asking you for help, so it should include as many compile-time checks as you can and should be friendly to the IDEs and languages in use. Note the plurals.
  2. Persistent Storage: Keyed by the name of your data, you need some place to put the value. Ideally this will be common regardless of the instrumentation’s language or thread of execution. And since you really don’t want crashes or sudden application shutdowns or power outages to cause you to lose everything, you need to persist this storage. You can write it to a file on disk (if your platforms have such access), but be sure to write the serialization and deserialization functions with backwards-compatibility in mind because you’ll eventually need to change the format.
  3. Networking: Data stored with the product has its uses, but chances are you want this data to be combined with more data from other installations. You don’t need to write the network code yourself, there are libraries for HTTPS after all, but you’ll need to write a protocol on top of it to serialize your data for transmission.
  4. Scheduling: Sending data each time a new piece of instrumentation comes in might be acceptable for some products whose nature is only-online. Messaging apps and MMOs send so much low-latency data all the time that you might as well send your data as it comes in. But chances are you aren’t writing something like that, or you respect the bandwidth of your users too much to waste it, so you’ll only want to be sending data occasionally. Maybe daily. Maybe when the user isn’t in the middle of something. Maybe regularly. Maybe when the stored data reaches a certain size. This could get complicated, so spend some time here and don’t be afraid to change it as you find new corners.
  5. Errors: Things will go wrong. Instrumentation will, despite your ergonomic API, do something wrong and write the wrong value or call stop() before start(). Your networking code will encounter the weirdness of the full Internet. Your storage will get full. You need some way to communicate the health of your data collection system to yourself (the owner who needs to adjust scheduling and persistence and other stuff to decrease errors) and to others (devs who need to fix their instrumentation, analysts who should be told if there’s a problem with the data, QA so they can write tests for these corner cases).
  6. Ingestion: You’ll need something on the Internet listening for your data coming in. It’ll need to scale to the size of your product’s base and be resilient to Internet Attacks. It should speak the protocol you defined in #4, so you should probably have some sort of machine-readable definition of that protocol that product and ingestion can share. And you should spend some time thinking about what to do when an old product with an old version of the protocol wants to send data to your latest ingestion endpoint.
  7. Pipeline: Not all data will go to the same place. Some is from a different product. Some adheres to a different schema. Some is wrong but ingestion (because it needs to scale) couldn’t do the verification of it, so now you need to discard it more expensively. Thus you’ll be wanting some sort of routing infrastructure to take ingested data and do some processing on it.
  8. Warehousing: Once you receive all these raw payloads you’ll need a place to put them. You’ll want this place to be scalable, high-performance, and highly-available.
  9. Datasets: Performing analysis to gain insight from raw payloads is possible (even I have done it), but it is far more pleasant to consolidate like payloads with like, perhaps ordered or partitioned by time and by some dimensions within the payload that’ll make analyses quicker. Maybe you’ll want to split payloads into multiple rows of a tabular dataset, or combine multiple payloads into single rows. Talk to the people doing the analyses and ask them what would make their lives easier.
  10. Tooling: Democratizing data analysis is a good way to scale up the number of insights your organization can find at once, and it’s a good way to build data intuition. You might want to consider low-barrier data analysis tooling to encourage exploration. You might also want to consider some high-barrier data tooling for operational analyses and monitoring (good to know that the update is rolling out properly and isn’t bricking users’ devices). And some things for the middle ground of folks that know data and have questions, but don’t know SQL or Python or R.
  11. Tests: Don’t forget that every piece of this should be testable and tested in isolation and in integration. If you can manage it, a suite of end-to-end tests does wonders for making you feel good that the whole system will continue to work as you develop it.
  12. Documentation: You’ll need two types of documentation: User and Developer. The former is for the “user” of the piece (developers who wish to instrument back in #1, analysts who have questions that need answering in #10). The latter is for anyone going in trying to understand the “Why” and “How” of the pieces’ architecture and design choices.

You get all that? Thread safety. File formats. Networking protocols. Scheduling using real wall-clock time. Schema validation. Open ports on the Internet. At scale. User-facing tools and documentation. All tested and verified.

Look, I said it’d be straightforward, not that it’d be easy. I’m sure it’ll only take you a few years and a couple tries to get it right.

Or, y’know, if you’re a Mozilla project you could just use Glean which already has all of these things…

  1. API: The Glean SDK API aims to be ergonomic and idiomatic in each supported language.
  2. Persistent Storage: The Glean SDK uses rkv as a persistent store for unsubmitted data, and a documented flat file format for submitted but not yet sent data.
  3. Networking: The Glean SDK provides an API for embedding applications to provide their own networking stack (useful when we’re embedded in a browser), and some default implementations if you don’t care to provide one. The payload protocol is built on Structured Ingestion and has a schema that generates and deploys new versions daily.
  4. Scheduling: Each Glean SDK payload has its own schedule to respect the character of the data it contains, from as frequently as the user foregrounds the app to, at most, once a day.
  5. Errors: The Glean SDK builds user metric and internal health metrics into the SDK itself.
  6. Ingestion: The edge servers and schema validation are all documented and tested. We autoscale quite well and have a process for handling incidents.
  7. Pipeline: We have a pubsub system on GCP that handles a variety of different types of data.
  8. Warehousing: I can’t remember if we still call this the Data Lake or not.
  9. Datasets: We have a few. They are monitored. Our workflow software for deriving the datasets is monitored as well.
  10. Tooling: Quite a few of them are linked from the Telemetry Index.
  11. Tests: Each piece is tested individually. Adjacent pieces sometimes have integration suites. And Raphael recently spun up end-to-end tests that we’re very appreciative of. And if you’re just a dev wondering if your new instrumentation is working? We have the debug ping viewer.
  12. Documentation: Each piece has developer documentation. Some pieces, like the SDK, also have user documentation. And the system at large? Even more documentation.

Glean takes this incredibly complex problem, breaks it into pieces, solves each piece individually, then puts the solution together in a way that makes it greater than the sum of its parts.

All you need is to follow the six steps to integrate the Glean SDK and notify the Ecosystem that your project exists, and then your responsibilities shrink to just instrumentation and analysis.

If that isn’t frictionless data collection, I don’t know what is.

:chutten

(( If you’re not a Mozilla project, and thus don’t by default get to use the Data Platform (numbers 6-10) for your project, come find us on the #glean channel on Matrix and we’ll see what help we can get you. ))

Data Science is Hard: ALSA in Firefox

(( We’re overdue for another episode in this series on how Data Science is Hard. Today is a story from 2016 which I think illustrates many important things to do with data. ))

It’s story time. Gather ’round.

In July of 2016, Anthony Jones made the case that the Mozilla-built Firefox for Linux should stop supporting the ALSA backend (and also the WinXP WinMM backend) so that we could innovate on features for more modern audio backends.

(( You don’t need to know what an audio backend is to understand this story. ))

The code supporting ALSA would remain in tree for any Linux distribution who wished to maintain the backend and build it for themselves, but Mozilla would stop shipping Firefox with that code in it.

But how could we ensure the number of Firefoxen relying on this backend was small enough that we wouldn’t be removing something our users desperately needed? Luckily :padenot had just added an audio backend measurement to Telemetry. “We’ll have data soon,” he wrote.

By the end of August we’d heard from Firefox Nightly and Firefox Developer Edition that only 3.5% and 2% (respectively) of Linux subsessions with audio used ALSA. This was small enough to for the removal to move ahead.

Fast-forward to March of 2017. Seven months have passed. The removal has wound its way through Nightly, Developer Edition, Beta, and now into the stable Release channel. Linux users following this update channel update their Firefox and… suddenly the web grows silent for a large number of users.

Bugs are filed (thirteen of them). The mailing list thread with Anthony’s original proposal is revived with some very angry language. It seems as though far more than just a fraction of a fraction of users were using ALSA. There were entire Linux distributions that didn’t ship anything besides ALSA. How did Telemetry miss them?

It turns out that many of those same ALSA-only Linux distributions also turned off Telemetry when they repackaged Firefox for their users. And for any that shipped with Telemetry at all, many users disabled it themselves. Those users’ Firefoxen had no way to phone home to tell Mozilla how important ALSA was to them… and now it was too late.

Those Linux distributions started building ALSA support into their distributed Firefox builds… and hopefully began reporting Telemetry by default to prevent this from happening again. I don’t know if they did for sure (we don’t collect fine-grained information like that because we don’t need it).

But it serves as a cautionary tale: Mozilla can only support a finite number of things. Far fewer now than we did back in 2016. We prioritize what we support based on its simplicity and its reach. That first one we can see for ourselves, and for the second we rely on data collection like Telemetry to tell us.

Counting things is harder than it looks. Counting things that are invisible is damn near impossible. So if you want to be counted: turn Telemetry on (it’s in the Preferences) and leave it on.

:chutten

Five-Year Moziversary

Wowee what a year that was. And I’m pretty sure the year to come will be even more so.

Me, in last year’s moziversary post

Oof. I hate being right for the wrong reasons. And that’s all I’ll say about COVID-19 and the rest of the 2020 dumpster fire.

In team news, Georg’s short break turned into the neverending kind as he left Mozilla late last year. We gained Michael Droettboom as our new fearless leader, and from my perspective he seems to be doing quite well at the managery things. Bea and Travis, our two newer team members, have really stepped into their roles well, providing much needed bench depth on Rust and Mobile. And Jan-Erik has taken over leadership of the SDK, freeing up Alessio to think about data collection for Web Extensions.

2020 is indeed being the Year of Glean on the Desktop with several projects already embedding the now-successful Glean SDK, including our very own mach (Firefox Build Tooling Commandline) and mozregression (Firefox Bug Regression Window Finding Tool). Oh, and Jan-Erik and I’ve spent ten months planning and executing on Project FOG (Firefox on Glean) (maybe you’ve heard of it), on track (more or less) to be able to recommend it for all new data collections by the end of the year.

My blogging frequency has cratered. Though I have a mitt full of ideas, I’ve spent no time developing them into proper posts beyond taking my turn at This Week in Glean. In the hopper I have “Naming Your Kid Based on how you Yell At Them”, “Tools Externalize Costs to their Users”, “Writing Code for two Wolves: Computers and Developers”, “Glean is Frictionless”, “Distributed Teams: Proposals are Inclusive”, and whatever of the twelve (Twelve?!) drafts I have saved up in wordpress that have any life in them.

Progress on my resolutions to blog more, continue improving, and put Glean on Firefox? Well, I think I’ve done the latter two. And I think those resolutions are equally valid for the next year, though I may tweak “put Glean on Firefox” to “support migrating Firefox Telemetry to Glean” which is more or less the same thing.

:chutten

This Week in Glean: Project FOG Update, end of H12020

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

It’s been a while since last I wrote on Project FOG, so I figure I should update all of you on the progress we’ve made.

A reminder: Project FOG (Firefox on Glean) is the year-long effort to bring the Glean SDK to Firefox. This means answering such varied questions as “Where are the docs going to live?” (here) “How do we update the SDK when we need to?” (this way) “How are tests gonna work?” (with difficulty) and so forth. In a project this long you can expect updates from time-to-time. So where are we?

First, we’ve added the Glean SDK to Firefox Desktop and include it in Firefox Nightly. This is only a partial integration, though, so the only builtin ping it sends is the “deletion-request” ping when the user opts out of data collection in the Preferences. We don’t actually collect any data, so the ping doesn’t do anything, but we’re sending it and soon we’ll have a test ensuring that we keep sending it. So that’s nice.

Second, we’ve written a lot of Design Proposals. The Glean Team and all the other teams our work impacts are widely distributed across a non-trivial fragment of the globe. To work together and not step on each others’ toes we have a culture of putting most things larger than a bugfix into Proposal Documents which we then pass around asynchronously for ideation, feedback, review, and signoff. For something the size and scope of adding a data collection library to Firefox Desktop, we’ve needed more than one. These design proposals are Google Docs for now, but will evolve to in-tree documentation (like this) as the proposals become code. This way the docs live with the code and hopefully remain up-to-date for our users (product developers, data engineers, data scientists, and other data consumers), and are made open to anyone in the community who’s interested in learning how it all works.

Third, we have a Glean SDK Rust API! Sorta. To limit scope creep we haven’t added the Rust API to mozilla/glean and are testing its suitability in FOG itself. This allows us to move a little faster by mixing our IPC implementation directly into the API, at the expense of needing to extract the common foundation later. But when we do extract it, it will be fully-formed and ready for consumers since it’ll already have been serving the demanding needs of FOG.

Fourth, we have tests. This was a bit of a struggle as the build order of Firefox means that any Rust code we write that touches Firefox internals can’t be tested in Rust tests (they must be tested by higher-level integration tests instead). By damming off the Firefox-adjacent pieces of the code we’ve been able to write and run Rust tests of the metrics API after all. Our code coverage is still a little low, but it’s better than it was.

Fifth, we are using Firefox’s own network stack to send pings. In a stroke of good fortune the application-services team (responsible for fan-favourite Firefox features “Sync”, “Send Tab”, and “Firefox Accounts”) was bringing a straightforward Rust networking API called Viaduct to Firefox Desktop almost exactly when we found ourselves in need of one. Plugging into Viaduct was a breeze, and now our “deletion-request” pings can correctly work their way through all the various proxies and protocols to get to Mozilla’s servers.

Sixth, we have firm designs on how to implement both the C++ and JS APIs in Firefox. They won’t be fully-fledged language bindings the way that Kotlin, Python, and Swift are (( they’ll be built atop the Rust language binding so they’re really more like shims )), but they need to have every metric type and every metric instance that a full language binding would have, so it’s no small amount of work.

But where does that leave our data consumers? For now, sadly, there’s little to report on both the input and output sides: We have no way for product engineers to collect data in Firefox Desktop (and no pings to send the data on), and we have no support in the pipeline for receiving data, not that we have any to analyse. These will be coming soon, and when they do we’ll start cautiously reaching out to potential first customers to see whether their needs can be satisfied by the pieces we’ve built so far.

And after that? Well, we need to do some validation work to ensure we’re doing things properly. We need to implement the designs we proposed. We need to establish how tasks accomplished in Telemetry can now be accomplished in the Glean SDK. We need to start building and shipping FOG and the Glean SDK beyond Nightly to Beta and Release. We need to implement the builtin Glean SDK pings. We need to document the designs so others can understand them, best practices so our users can follow them, APIs so engineers can use them, test guarantees so QA can validate them, and grand processes for migration from Telemetry to Glean so that organizations can start roadmapping their conversions.

In short: plenty has been done, and there’s still plenty to do. 

I guess we’d better be about it, then.

:chutten

This Week in Glean: How Much Does That Data Cost?

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

I’ve written before about data, but never tackled the business perspective. To a business, what is data? It could be considered an asset, I suppose: a tool, like a printer, to make your business more efficient.

But like that printer and other assets, data has a cost. We can quite easily look up how much it costs to store arbitrary data on AWS (less than 2.3 cents USD per GB per month) but that only provides the cost of the data at rest. It doesn’t consider what it took for the data to get there or how much it costs to be useful once it’s stored.

So let’s imagine that you come across a decision that can only be made with data. You’ve tried your best to do without it, but you really do need to know how many Live Bookmarks there are per Firefox profile… maybe it’s in wide use and we should assign someone to spruce it up. Maybe almost no one uses it and so Live Bookmarks should be removed and instead become a feature provided by extensions.

This should be easy, right? Slap the number into an HTTP payload and send it to a Mozilla-controlled server. Then just count them all up!

As one of the Data Organization’s unofficial mottos puts it: Counting is Harder Than It Looks.

Let’s look at the full lifecycle of the metric from ideation and instrumentation to expiry and deletion. I’ll measure money and time costs, being clear about the assumptions guiding my estimates and linking to sources where available.

For a rule of thumb, time costs are $50 per hour. Developers and Managers and PMs cost more than $100k per year in total compensation in many jurisdictions, and less in many others. Let’s go with this because why not. I considered ignoring labour costs altogether because these people are doing their jobs whether they’re performing their part in this collection or not… but that’s assuming they have the spare capacity and would otherwise be doing nothing. Everyone I talk to is busy, so everyone’s doing this data collection work instead of something else they could be doing: so there is an opportunity cost.

Fixed costs, like the cost of building and maintaining a data collection library, data collection pipeline, bug trackers, code review tooling, dev computers are all ignored. We could amortize that per data collection… but it’d probably work out to $0 anyway.

Also, for the purposes of measuring data we’re counting only the size of the data itself (the count of the number of Live Bookmarks). To be more complete we’d need to amortize the cost of sending the data point (HTTP headers, payload metadata, the data point’s identifier, etc.) and factor in additional complexity (transfer encoding, compression, etc.). This would require a lot of words, and in the present Firefox Telemetry system this amortizes to 0 because the “main” ping has many data points in it and gzip compression is pretty good.

Also, this is a Best Case Estimate. I make many assumptions small in order to make this a lower-bound cost if everything goes according to plan and everyone acts the way they should.

Ideation – Time: 30min, Cost: $25

How long does it take you to figure out how to measure something? You need to know the feature you’re measuring, the capabilities of the data collection library you’re using to do the measuring, and some idea of how you’ll analyse it at the other end.  If you’re trying to send something clever like the state of a customizable UI element or do something that requires custom analysis, this will take longer and take more people which will cost more money.

But for our example we know what we’re collecting: numbers of things. The data collection library is old and well understood. The analysis is straightforward. This takes one person a half hour to think through.

Instrumentation – Time: 60min, Cost: $50

Knowing the feature is not the same as knowing the code. You need a subject matter expert (developer who knows the feature and the code as well as the data collection library’s API) to figure out on exactly which line of code we should call exactly what method with exactly which count. If it’s complicated, several people may need to meet in order to figure out what to do here: are the input event timestamps the same format on Windows and Mac? Does time when the computer is asleep count or not?

For our example we have questions: Should we count the number of Live Bookmarks in the database? The number in the Bookmark Menu? The Bookmark Toolbar? What if the user deletes one, should we count before or after the delete?

This is small enough that we can find a single subject matter expert who knows it all. They read some documentation, make some decisions, write some code, and take an hour to do this themselves.

Review – Time: 30min, Cost $25

Both the code and the data collection need review. The simplicity of the data collection and the code make this quick. Mozilla’s code review tooling helps a lot here, too. Though it takes a day or two for the Module Peer and the Data Steward to find time to get to the reviews, it only takes a combination of a half hour for them to okay it to ship.

Storage (user) – Cost: $0

Data takes up space. Its definition takes up some bytes in the Firefox binary that you installed. It takes up bytes in your computer’s memory. It takes up bytes on disk while it waits to be sent and afterwards so you can look at it if you type about:telemetry into your address bar. (Try it yourself!)

The marginal cost to the user of the tens of bytes of memory and disk from our single number of Live Bookmarks is most accurately represented as a zero not only because memory and disk are excitingly cheap these days but also because there was likely some spare capacity in those systems.

Bandwidth (user) – Cost: $0.00 (but not zero)

Data requires network bandwidth to be reported, and network bandwidth costs money. Many consumer plans are flat-rate and so the marginal cost of the extra bytes is not felt at all (we’re using a little of the slack), so we can flatten this to zero.

But hey, let’s do some recreational math for fun! (We all do this in our spare time, right? It’s not just me?)

If we were paying per-byte and sending this from a smartphone, the first GB in Canada (where mobile data makes the most money for the service providers in the world) costs $30 per month. That’s about 3 thousandths of a cent per kilobyte.

The data collection is a number, which is about 4 bytes of data. We send it about three times per day and individual profiles are in use by Firefox on average 12 days a month (engagement ratio of 0.4). (If you’re interested, this is due to a bunch of factors including users having multiple profiles at school, work, and home… but that’s another blog post).

4 bytes x 3 per day x 12 days in a month ~= 144 bytes per month

Thus a more accurate cost estimate of user bandwidth for this data would be 4 ten-thousandths of a cent (in Canadian dollars). It would take over 200 years of reporting this figure to cost the user a single penny. So let’s call it 0 for our purposes here.

Though… however close the cost is to 0, It isn’t 0. This means that, over time and over enough data points and over our full Firefox population, there is a measurable cost. Though its weight is light when it is but a single data point sent infrequently by each of our users, put together it is still hefty enough that we shouldn’t ignore it.

Bandwidth (Mozilla) – Cost: $0

Internet Service Providers have a nice gig: they can charge the user when the bytes leave their machine and charge Mozilla when the bytes enter their machine. However, cloud data platform providers (Amazon’s AWS, Google’s GCP, Microsoft’s Azure, etc) don’t charge for bandwidth for the data coming into their services.

You do get charged for bandwidth _leaving_ their systems. And for anything you do _on_ their systems. If I were feeling uncharitable I guess I’d call this a vendor lock-in data roach motel.

At any rate, the cost for this step is 0.

Pipeline Processing – Cost: $15.12

Once our Live Bookmarks data reaches the pipeline, there’s a few steps the data needs to go through. It needs to be decompressed, examined for adherence to the data schema (malformed data gets thrown out), and a response written to the client to tell it that we received it all okay. It needs to be processed, examined, and funneled to the correct storage locations while also being made available for realtime analysis if we need it.

For our little 4-byte number that shouldn’t be too bad, right?

Well, now that we’re on Mozilla’s side of the operation we need to consider the scale. Just how many Firefox profiles are sending how many of these numbers at us? About 250M of them each month. (At time of writing this isn’t up-to-date beyond EOY2019. Sorry about that. We’re working on it). With an engagement ratio of about 0.4, data being sent about thrice a day, and each count of Live Bookmarks taking up 4 bytes of space, we’re looking at 12GB of data per month.

At our present levels, ingestion and processing costs about $90 per TB. This comes out to $1.08 of cost for this step, each month. Multiplied by 14 “months”, that’s $15.12.

About Months

In saying “14 months” for how long the pipeline needs to put up with the collection coming from the entire Firefox population I glossed over quite a lot of detail. The main piece of information is that the default expiry for new data collections in Firefox is five or six Firefox versions (which should come out to about six months).

However, as I’ve mentioned before, updates don’t all happen at once. Though we have about 90% of the Firefox population within 3 versions of up-to-date at any one time, there’s a long tail of Firefox profiles from ancient versions sending us data.

To calculate 14 months I looked at the total data collection volumes for five versions of Firefox: Firefox 69-73 (inclusive). This avoids Firefox ESR 68 gumming up the works (its support lifetime is much longer than a normal release, and we’re aiming for a best-case cost estimate) and is both far enough in the past that Firefox 69 ought to be winding down around now _and_ is recent enough that we’ll not have thrown out the data yet (more on retention periods later) and it is closer in behaviour to releases we’re doing this year.

Here’s what that looks like:time series plot showing data volumes from five Firefox versions

So I said this was far enough in the past that Firefox 69 ought to be winding down around now? Well, if you look really closely at the bottom-right you might be able to see that we’re still receiving data from users still on that Firefox version. Lots of them.

But this is where we are in history, and I’m not running this query again (it only cost 15 cents, but it took half an hour), so let’s do the math. The total amount of data received from these five releases so far divided by the amount of data I said above that the user population would be sending each month (12GB) comes out to about 13.7 months.

To account for the seriously-annoying number of pings from those five versions that we presumably will continue receiving into the future, I rounded up to 14.

Storage (Mozilla) – Cost: $84

Once the data has been processed it needs to live somewhere. This costs us 2 cents per gigabyte stored, per month we decide to store it. 12GB per month means $0.24, right?

Well, no. We don’t have a way to only store this data for a period of time, so we need to store it for as long as the other stuff we store. For year-over-year forecasting we retain data for two years plus one month: 25 months. (Well, we presently retain data a bit longer than that, but we’re getting there.) So we need to take the 12GB we get each month and store it for 25 months. When we do that for each of the 14 “months” of data we get:

12GB/”month” x 14 “months” x $0.02 per GB per month x 25 months retention = $84

Now if you think this “2 cents per GB” figure is a little high: it is! We should be able to take advantage of lower storage costs for data we don’t write to any more. Unfortunately, we do write to it all the time servicing Deletion Requests (which I’ll get to in a little bit).

Analysis (Mozilla) – Time: 30min, Cost: $25.55

Data stored on some server someplace is of no use. Its value is derived through interrogating it, faceting its aggregations across interesting dimensions, picking it apart and putting it back together.

If this sounds like processing time Mozilla needs to pay for, you are correct!

On-demand analyses in Google’s BigQuery cost $5 per TB of data scanned. Mozilla’s spent some decent time thinking about query patterns to arrange data in a way that minimizes the amount of data we need to look at in any given analysis… but it isn’t perfect. To deliver us a count of the number of Live Bookmarks across our user base we’re going to have to scan more than the 12GB per month.

But this is a Best Case Estimate so let’s figure out how much a perfect query (one that only had to scan the data we wanted to get out of it) would cost:

12GB / 1000GB/TB * 5 $/TB = $0.06

That gives you back a sum of all the Live Bookmarks reported from all the Firefox profiles in a month. The number might be 5, or 5 million, or 5 trillion.

In other words, the number is useless. The real question you want to answer is “How much is this feature used?” which is less about the number of Live Bookmarks reported than it is Live Bookmarks stored per Firefox profile. If the 5 million Live Bookmarks are five thousand reports of 1000 Live Bookmarks all from one fellow named Nick, then we shouldn’t be investing in a feature used by one person, however much that one person uses it.

If the 5 million Live Bookmarks are one hundred thousand profiles reporting various handfuls of times a moderate number of bookmarks, then Live Bookmarks is more likely a broadly-used feature and might just need a little kick to be used even more.

So we need to aggregate the counts per-client and then look at the distribution. We can ask, over all the reports of Live Bookmarks from this one Firefox profile, give us the maximum number reported. Then show us a graph (like this). A perfect query of a month’s data will not only need to look at the 12GB of the month’s Live Bookmarks count, but also the profile identifier (client_id) so we can deduplicate reports. That id is a UUID and is represented as a 36-byte string. This adds another 8x data to scan compared to the 4B Live Bookmarks count we were previously looking at, ballooning our query to 108GB and our cost to $0.54.

But wait! We’re doing two steps: one to crunch these down to the 250M profiles that reported data that month and then a second to count the counts (to make our graph). That second step needs to scan the 250M 4B “maximum counts”, which adds another half a cent.

So our Best Case Estimate for querying the data to get the answer to our question is: $0.55 cents (I rounded up the half cent).

But don’t forget you need an analyst to perform this analysis! Assuming you have a mature suite of data analysis tooling, some rigorous documentation, and a well-entrenched culture of everyone helping everyone, this shouldn’t take longer than a half-hour of a single person’s time. Which is another $25, coming to a grand total of $25.55.

Deletion – Cost: $21

The data’s journey is not complete because any time a user opts their Firefox profile out of data collection we receive an order to delete what data we’ve previously received from that profile. To delete we need to copy out all the not-deleted data into new partitions and drop the old ones. This is a processing cost that is currently using the ad hoc $5/TB rate every time we process a batch of deletions (monthly).

Our Live Bookmarks count is adding 4 bytes of data per row that needs to be copied over. Each of those counts (excepting the ones that are deleted) needs to be copied over 25 times (retention period of 25 months). The amount of deleted data is small (Firefox’s data collection is very specifically designed to only collect what is necessary, so you shouldn’t ever feel as though you need to opt out and trigger deletion) so we’ll ignore its effect on the numbers for the purposes of making this easier to calculate.

12 GB/”month” x 14 “months” x 25 deletions / 1000GB/TB x 5 $/TB = $21

The total lifetime cost of all the deletion batches we process for the Live Bookmarks counts we record is $21. We’re hoping to knock this down a few pegs in cost, but it’ll probably remain in the “some dollars” order of magnitude.

The bigger share of this cost is actually in Storage, above. If we didn’t have to delete our data then, after 90 days, storage costs drop by half per month. This means that, if you want to assign the dollars a little more like blame, Storage costs are “only” $52.08 (full price for 3 months, half for 22) and Deletion costs are $52.92.

Grand Total: $245.67

In the best case, a collection of a single number from the wide Firefox user base will cost Mozilla almost $246 over the collection’s lifetime, split about 50% between labour and data computing platform costs.

So that’s it? Call it a wrap? Well… no. There are some cautionary tales to be learned here.

Lessons

0) Lean Data Practices save money. Our Data Collection Review Request form ensures that we aren’t adding these costs to Mozilla and our users without justifying that the collection is necessary. These practices were put into place to protect our users’ privacy, but they do an equally good job of reducing costs.

1) The simplest permanent data collection costs $228 its first year and $103 every year afterwards even if you never look at it again. It costs $25 (30min) to expire a collection, which pays for itself in a maximum of 2.9 months (the payback period is much shorter if the data collection is bigger than 4B (like a Histogram) because the yearly costs are higher). The best time to have expired that collection was ages ago: the second-best time is now.

2) Spending extra time thinking about a data collection saves you time and money. Even if you uplift a quick expiry patch for a mis-measured collection, the nature of Firefox releases is such that you would still end up paying nearly all of the same $245.67 for a useless collection as you would for a correct one. Spend the time ahead of time to save the expense. Especially for permanent collections.

3) Even small improvements in documentation, process, and tooling will result in large savings. Half of this cost is labour, and lesson #2 is recommending you spend more time on it. Good documentation enables good decisions to be made confidently. Process protects you from collecting the wrong thing. Tooling catches mistakes before they make their way out into the wild. Even small things like consistent naming and language will save time and protect you from mistakes. These are your force multipliers.

4) To reduce costs, efficient data representations matter, and quickly-expiring data collections matter more.

5) Retention periods should be set as short as possible. You shouldn’t have to store Live Bookmarks counts from 2+ years ago.

Where Does Glean Fit In

Glean‘s focus on high-level metric types, end-to-end-testable data collections, and consistent naming makes mistakes in instrumentation easier to find during development. Rather than waiting for instrumentation code to reach release before realizing it isn’t correct, Glean is designed to help you catch those errors earlier.

Also, Glean’s use of per-application identifiers and emphasis on custom pings allows for data segregation that allows for different retention periods per-application or per-feature (e.g. the “metrics” ping might not need to be retained for 25 months even if the “baseline” ping does. And Firefox Desktop’s retention periods could be configured to be of a different length than Firefox Lockwise‘s) and reduces data scanned per analysis. And a consistent ping format and continued involvement of Data Science through design and development reduces analyst labour costs.

Basically the only thing we didn’t address was efficient data transfer encodings, and since Glean controls its ping format as an internal detail (unlike Telemetry) we could decide to address that later on without troubling Product Developers or Data Science.

There’s no doubt more we could do (and if you come up with something, do let us know!), but already I’m confident Glean will be worth its weight in Canadian Dollars.

:chutten

(( Special thanks to :jason and :mreid for helping me nail down costs for the pipeline pieces and for the broader audience of Data Engineers, Data Scientists, Telemetry Engineers, and other folks who reviewed the draft. ))

Distributed Teams: Not Just Working From Home

Technology companies taking curve-flattening exercises of late has resulted in me digging up my old 2017 talk about working as and working with remote employees. Though all of the advice in it holds up even these three years later, surprisingly little of it seemed all that relevant to the newly-working-from-home (WFH) multitudes.

Thinking about it, I reasoned that it’s because the talk (slides are here if you want ’em) is actually more about working on a distributed team than working from home. Though it contained the usual WFH gems of “have a commute”, “connect with people”, “overcommunicate”, etc etc (things that others have explained much better than I ever will); it also spent a significant amount of its time talking about things that are only relevant if your team isn’t working in the same place.

Aspects of distributed work that are unique not to my not being in the office but my being on a distributed team are things like timezones, cultural differences, personal schedules, presentation, watercooler chats, identity… things that you don’t have to think about or spend effort on if you work in the same place (and, not coincidentally, things I’ve written about in the past). If we’re all in Toronto you know not only that 12cm of snow fell since last night but also what that does to the city in the morning. If we’re all in Italy you know not to schedule any work in August. If we see each other all the time then I can use a picture I took of a glacier in Iceland for my avatar instead of using it as a rare opportunity to be able to show you my face.

So as much as I was hoping that all this sudden interest in WFH was going to result in a sea change in how working on a distributed team is viewed and operates, I’m coming to the conclusion that things probably will not change. Maybe we’ll get some better tools… but none that know anything about being on a distributed team (like how “working hours” aren’t always contiguous (looking at you, Google Calendar)).

At least maybe people will stop making the same seven jokes about how WFH means you’re not actually working.

:chutten

Jira, Bugzilla, and Tales of Issue Trackers Past

It seems as though Mozilla is never not in a period of transition. The distributed nature of the organization and community means that teams and offices and any informal or formal group is its own tiny experimental plot tended by gardeners with radically different tastes.

And if there’s one thing that unites gardeners and tech workers is that both have Feelings about their tools.

Tools are personal things: they’re the only thing that allows us to express ourselves in our craft. I can’t code without an editor. I can’t prune without shears. They’re the part of our work that we actually touch. The code lives Out There, the garden is Outside… but the tools are in our hands.

But tools can also be group things. A shed is a tool for everyone’s tools. A workshop is a tool that others share. An Issue Tracker is a tool that helps us all coordinate work.

And group things require cooperation, agreement, and compromise.

While I was on the Browser team at BlackBerry I used a variety of different Issue Trackers. We started with an outdated version of FogBugz, then we had a Bugzilla fork for the WebKit porting work and MKS Integrity for everything else across the entire company, and then we all standardized on Jira.

With minimal customization, Jira and MKS Integrity both seemed to be perfectly adequate Issue Tracking Software. They had id numbers, relationships, state, attachments, comments… all the things you need in an Issue Tracker. But they couldn’t just be “perfectly adequate”, they had to be better enough than what they were replacing to warrant the switch.

In other words, to make the switch the new thing needs to do something that the previous one couldn’t, wouldn’t, or just didn’t do (or you’ve forgotten that it did). And every time Jira or MKS is customized it seems to stop being Issue Tracking Software and start being Workflow Management Software.

Perhaps because the people in charge of the customization are interested more in workflows than in Issue Tracking?

Regardless of why, once they become Workflow Management Software they become incomparable with Issue Trackers. Apples and Oranges. You end up optimizing for similar but distinct use cases as it might become more important to report about issues than it is to file and fix and close them.

And that’s the state Mozilla might be finding itself in right now as a few teams here and there try to find the best tools for their garden and settle upon Jira. Maybe they tried building workflows in Bugzilla and didn’t make it work. Maybe they were using Github Issues for a while and found it lacking. We already had multiple places to file issues, but now some of the places are Workflow Management Software.

And the rumbling has begun. And it’s no wonder, as even tools that are group things are still personal. They’re still what we touch when we craft.

The GNU-minded part of me thinks that workflow management should be built above and separate from issue tracking by the skillful use of open and performant interfaces. Bugzilla lets you query whatever you want, however you want, so why not build reporting Over There and leave me my issue tracking Here where I Like It.

The practical-minded part of me thinks that it doesn’t matter what we choose, so long as we do it deliberately and consistently.

The schedule-minded part of me notices that I should probably be filing and fixing issues rather than writing on them. And I think now’s the time to let that part win.

:chutten

This Week in Glean: A Distributed Team Echoes Distributed Workflow

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

Last Week: Extending Glean: build re-usable types for new use-cases by Alessio


I was recently struck by a realization that the position of our data org’s team members around the globe mimics the path that data flows through the Glean Ecosystem.

Glean Data takes this five-fold path (corresponding to five teams):

  1. Data is collected in a client using the Glean SDK (Glean Team)
  2. Data is transmitted to the Structured Ingestion pipeline (Data Platform)
  3. Data is stored and maintained in our infrastructure (Data Operations)
  4. Data is presented in our tools (Data Tools)
  5. Data is analyzed and reported on (Data Science)

The geographical midpoint of the Glean Team is about halfway across the north atlantic. For Data Platform it’s on the continental US, anchored by three members in the midwestern US. Data Ops is further West still, with four members in the Pacific timezone and no Europeans. Data Tools breaks the trend by being a bit further East, with fewer westcoasters. Data Science (for Firefox) is centred farther west still, with only two members East of the Rocky Mountains.

Or, graphically:

gleanEcosystemTeamCentres
(approximate) Team Geocentres

Given the rotation of the Earth, the sun rises first on the Glean Team and the data collected by the Glean SDK. Then the data and the sun move West to the Data Platform where it is ingested. Data Tools gets the data from the Platform as morning breaks over Detroit. Data Operations keeps it all running from the midwest. And finally, the West Coast Centre of Firefox Data Science Excellence greets the data from a mountaintop, to try and make sense of it all.

(( Lying orthogonal to the data organization is the secret Sixth Glean Data “Team”: Data Stewardship. They ensure all Glean Data is collected in accordance with Mozilla’s Privacy Promise. The sun never sets on the Stewardship’s global coverage, and it’s a volunteer effort supplied from eight teams (and growing!), so I’ve omitted them from this narrative. ))

Bad metaphors about sunlight aside, I wonder whether this is random or whether this is some sort of emergent behaviour.

Conway’s Law suggests that our system architecture will tend to reflect our orgchart (well, the law is a bit more explicit about “communication structure” independent of organizational structure, but in the data org’s case they’re pretty close). Maybe this is a specific example of that: data architecture as a reflection of orgchart geography.

Or perhaps five dots on a globe that are only mostly in some order is too weak of a coincidence to even bear thinking about? Nah, where’s the blog post in that thinking…

If it’s emergent, it then becomes interesting to consider the “chicken and egg” point of view: did the organization beget the system or the system beget the organization? When I joined Mozilla some of these teams didn’t exist. Some of them only kinda existed within other parts of the company. So is the situation we’re in today a formalization by us of a structure that mirrors the system we’re building, or did we build the system in this way because of the structure we already had?

mindblown.gif

:chutten