This Week in Glean: The Three Roles of Data Engagements

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean.) All “This Week in Glean” blog posts are listed in the TWiG index).

I’ve just recently started my sixth year working at Mozilla on data and data-adjacent things. In those years I’ve started to notice some patterns in how data is approached, so I thought I’d set them down in a TWiG because Glean’s got a role to play in them.

Data Engagements

A Data Engagement is when there’s a question that needs to engage with data to be answered. Something like “How many bookmarks are used by Firefox users?”.

(No one calls these Data Engagements but me, and I only do because I need to call them _something_.)

I’ve noticed three roles in Data Engagements at Mozilla:

  1. Data Consumer: The Question-Asker. The Temperature-Taker. This is the one who knows what questions are important, and is frustrated without an answer until and unless data can be collected and analysed to provide it. “We need to know how many bookmarks are used to see if we should invest more in bookmark R&D.”
  2. Data Analyst: The Answer-Maker. The Stats-Cruncher. This is the one who can use Data to answer a Consumer’s Question. “Bookmarks are used by Canadians more than Mexicans most of the time, but only amongst profiles that have at least one bookmark.”
  3. Data Instrumentor: The Data-Digger. The Code-Implementor. This one can sift through product code and find the correct place to collect the right piece of data. “The Places database holds many things, we’ll need to filter for just bookmarks to count them.”

(diagrams courtesy of :brizental)

It’s through these three working in concert — The Consumer having a question that the Instrumentor instruments to generate data the Analyst can analyse to return an answer back to the Consumer — that a Data Engagement succeeds.

At Mozilla, Data Engagements succeed very frequently in certain circumstances. The Graphics team answers many deeply-technical questions about Firefox running in the wild to determine how well WebRender is working. The Telemetry team examines the health of the data collection system as a whole. Mike Conley’s old Tab Switcher Dashboard helped find and solve performance regressions in (unsurprisingly) Tab Switching. These go well, and there’s a common thread here that I think is the secret of why: 

In these and the other high-success-rate Data Engagements, all three roles (Consumer, Analyst, and Instrumentor) are embodied by the same person.

It’s a common problem in the industry. It’s hard to build anything at all, but it’s least hard to build something for yourself. When you are in yourself the Question-Asker, Answer-Maker, and Data-Digger, you don’t often mistakenly dig the wrong data to create an answer that isn’t to the question you had in mind. And when you accidentally do make a mistake (because, remember, this is hard), you can go back in and change the instrumentation, update the analysis, or reword the question.

But when these three roles are in different parts of the org, or different parts of the planet, things get harder. Each role is now trying to speak the others’ languages and infer enough context to do their jobs independently.

In comes the Data Org at Mozilla which has had great successes to date on the theme of “Making it easier for anyone to be their own Analyst”. Data Democratization. When you’re your own Analyst, then there’s fewer situations when the roles are disparate: Instrumentors who are their own Analysts know when data won’t be the right shape to answer their own questions and Consumers who are their own Analysts know when their questions aren’t well-formed.

Unfortunately we haven’t had as much success in making the other roles more accessible. Everyone can theoretically be their own Consumer: curiosity in a data-rich environment is as common as lanyards at an industry conference[1]. Asking _good_ questions is hard, though. Possible, but hard. You could just about imagine someone in a mature data organization becoming able to tell the difference between questions that are important and questions that are just interesting through self-serve tooling and documentation.

As for being your own Instrumentor… that is something that only a small fraction of folks have the patience to do. I (and Mozilla’s Community Managers) welcome you to try: it is possible to download and build Firefox yourself. It’s possible to find out which part of the codebase controls which pieces of UI. It’s… well, it’s more than possible, it’s actually quite pleasant to add instrumentation using Glean… but on the whole, if you are someone who _can_ Instrument Firefox Desktop you probably already have a copy of the source code on your hard drive. If you check right now and it’s not there, then there’s precious little likelihood that will change.

(Unless you come and work for Mozilla, that is.)

So let’s assume for now that democratizing instrumentation is impossible. Why does it matter? Why should it matter that the Consumer is a separate person from the Instrumentor?

Communication

Each role communicates with each other role with a different language:

  • Consumers talk to Instrumentors and Analysts in units of Questions and Answers. “How many bookmarks are there? We need to know whether people are using bookmarks.”
  • Analysts speak Data, Metadata, and Stats. “The median number of bookmarks is, according to a representative sample of Firefox profiles, twelve (confidence interval 99.5%).”
  • Instrumentors speak Data and Code. “There’s a few ways we delete bookmarks, we should cover them all to make sure the count’s correct when the next ping’s sent”

Some more of the Data Org and Mozilla’s greatest successes involve supplying context at the points in a Data Engagement where they’re most needed. We’ve gotten exceedingly good at loading context about data (metadata) to facilitate communication between Instrumentors and Analysts with tools like Glean Dictionary.

Ah, but once again the weak link appears to be the communication of Questions and Answers between Consumers and Instrumentors. Taking the above example, does the number of bookmarks include folders?

The Consumer knows, but the further away they sit from the Instrumentor, the less likely that the data coming from the product and fueling the analysis will be the “correct” one.

(Either including or excluding folders would be “correct” for different cases. Which one do you think was “more correct”?)

So how do we improve this?

Glean

Well, actually, Glean doesn’t have a solution for this. I don’t actually know what the solutions are. I have some ideas. Maybe we should share more context between Consumers and Instrumentors somehow. Maybe we should formalize the act of question-asking. Maybe we should build into the Glean SDK a high-enough level of metric abstraction that instead of asking questions, Consumers learn to speak a language of metrics.

The one thing I do know is that Glean is absolutely necessary to making any of these solutions possible. Without Glean, we have too many systems that are fractally complex for any context to be relevantly shared. How can we talk about sharing context about bookmark counts when we aren’t even counting things consistently[2]?

Glean brings that consistency. And from there we get to start solving these problems.

Expect me to come back to this realm of Engagements and the Three Roles in future posts. I’ve been thinking about:

  • how tooling affects the languages the roles speak amongst themselves and between each other,
  • how the roles are distributed on the org chart,
  • which teams support each role,
  • how Data Stewardship makes communication easier by adding context and formality,
  • how Telemetry and Glean handle the same situations in different ways, and
  • what roles Users play in all this. No model about data is complete without considering where the data comes from.

I’m not sure how many I’ll actually get to, but at least I have ideas.

:chutten

[1] Other rejected similes include “as common as”: maple syrup on Canadian breakfast tables, frustration in traffic, sense isn’t.

[2] Counting is harder than it looks.

This Week in Glean: Data Reviews are Important, Glean Parser makes them Easy

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean.) All “This Week in Glean” blog posts are listed in the TWiG index).

At Mozilla we put a lot of stock in Openness. Source? Open. Bug tracker? Open. Discussion Forums (Fora?)? Open (synchronous and asynchronous).

We also have an open process for determining if a new or expanded data collection in a Mozilla project is in line with our Privacy Principles and Policies: Data Review.

Basically, when a new piece of instrumentation is put up for code review (or before, or after), the instrumentor fills out a form and asks a volunteer Data Steward to review it. If the instrumentation (as explained in the filled-in form) is obviously in line with our privacy commitments to our users, the Data Steward gives it the go-ahead to ship.

(If it isn’t _obviously_ okay then we kick it up to our Trust Team to make the decision. They sit next to Legal, in case you need to find them.)

The Data Review Process and its forms are very generic. They’re designed to work for any instrumentation (tab count, bytes transferred, theme colour) being added to any project (Firefox Desktop, mozilla.org, Focus) and being collected by any data collection system (Firefox Telemetry, Crash Reporter, Glean). This is great for the process as it means we can use it and rely on it anywhere.

It isn’t so great for users _of_ the process. If you only ever write Data Reviews for one system, you’ll find yourself answering the same questions with the same answers every time.

And Glean makes this worse (better?) by including in its metrics definitions almost every piece of information you need in order to answer the review. So now you get to write the answers first in YAML and then in English during Data Review.

But no more! Introducing glean_parser data-review and mach data-review: command-line tools that will generate for you a Data Review Request skeleton with all the easy parts filled in. It works like this:

  1. Write your instrumentation, providing full information in the metrics definition.
  2. Call python -m glean_parser data-review <bug_number> <list of metrics.yaml files> (or mach data-review <bug_number> if you’re adding the instrumentation to Firefox Desktop).
  3. glean_parser will parse the metrics definitions files, pull out only the definitions that were added or changed in <bug_number>, and then output a partially-filled-out form for you.

Here’s an example. Say I’m working on bug 1664461 and add a new piece of instrumentation to Firefox Desktop:

fog.ipc:
  replay_failures:
    type: counter
    description: |
      The number of times the ipc buffer failed to be replayed in the
      parent process.
    bugs:
      - https://bugzilla.mozilla.org/show_bug.cgi?id=1664461
    data_reviews:
      - https://bugzilla.mozilla.org/show_bug.cgi?id=1664461
    data_sensitivity:
      - technical
    notification_emails:
      - chutten@mozilla.com
      - glean-team@mozilla.com
    expires: never

I’m sure to fill in the `bugs` field correctly (because that’s important on its own _and_ it’s what glean_parser data-review uses to find which data I added), and have categorized the data_sensitivity. I also included a helpful description. (The data_reviews field currently points at the bug I’ll attach the Data Review Request for. I’d better remember to come back before I land this code and update it to point at the specific comment…)

Then I can simply use mach data-review 1664461 and it spits out:

!! Reminder: it is your responsibility to complete and check the correctness of
!! this automatically-generated request skeleton before requesting Data
!! Collection Review. See https://wiki.mozilla.org/Data_Collection for details.

DATA REVIEW REQUEST
1. What questions will you answer with this data?

TODO: Fill this in.

2. Why does Mozilla need to answer these questions? Are there benefits for users?
   Do we need this information to address product or business requirements?

TODO: Fill this in.

3. What alternative methods did you consider to answer these questions?
   Why were they not sufficient?

TODO: Fill this in.

4. Can current instrumentation answer these questions?

TODO: Fill this in.

5. List all proposed measurements and indicate the category of data collection for each
   measurement, using the Firefox data collection categories found on the Mozilla wiki.

Measurement Name | Measurement Description | Data Collection Category | Tracking Bug
---------------- | ----------------------- | ------------------------ | ------------
fog_ipc.replay_failures | The number of times the ipc buffer failed to be replayed in the parent process.  | technical | https://bugzilla.mozilla.org/show_bug.cgi?id=1664461


6. Please provide a link to the documentation for this data collection which
   describes the ultimate data set in a public, complete, and accurate way.

This collection is Glean so is documented
[in the Glean Dictionary](https://dictionary.telemetry.mozilla.org).

7. How long will this data be collected?

This collection will be collected permanently.
**TODO: identify at least one individual here** will be responsible for the permanent collections.

8. What populations will you measure?

All channels, countries, and locales. No filters.

9. If this data collection is default on, what is the opt-out mechanism for users?

These collections are Glean. The opt-out can be found in the product's preferences.

10. Please provide a general description of how you will analyze this data.

TODO: Fill this in.

11. Where do you intend to share the results of your analysis?

TODO: Fill this in.

12. Is there a third-party tool (i.e. not Telemetry) that you
    are proposing to use for this data collection?

No.

As you can see, this Data Review Request skeleton comes partially filled out. Everything you previously had to mechanically fill out has been done for you, leaving you more time to focus on only the interesting questions like “Why do we need this?” and “How are you going to use it?”.

Also, this saves you from having to remember the URL to the Data Review Request Form Template each time you need it. We’ve got you covered.

And since this is part of Glean, this means this is already available to every project you can see here. This isn’t just a Firefox Desktop thing. 

Hope this saves you some time! If you can think of other time-saving improvements we could add once to Glean so every Mozilla project can take advantage of, please tell us on Matrix.

If you’re interested in how this is implemented, glean_parser’s part of this is over here, while the mach command part is here.

:chutten

This Week in Glean: Firefox Telemetry is to Glean as C++ is to Rust

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

I had this goofy idea that, like Rust, the Glean SDKs (and Ecosystem) aim to bring safety and higher-level thought to their domain. This is in comparison to how, like C++, Firefox Telemetry is built out of flexible primitives that assume you very much know what you’re doing and cannot (will not?) provide any clues in its design as to how to do things properly.

I have these goofy thoughts a lot. I’m a goofy guy. But the more I thought about it, the more the comparison seemed apt.

In Glean wherever we can we intentionally forbid behaviour we cannot guarantee is safe (e.g. we forbid non-commutative operations in FOG IPC, we forbid decrementing counters). And in situations where we need to permit perhaps-unsafe data practices, we do it in tightly-scoped areas that are identified as unsafe (e.g. if a timing_distribution uses accumulate_raw_samples_nanos you know to look at its data with more skepticism).

In Glean we encourage instrumentors to think at a higher level (e.g. memory_distribution instead of a Histogram of unknown buckets and samples) thereby permitting Glean to identify errors early (e.g. you can’t start a timespan twice) and allowing Glean to do clever things about it (e.g. in our tooling we know counter metrics are interesting when summed, but quantity metrics are not). Speaking of those errors, we are able to forbid error-prone behaviour through design and use of language features (e.g. In languages with type systems we can prevent you from collecting the wrong type of data) and when the error is only detectable at runtime we can report it with a high degree of specificity to make it easier to diagnose.

There are more analogues, but the metaphor gets strained. (( I mean, I guess a timing_distribution’s `TimerId` is kinda the closest thing to a borrow checker we have? Maybe? )) So I should probably stop here.

Now, those of you paying attention might have already seen this relationship. After all, as we all know, glean-core (which underpins most of the Glean SDKs regardless of language) is actually written in Rust whereas Firefox Telemetry’s core of Histograms, Scalars, and Events is written in C++. Maybe we shouldn’t be too surprised when the language the system is written in happens to be reflected in the top-level design.

But! glean-core was (for a long time) written in Kotlin from stem to stern. So maybe it’s not due to language determinism and is more to do with thoughtful design, careful change processes, and a list of principles we hold to firmly as the number of supported languages and metric types continues to grow.

I certainly don’t know. I’m just goofing around.

:chutten

Responsible Data Collection is Good, Actually (Ubisoft Data Summit 2021)

In June I was invited to talk at Ubisoft’s Data Summit about how Mozilla does data. I’ve given a short talk on this subject before, but this was an opportunity to update the material, cover more ground, and include more stories. The talk, including questions, comes in at just under an hour and is probably best summarized by the synopsis:

Learn how responsible data collection as practiced at Mozilla makes cataloguing easy, stops instrumentation mistakes before they ship, and allows you to build self-serve analysis tooling that gets everyone invested in data quality. Oh, and it’s cheaper, too.

If you want to skip to the best bits, I included shameless advertising for Mozilla VPN at 3:20 and becoming a Mozilla contributor at 14:04, and I lose my place in my notes at about 29:30.

Many thanks to Mathieu Nayrolles, Sebastien Hinse and the Data Summit committee at Ubisoft for guiding me through the process and organizing a wonderful event.

:chutten

Data Science is Interesting: Why are there so many Canadians in India?

Any time India comes up in the context of Firefox and Data I know it’s going to be an interesting day.

They’re our largest Beta population:

pie chart showing India by far the largest at 33.2%

They’re our second-largest English user base (after the US):

pie chart showing US as largest with 37.8% then India with 10.8%

 

But this is the interesting stuff about India that you just take for granted in Firefox Data. You come across these factoids for the first time and your mind is all blown and you hear the perhaps-apocryphal stories about Indian ISPs distributing Firefox Beta on CDs to their customers back in the Firefox 4 days… and then you move on. But every so often something new comes up and you’re reminded that no matter how much you think you’re prepared, there’s always something new you learn and go “Huh? What? Wait, what?!”

Especially when it’s India.

One of the facts I like to trot out to catch folks’ interest is how, when we first released the Canadian English localization of Firefox, India had more Canadians than Canada. Even today India is, after Canada and the US, the third largest user base of Canadian English Firefox:

pie chart of en-CA using Firefox clients by country. Canada at 75.5%, US at 8.35%, then India at 5.41%

 

Back in September 2018 Mozilla released the official Canadian English-localized Firefox. You can try it yourself by selecting it from the drop down menu in Firefox’s Preferences/Options in the “Language” section. You may have to click ‘Search for More Languages’ to be able to add it to the list first, but a few clicks later and you’ll be good to go, eh?

(( Or, if you don’t already have Firefox installed, you can select which language and dialect of Firefox you want from this download page. ))

Anyhoo, the Canadian English locale quickly gained a chunk of our install base:

uptake chart for en-CA users in Firefox in September 2018. Shows a sharp uptake followed by a weekly seasonal pattern with weekends lower than week days

…actually, it very quickly gained an overlarge chunk of our install base. Within a week we’d reached over three quarters of the entire Canadian user base?! Say we have one million Canadian users, that first peak in the chart was over 750k!

Now, we Canadian Mozillians suspected that there was some latent demand for the localized edition (they were just too polite to bring it up, y’know)… but not to this order of magnitude.

So back around that time a group of us including :flod, :mconnor, :catlee, :Aryx, :callek (and possibly others) fell down the rabbit hole trying to figure out where these Canadians were coming from. We ran down the obvious possibilities first: errors in data, errors in queries, errors in visualization… who knows, maybe I was counting some clients more than once a day? Maybe I was counting other Englishes (like South African and Great Britain) as well? Nothing panned out.

Then we guessed that maybe Canadians in Canada weren’t the only ones interested in the Canadian English localization. Originally I think we made a joke about how much Canadians love to travel, but then the query stopped running and showed us just how many Canadians there must be in India.

We were expecting a fair number of Canadians in the US. It is, after all, home to Firefox’s largest user base. But India? Why would India have so many Canadians? Or, if it’s not Canadians, why would Indians have such a preference for the English spoken in ten provinces and three territories? What is it about one of two official languages spoken from sea to sea to sea that could draw their attention?

Another thing that was puzzling was the raw speed of the uptake. If users were choosing the new localization themselves, we’d have seen a shallow curve with spikes as various news media made announcements or as we started promoting it ourselves. But this was far sharper an incline. This spoke to some automated process.

And the final curiosity (or clue, depending on your point of view) was discovered when we overlaid British English (en-GB) on top of the Canadian English (en-CA) uptake and noticed that (after accounting for some seasonality at the time due to the start of the school year) this suddenly-large number of Canadian English Firefoxes was drawn almost entirely from the number previously using British English:

chart showing use of British and Canadian English in Firefox in September 2018. The rise in use of Canadian English is matched by a fall in the use of British English.

It was with all this put together that day that lead us to our Best Guess. I’ll give you a little space to make your own guess. If you think yours is a better fit for the evidence, or simply want to help out with Firefox in Canadian English, drop by the Canadian English (en-CA) Localization matrix room and let us know! We’re a fairly quiet bunch who are always happy to have folks help us keep on top of the new strings added or changed in Mozilla projects or just chat about language stuff.

Okay, got your guess made? Here’s ours:

en-CA is alphabetically before en-GB.

Which is to say that the Canadian English Firefox, when put in a list with all the other Firefox builds (like this one which lists all the locales Firefox 88 comes in for Windows 64-bit), comes before the British English Firefox. We assume there is a population of Firefoxes, heavily represented in India (and somewhat in the US and elsewhere), that are installed automatically from a list like this one. This automatic installation is looking for the first English build in this list, and it doesn’t care which dialect. Starting September of 2018, instead of grabbing British English like it’s been doing for who knows how long, it had a new English higher in the list: Canadian English.

But who can say! All I know is that any time India comes up in the data, it’s going to be an interesting day.

:chutten

This Week in Glean: How Much Does That Data Cost?

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

I’ve written before about data, but never tackled the business perspective. To a business, what is data? It could be considered an asset, I suppose: a tool, like a printer, to make your business more efficient.

But like that printer and other assets, data has a cost. We can quite easily look up how much it costs to store arbitrary data on AWS (less than 2.3 cents USD per GB per month) but that only provides the cost of the data at rest. It doesn’t consider what it took for the data to get there or how much it costs to be useful once it’s stored.

So let’s imagine that you come across a decision that can only be made with data. You’ve tried your best to do without it, but you really do need to know how many Live Bookmarks there are per Firefox profile… maybe it’s in wide use and we should assign someone to spruce it up. Maybe almost no one uses it and so Live Bookmarks should be removed and instead become a feature provided by extensions.

This should be easy, right? Slap the number into an HTTP payload and send it to a Mozilla-controlled server. Then just count them all up!

As one of the Data Organization’s unofficial mottos puts it: Counting is Harder Than It Looks.

Let’s look at the full lifecycle of the metric from ideation and instrumentation to expiry and deletion. I’ll measure money and time costs, being clear about the assumptions guiding my estimates and linking to sources where available.

For a rule of thumb, time costs are $50 per hour. Developers and Managers and PMs cost more than $100k per year in total compensation in many jurisdictions, and less in many others. Let’s go with this because why not. I considered ignoring labour costs altogether because these people are doing their jobs whether they’re performing their part in this collection or not… but that’s assuming they have the spare capacity and would otherwise be doing nothing. Everyone I talk to is busy, so everyone’s doing this data collection work instead of something else they could be doing: so there is an opportunity cost.

Fixed costs, like the cost of building and maintaining a data collection library, data collection pipeline, bug trackers, code review tooling, dev computers are all ignored. We could amortize that per data collection… but it’d probably work out to $0 anyway.

Also, for the purposes of measuring data we’re counting only the size of the data itself (the count of the number of Live Bookmarks). To be more complete we’d need to amortize the cost of sending the data point (HTTP headers, payload metadata, the data point’s identifier, etc.) and factor in additional complexity (transfer encoding, compression, etc.). This would require a lot of words, and in the present Firefox Telemetry system this amortizes to 0 because the “main” ping has many data points in it and gzip compression is pretty good.

Also, this is a Best Case Estimate. I make many assumptions small in order to make this a lower-bound cost if everything goes according to plan and everyone acts the way they should.

Ideation – Time: 30min, Cost: $25

How long does it take you to figure out how to measure something? You need to know the feature you’re measuring, the capabilities of the data collection library you’re using to do the measuring, and some idea of how you’ll analyse it at the other end.  If you’re trying to send something clever like the state of a customizable UI element or do something that requires custom analysis, this will take longer and take more people which will cost more money.

But for our example we know what we’re collecting: numbers of things. The data collection library is old and well understood. The analysis is straightforward. This takes one person a half hour to think through.

Instrumentation – Time: 60min, Cost: $50

Knowing the feature is not the same as knowing the code. You need a subject matter expert (developer who knows the feature and the code as well as the data collection library’s API) to figure out on exactly which line of code we should call exactly what method with exactly which count. If it’s complicated, several people may need to meet in order to figure out what to do here: are the input event timestamps the same format on Windows and Mac? Does time when the computer is asleep count or not?

For our example we have questions: Should we count the number of Live Bookmarks in the database? The number in the Bookmark Menu? The Bookmark Toolbar? What if the user deletes one, should we count before or after the delete?

This is small enough that we can find a single subject matter expert who knows it all. They read some documentation, make some decisions, write some code, and take an hour to do this themselves.

Review – Time: 30min, Cost $25

Both the code and the data collection need review. The simplicity of the data collection and the code make this quick. Mozilla’s code review tooling helps a lot here, too. Though it takes a day or two for the Module Peer and the Data Steward to find time to get to the reviews, it only takes a combination of a half hour for them to okay it to ship.

Storage (user) – Cost: $0

Data takes up space. Its definition takes up some bytes in the Firefox binary that you installed. It takes up bytes in your computer’s memory. It takes up bytes on disk while it waits to be sent and afterwards so you can look at it if you type about:telemetry into your address bar. (Try it yourself!)

The marginal cost to the user of the tens of bytes of memory and disk from our single number of Live Bookmarks is most accurately represented as a zero not only because memory and disk are excitingly cheap these days but also because there was likely some spare capacity in those systems.

Bandwidth (user) – Cost: $0.00 (but not zero)

Data requires network bandwidth to be reported, and network bandwidth costs money. Many consumer plans are flat-rate and so the marginal cost of the extra bytes is not felt at all (we’re using a little of the slack), so we can flatten this to zero.

But hey, let’s do some recreational math for fun! (We all do this in our spare time, right? It’s not just me?)

If we were paying per-byte and sending this from a smartphone, the first GB in Canada (where mobile data makes the most money for the service providers in the world) costs $30 per month. That’s about 3 thousandths of a cent per kilobyte.

The data collection is a number, which is about 4 bytes of data. We send it about three times per day and individual profiles are in use by Firefox on average 12 days a month (engagement ratio of 0.4). (If you’re interested, this is due to a bunch of factors including users having multiple profiles at school, work, and home… but that’s another blog post).

4 bytes x 3 per day x 12 days in a month ~= 144 bytes per month

Thus a more accurate cost estimate of user bandwidth for this data would be 4 ten-thousandths of a cent (in Canadian dollars). It would take over 200 years of reporting this figure to cost the user a single penny. So let’s call it 0 for our purposes here.

Though… however close the cost is to 0, It isn’t 0. This means that, over time and over enough data points and over our full Firefox population, there is a measurable cost. Though its weight is light when it is but a single data point sent infrequently by each of our users, put together it is still hefty enough that we shouldn’t ignore it.

Bandwidth (Mozilla) – Cost: $0

Internet Service Providers have a nice gig: they can charge the user when the bytes leave their machine and charge Mozilla when the bytes enter their machine. However, cloud data platform providers (Amazon’s AWS, Google’s GCP, Microsoft’s Azure, etc) don’t charge for bandwidth for the data coming into their services.

You do get charged for bandwidth _leaving_ their systems. And for anything you do _on_ their systems. If I were feeling uncharitable I guess I’d call this a vendor lock-in data roach motel.

At any rate, the cost for this step is 0.

Pipeline Processing – Cost: $15.12

Once our Live Bookmarks data reaches the pipeline, there’s a few steps the data needs to go through. It needs to be decompressed, examined for adherence to the data schema (malformed data gets thrown out), and a response written to the client to tell it that we received it all okay. It needs to be processed, examined, and funneled to the correct storage locations while also being made available for realtime analysis if we need it.

For our little 4-byte number that shouldn’t be too bad, right?

Well, now that we’re on Mozilla’s side of the operation we need to consider the scale. Just how many Firefox profiles are sending how many of these numbers at us? About 250M of them each month. (At time of writing this isn’t up-to-date beyond EOY2019. Sorry about that. We’re working on it). With an engagement ratio of about 0.4, data being sent about thrice a day, and each count of Live Bookmarks taking up 4 bytes of space, we’re looking at 12GB of data per month.

At our present levels, ingestion and processing costs about $90 per TB. This comes out to $1.08 of cost for this step, each month. Multiplied by 14 “months”, that’s $15.12.

About Months

In saying “14 months” for how long the pipeline needs to put up with the collection coming from the entire Firefox population I glossed over quite a lot of detail. The main piece of information is that the default expiry for new data collections in Firefox is five or six Firefox versions (which should come out to about six months).

However, as I’ve mentioned before, updates don’t all happen at once. Though we have about 90% of the Firefox population within 3 versions of up-to-date at any one time, there’s a long tail of Firefox profiles from ancient versions sending us data.

To calculate 14 months I looked at the total data collection volumes for five versions of Firefox: Firefox 69-73 (inclusive). This avoids Firefox ESR 68 gumming up the works (its support lifetime is much longer than a normal release, and we’re aiming for a best-case cost estimate) and is both far enough in the past that Firefox 69 ought to be winding down around now _and_ is recent enough that we’ll not have thrown out the data yet (more on retention periods later) and it is closer in behaviour to releases we’re doing this year.

Here’s what that looks like:time series plot showing data volumes from five Firefox versions

So I said this was far enough in the past that Firefox 69 ought to be winding down around now? Well, if you look really closely at the bottom-right you might be able to see that we’re still receiving data from users still on that Firefox version. Lots of them.

But this is where we are in history, and I’m not running this query again (it only cost 15 cents, but it took half an hour), so let’s do the math. The total amount of data received from these five releases so far divided by the amount of data I said above that the user population would be sending each month (12GB) comes out to about 13.7 months.

To account for the seriously-annoying number of pings from those five versions that we presumably will continue receiving into the future, I rounded up to 14.

Storage (Mozilla) – Cost: $84

Once the data has been processed it needs to live somewhere. This costs us 2 cents per gigabyte stored, per month we decide to store it. 12GB per month means $0.24, right?

Well, no. We don’t have a way to only store this data for a period of time, so we need to store it for as long as the other stuff we store. For year-over-year forecasting we retain data for two years plus one month: 25 months. (Well, we presently retain data a bit longer than that, but we’re getting there.) So we need to take the 12GB we get each month and store it for 25 months. When we do that for each of the 14 “months” of data we get:

12GB/”month” x 14 “months” x $0.02 per GB per month x 25 months retention = $84

Now if you think this “2 cents per GB” figure is a little high: it is! We should be able to take advantage of lower storage costs for data we don’t write to any more. Unfortunately, we do write to it all the time servicing Deletion Requests (which I’ll get to in a little bit).

Analysis (Mozilla) – Time: 30min, Cost: $25.55

Data stored on some server someplace is of no use. Its value is derived through interrogating it, faceting its aggregations across interesting dimensions, picking it apart and putting it back together.

If this sounds like processing time Mozilla needs to pay for, you are correct!

On-demand analyses in Google’s BigQuery cost $5 per TB of data scanned. Mozilla’s spent some decent time thinking about query patterns to arrange data in a way that minimizes the amount of data we need to look at in any given analysis… but it isn’t perfect. To deliver us a count of the number of Live Bookmarks across our user base we’re going to have to scan more than the 12GB per month.

But this is a Best Case Estimate so let’s figure out how much a perfect query (one that only had to scan the data we wanted to get out of it) would cost:

12GB / 1000GB/TB * 5 $/TB = $0.06

That gives you back a sum of all the Live Bookmarks reported from all the Firefox profiles in a month. The number might be 5, or 5 million, or 5 trillion.

In other words, the number is useless. The real question you want to answer is “How much is this feature used?” which is less about the number of Live Bookmarks reported than it is Live Bookmarks stored per Firefox profile. If the 5 million Live Bookmarks are five thousand reports of 1000 Live Bookmarks all from one fellow named Nick, then we shouldn’t be investing in a feature used by one person, however much that one person uses it.

If the 5 million Live Bookmarks are one hundred thousand profiles reporting various handfuls of times a moderate number of bookmarks, then Live Bookmarks is more likely a broadly-used feature and might just need a little kick to be used even more.

So we need to aggregate the counts per-client and then look at the distribution. We can ask, over all the reports of Live Bookmarks from this one Firefox profile, give us the maximum number reported. Then show us a graph (like this). A perfect query of a month’s data will not only need to look at the 12GB of the month’s Live Bookmarks count, but also the profile identifier (client_id) so we can deduplicate reports. That id is a UUID and is represented as a 36-byte string. This adds another 8x data to scan compared to the 4B Live Bookmarks count we were previously looking at, ballooning our query to 108GB and our cost to $0.54.

But wait! We’re doing two steps: one to crunch these down to the 250M profiles that reported data that month and then a second to count the counts (to make our graph). That second step needs to scan the 250M 4B “maximum counts”, which adds another half a cent.

So our Best Case Estimate for querying the data to get the answer to our question is: $0.55 cents (I rounded up the half cent).

But don’t forget you need an analyst to perform this analysis! Assuming you have a mature suite of data analysis tooling, some rigorous documentation, and a well-entrenched culture of everyone helping everyone, this shouldn’t take longer than a half-hour of a single person’s time. Which is another $25, coming to a grand total of $25.55.

Deletion – Cost: $21

The data’s journey is not complete because any time a user opts their Firefox profile out of data collection we receive an order to delete what data we’ve previously received from that profile. To delete we need to copy out all the not-deleted data into new partitions and drop the old ones. This is a processing cost that is currently using the ad hoc $5/TB rate every time we process a batch of deletions (monthly).

Our Live Bookmarks count is adding 4 bytes of data per row that needs to be copied over. Each of those counts (excepting the ones that are deleted) needs to be copied over 25 times (retention period of 25 months). The amount of deleted data is small (Firefox’s data collection is very specifically designed to only collect what is necessary, so you shouldn’t ever feel as though you need to opt out and trigger deletion) so we’ll ignore its effect on the numbers for the purposes of making this easier to calculate.

12 GB/”month” x 14 “months” x 25 deletions / 1000GB/TB x 5 $/TB = $21

The total lifetime cost of all the deletion batches we process for the Live Bookmarks counts we record is $21. We’re hoping to knock this down a few pegs in cost, but it’ll probably remain in the “some dollars” order of magnitude.

The bigger share of this cost is actually in Storage, above. If we didn’t have to delete our data then, after 90 days, storage costs drop by half per month. This means that, if you want to assign the dollars a little more like blame, Storage costs are “only” $52.08 (full price for 3 months, half for 22) and Deletion costs are $52.92.

Grand Total: $245.67

In the best case, a collection of a single number from the wide Firefox user base will cost Mozilla almost $246 over the collection’s lifetime, split about 50% between labour and data computing platform costs.

So that’s it? Call it a wrap? Well… no. There are some cautionary tales to be learned here.

Lessons

0) Lean Data Practices save money. Our Data Collection Review Request form ensures that we aren’t adding these costs to Mozilla and our users without justifying that the collection is necessary. These practices were put into place to protect our users’ privacy, but they do an equally good job of reducing costs.

1) The simplest permanent data collection costs $228 its first year and $103 every year afterwards even if you never look at it again. It costs $25 (30min) to expire a collection, which pays for itself in a maximum of 2.9 months (the payback period is much shorter if the data collection is bigger than 4B (like a Histogram) because the yearly costs are higher). The best time to have expired that collection was ages ago: the second-best time is now.

2) Spending extra time thinking about a data collection saves you time and money. Even if you uplift a quick expiry patch for a mis-measured collection, the nature of Firefox releases is such that you would still end up paying nearly all of the same $245.67 for a useless collection as you would for a correct one. Spend the time ahead of time to save the expense. Especially for permanent collections.

3) Even small improvements in documentation, process, and tooling will result in large savings. Half of this cost is labour, and lesson #2 is recommending you spend more time on it. Good documentation enables good decisions to be made confidently. Process protects you from collecting the wrong thing. Tooling catches mistakes before they make their way out into the wild. Even small things like consistent naming and language will save time and protect you from mistakes. These are your force multipliers.

4) To reduce costs, efficient data representations matter, and quickly-expiring data collections matter more.

5) Retention periods should be set as short as possible. You shouldn’t have to store Live Bookmarks counts from 2+ years ago.

Where Does Glean Fit In

Glean‘s focus on high-level metric types, end-to-end-testable data collections, and consistent naming makes mistakes in instrumentation easier to find during development. Rather than waiting for instrumentation code to reach release before realizing it isn’t correct, Glean is designed to help you catch those errors earlier.

Also, Glean’s use of per-application identifiers and emphasis on custom pings allows for data segregation that allows for different retention periods per-application or per-feature (e.g. the “metrics” ping might not need to be retained for 25 months even if the “baseline” ping does. And Firefox Desktop’s retention periods could be configured to be of a different length than Firefox Lockwise‘s) and reduces data scanned per analysis. And a consistent ping format and continued involvement of Data Science through design and development reduces analyst labour costs.

Basically the only thing we didn’t address was efficient data transfer encodings, and since Glean controls its ping format as an internal detail (unlike Telemetry) we could decide to address that later on without troubling Product Developers or Data Science.

There’s no doubt more we could do (and if you come up with something, do let us know!), but already I’m confident Glean will be worth its weight in Canadian Dollars.

:chutten

(( Special thanks to :jason and :mreid for helping me nail down costs for the pipeline pieces and for the broader audience of Data Engineers, Data Scientists, Telemetry Engineers, and other folks who reviewed the draft. ))

Distributed Teams: Regional Peculiarities Like Oktoberfest and Bagged Milk

It’s Oktoberfest! You know, that German holiday about beer and lederhosen?

No. As many Germans will tell you it’s not a German thing as much as it is a Bavarian thing. It’s like saying kilts are a British thing (it’s a Scottish thing). Or that milk in bags is a Canadian thing (in Canada it’s an Eastern Canada thing).

In researching what the heck I was talking about when I was making this comparison at a recent team meeting, Alessio found a lovely study on the efficiency of milk bags as milk packaging in Ontario published by The Environment and Plastics Industry Council in 1997.

I highly recommend you skim it for its graphs and the study conclusions. The best parts for me are how it highlights that the consumption of milk (by volume) increased 22% from 1968 to 1995 while at the same time the amount (by mass) of solid waste produced by milk packaging decreased by almost 20%.

I also liked Table 8 which showed the recycling rates of the various packaging types that we’d need to reach in order to match the small amount (by mass) of solid waste generation of the (100% unrecycled) milk bags. (Interestingly, in my region you can recycle milk bags if you first rinse and dry them).

I guess what I’m trying to say about this is three-fold:

  1. Don’t assume regional characteristics are national in your distributed team. Berliners might not look forward to Oktoberfest the way Münchner do, and it’s possible no one in the Vancouver office owns a milk jug or bag cutter.
  2. Milk Bags are kinda neat, and now I feel a little proud about living in a part of the world where they’re common. I’d be a little more confident about this if the data wasn’t presented by the plastics industry, but I’ll take what I can get (and I’ll start recycling my milk bags).
  3. Geez, my team can find data for _any topic_. What differences we have by being distributed around the world are eclipsed by how we’re universally a bunch of nerds.

:chutten

 

My StarCon 2019 Talk: Collecting Data Responsibly and at Scale

 

Back in January I was privileged to speak at StarCon 2019 at the University of Waterloo about responsible data collection. It was a bitterly-cold weekend with beautiful sun dogs ringing the morning sun. I spent it inside talking about good ways to collect data and how Mozilla serves as a concrete example. It’s 15 minutes short and aimed at a general audience. I hope you like it.

I encourage you to also sample some of the other talks. Two I remember fondly are Aaron Levin’s “Conjure ye File System, transmorgifier” about video games that look like file systems and Cory Dominguez’s lovely analysis of Moby Dick editions in “or, the whale“. Since I missed a whole day, I now get to look forward to fondly discovering new ones from the full list.

:chutten

Data Science is Festive: Christmas Light Reliability by Colour

This past weekend was a balmy 5 degrees Celsius which was lucky for me as I had to once again climb onto the roof of my house to deal with my Christmas lights. The middle two strings had failed bulbs somewhere along their length and I had a decent expectation that it was the Blue ones. Again.

Two years ago was our first autumn at our new house. The house needed Christmas lights so we bought four strings of them. Over the course of their December tour they suffered devastating bulb failures rendering alternating strings inoperable. (The bulbs are wired in a single parallel strand making a single bulb failure take down the whole string. However, connectivity is maintained so power flows through the circuit.)

20181104_111900

Last year I tested the four strings and found them all faulty. We bought two replacement strings and I scavenged all the working bulbs from one of the strings to make three working strings out of the old four. All five (four in use, one in reserve) survived the season in working order.

20181104_111948

This year in performing my sanity check before climbing the ladder I had to replace lamps in all three of the original strings to get them back to operating condition. Again.

And then I had an idea. A nerdy idea.

I had myself a wonderful nerdy idea!

“I know just what to do!” I laughed like an old miser.

I’ll gather some data and then visualize’er!

The strings are penta-colour: Red, Orange, Yellow, Green, and Blue. Each string has about an equal number of each colour of bulb and an extra Red and Yellow replacement bulb. Each bulb is made up of an internal LED lamp and an external plastic globe.

The LED lamps are the things that fail either from corrosion on the contacts or from something internal to the diode.

So I started with 6N+12 lamps and 6N+12 globes in total: N of each colour with an extra 1 Red and 1 Yellow per string. Whenever a lamp died I kept its globe. So the losses over time should manifest themselves as a surplus of globes and a defecit of lamps.

If the losses were equal amongst the colours we’d see a equal surplus of Green, Orange, and Blue globes and a slightly lower surplus of Red and Yellow globes (because of the extras). This is not what I saw when I lined them all up, though:

An image of christmas lightbulb globes and LED lamps in a histogram fashion. The blue globes are the most populous followed by yellow, green, then red. Yellow LED lamps are the most populous followed by red and green.

Instead we find ourselves with no oranges (I fitted all the extra oranges into empty blue spots when consolidating), an equal number of lamps and globes of yellow (yellow being one of the colours adjacent to most broken bulbs and, thus, less likely to be chosen for replacement), a mild surplus of red (one red lamp had evidently failed at one point), a larger surplus of green globes (four failed green lamps isn’t great but isn’t bad)…

And 14 excess blue globes.

Now, my sampling frequency isn’t all that high. And my knowledge of confidence intervals is a little rusty. But that’s what I think I can safely call a statistical outlier. I’m pretty sure we can conclude that, on my original set of strings of Christmas lights, Blue LEDs are more likely to fail than any other colour. But why?

I know from my LED history that high-luminance blue LEDs took the longest to be invented (patents filed in 1993 over 30 years after the first red LED). I learned from my friend who works at a display company that blue LEDs are more expensive. If I take those together I can suppose that perhaps the manufacturers of my light strings cheaped out on their lot of blue LEDs one year and stuck me, the consumer, with substandard lamps.

Instead of bringing joy, it brought frustration. But also predictive power because, you know what? On those two broken strings I had to climb up to retrieve this past, unseasonably-warm Saturday two of the four failed bulbs were indeed, as I said at the top, the Blue ones. Again.

 

:chutten

Data Science is Hard: Counting Users

Screenshot_2018-08-29 User Activity Firefox Public Data Report

Counting is harder than you think. No, really!

Intuitively, as you look around you, you think this can’t be true. If you see a parking lot you can count the cars, right?

But do cars that have left the parking lot count? What about cars driving through it without stopping? What about cars driving through looking for a space? (And can you tell the difference between those two kinds from a distance?)

These cars all count if you’re interested in usage. It’s all well and good to know the number of cars using your parking lot right now… but is it lower on weekends? Holidays? Are you measuring on a rainy day when fewer people take bicycles, or in the Summer when more people are on vacation? Do you need better signs or more amenities to get more drivers to stop? Are you going to have expand capacity this year, or next?

Yesterday we released the Firefox Public Data Report. Go take a look! It is the culmination of months of work of many mozillians (not me, I only contributed some early bug reports). In it you can find out how many users Firefox has, the most popular addons, and how quickly Firefox users update to the latest version. And you can choose whether to look at how these plots look for the worldwide user base or for one of the top ten (by number of Firefox users) countries individually.

It’s really cool.

The first two plots are a little strange, though. They count the number of Firefox users over time… and they don’t agree. They don’t even come close!

For the week including August 17, 2018 the Yearly Active User (YAU) count is 861884770 (or about 862M)… but the Monthly Active User (MAU) count is 256092920 (or about 256M)!

That’s over 600M difference! Which one is right?

Well, they both are.

Returning to our parking lot analogy, MAU is about counting how many cars use the parking lot over a 28-day period. So, starting Feb 1, count cars. If someone you saw earlier returns the next day or after a week, don’t count them again: we only want unique cars. Then, at the end of the 28-day period, that was the MAU for Feb 28. The MAU for Mar 1 (on non-leap-years) is the same thing, but you start counting on Feb 2.

Similarly for YAU, but you count over the past 365 days.

It stands to reason that you’ll see more unique cars over the year than you will over the month: you’ll see visitors, tourists, people using the lot just once, and people who have changed jobs and haven’t been back in four months.

So how many of these 600M who are in the YAU but not in the MAU are gone forever? How many are coming back? We don’t know.

Well, we don’t know _precisely_.

We’ve been at the browser game for long enough to see patterns in the data. We’re in the Summer slump for MAU numbers, and we have a model for how much higher the numbers are likely to be come October. We have surveyed people of varied backgrounds and have some ideas of why people change browsers to or away from Firefox.

We have the no-longer users, the lapsed users, the lost-and-regained users, the tried-us-once users, the non-human users, … we have categories and rough proportions on what we think we know about our population, and how that influences how we can better make the internet better for them.

Ultimately, to me, it doesn’t matter too much. I work on Firefox, a product that hundreds of millions of people use. How many hundreds of millions doesn’t matter: we’re above the threshold that makes me feel like I’m making the world better.

(( Well… I say that, but it is actually my job to understand the mechanisms behind these  numbers and why they can’t be exact, so I do have a bit of a vested interest. And there are a myriad of technological and behavioural considerations to account for in code and in documentation and in analysis which makes it an interesting job. But, you know. Hundreds of millions is precise enough for my job satisfaction index. ))

But once again we reach the inescapable return to the central thesis. Counting is harder than you think: one of the leading candidates for the Data Team’s motto. (Others include “Well, it depends.” and “¯\_(ツ)_/¯”). And now we’re counting in the open, so you get to experience its difficulty firsthand. Go have another look.

:chutten