Data Science is Hard: ALSA in Firefox

(( We’re overdue for another episode in this series on how Data Science is Hard. Today is a story from 2016 which I think illustrates many important things to do with data. ))

It’s story time. Gather ’round.

In July of 2016, Anthony Jones made the case that the Mozilla-built Firefox for Linux should stop supporting the ALSA backend (and also the WinXP WinMM backend) so that we could innovate on features for more modern audio backends.

(( You don’t need to know what an audio backend is to understand this story. ))

The code supporting ALSA would remain in tree for any Linux distribution who wished to maintain the backend and build it for themselves, but Mozilla would stop shipping Firefox with that code in it.

But how could we ensure the number of Firefoxen relying on this backend was small enough that we wouldn’t be removing something our users desperately needed? Luckily :padenot had just added an audio backend measurement to Telemetry. “We’ll have data soon,” he wrote.

By the end of August we’d heard from Firefox Nightly and Firefox Developer Edition that only 3.5% and 2% (respectively) of Linux subsessions with audio used ALSA. This was small enough to for the removal to move ahead.

Fast-forward to March of 2017. Seven months have passed. The removal has wound its way through Nightly, Developer Edition, Beta, and now into the stable Release channel. Linux users following this update channel update their Firefox and… suddenly the web grows silent for a large number of users.

Bugs are filed (thirteen of them). The mailing list thread with Anthony’s original proposal is revived with some very angry language. It seems as though far more than just a fraction of a fraction of users were using ALSA. There were entire Linux distributions that didn’t ship anything besides ALSA. How did Telemetry miss them?

It turns out that many of those same ALSA-only Linux distributions also turned off Telemetry when they repackaged Firefox for their users. And for any that shipped with Telemetry at all, many users disabled it themselves. Those users’ Firefoxen had no way to phone home to tell Mozilla how important ALSA was to them… and now it was too late.

Those Linux distributions started building ALSA support into their distributed Firefox builds… and hopefully began reporting Telemetry by default to prevent this from happening again. I don’t know if they did for sure (we don’t collect fine-grained information like that because we don’t need it).

But it serves as a cautionary tale: Mozilla can only support a finite number of things. Far fewer now than we did back in 2016. We prioritize what we support based on its simplicity and its reach. That first one we can see for ourselves, and for the second we rely on data collection like Telemetry to tell us.

Counting things is harder than it looks. Counting things that are invisible is damn near impossible. So if you want to be counted: turn Telemetry on (it’s in the Preferences) and leave it on.


This Week in Glean: How Much Does That Data Cost?

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

I’ve written before about data, but never tackled the business perspective. To a business, what is data? It could be considered an asset, I suppose: a tool, like a printer, to make your business more efficient.

But like that printer and other assets, data has a cost. We can quite easily look up how much it costs to store arbitrary data on AWS (less than 2.3 cents USD per GB per month) but that only provides the cost of the data at rest. It doesn’t consider what it took for the data to get there or how much it costs to be useful once it’s stored.

So let’s imagine that you come across a decision that can only be made with data. You’ve tried your best to do without it, but you really do need to know how many Live Bookmarks there are per Firefox profile… maybe it’s in wide use and we should assign someone to spruce it up. Maybe almost no one uses it and so Live Bookmarks should be removed and instead become a feature provided by extensions.

This should be easy, right? Slap the number into an HTTP payload and send it to a Mozilla-controlled server. Then just count them all up!

As one of the Data Organization’s unofficial mottos puts it: Counting is Harder Than It Looks.

Let’s look at the full lifecycle of the metric from ideation and instrumentation to expiry and deletion. I’ll measure money and time costs, being clear about the assumptions guiding my estimates and linking to sources where available.

For a rule of thumb, time costs are $50 per hour. Developers and Managers and PMs cost more than $100k per year in total compensation in many jurisdictions, and less in many others. Let’s go with this because why not. I considered ignoring labour costs altogether because these people are doing their jobs whether they’re performing their part in this collection or not… but that’s assuming they have the spare capacity and would otherwise be doing nothing. Everyone I talk to is busy, so everyone’s doing this data collection work instead of something else they could be doing: so there is an opportunity cost.

Fixed costs, like the cost of building and maintaining a data collection library, data collection pipeline, bug trackers, code review tooling, dev computers are all ignored. We could amortize that per data collection… but it’d probably work out to $0 anyway.

Also, for the purposes of measuring data we’re counting only the size of the data itself (the count of the number of Live Bookmarks). To be more complete we’d need to amortize the cost of sending the data point (HTTP headers, payload metadata, the data point’s identifier, etc.) and factor in additional complexity (transfer encoding, compression, etc.). This would require a lot of words, and in the present Firefox Telemetry system this amortizes to 0 because the “main” ping has many data points in it and gzip compression is pretty good.

Also, this is a Best Case Estimate. I make many assumptions small in order to make this a lower-bound cost if everything goes according to plan and everyone acts the way they should.

Ideation – Time: 30min, Cost: $25

How long does it take you to figure out how to measure something? You need to know the feature you’re measuring, the capabilities of the data collection library you’re using to do the measuring, and some idea of how you’ll analyse it at the other end.  If you’re trying to send something clever like the state of a customizable UI element or do something that requires custom analysis, this will take longer and take more people which will cost more money.

But for our example we know what we’re collecting: numbers of things. The data collection library is old and well understood. The analysis is straightforward. This takes one person a half hour to think through.

Instrumentation – Time: 60min, Cost: $50

Knowing the feature is not the same as knowing the code. You need a subject matter expert (developer who knows the feature and the code as well as the data collection library’s API) to figure out on exactly which line of code we should call exactly what method with exactly which count. If it’s complicated, several people may need to meet in order to figure out what to do here: are the input event timestamps the same format on Windows and Mac? Does time when the computer is asleep count or not?

For our example we have questions: Should we count the number of Live Bookmarks in the database? The number in the Bookmark Menu? The Bookmark Toolbar? What if the user deletes one, should we count before or after the delete?

This is small enough that we can find a single subject matter expert who knows it all. They read some documentation, make some decisions, write some code, and take an hour to do this themselves.

Review – Time: 30min, Cost $25

Both the code and the data collection need review. The simplicity of the data collection and the code make this quick. Mozilla’s code review tooling helps a lot here, too. Though it takes a day or two for the Module Peer and the Data Steward to find time to get to the reviews, it only takes a combination of a half hour for them to okay it to ship.

Storage (user) – Cost: $0

Data takes up space. Its definition takes up some bytes in the Firefox binary that you installed. It takes up bytes in your computer’s memory. It takes up bytes on disk while it waits to be sent and afterwards so you can look at it if you type about:telemetry into your address bar. (Try it yourself!)

The marginal cost to the user of the tens of bytes of memory and disk from our single number of Live Bookmarks is most accurately represented as a zero not only because memory and disk are excitingly cheap these days but also because there was likely some spare capacity in those systems.

Bandwidth (user) – Cost: $0.00 (but not zero)

Data requires network bandwidth to be reported, and network bandwidth costs money. Many consumer plans are flat-rate and so the marginal cost of the extra bytes is not felt at all (we’re using a little of the slack), so we can flatten this to zero.

But hey, let’s do some recreational math for fun! (We all do this in our spare time, right? It’s not just me?)

If we were paying per-byte and sending this from a smartphone, the first GB in Canada (where mobile data makes the most money for the service providers in the world) costs $30 per month. That’s about 3 thousandths of a cent per kilobyte.

The data collection is a number, which is about 4 bytes of data. We send it about three times per day and individual profiles are in use by Firefox on average 12 days a month (engagement ratio of 0.4). (If you’re interested, this is due to a bunch of factors including users having multiple profiles at school, work, and home… but that’s another blog post).

4 bytes x 3 per day x 12 days in a month ~= 144 bytes per month

Thus a more accurate cost estimate of user bandwidth for this data would be 4 ten-thousandths of a cent (in Canadian dollars). It would take over 200 years of reporting this figure to cost the user a single penny. So let’s call it 0 for our purposes here.

Though… however close the cost is to 0, It isn’t 0. This means that, over time and over enough data points and over our full Firefox population, there is a measurable cost. Though its weight is light when it is but a single data point sent infrequently by each of our users, put together it is still hefty enough that we shouldn’t ignore it.

Bandwidth (Mozilla) – Cost: $0

Internet Service Providers have a nice gig: they can charge the user when the bytes leave their machine and charge Mozilla when the bytes enter their machine. However, cloud data platform providers (Amazon’s AWS, Google’s GCP, Microsoft’s Azure, etc) don’t charge for bandwidth for the data coming into their services.

You do get charged for bandwidth _leaving_ their systems. And for anything you do _on_ their systems. If I were feeling uncharitable I guess I’d call this a vendor lock-in data roach motel.

At any rate, the cost for this step is 0.

Pipeline Processing – Cost: $15.12

Once our Live Bookmarks data reaches the pipeline, there’s a few steps the data needs to go through. It needs to be decompressed, examined for adherence to the data schema (malformed data gets thrown out), and a response written to the client to tell it that we received it all okay. It needs to be processed, examined, and funneled to the correct storage locations while also being made available for realtime analysis if we need it.

For our little 4-byte number that shouldn’t be too bad, right?

Well, now that we’re on Mozilla’s side of the operation we need to consider the scale. Just how many Firefox profiles are sending how many of these numbers at us? About 250M of them each month. (At time of writing this isn’t up-to-date beyond EOY2019. Sorry about that. We’re working on it). With an engagement ratio of about 0.4, data being sent about thrice a day, and each count of Live Bookmarks taking up 4 bytes of space, we’re looking at 12GB of data per month.

At our present levels, ingestion and processing costs about $90 per TB. This comes out to $1.08 of cost for this step, each month. Multiplied by 14 “months”, that’s $15.12.

About Months

In saying “14 months” for how long the pipeline needs to put up with the collection coming from the entire Firefox population I glossed over quite a lot of detail. The main piece of information is that the default expiry for new data collections in Firefox is five or six Firefox versions (which should come out to about six months).

However, as I’ve mentioned before, updates don’t all happen at once. Though we have about 90% of the Firefox population within 3 versions of up-to-date at any one time, there’s a long tail of Firefox profiles from ancient versions sending us data.

To calculate 14 months I looked at the total data collection volumes for five versions of Firefox: Firefox 69-73 (inclusive). This avoids Firefox ESR 68 gumming up the works (its support lifetime is much longer than a normal release, and we’re aiming for a best-case cost estimate) and is both far enough in the past that Firefox 69 ought to be winding down around now _and_ is recent enough that we’ll not have thrown out the data yet (more on retention periods later) and it is closer in behaviour to releases we’re doing this year.

Here’s what that looks like:time series plot showing data volumes from five Firefox versions

So I said this was far enough in the past that Firefox 69 ought to be winding down around now? Well, if you look really closely at the bottom-right you might be able to see that we’re still receiving data from users still on that Firefox version. Lots of them.

But this is where we are in history, and I’m not running this query again (it only cost 15 cents, but it took half an hour), so let’s do the math. The total amount of data received from these five releases so far divided by the amount of data I said above that the user population would be sending each month (12GB) comes out to about 13.7 months.

To account for the seriously-annoying number of pings from those five versions that we presumably will continue receiving into the future, I rounded up to 14.

Storage (Mozilla) – Cost: $84

Once the data has been processed it needs to live somewhere. This costs us 2 cents per gigabyte stored, per month we decide to store it. 12GB per month means $0.24, right?

Well, no. We don’t have a way to only store this data for a period of time, so we need to store it for as long as the other stuff we store. For year-over-year forecasting we retain data for two years plus one month: 25 months. (Well, we presently retain data a bit longer than that, but we’re getting there.) So we need to take the 12GB we get each month and store it for 25 months. When we do that for each of the 14 “months” of data we get:

12GB/”month” x 14 “months” x $0.02 per GB per month x 25 months retention = $84

Now if you think this “2 cents per GB” figure is a little high: it is! We should be able to take advantage of lower storage costs for data we don’t write to any more. Unfortunately, we do write to it all the time servicing Deletion Requests (which I’ll get to in a little bit).

Analysis (Mozilla) – Time: 30min, Cost: $25.55

Data stored on some server someplace is of no use. Its value is derived through interrogating it, faceting its aggregations across interesting dimensions, picking it apart and putting it back together.

If this sounds like processing time Mozilla needs to pay for, you are correct!

On-demand analyses in Google’s BigQuery cost $5 per TB of data scanned. Mozilla’s spent some decent time thinking about query patterns to arrange data in a way that minimizes the amount of data we need to look at in any given analysis… but it isn’t perfect. To deliver us a count of the number of Live Bookmarks across our user base we’re going to have to scan more than the 12GB per month.

But this is a Best Case Estimate so let’s figure out how much a perfect query (one that only had to scan the data we wanted to get out of it) would cost:

12GB / 1000GB/TB * 5 $/TB = $0.06

That gives you back a sum of all the Live Bookmarks reported from all the Firefox profiles in a month. The number might be 5, or 5 million, or 5 trillion.

In other words, the number is useless. The real question you want to answer is “How much is this feature used?” which is less about the number of Live Bookmarks reported than it is Live Bookmarks stored per Firefox profile. If the 5 million Live Bookmarks are five thousand reports of 1000 Live Bookmarks all from one fellow named Nick, then we shouldn’t be investing in a feature used by one person, however much that one person uses it.

If the 5 million Live Bookmarks are one hundred thousand profiles reporting various handfuls of times a moderate number of bookmarks, then Live Bookmarks is more likely a broadly-used feature and might just need a little kick to be used even more.

So we need to aggregate the counts per-client and then look at the distribution. We can ask, over all the reports of Live Bookmarks from this one Firefox profile, give us the maximum number reported. Then show us a graph (like this). A perfect query of a month’s data will not only need to look at the 12GB of the month’s Live Bookmarks count, but also the profile identifier (client_id) so we can deduplicate reports. That id is a UUID and is represented as a 36-byte string. This adds another 8x data to scan compared to the 4B Live Bookmarks count we were previously looking at, ballooning our query to 108GB and our cost to $0.54.

But wait! We’re doing two steps: one to crunch these down to the 250M profiles that reported data that month and then a second to count the counts (to make our graph). That second step needs to scan the 250M 4B “maximum counts”, which adds another half a cent.

So our Best Case Estimate for querying the data to get the answer to our question is: $0.55 cents (I rounded up the half cent).

But don’t forget you need an analyst to perform this analysis! Assuming you have a mature suite of data analysis tooling, some rigorous documentation, and a well-entrenched culture of everyone helping everyone, this shouldn’t take longer than a half-hour of a single person’s time. Which is another $25, coming to a grand total of $25.55.

Deletion – Cost: $21

The data’s journey is not complete because any time a user opts their Firefox profile out of data collection we receive an order to delete what data we’ve previously received from that profile. To delete we need to copy out all the not-deleted data into new partitions and drop the old ones. This is a processing cost that is currently using the ad hoc $5/TB rate every time we process a batch of deletions (monthly).

Our Live Bookmarks count is adding 4 bytes of data per row that needs to be copied over. Each of those counts (excepting the ones that are deleted) needs to be copied over 25 times (retention period of 25 months). The amount of deleted data is small (Firefox’s data collection is very specifically designed to only collect what is necessary, so you shouldn’t ever feel as though you need to opt out and trigger deletion) so we’ll ignore its effect on the numbers for the purposes of making this easier to calculate.

12 GB/”month” x 14 “months” x 25 deletions / 1000GB/TB x 5 $/TB = $21

The total lifetime cost of all the deletion batches we process for the Live Bookmarks counts we record is $21. We’re hoping to knock this down a few pegs in cost, but it’ll probably remain in the “some dollars” order of magnitude.

The bigger share of this cost is actually in Storage, above. If we didn’t have to delete our data then, after 90 days, storage costs drop by half per month. This means that, if you want to assign the dollars a little more like blame, Storage costs are “only” $52.08 (full price for 3 months, half for 22) and Deletion costs are $52.92.

Grand Total: $245.67

In the best case, a collection of a single number from the wide Firefox user base will cost Mozilla almost $246 over the collection’s lifetime, split about 50% between labour and data computing platform costs.

So that’s it? Call it a wrap? Well… no. There are some cautionary tales to be learned here.


0) Lean Data Practices save money. Our Data Collection Review Request form ensures that we aren’t adding these costs to Mozilla and our users without justifying that the collection is necessary. These practices were put into place to protect our users’ privacy, but they do an equally good job of reducing costs.

1) The simplest permanent data collection costs $228 its first year and $103 every year afterwards even if you never look at it again. It costs $25 (30min) to expire a collection, which pays for itself in a maximum of 2.9 months (the payback period is much shorter if the data collection is bigger than 4B (like a Histogram) because the yearly costs are higher). The best time to have expired that collection was ages ago: the second-best time is now.

2) Spending extra time thinking about a data collection saves you time and money. Even if you uplift a quick expiry patch for a mis-measured collection, the nature of Firefox releases is such that you would still end up paying nearly all of the same $245.67 for a useless collection as you would for a correct one. Spend the time ahead of time to save the expense. Especially for permanent collections.

3) Even small improvements in documentation, process, and tooling will result in large savings. Half of this cost is labour, and lesson #2 is recommending you spend more time on it. Good documentation enables good decisions to be made confidently. Process protects you from collecting the wrong thing. Tooling catches mistakes before they make their way out into the wild. Even small things like consistent naming and language will save time and protect you from mistakes. These are your force multipliers.

4) To reduce costs, efficient data representations matter, and quickly-expiring data collections matter more.

5) Retention periods should be set as short as possible. You shouldn’t have to store Live Bookmarks counts from 2+ years ago.

Where Does Glean Fit In

Glean‘s focus on high-level metric types, end-to-end-testable data collections, and consistent naming makes mistakes in instrumentation easier to find during development. Rather than waiting for instrumentation code to reach release before realizing it isn’t correct, Glean is designed to help you catch those errors earlier.

Also, Glean’s use of per-application identifiers and emphasis on custom pings allows for data segregation that allows for different retention periods per-application or per-feature (e.g. the “metrics” ping might not need to be retained for 25 months even if the “baseline” ping does. And Firefox Desktop’s retention periods could be configured to be of a different length than Firefox Lockwise‘s) and reduces data scanned per analysis. And a consistent ping format and continued involvement of Data Science through design and development reduces analyst labour costs.

Basically the only thing we didn’t address was efficient data transfer encodings, and since Glean controls its ping format as an internal detail (unlike Telemetry) we could decide to address that later on without troubling Product Developers or Data Science.

There’s no doubt more we could do (and if you come up with something, do let us know!), but already I’m confident Glean will be worth its weight in Canadian Dollars.


(( Special thanks to :jason and :mreid for helping me nail down costs for the pipeline pieces and for the broader audience of Data Engineers, Data Scientists, Telemetry Engineers, and other folks who reviewed the draft. ))

This Week in Glean: A Distributed Team Echoes Distributed Workflow

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

Last Week: Extending Glean: build re-usable types for new use-cases by Alessio

I was recently struck by a realization that the position of our data org’s team members around the globe mimics the path that data flows through the Glean Ecosystem.

Glean Data takes this five-fold path (corresponding to five teams):

  1. Data is collected in a client using the Glean SDK (Glean Team)
  2. Data is transmitted to the Structured Ingestion pipeline (Data Platform)
  3. Data is stored and maintained in our infrastructure (Data Operations)
  4. Data is presented in our tools (Data Tools)
  5. Data is analyzed and reported on (Data Science)

The geographical midpoint of the Glean Team is about halfway across the north atlantic. For Data Platform it’s on the continental US, anchored by three members in the midwestern US. Data Ops is further West still, with four members in the Pacific timezone and no Europeans. Data Tools breaks the trend by being a bit further East, with fewer westcoasters. Data Science (for Firefox) is centred farther west still, with only two members East of the Rocky Mountains.

Or, graphically:

(approximate) Team Geocentres

Given the rotation of the Earth, the sun rises first on the Glean Team and the data collected by the Glean SDK. Then the data and the sun move West to the Data Platform where it is ingested. Data Tools gets the data from the Platform as morning breaks over Detroit. Data Operations keeps it all running from the midwest. And finally, the West Coast Centre of Firefox Data Science Excellence greets the data from a mountaintop, to try and make sense of it all.

(( Lying orthogonal to the data organization is the secret Sixth Glean Data “Team”: Data Stewardship. They ensure all Glean Data is collected in accordance with Mozilla’s Privacy Promise. The sun never sets on the Stewardship’s global coverage, and it’s a volunteer effort supplied from eight teams (and growing!), so I’ve omitted them from this narrative. ))

Bad metaphors about sunlight aside, I wonder whether this is random or whether this is some sort of emergent behaviour.

Conway’s Law suggests that our system architecture will tend to reflect our orgchart (well, the law is a bit more explicit about “communication structure” independent of organizational structure, but in the data org’s case they’re pretty close). Maybe this is a specific example of that: data architecture as a reflection of orgchart geography.

Or perhaps five dots on a globe that are only mostly in some order is too weak of a coincidence to even bear thinking about? Nah, where’s the blog post in that thinking…

If it’s emergent, it then becomes interesting to consider the “chicken and egg” point of view: did the organization beget the system or the system beget the organization? When I joined Mozilla some of these teams didn’t exist. Some of them only kinda existed within other parts of the company. So is the situation we’re in today a formalization by us of a structure that mirrors the system we’re building, or did we build the system in this way because of the structure we already had?



Data Science is Hard: Validating Data for Glean

Glean is a new library for collecting data in Mozilla products. It’s been shipping in Firefox Preview for a little while and I’d like to take a minute to talk about how I validated that it sends what we think it’s sending.

Validating new data collections in an existing system like Firefox Desktop Telemetry is a game of comparing against things we already know. We know that some percentage of data we receive is just garbage: bad dates, malformed records, attempts at buffer overflows and SQL injection. If the amount of garbage in the new collection is within the same overall amount of garbage we see normally, we count it as “good enough” and move on.

With new data collection from a new system like Glean coming from new endpoints like the reference browser and Firefox Preview, we’re given an opportunity to compare against the ideal. Maybe the correct number of failures is 0?

But what is a failure? What is acceptable behaviour?

We have an “events” ping in Glean: can the amount of time covered by the events’ timestamps ever exceed the amount of time covered by the ping? I didn’t think so, but apparently it’s an expected outcome when the events were restored from a previous session.

So how do you validate something that has unknown error states?

I started with a list of things any client-based network-transmitted data collection system had to have:

  • How many pings (data transmissions) are there?
  • How many measurements are in those pings?
  • How many clients are sending these pings?
  • How often?
  • How long do they take to get to our servers?
  • How many poorly-structured pings are sent? By how many clients? How often?
  • How many pings with bad values are sent? By how many clients? How often?

From there we can dive into validating specifics about the data collections:

  • Do the events in the “events” ping have timestamps with reasonable separations? (What does reasonable mean here? Well, it could be anything, but if the span between two timestamps is measured in years, and the app has only been available for some number of weeks, it’s probably broken.)
  • Are the GUIDs in the pings actually globally unique? Are we seeing duplicates? (We are, but not many)
  • Are there other ping fields that should be unique, but aren’t? (For Glean no client should ever send the same ping type with the same sequence number. But that kind of duplicate appears, too)

Once we can establish confidence in the overall health of the data reporting mechanism we can start using it to report errors about itself:

  • Ping transmission should be quick (because they’re small). Assuming the ping transmission time is 0, how far away are the clients’ clocks from the server’s clock? (AKA “Clock skew”. Turns out that mobile clients’ clocks are more reliable than desktop clients’ clocks (at least in the prerelease population. We’ll see what happens when we start measuring non-beta users))
  • How many errors are reported by internal error-reporting metrics? How many send failures? How many times did the app try to record a string that was too long?
  • What measurements are in the ping? Are they only the ones we expect to see? Are they showing in reasonable proportions relative to each other and the number of clients and pings reporting them?

All these attempts to determine what is reasonable and what is correct depend on a strong foundation of documentation. I can read the code that collects the data and sends the pings… but that tells me what is correct relative to what is implemented, not what is correct relative to what is intended.

By validating to the documentation, to what is intended, we can not only find bugs in the code, we can find bugs in the docs. And a data collection system lives and dies on its documentation: it is in many ways a more important part of the “product” than the code.

At this point, aside from the “metrics” ping which is awaiting validation after some fixes reach saturation in the population, Glean has passed all of these criteria acceptably. It still has a bit of a duplicate ping problem, but its clock skew and latency are much lower than Firefox Desktop’s. There are some outrageous clients sending dozens of pings over a period that they should be sending a handful, but that might just be a test client whose values will disappear into the noise when the user population grows.


Data Science is Festive: Christmas Light Reliability by Colour

This past weekend was a balmy 5 degrees Celsius which was lucky for me as I had to once again climb onto the roof of my house to deal with my Christmas lights. The middle two strings had failed bulbs somewhere along their length and I had a decent expectation that it was the Blue ones. Again.

Two years ago was our first autumn at our new house. The house needed Christmas lights so we bought four strings of them. Over the course of their December tour they suffered devastating bulb failures rendering alternating strings inoperable. (The bulbs are wired in a single parallel strand making a single bulb failure take down the whole string. However, connectivity is maintained so power flows through the circuit.)


Last year I tested the four strings and found them all faulty. We bought two replacement strings and I scavenged all the working bulbs from one of the strings to make three working strings out of the old four. All five (four in use, one in reserve) survived the season in working order.


This year in performing my sanity check before climbing the ladder I had to replace lamps in all three of the original strings to get them back to operating condition. Again.

And then I had an idea. A nerdy idea.

I had myself a wonderful nerdy idea!

“I know just what to do!” I laughed like an old miser.

I’ll gather some data and then visualize’er!

The strings are penta-colour: Red, Orange, Yellow, Green, and Blue. Each string has about an equal number of each colour of bulb and an extra Red and Yellow replacement bulb. Each bulb is made up of an internal LED lamp and an external plastic globe.

The LED lamps are the things that fail either from corrosion on the contacts or from something internal to the diode.

So I started with 6N+12 lamps and 6N+12 globes in total: N of each colour with an extra 1 Red and 1 Yellow per string. Whenever a lamp died I kept its globe. So the losses over time should manifest themselves as a surplus of globes and a defecit of lamps.

If the losses were equal amongst the colours we’d see a equal surplus of Green, Orange, and Blue globes and a slightly lower surplus of Red and Yellow globes (because of the extras). This is not what I saw when I lined them all up, though:

An image of christmas lightbulb globes and LED lamps in a histogram fashion. The blue globes are the most populous followed by yellow, green, then red. Yellow LED lamps are the most populous followed by red and green.

Instead we find ourselves with no oranges (I fitted all the extra oranges into empty blue spots when consolidating), an equal number of lamps and globes of yellow (yellow being one of the colours adjacent to most broken bulbs and, thus, less likely to be chosen for replacement), a mild surplus of red (one red lamp had evidently failed at one point), a larger surplus of green globes (four failed green lamps isn’t great but isn’t bad)…

And 14 excess blue globes.

Now, my sampling frequency isn’t all that high. And my knowledge of confidence intervals is a little rusty. But that’s what I think I can safely call a statistical outlier. I’m pretty sure we can conclude that, on my original set of strings of Christmas lights, Blue LEDs are more likely to fail than any other colour. But why?

I know from my LED history that high-luminance blue LEDs took the longest to be invented (patents filed in 1993 over 30 years after the first red LED). I learned from my friend who works at a display company that blue LEDs are more expensive. If I take those together I can suppose that perhaps the manufacturers of my light strings cheaped out on their lot of blue LEDs one year and stuck me, the consumer, with substandard lamps.

Instead of bringing joy, it brought frustration. But also predictive power because, you know what? On those two broken strings I had to climb up to retrieve this past, unseasonably-warm Saturday two of the four failed bulbs were indeed, as I said at the top, the Blue ones. Again.



Data Science is Hard: Counting Users

Screenshot_2018-08-29 User Activity Firefox Public Data Report

Counting is harder than you think. No, really!

Intuitively, as you look around you, you think this can’t be true. If you see a parking lot you can count the cars, right?

But do cars that have left the parking lot count? What about cars driving through it without stopping? What about cars driving through looking for a space? (And can you tell the difference between those two kinds from a distance?)

These cars all count if you’re interested in usage. It’s all well and good to know the number of cars using your parking lot right now… but is it lower on weekends? Holidays? Are you measuring on a rainy day when fewer people take bicycles, or in the Summer when more people are on vacation? Do you need better signs or more amenities to get more drivers to stop? Are you going to have expand capacity this year, or next?

Yesterday we released the Firefox Public Data Report. Go take a look! It is the culmination of months of work of many mozillians (not me, I only contributed some early bug reports). In it you can find out how many users Firefox has, the most popular addons, and how quickly Firefox users update to the latest version. And you can choose whether to look at how these plots look for the worldwide user base or for one of the top ten (by number of Firefox users) countries individually.

It’s really cool.

The first two plots are a little strange, though. They count the number of Firefox users over time… and they don’t agree. They don’t even come close!

For the week including August 17, 2018 the Yearly Active User (YAU) count is 861884770 (or about 862M)… but the Monthly Active User (MAU) count is 256092920 (or about 256M)!

That’s over 600M difference! Which one is right?

Well, they both are.

Returning to our parking lot analogy, MAU is about counting how many cars use the parking lot over a 28-day period. So, starting Feb 1, count cars. If someone you saw earlier returns the next day or after a week, don’t count them again: we only want unique cars. Then, at the end of the 28-day period, that was the MAU for Feb 28. The MAU for Mar 1 (on non-leap-years) is the same thing, but you start counting on Feb 2.

Similarly for YAU, but you count over the past 365 days.

It stands to reason that you’ll see more unique cars over the year than you will over the month: you’ll see visitors, tourists, people using the lot just once, and people who have changed jobs and haven’t been back in four months.

So how many of these 600M who are in the YAU but not in the MAU are gone forever? How many are coming back? We don’t know.

Well, we don’t know _precisely_.

We’ve been at the browser game for long enough to see patterns in the data. We’re in the Summer slump for MAU numbers, and we have a model for how much higher the numbers are likely to be come October. We have surveyed people of varied backgrounds and have some ideas of why people change browsers to or away from Firefox.

We have the no-longer users, the lapsed users, the lost-and-regained users, the tried-us-once users, the non-human users, … we have categories and rough proportions on what we think we know about our population, and how that influences how we can better make the internet better for them.

Ultimately, to me, it doesn’t matter too much. I work on Firefox, a product that hundreds of millions of people use. How many hundreds of millions doesn’t matter: we’re above the threshold that makes me feel like I’m making the world better.

(( Well… I say that, but it is actually my job to understand the mechanisms behind these  numbers and why they can’t be exact, so I do have a bit of a vested interest. And there are a myriad of technological and behavioural considerations to account for in code and in documentation and in analysis which makes it an interesting job. But, you know. Hundreds of millions is precise enough for my job satisfaction index. ))

But once again we reach the inescapable return to the central thesis. Counting is harder than you think: one of the leading candidates for the Data Team’s motto. (Others include “Well, it depends.” and “¯\_(ツ)_/¯”). And now we’re counting in the open, so you get to experience its difficulty firsthand. Go have another look.



When do All Firefox Users Update?

Last time we talked about updates I wrote about all of what goes into an individual Firefox user’s ability to update to a new release. We looked into how often Firefox checks for updates, and how we sometimes lie and say that there isn’t an update even after release day.

But how does that translate to a population?

Well, let’s look at some pictures. First, the number of “update” pings we received from users during the recent Firefox 61 release:update_ping_volume_release61

This is a real-time look at how many updates were received or installed by Firefox users each minute. There is a plateau’s edge on June 26th shortly after the update went live (around 10am PDT), and then a drop almost exactly 24 hours later when we turned updates off. This plot isn’t the best for looking into the mechanics of how this works since it shows volume of all types of “update” pings from all versions, so it includes users finally installing Firefox Quantum 57 from last November as well as users being granted the fresh update for Firefox 61.

Now that it’s been a week we can look at our derived dataset of “update” pings and get a more nuanced view (actually, latency on this dataset is much lower than a week, but it’s been at least a week). First, here’s the same graph, but filtered to look at only clients who are letting us know they have the Firefox 61 update (“update” ping, reason: “ready”) or they have already updated and are running Firefox 61 for the first time after update (“update” ping, reason: “success”):update_ping_volume_61only_wide

First thing to notice is how closely the two graphs line up. This shows how, during an update, the volume of “update” pings is dominated by those users who are updating to the most recent version.

And it’s also nice validation that we’re looking at the same data historically that we were in real time.

To step into the mechanics, let’s break the graph into its two constituent parts: the users reporting that they’ve received the update (reason: “ready”) and the users reporting that they’re now running the updated Firefox (reason: “success”).


The first graph shows the two lines stacked for maximum similarity to the graphs above. The second unstacks the two so we can examine them individually.

It is now much clearer to see how and when we turned updates on and off during this release. We turned them on June 26, off June 27, then on again June 28. The blue line also shows us some other features: the Canada Day weekend of lower activity June 30 and July 1, and even time-of-day effects where our sharpest peaks are (EDT) 6-8am, 12-2pm, and a noticeable hook at 1am.

(( That first peak is mostly made up of countries in the Central European Timezone UTC+1 (e.g. Germany, France, Poland). Central Europe’s peak is so sharp because Firefox users, like the populations of European countries, are concentrated mostly in that one timezone. The second peak is North and South America (e.g. United States, Brazil). It is broader because of how many timezones the Americas span (6) and how populations are dense in the Eastern Timezone and Pacific Timezone which are 3 hours apart (UTC-5 to UTC-8). The noticeable hook is China and Indonesia. China has one timezone for its entire population (UTC+8), and Indonesia has three. ))

This blue line shows us how far we got in the last post: delivering the updates to the user.

The red line shows why that’s only part of the story, and why we need to look at populations in addition to just an individual user.

Ultimately we want to know how quickly we can reach our users with updated code. We want, on release day, to get our fancy new bells and whistles out to a population of users small enough that if something goes wrong we’re impacting the fewest number of them, but big enough that we can consider their experience representative of what the whole Firefox user population would experience were we to release to all of them.

To do this we have a two levers: we can change how frequently Firefox asks for updates, and we can change how many Firefox installs that ask for updates actually get it right away. That’s about it.

So what happens if we change Firefox to check for updates every 6 hours instead of every 12? Well, that would ensure more users will check for updates during periods they’re offered. It would also increase the likelihood of a given user being offered the update when their Firefox asks for one. It would raise the blue line a bit in those first hours.

What if we change the %ge of update requests that are offers? We could tune up or down the number of users who are offered the updates. That would also raise the blue line in those first hours. We could offer the update to more users faster.

But neither of these things would necessarily increase the speed at which we hit that Goldilocks number of users that is both big enough to be representative, and small enough to be prudent. Why? The red line is why. There is a delay between a Firefox having an update and a user restarting it to take advantage of it.

Users who have an update won’t install it immediately (see how the red line is lower than the blue except when we turn updates off completely), and even if we turn updates off it doesn’t stop users who have the update from installing it (see how the red line continues even when the blue line is floored).

Even with our current practices of serving updates, more users receive the update than install it for at least the first week after release.

If we want to accelerate users getting update code, we need to control the red line. Which we don’t. And likely can’t.

I mean, we can try. When an update has been pending for long enough, you get a little arrow on your Firefox menu: update_arrow.png

If you leave it even longer, we provide a larger piece of UI: a doorhanger.


We can tune how long it takes to show these. We can show the doorhanger immediately when the update is ready, asking the user to stop what they’re doing and–


…maybe we should just wait until users update their own selves, and just offer some mild encouragement if they’ve put it off for, say, four days? We’ll give the user eight days before we show them anything as invasive as the doorhanger. And if they dismiss that doorhanger, we’ll just not show it again and trust they’ll (eventually) restart their browser.

…if only because Windows restarted their whole machines when they weren’t looking.

(( If you want your updates faster and more frequent, may I suggest installing Firefox Beta (updates every week) or Firefox Nightly (updates twice a day)? ))

This means that if the question is “When do all Firefox users update?” the answer is, essentially, “When they want to.” We can try and serve the updates faster, or try to encourage them to restart their browsers sooner… but when all is said and done our users will restart their browsers when they want to restart their browsers, and not a moment before.

Maybe in the future we’ll be able to smoothly update Firefox when we detect the user isn’t using it and when all the information they have entered can be restored. Maybe in the future we’ll be able to update seamlessly by porting the running state from instance to instance. But for now, given our current capabilities, we don’t want to dictate when a user has to restart their browser.

…but what if we could ship new features more quickly to an even more controlled segment of the release population… and do it without restarting the user’s browser? Wouldn’t that be even better than updates?

Well, we might have answers for that… but that’s a subject for another time.


Faster Event Telemetry with “event” Pings

Screenshot_2018-07-04 New Query(1).pngEvent Telemetry is the means by which we can send ordered interaction data from Firefox users back to Mozilla where we can use it to make product decisions.

For example, we know from a histogram that the most popular way of opening the Developer Tools in Firefox Beta 62 is by the shortcut key (Ctrl+Shift+I). And it’s nice to see that the number of times the Javascript Debugger was opened was roughly 1/10th of the number of times the shortcut key was used.

…but are these connected? If so, how?

And the Javascript Profiler is opened only half as often as the Debugger. Why? Isn’t it easy to find that panel from the Debugger? Are users going there directly from the DOM view or is it easier to find from the Debugger?

To determine what parts of Firefox our users are having trouble finding or using, we often need to know the order things happen. That’s where Event Telemetry comes into play: we timestamp things that happen all over the browser so we can see what happens and in what order (and a little bit of how long it took to happen).

Event Telemetry isn’t new: it’s been around for about 2 years now. And for those two years it has been piggy-backing on the workhorse of the Firefox Telemetry system: the “main” ping.

The “main” ping carries a lot of information and is usually sent once per time you close your Firefox (or once per day, whichever is shorter). As such, Event Telemetry was constrained in how it was able to report this ordered data. It takes two whole days to get 95% of it (because that’s how long it takes us to get “main” pings), and it isn’t allowed to send more than one thousand events per process (lest it balloon the size of the “main” ping, causing problems).

This makes the data slow, and possibly incomplete.

With the landing of bug 1460595 in Firefox Nightly 63 last week, Event Telemetry now has its own ping: the “event” ping.

The “event” ping maintains the same 1000-events-per-process-per-ping limit as the “main” ping, but can send pings as frequently as one ping every ten minutes. Typically, though, it waits the full hour before sending as there isn’t any rush. A maximum delay of an hour still makes for low-latency data, and a minimum delay of ten minutes is unlikely to be overrun by event recordings which means we should get all of the events.

This means it takes less time to receive data that is more likely to be complete. This in turn means we can use less of it to get our answers. And it means more efficiency in our decision-making process, which is important when you’re competing against giants.

If you use Event Telemetry to answer your questions with data, now you can look forward to being able to do so faster and with less worry about losing data along the way.

And if you don’t use Event Telemetry to answer your questions, maybe now would be a good time to start.

The “event” ping landed in Firefox Nightly 63 (build id 20180627100027) and I hope to have it uplifted to Firefox Beta 62 in the coming days.

Thanks to :sunahsuh for her excellent work reviewing the proposal and in getting the data into the derived datasets so they can be easily queried, and further thanks to the Data Team for their support.


Some More Very Satisfying Graphs

I guess I just really like graphs that step downwards:

Screenshot_2018-06-27 Telemetry Budget Forecasting

Earlier this week :mreid noticed that our Nightly population suddenly started sending us, on average, 150 fewer kilobytes (uncompressed) of data per ping. And they started doing this in the middle of the previous week.

Step 1 was to panic that we were missing information. However, no one had complained yet and we can usually count on things that break to break loudly, so we cautiously-optimistically put our panic away.

Step 2 was to see if the number of pings changed. It could be we were being flooded with twice as many pings at half the size, for the same volume. This was not the case:

Screenshot_2018-06-27 Telemetry Budget Forecasting(2)

Step 3 was to do some code archaeology to try and determine the “culprit” change that was checked into Firefox and resulted in us sending so much less data. We quickly hit upon the removal of BrowserUITelemetry and that was that.

…except… when I went to thank :Standard8 for removing BrowserUITelemetry and saving us and our users so much bandwidth, he was confused. To the best of his knowledge, BrowserUITelemetry was already not being sent. And then I remembered that, indeed, back in March :janerik had been responsible for stopping many things like BrowserUITelemetry from being sent (since they were unmaintained and unused).

So I fired up an analysis notebook and started poking to see if I could find out what parts of the payload had suddenly decreased in size. Eventually, I generated a plot that showed quite clearly that it was the keyedHistograms section that had decreased so radically.

Screenshot_2018-06-27 main_ping_size - Databricks

Around the same time :janerik found the culprit in the list of changes that went into the build: we are no longer sending a couple of incredibly-verbose keyed histograms because their information is now much more readily available in profiles.

The power of cleaning up old code: removing 150kb from the average “main” ping sent multiple times per day by each and every Firefox Nightly user.

Very satisfying.


Perplexing Graphs: The Case of the 0KB Virtual Memory Allocations

Every Monday and Thursday around 3pm I check dev-telemetry-alerts to see if there have been any changes detected in the distribution of any of the 1500-or-so pieces of anonymous usage statistics we record in Firefox using Firefox Telemetry.

This past Monday there was one. It was a little odd.489b9ce7-84e6-4de0-b52d-e0179a9fdb1a

Generally, when you’re measuring continuous variables (timings, memory allocations…) you don’t see too many of the same value. Sure, there are common values (2GB of physical memory, for instance), but generally you don’t suddenly see a quarter of all reports become 0.

That was weird.

So I did what I always do when I find an alert that no one’s responded to, and triaged it. Mostly this involves looking at it on to see if it was still happening, whether it was caused by a change in submission volumes (could be that we’re suddenly hearing from a lot more users, and they all report just “0”, for example), or whether it was limited to a single operating system or architecture:


Hello, Windows.


Specifically: hello Windows 64-bit.

With these clues, :erahm was able to highlight for me a bug that might have contributed to this sudden change: enabling Control Flow Guard on Windows builds.

Control Flow Guard (CFG) is a feature of Windows 8.1 (Update 3) and 10 that inserts some runtime checks into your binary to ensure you only make sensible jumps. This protects against certain exploits where attackers force a binary to jump into strange places in the running program, causing Bad Things to happen.

I had no idea how a control flow integrity feature would result in 0-size virtual memory allowances, but when :erahm gives you a hint, you take it. I commented on the bug.

Luckily, I was taken seriously, so a new bug was filed and :tjr looked into it almost immediately. The most important clue came from :dmajor who had the smartest money in the room, and crucial help from :ted who was able to reproduce the bug.

It turns out that turning CFG on made our Virtual Memory allowances jump above two terabytes.

Now, to head off “Firefox iz eatang ur RAM!!!!111eleven” commentary: this is CFG’s fault, not ours. (Also: Virtual Memory isn’t RAM.)

In order to determine what parts of a binary are valid “indirect jump targets”, Windows needs to keep track of them all, and do so performantly enough that the jumps can still happen at speed. Windows does this by maintaining a map with a bit per possible jump location. The bit is 1 if it is a valid location to jump to, and 0 if it is not. On each indirect jump, Windows checks the bit for the jump location and interrupts the process if it was about to jump to a forbidden place.

When running this on a 64-bit machine, this bitmap gets… big. Really big. Two Terabytes big. And that’s using an optimized way of storing data about the jump availability of up to 2^64 (18 quintillion) addresses. Windows puts this in the process’ storage allocations for its own recordkeeping reasons, which means that every 64-bit process with CFG enabled (on CFG-aware Windows versions (8.1 Update 3 and 10)) has a 2TB virtual memory allocation.

So. We have an abnormally-large value for Virtual Memory. How does that become 0?

Well, those of you with CS backgrounds (or who clicked on the “smartest money” link a few paragraphs back), will be thinking about the word “overflow”.

And you’d be wrong. Ish.

The raw number :ted was seeing was the number 2201166503936. This number is the number of bytes in his virtual memory allocation and is a few powers of two above what we can fit in 32 bits. However, we report the number of kilobytes. The number of kilobytes is 2149576664, well underneath the maximum number you can store in an unsigned 32-bit integer, which we all know (*eyeroll*) is 4294967296. So instead of a number about 512x too big to fit, we get one that can fit almost twice over.


So we’re left with a number that should fit, being recorded as 0. So I tried some things and, sure enough, recording the number 2149576664 into any histogram did indeed record as 0. I filed a new bug.

Then I tried numbers plus or minus 1 around :ted’s magic number. They became zeros. I tried recording 2^31 + 1. Zero. I tried recording 2^32 – 1. Zero.

With a sinking feeling in my gut, I then tried recording 2^32 + 1. I got my overflow. It recorded as 1. 2^32 + 2 recorded as 2. And so on.

All numbers between 2^31 and 2^32 were being recorded as 0.


In a sensible language like Rust, assigning an unsigned value to a signed variable isn’t something you can do accidentally. You almost never want to do it, so why make it easy? And let’s make sure to warn the code author that they’re probably making a mistake while we’re at it.

In C++, however, you can silently convert from unsigned to signed. For values between 0 and 2^31 this doesn’t matter. For values between 2^31 and 2^32, this means you can turn a large positive number into a negative number somewhere between -2^31 and -1. Silently.

Telemetry Histograms don’t record negatives. We clamp them to 0. But something in our code was coercing our fancy unsigned 32-bit integer to a signed one before it was clamped to 0. And it was doing it silently. Because C++.

Now that we’ve found the problem, fixed the problem, and documented the problem we are collecting data about the data[citation] we may have lost because of the problem.

But to get there I had to receive an automated alert (which I had to manually check), split the data against available populations, become incredibly lucky and run it by :erahm who had an idea of what it might be, find a team willing to take me seriously, and then do battle with silent type coercion in a language that really should know better.

All in a day’s work, I guess?