Data Science is Hard: ALSA in Firefox

(( We’re overdue for another episode in this series on how Data Science is Hard. Today is a story from 2016 which I think illustrates many important things to do with data. ))

It’s story time. Gather ’round.

In July of 2016, Anthony Jones made the case that the Mozilla-built Firefox for Linux should stop supporting the ALSA backend (and also the WinXP WinMM backend) so that we could innovate on features for more modern audio backends.

(( You don’t need to know what an audio backend is to understand this story. ))

The code supporting ALSA would remain in tree for any Linux distribution who wished to maintain the backend and build it for themselves, but Mozilla would stop shipping Firefox with that code in it.

But how could we ensure the number of Firefoxen relying on this backend was small enough that we wouldn’t be removing something our users desperately needed? Luckily :padenot had just added an audio backend measurement to Telemetry. “We’ll have data soon,” he wrote.

By the end of August we’d heard from Firefox Nightly and Firefox Developer Edition that only 3.5% and 2% (respectively) of Linux subsessions with audio used ALSA. This was small enough to for the removal to move ahead.

Fast-forward to March of 2017. Seven months have passed. The removal has wound its way through Nightly, Developer Edition, Beta, and now into the stable Release channel. Linux users following this update channel update their Firefox and… suddenly the web grows silent for a large number of users.

Bugs are filed (thirteen of them). The mailing list thread with Anthony’s original proposal is revived with some very angry language. It seems as though far more than just a fraction of a fraction of users were using ALSA. There were entire Linux distributions that didn’t ship anything besides ALSA. How did Telemetry miss them?

It turns out that many of those same ALSA-only Linux distributions also turned off Telemetry when they repackaged Firefox for their users. And for any that shipped with Telemetry at all, many users disabled it themselves. Those users’ Firefoxen had no way to phone home to tell Mozilla how important ALSA was to them… and now it was too late.

Those Linux distributions started building ALSA support into their distributed Firefox builds… and hopefully began reporting Telemetry by default to prevent this from happening again. I don’t know if they did for sure (we don’t collect fine-grained information like that because we don’t need it).

But it serves as a cautionary tale: Mozilla can only support a finite number of things. Far fewer now than we did back in 2016. We prioritize what we support based on its simplicity and its reach. That first one we can see for ourselves, and for the second we rely on data collection like Telemetry to tell us.

Counting things is harder than it looks. Counting things that are invisible is damn near impossible. So if you want to be counted: turn Telemetry on (it’s in the Preferences) and leave it on.

:chutten

Data Science is Hard: Validating Data for Glean

Glean is a new library for collecting data in Mozilla products. It’s been shipping in Firefox Preview for a little while and I’d like to take a minute to talk about how I validated that it sends what we think it’s sending.

Validating new data collections in an existing system like Firefox Desktop Telemetry is a game of comparing against things we already know. We know that some percentage of data we receive is just garbage: bad dates, malformed records, attempts at buffer overflows and SQL injection. If the amount of garbage in the new collection is within the same overall amount of garbage we see normally, we count it as “good enough” and move on.

With new data collection from a new system like Glean coming from new endpoints like the reference browser and Firefox Preview, we’re given an opportunity to compare against the ideal. Maybe the correct number of failures is 0?

But what is a failure? What is acceptable behaviour?

We have an “events” ping in Glean: can the amount of time covered by the events’ timestamps ever exceed the amount of time covered by the ping? I didn’t think so, but apparently it’s an expected outcome when the events were restored from a previous session.

So how do you validate something that has unknown error states?

I started with a list of things any client-based network-transmitted data collection system had to have:

  • How many pings (data transmissions) are there?
  • How many measurements are in those pings?
  • How many clients are sending these pings?
  • How often?
  • How long do they take to get to our servers?
  • How many poorly-structured pings are sent? By how many clients? How often?
  • How many pings with bad values are sent? By how many clients? How often?

From there we can dive into validating specifics about the data collections:

  • Do the events in the “events” ping have timestamps with reasonable separations? (What does reasonable mean here? Well, it could be anything, but if the span between two timestamps is measured in years, and the app has only been available for some number of weeks, it’s probably broken.)
  • Are the GUIDs in the pings actually globally unique? Are we seeing duplicates? (We are, but not many)
  • Are there other ping fields that should be unique, but aren’t? (For Glean no client should ever send the same ping type with the same sequence number. But that kind of duplicate appears, too)

Once we can establish confidence in the overall health of the data reporting mechanism we can start using it to report errors about itself:

  • Ping transmission should be quick (because they’re small). Assuming the ping transmission time is 0, how far away are the clients’ clocks from the server’s clock? (AKA “Clock skew”. Turns out that mobile clients’ clocks are more reliable than desktop clients’ clocks (at least in the prerelease population. We’ll see what happens when we start measuring non-beta users))
  • How many errors are reported by internal error-reporting metrics? How many send failures? How many times did the app try to record a string that was too long?
  • What measurements are in the ping? Are they only the ones we expect to see? Are they showing in reasonable proportions relative to each other and the number of clients and pings reporting them?

All these attempts to determine what is reasonable and what is correct depend on a strong foundation of documentation. I can read the code that collects the data and sends the pings… but that tells me what is correct relative to what is implemented, not what is correct relative to what is intended.

By validating to the documentation, to what is intended, we can not only find bugs in the code, we can find bugs in the docs. And a data collection system lives and dies on its documentation: it is in many ways a more important part of the “product” than the code.

At this point, aside from the “metrics” ping which is awaiting validation after some fixes reach saturation in the population, Glean has passed all of these criteria acceptably. It still has a bit of a duplicate ping problem, but its clock skew and latency are much lower than Firefox Desktop’s. There are some outrageous clients sending dozens of pings over a period that they should be sending a handful, but that might just be a test client whose values will disappear into the noise when the user population grows.

:chutten

Data Science is Festive: Christmas Light Reliability by Colour

This past weekend was a balmy 5 degrees Celsius which was lucky for me as I had to once again climb onto the roof of my house to deal with my Christmas lights. The middle two strings had failed bulbs somewhere along their length and I had a decent expectation that it was the Blue ones. Again.

Two years ago was our first autumn at our new house. The house needed Christmas lights so we bought four strings of them. Over the course of their December tour they suffered devastating bulb failures rendering alternating strings inoperable. (The bulbs are wired in a single parallel strand making a single bulb failure take down the whole string. However, connectivity is maintained so power flows through the circuit.)

20181104_111900

Last year I tested the four strings and found them all faulty. We bought two replacement strings and I scavenged all the working bulbs from one of the strings to make three working strings out of the old four. All five (four in use, one in reserve) survived the season in working order.

20181104_111948

This year in performing my sanity check before climbing the ladder I had to replace lamps in all three of the original strings to get them back to operating condition. Again.

And then I had an idea. A nerdy idea.

I had myself a wonderful nerdy idea!

“I know just what to do!” I laughed like an old miser.

I’ll gather some data and then visualize’er!

The strings are penta-colour: Red, Orange, Yellow, Green, and Blue. Each string has about an equal number of each colour of bulb and an extra Red and Yellow replacement bulb. Each bulb is made up of an internal LED lamp and an external plastic globe.

The LED lamps are the things that fail either from corrosion on the contacts or from something internal to the diode.

So I started with 6N+12 lamps and 6N+12 globes in total: N of each colour with an extra 1 Red and 1 Yellow per string. Whenever a lamp died I kept its globe. So the losses over time should manifest themselves as a surplus of globes and a defecit of lamps.

If the losses were equal amongst the colours we’d see a equal surplus of Green, Orange, and Blue globes and a slightly lower surplus of Red and Yellow globes (because of the extras). This is not what I saw when I lined them all up, though:

An image of christmas lightbulb globes and LED lamps in a histogram fashion. The blue globes are the most populous followed by yellow, green, then red. Yellow LED lamps are the most populous followed by red and green.

Instead we find ourselves with no oranges (I fitted all the extra oranges into empty blue spots when consolidating), an equal number of lamps and globes of yellow (yellow being one of the colours adjacent to most broken bulbs and, thus, less likely to be chosen for replacement), a mild surplus of red (one red lamp had evidently failed at one point), a larger surplus of green globes (four failed green lamps isn’t great but isn’t bad)…

And 14 excess blue globes.

Now, my sampling frequency isn’t all that high. And my knowledge of confidence intervals is a little rusty. But that’s what I think I can safely call a statistical outlier. I’m pretty sure we can conclude that, on my original set of strings of Christmas lights, Blue LEDs are more likely to fail than any other colour. But why?

I know from my LED history that high-luminance blue LEDs took the longest to be invented (patents filed in 1993 over 30 years after the first red LED). I learned from my friend who works at a display company that blue LEDs are more expensive. If I take those together I can suppose that perhaps the manufacturers of my light strings cheaped out on their lot of blue LEDs one year and stuck me, the consumer, with substandard lamps.

Instead of bringing joy, it brought frustration. But also predictive power because, you know what? On those two broken strings I had to climb up to retrieve this past, unseasonably-warm Saturday two of the four failed bulbs were indeed, as I said at the top, the Blue ones. Again.

 

:chutten