This Week in Glean: Glean is Frictionless Data Collection

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

So you want to collect data in your project? Okay, it’s pretty straightforward.

  1. API: You need a way to combine the name of your data with the value that data has. Ideally you want it to be ergonomic to your developers to encourage them to instrument things without asking you for help, so it should include as many compile-time checks as you can and should be friendly to the IDEs and languages in use. Note the plurals.
  2. Persistent Storage: Keyed by the name of your data, you need some place to put the value. Ideally this will be common regardless of the instrumentation’s language or thread of execution. And since you really don’t want crashes or sudden application shutdowns or power outages to cause you to lose everything, you need to persist this storage. You can write it to a file on disk (if your platforms have such access), but be sure to write the serialization and deserialization functions with backwards-compatibility in mind because you’ll eventually need to change the format.
  3. Networking: Data stored with the product has its uses, but chances are you want this data to be combined with more data from other installations. You don’t need to write the network code yourself, there are libraries for HTTPS after all, but you’ll need to write a protocol on top of it to serialize your data for transmission.
  4. Scheduling: Sending data each time a new piece of instrumentation comes in might be acceptable for some products whose nature is only-online. Messaging apps and MMOs send so much low-latency data all the time that you might as well send your data as it comes in. But chances are you aren’t writing something like that, or you respect the bandwidth of your users too much to waste it, so you’ll only want to be sending data occasionally. Maybe daily. Maybe when the user isn’t in the middle of something. Maybe regularly. Maybe when the stored data reaches a certain size. This could get complicated, so spend some time here and don’t be afraid to change it as you find new corners.
  5. Errors: Things will go wrong. Instrumentation will, despite your ergonomic API, do something wrong and write the wrong value or call stop() before start(). Your networking code will encounter the weirdness of the full Internet. Your storage will get full. You need some way to communicate the health of your data collection system to yourself (the owner who needs to adjust scheduling and persistence and other stuff to decrease errors) and to others (devs who need to fix their instrumentation, analysts who should be told if there’s a problem with the data, QA so they can write tests for these corner cases).
  6. Ingestion: You’ll need something on the Internet listening for your data coming in. It’ll need to scale to the size of your product’s base and be resilient to Internet Attacks. It should speak the protocol you defined in #4, so you should probably have some sort of machine-readable definition of that protocol that product and ingestion can share. And you should spend some time thinking about what to do when an old product with an old version of the protocol wants to send data to your latest ingestion endpoint.
  7. Pipeline: Not all data will go to the same place. Some is from a different product. Some adheres to a different schema. Some is wrong but ingestion (because it needs to scale) couldn’t do the verification of it, so now you need to discard it more expensively. Thus you’ll be wanting some sort of routing infrastructure to take ingested data and do some processing on it.
  8. Warehousing: Once you receive all these raw payloads you’ll need a place to put them. You’ll want this place to be scalable, high-performance, and highly-available.
  9. Datasets: Performing analysis to gain insight from raw payloads is possible (even I have done it), but it is far more pleasant to consolidate like payloads with like, perhaps ordered or partitioned by time and by some dimensions within the payload that’ll make analyses quicker. Maybe you’ll want to split payloads into multiple rows of a tabular dataset, or combine multiple payloads into single rows. Talk to the people doing the analyses and ask them what would make their lives easier.
  10. Tooling: Democratizing data analysis is a good way to scale up the number of insights your organization can find at once, and it’s a good way to build data intuition. You might want to consider low-barrier data analysis tooling to encourage exploration. You might also want to consider some high-barrier data tooling for operational analyses and monitoring (good to know that the update is rolling out properly and isn’t bricking users’ devices). And some things for the middle ground of folks that know data and have questions, but don’t know SQL or Python or R.
  11. Tests: Don’t forget that every piece of this should be testable and tested in isolation and in integration. If you can manage it, a suite of end-to-end tests does wonders for making you feel good that the whole system will continue to work as you develop it.
  12. Documentation: You’ll need two types of documentation: User and Developer. The former is for the “user” of the piece (developers who wish to instrument back in #1, analysts who have questions that need answering in #10). The latter is for anyone going in trying to understand the “Why” and “How” of the pieces’ architecture and design choices.

You get all that? Thread safety. File formats. Networking protocols. Scheduling using real wall-clock time. Schema validation. Open ports on the Internet. At scale. User-facing tools and documentation. All tested and verified.

Look, I said it’d be straightforward, not that it’d be easy. I’m sure it’ll only take you a few years and a couple tries to get it right.

Or, y’know, if you’re a Mozilla project you could just use Glean which already has all of these things…

  1. API: The Glean SDK API aims to be ergonomic and idiomatic in each supported language.
  2. Persistent Storage: The Glean SDK uses rkv as a persistent store for unsubmitted data, and a documented flat file format for submitted but not yet sent data.
  3. Networking: The Glean SDK provides an API for embedding applications to provide their own networking stack (useful when we’re embedded in a browser), and some default implementations if you don’t care to provide one. The payload protocol is built on Structured Ingestion and has a schema that generates and deploys new versions daily.
  4. Scheduling: Each Glean SDK payload has its own schedule to respect the character of the data it contains, from as frequently as the user foregrounds the app to, at most, once a day.
  5. Errors: The Glean SDK builds user metric and internal health metrics into the SDK itself.
  6. Ingestion: The edge servers and schema validation are all documented and tested. We autoscale quite well and have a process for handling incidents.
  7. Pipeline: We have a pubsub system on GCP that handles a variety of different types of data.
  8. Warehousing: I can’t remember if we still call this the Data Lake or not.
  9. Datasets: We have a few. They are monitored. Our workflow software for deriving the datasets is monitored as well.
  10. Tooling: Quite a few of them are linked from the Telemetry Index.
  11. Tests: Each piece is tested individually. Adjacent pieces sometimes have integration suites. And Raphael recently spun up end-to-end tests that we’re very appreciative of. And if you’re just a dev wondering if your new instrumentation is working? We have the debug ping viewer.
  12. Documentation: Each piece has developer documentation. Some pieces, like the SDK, also have user documentation. And the system at large? Even more documentation.

Glean takes this incredibly complex problem, breaks it into pieces, solves each piece individually, then puts the solution together in a way that makes it greater than the sum of its parts.

All you need is to follow the six steps to integrate the Glean SDK and notify the Ecosystem that your project exists, and then your responsibilities shrink to just instrumentation and analysis.

If that isn’t frictionless data collection, I don’t know what is.

:chutten

(( If you’re not a Mozilla project, and thus don’t by default get to use the Data Platform (numbers 6-10) for your project, come find us on the #glean channel on Matrix and we’ll see what help we can get you. ))

Data Science is Hard: ALSA in Firefox

(( We’re overdue for another episode in this series on how Data Science is Hard. Today is a story from 2016 which I think illustrates many important things to do with data. ))

It’s story time. Gather ’round.

In July of 2016, Anthony Jones made the case that the Mozilla-built Firefox for Linux should stop supporting the ALSA backend (and also the WinXP WinMM backend) so that we could innovate on features for more modern audio backends.

(( You don’t need to know what an audio backend is to understand this story. ))

The code supporting ALSA would remain in tree for any Linux distribution who wished to maintain the backend and build it for themselves, but Mozilla would stop shipping Firefox with that code in it.

But how could we ensure the number of Firefoxen relying on this backend was small enough that we wouldn’t be removing something our users desperately needed? Luckily :padenot had just added an audio backend measurement to Telemetry. “We’ll have data soon,” he wrote.

By the end of August we’d heard from Firefox Nightly and Firefox Developer Edition that only 3.5% and 2% (respectively) of Linux subsessions with audio used ALSA. This was small enough to for the removal to move ahead.

Fast-forward to March of 2017. Seven months have passed. The removal has wound its way through Nightly, Developer Edition, Beta, and now into the stable Release channel. Linux users following this update channel update their Firefox and… suddenly the web grows silent for a large number of users.

Bugs are filed (thirteen of them). The mailing list thread with Anthony’s original proposal is revived with some very angry language. It seems as though far more than just a fraction of a fraction of users were using ALSA. There were entire Linux distributions that didn’t ship anything besides ALSA. How did Telemetry miss them?

It turns out that many of those same ALSA-only Linux distributions also turned off Telemetry when they repackaged Firefox for their users. And for any that shipped with Telemetry at all, many users disabled it themselves. Those users’ Firefoxen had no way to phone home to tell Mozilla how important ALSA was to them… and now it was too late.

Those Linux distributions started building ALSA support into their distributed Firefox builds… and hopefully began reporting Telemetry by default to prevent this from happening again. I don’t know if they did for sure (we don’t collect fine-grained information like that because we don’t need it).

But it serves as a cautionary tale: Mozilla can only support a finite number of things. Far fewer now than we did back in 2016. We prioritize what we support based on its simplicity and its reach. That first one we can see for ourselves, and for the second we rely on data collection like Telemetry to tell us.

Counting things is harder than it looks. Counting things that are invisible is damn near impossible. So if you want to be counted: turn Telemetry on (it’s in the Preferences) and leave it on.

:chutten