(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. All “This Week in Glean” blog posts are listed in the TWiG index).
At Mozilla we make, among other things, Web Browsers which we tend to call Firefox. The central activity in a Web Browser like Firefox is loading a web page. It gets done a lot by each and every one of our users, and so you can imagine that data about pageloads is of important business interest to us.
But exactly because this is done a lot and by every one of our users, this inspires concerns of scale and cost. How much does it cost us to learn more about pageloads?
As with all things in Data, the answer is the same: “Well, it depends.”
In this case it depends on how you record the data. How you record the data depends on what questions you hope to answer with it. We’re going to stick to the simplest of questions to make this (highly-suspect) comparison even remotely comparable.
Option 1: Just the Counts, Ma’am
I say page loads are done a lot, but how much is “a lot”? If that’s our only question, maybe the data we need is simply a count of pageloads. Glean already has a metric type for counting things, so it should be fairly quick to implement.
This should be cheap, right? Just a single number? Well, it depends.
Scale 1: Frequency
The count of pageloads is just a single number. One, maybe as many as eight, bytes to record, store, transmit, retain, and analyze. But Firefox has to report it more than once, so we need to first scale our cost of “one, maybe as many as eight, bytes” by the number of times we send this information.
When we first implemented Firefox’s pageload count in Glean, I wanted to send it on the builtin “metrics” ping which is sent once a day from anyone running Firefox that day. In an effort to gain more complete and timely data, we ended up adding it to the builtin “baseline” ping which is sent (on average for Firefox Desktop) 8 or more times per day.
For our frequency scale we thus use 8/day.
Scale 2: Population
These 8 recordings per day are sent by about 200M users over a month. Days and months aren’t easy to scale between as not all users use Firefox every day, and our population gains new users and loses old users at variable rates… so I recalculated the Frequency scale to be in terms of months and found that we get 68 pings per month from these roughly 200M users.
So the cost is pretty easy to calculate then? Whatever the cost is of storing and transmitting 200M x 68/month x eight bytes ~= 109 GB?
Not entirely. But until and unless those other costs are not comparable between options, we can just treat them as noise. This cost, rendered in the size of the data, of about 109GB? It’ll do.
Option 2: What an Event
Page loads are interesting not just in how many of them there are, but also about what type of load they are and how long the load took. The order of a page load in between other events might also be of interest: did it happen before or after some network trouble? Did a bunch of pageloads happen all at once, or spread across the day? We might wish to instrument page loads as Glean events.
Events are each more expensive than a count. They carry a timestamp (eight bytes) and repeat their names each time they’re recorded (some strings, say fifteen bytes).
(We are not counting the load type or how long the load took in our calculations of the size of an individual sample as we’re still trying to compare methods of answering the same “How many page loads are there?” question.)
Scale 3: Page Loads
“Each time they’re recorded”, huh. Guess that means we get to multiply by the number of page loads. Each Firefox Desktop user, over the course of a month, loads on average 1190 pages. This means instead of sending 68 numbers a month, we’re sending 1190 batches of strings a month.
So the comparable cost is whatever the cost is of storing and transmitting 200M x (eight bytes and fifteen bytes) x 1190 ~= 5.47TB..
We’ve jumped an order of magnitude here. And we’re not done.
Option 3: Custom Pings, and Custom Pings Only
What if the context we wish to record alongside the event of a page load cannot fit inside Glean’s prudent “event” metric type limits? What if the collected pageload data would benefit from a retention limit or access control list different from other counts or events? What if you want to submit this data to be uploaded as soon as it has been recorded? In that case, we could send a pageload as a Glean custom ping.
We’ve not (yet) done this in Firefox Desktop (at least partially because it complicates ordering amongst other events: the Glean SDK expends a lot of effort to ensure the timestamps between events are reliable. Ping times are client times which are subject to the whims of the user.), so I’m going to get even hand-wavier than before as I try to determine how large each individual data sample will be.
A Glean custom ping without any metrics in it comes to around 500 bytes. When our data platform ingests the ping and turns it into a row in a dataset, we add some metadata which adds another 300 bytes or so (which only affects storage inside the Data Platform and doesn’t add costs to client storage or client bandwidth).
We could go deeper and cost out the network headers, the costs of using TLS to ensure the integrity of the connection… but we’d be here all day. So I’m gonna call that 200 bytes to make it a nice round 1000 bytes per ping.
We’re sending these pings per pageload, so the cost is whatever the cost is of storing and transmitting 200M x 1190 x 1000 bytes = 238TB.
Rule of Thumb: 50x
There you have it: for each step up the cost ladder you’re adding an extra 50x multiplier to the cost of storing and transmitting the data. The reality’s actually much worse if it’s harder to analyze and reason about the data as it gets more complex (which it in most cases is) because, as you might remember from one of my previous explorations in costing out metrics: it’s the human costs of things (like analysis) that really getcha.
But you have to balance it out. If adding more context and information ensures your analyses only have to look in one place for its data instead of trying to tie together loosely-coupled concepts from multiple locations… if using a custom ping ensures you have everything you need and don’t have to form a committee to resource an engineer to add implementation which needs to be deployed and individually validated… if you’re willing to bet 50x or 250x the cost on getting it right the first time, then that could be a good price to pay.
But is this the case for you and your data?
Well, it depends.
: Avid readers of this blog may notice that this isn’t the first time I’ve written on the costs of data. And it likely won’t be the last!
: Yes there are some wild and wacky outliers included in the figure “an average of 1190 page loads” that I’m not bothering to clean up. You can Page Loads Georg to your hearts’ content.
: This is about how many characters the JSON-encoded ping payload comes to, uncompressed.