Never Look at the Data: Why did we start getting so many pings from Korea?

Something happened on January 5, 2023. All of a sudden we abruptly started receiving a number of pings from Firefox Desktop clients in Korea equal to two times the size of the entire Korean Firefox Desktop population.

What happened? How did we notice it? What did we do about it?

Let’s back up.

I can’t remember where I learned it, but I’d already started reciting as dogma in my first year of University: “The most important part about any feature is the ability to turn it off”. It’s served me well through my studies and my career. I’ve also found it to be especially true for data collection systems where, for whatever reason, as a user you might decide you no longer want the software you’re using to continue to send data. In some places this is even enshrined in laws where you can request the deletion of data that has already been collected.

Law or not, Mozilla has before, does now, and will always make it easy for you to decide whether to send data to Mozilla. We may not understand why you make that choice, and it definitely will make it harder for us to ensure our products meet your needs, but we’ll respect the heck out of your choice in our processes and in our products.

This is why, when Mozilla’s data collection system Glean is told the user went from allowing data upload to forbidding it, we send one final “deletion-request” ping before shutting down. The “deletion-request” ping contains all the internal identifiers we’ve used to longitudinally group data (if we receive ten crash reports it’s important to know whether it’s the same Firefox crashing ten times or if it’s ten Firefoxes crashing once), and we use those identifiers to (well) identify what data we’ve collected that we’re now going to delete.

For the purposes of this story you’ll need to know that there’s two times when Glean notices the product’s gone from “data upload: on” to “data upload: off”: while Glean is running, and during Glean startup. If Glean’s running, then we just handle things – we were told the setting changed from “data upload: on” to “data upload: off” and away we go. But Glean knows that it isn’t always listening to the data upload setting, so if it it starts up with “data upload: off” and the last time it shut down we were “data upload: on” we’ll send a specific “at_init”-reason “deletion-request” ping.

We in the Data Org monitor how Glean is behaving. One thing we’ve learned about how Glean behaves is that the number of “deletion-request” pings is roughly constant over time. And the proportion of “deletion-request” pings that have the “at_init” reason should remain a fairly fixed one.

What shouldn’t happen is for Firefox Desktop-sent “at_init”-reason “deletion-request” pings to spike like this on January 5:

time-series plot of ping volumes from December 2022 until mid-January 2023 showing abnormal abrupt increases in volume starting on January 5.

What we do when we notice things like this is file a bug. As the one responsible for Glean’s integration in Firefox Desktop, and as someone with a long history of looking into anomalies, I took a look. At this initial point I was pretty sure it’d be a single actor (a single user, a single company, a single internet cafe) doing something odd… but alas, the evidence was inconclusive:

Evidence consistent with a single actor being responsible for it all:

  • All the pings were coming from the same internet provider. Korea Telecom is responsible for a bare majority of Firefox Desktop data delivery from Korea, but the spikes were entirely from that ISP.
  • The Mozilla Community in Korea could offer no explanation of any wide-spread computer or software event that matched the timeline.
  • “at_init”-reason “deletion-request” pings could be a result of automation changing the files on disk to read “data upload: off” between runs of Firefox Desktop.

Evidence inconsistent with a single actor being responsible for it all:

  • The data came from a mix of Firefox Desktop versions: versions 101.0.1, 104.0, and 108.0.2.
  • The data came from a range of different regions, more or less following the population density of Korea itself.
  • “at_init”-reason “deletion-request” pings could instead be the result of users changing the setting to “data upload: off” early enough during Firefox Desktop startup that Glean hasn’t yet been initialized.

Regardless of why it was happening, it quickly became more important that we learn what we needed to do about it. We spun up an Incident, which is how we organize ourselves when there’s something happening that requires cross-functional collaboration and isn’t getting better on its own. Once there we ascertained that we could respond very quickly and decisively and do

Nothing at all.

The volume of these pings vastly eclipsed any other “deletion-request” pings we would otherwise have received, so you’d be forgiven for thinking that it was terribly expensive to receive, store, and process them all. In reality, we batch these requests. And even before this spike, every batch of requests required editing every partition of every table. Adding another list of identifiers to delete equal in size to two times the peak Firefox Desktop population in Korea just doesn’t matter all that much.

The pressure was off. Even if it got worse… which it did:

Time-series plot of "deletion-request" pings isolated to just those from Korea. Spikes begin January 25 and dwarf other reports. A plateau begins March 26 and continues to the right edge of the plot around April 10.

On March 26, when it reached and maintained a peak of five times the volume of the Firefox Desktop population in Korea, it still wasn’t harming our data platform’s ability to serve business needs or costing us all that much in operational spend. We didn’t need to invest effort into running down the source, so we didn’t.

And so I just kept an occasional eye on it until, just as suddenly but not quite as abruptly as it began, on April 12 the ping volumes began to decrease. By April 18, we were back to normal levels.

Time-series plot of "deletion-request" pings isolated to just those from Korea. Very similar to the previous plot, but continues until April 18. Spikes begin January 25 and dwarf other reports. A plateau begins March 26 and stays up there until April 12 when falls away to nothing over the course of five days or so.

We had successfully ignored it until it went away.

So what happened to Korean Firefox Desktop users from Jan 5 to April 12, 2023? We never figured it out. If you know about something happening across those dates in Korea: please get in touch. As little as it needed solving for the sake of business needs, it still needs solving for the sake of my curiosity. 

:chutten

Advertisement

This Week in Glean: Page Load Data, Three Ways (Or, How Expensive Are Events?)

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. All “This Week in Glean” blog posts are listed in the TWiG index).

At Mozilla we make, among other things, Web Browsers which we tend to call Firefox. The central activity in a Web Browser like Firefox is loading a web page. It gets done a lot by each and every one of our users, and so you can imagine that data about pageloads is of important business interest to us.

But exactly because this is done a lot and by every one of our users, this inspires concerns of scale and cost. How much does it cost us to learn more about pageloads?[0]

As with all things in Data, the answer is the same: “Well, it depends.”

In this case it depends on how you record the data. How you record the data depends on what questions you hope to answer with it. We’re going to stick to the simplest of questions to make this (highly-suspect) comparison even remotely comparable.

Option 1: Just the Counts, Ma’am

I say page loads are done a lot, but how much is “a lot”? If that’s our only question, maybe the data we need is simply a count of pageloads. Glean already has a metric type for counting things, so it should be fairly quick to implement.

This should be cheap, right? Just a single number? Well, it depends.

Scale 1: Frequency

The count of pageloads is just a single number. One, maybe as many as eight, bytes to record, store, transmit, retain, and analyze. But Firefox has to report it more than once, so we need to first scale our cost of “one, maybe as many as eight, bytes” by the number of times we send this information.

When we first implemented Firefox’s pageload count in Glean, I wanted to send it on the builtin “metrics” ping which is sent once a day from anyone running Firefox that day[1]. In an effort to gain more complete and timely data, we ended up adding it to the builtin “baseline” ping which is sent (on average for Firefox Desktop) 8 or more times per day.

For our frequency scale we thus use 8/day.

Scale 2: Population

These 8 recordings per day are sent by about 200M users over a month. Days and months aren’t easy to scale between as not all users use Firefox every day, and our population gains new users and loses old users at variable rates… so I recalculated the Frequency scale to be in terms of months and found that we get 68 pings per month from these roughly 200M users.

So the cost is pretty easy to calculate then? Whatever the cost is of storing and transmitting 200M x 68/month x eight bytes ~= 109 GB?

Not entirely. But until and unless those other costs are not comparable between options, we can just treat them as noise. This cost, rendered in the size of the data, of about 109GB? It’ll do.

Option 2: What an Event

Page loads are interesting not just in how many of them there are, but also about what type of load they are and how long the load took. The order of a page load in between other events might also be of interest: did it happen before or after some network trouble? Did a bunch of pageloads happen all at once, or spread across the day? We might wish to instrument page loads as Glean events.

Events are each more expensive than a count. They carry a timestamp (eight bytes) and repeat their names each time they’re recorded (some strings, say fifteen bytes).

(We are not counting the load type or how long the load took in our calculations of the size of an individual sample as we’re still trying to compare methods of answering the same “How many page loads are there?” question.)

Scale 3: Page Loads

“Each time they’re recorded”, huh. Guess that means we get to multiply by the number of page loads. Each Firefox Desktop user, over the course of a month, loads on average 1190 pages[2]. This means instead of sending 68 numbers a month, we’re sending 1190 batches of strings a month.

So the comparable cost is whatever the cost is of storing and transmitting 200M x (eight bytes and fifteen bytes) x 1190 ~= 5.47TB..

We’ve jumped an order of magnitude here. And we’re not done.

Option 3: Custom Pings, and Custom Pings Only

What if the context we wish to record alongside the event of a page load cannot fit inside Glean’s prudent “event” metric type limits? What if the collected pageload data would benefit from a retention limit or access control list different from other counts or events? What if you want to submit this data to be uploaded as soon as it has been recorded? In that case, we could send a pageload as a Glean custom ping.

We’ve not (yet) done this in Firefox Desktop (at least partially because it complicates ordering amongst other events: the Glean SDK expends a lot of effort to ensure the timestamps between events are reliable. Ping times are client times which are subject to the whims of the user.), so I’m going to get even hand-wavier than before as I try to determine how large each individual data sample will be.

A Glean custom ping without any metrics in it comes to around 500 bytes. When our data platform ingests the ping and turns it into a row in a dataset, we add some metadata which adds another 300 bytes or so (which only affects storage inside the Data Platform and doesn’t add costs to client storage or client bandwidth).

We could go deeper and cost out the network headers, the costs of using TLS to ensure the integrity of the connection… but we’d be here all day. So I’m gonna call that 200 bytes to make it a nice round 1000 bytes per ping.

We’re sending these pings per pageload, so the cost is whatever the cost is of storing and transmitting 200M x 1190 x 1000 bytes = 238TB.

Rule of Thumb: 50x

There you have it: for each step up the cost ladder you’re adding an extra 50x multiplier to the cost of storing and transmitting the data. The reality’s actually much worse if it’s harder to analyze and reason about the data as it gets more complex (which it in most cases is) because, as you might remember from one of my previous explorations in costing out metrics: it’s the human costs of things (like analysis) that really getcha.

But you have to balance it out. If adding more context and information ensures your analyses only have to look in one place for its data instead of trying to tie together loosely-coupled concepts from multiple locations… if using a custom ping ensures you have everything you need and don’t have to form a committee to resource an engineer to add implementation which needs to be deployed and individually validated… if you’re willing to bet 50x or 250x the cost on getting it right the first time, then that could be a good price to pay.

But is this the case for you and your data?

Well, it depends.

:chutten

[0]: Avid readers of this blog may notice that this isn’t the first time I’ve written on the costs of data. And it likely won’t be the last!

[1]: How often a “metrics” ping is sent is a little more complicated than “once a day”, but it averages out to about that much so I’m sticking with it for this napkin.

[2]: Yes there are some wild and wacky outliers included in the figure “an average of 1190 page loads” that I’m not bothering to clean up. You can Page Loads Georg to your hearts’ content.

[3]: This is about how many characters the JSON-encoded ping payload comes to, uncompressed.

Seven-Year Moziversary

Seven years ago today I began working at Mozilla.

What have I been up to this year? Not blogging, that’s for sure. I’m not sure if I can lay the entire blame of this at the feet of *gestures at everything*, but with the retirement of the This Week in Glean rotation, I’ve gone from infrequently blogging to never blogging.

Which is weird. I like doing it. It can be very fun. It isn’t usually too difficult. Seems like the intersection of all the things that would make it not only something I could do but something I want to do.

And yet. Here we are with barely a post to show for it. Alas.

If blogging is what I’ve not been doing, then what have I been not not doing? More Firefox on Glean stuff. Spent a lot of time and tears trying to get a proper Migration from Legacy Telemetry to Glean moving. Alas, it is not to be. However, we’ve crested over 100 Glean metrics being sent from Firefox Desktop, and the number isn’t going down… so 2022 has been the year of Glean on the Desktop, whether it was a flagship Platform Initiative or not.

In other news, we just got back from Hawaii where there was the first real All Hands in absolutely forever (since January 2020). It was stressful and weird and emotional and quite fun and necessary and I wanna have another one and I don’t want to have to fly so far for it and and and…

Predictions for the next year of Moz Work:

  • There’ll be another All Hands
  • Glean will continue to gain ground in Firefox Desktop
  • “FOG Migration” will not happen

:chutten

So I’ve Finished: Later Alligator

A very quick review for a very quick game: it’s all about the vibe, this one, isn’t it? From the kooky hyperkinetic characters to their individual quirks and styles. The music is a standout wonder (though for any that you sit on for a while it could probably tone down and become more ambient after the first loop). The controls never fail (though on the Switch they’re a tad clunky)… it’s good?

And yet. There is no mystery. I was promised a mystery.

There is endless delight, moment by moment, though: from how each minigame is titled to what absolutely wild pronouncement the next resident of Alligator City will casually just throw out there. But there’s really nothing holding it together.

Something that really exemplifies things is the Family Tree. As you talk to family members and complete their minigames, you get their token to slot into the family tree. But you can’t do this from anywhere, you need to find the Mom and slot them in there. It’s nice to see how everyone relates, and encourages me to pay even more attention to the hilarious dialogue for hints about how everyone is related… but not only do you have to find Mom, you have to spend 30min in-game time to arrange these tokens. In-game time feels like the scarcest resource (especially on the first playthrough when you’re accumulating most of the tokens) so it both encourages and discourages collecting and arranging these tokens, which both encourages and discourages me from paying close attention to the dialogue.

I think this game could’ve used a couple of polishing passes. Especially at its twenty dollabux price.

Still worth it, though.

This Week in Glean: What If I Want To Collect All The Data?

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. All “This Week in Glean” blog posts are listed in the TWiG index).

Mozilla’s approach to data is “as little as necessary to get the job done” as espoused in our Firefox Privacy Promise and put in a shape you can import into your own organization in Mozilla’s Lean Data Practices. If you didn’t already know, you’d find out very quickly by using it that Glean is a Mozilla project. All of its systems are designed with the idea that you’ve carefully considered your instrumentation ahead of time, and you’ve done some review to ensure that the collection aligns with your values.

(This happens to have some serious knock-on benefits for data democratization and tooling that allows Mozilla’s small Data Org to offer some seriously-powerful insights on a shoestring budget, which you can learn more about in a talk I gave to Ubisoft at their Data Summit in 2021.)

Less Data, as the saying goes, implies Greater Data and Greatest Data. Or in a less memetic way, Mozilla wants to collect less data… but less than what?

Less than more, certainly. But how much more? How much is too much?

How much is “all”?

Since my brain’s weird I decided to pursue this thought experiment of “What is the _most_ data you could collect from a software project being used?”.

Well, looking at Firefox, every button press and page load and scroll and click and and and… all of that matters. As does the state of Firefox when it’s being clicked and scrolled and so forth. Typing in the urlbar is different if you already have a page loaded. Opening your first tab is different from opening your nine-thousand-two-hundred-and-fiftieth.

And, underneath it all, is the code. How fast is it running? How much memory are we using? All these performance questions that Firefox Telemetry was originally built to answer. Is code on line 123 of file XYZ.cpp running? Is it running well? What do we run next?

For software this means to record all of the data, we’ll need to know the full state of the program at every expression it runs in every line of code. At every advancement of the Program Counter, we’d need to dump the entire Stack and Heap.

Yikes! That’s gigabytes of data per clock cycle.

Well, maybe we can be cleverer than this. Another one of those projects Mozilla incubated that now has a whole community of contributors and users (like Rust) is a lightweight record-and-replay debugger called rr. The rr debugger collects traces of a running piece of software and can deterministically replay it over and over again (backwards, even!), meaning it has all the information we need in it.

So a decent size estimate for “all the data” might be the size of one of these trace recordings. They’re big, but not “full heap and stack at every program counter” big. A short test run of Firefox was about 2GB for a one minute run (albeit without any user interaction or graphics).

Could Glean collect traces like these? Or bigger ones after, say, a full day’s use? Not easily. Not without modification.

Let’s say we did those modifications. Let’s push this thought experiment further. What does that mean for analysis? Well, we’d have all these recordings we could spin up a VM to replay for us. If we want the number of open tabs, we could replay it and sample that count whenever we wanted.

This would be a seismic shift in how instrumentation interacted with analysis. We’d no longer have to ship code to instrument Firefox, we could “simply” (in quotes because using rr requires you to be a programming nerd) replay existing traces and extract the new data we needed.

It would also be absolutely horrible. We’d have to store every possible metric just in case we wanted _one_ of them. And there’s so much data in these traces that Mozilla doesn’t want to store: pictures you looked at, videos you watched, videos you uploaded… good grief. We don’t want any of that.

(( I’d like to take a second to highlight that this is a thought experiment: Mozilla doesn’t do this. We don’t have plans to do this. In fact, Mozilla’s Data Privacy Principles (specifically “Limited Data”) and Mozilla’s Manifesto (specifically Principle 4 “Individuals’ security and privacy on the internet are fundamental and must not be treated as optional.”) pretty clearly state how we think about data like this. ))

And processing these traces into a useful form for analysis to be performed would take the CPU processing power of a small country, over and over again.

(( And rr introduces a 20% performance penalty which really wouldn’t ingratiate us to our users. And it only works on Linux meaning the data we’d have access to wouldn’t be representative of our user base anyway. ))

And what was the point of this again? Right. We’re here to quantify what “less data” means. But how can we do that, even knowing as we do now what the size of “all data” is? Can we compare the string value of the profile directory’s random portion comparable to the url the user visits the most? Are those both 1 piece of data that we can compare to the N pieces of data we get in a full rr trace? Mozilla doesn’t think they’re the same, since we categorize (and thus treat) these collections differently.

All in all maybe figuring out the maximum amount of data you could collect in order to contextualize how much less of it you are collecting might not be meaningful.

Oh well.

I guess this means that the only way Mozilla (and you!) can continue to quantify “less data” is by comparing it to “no data” – the least possible amount of data.

:chutten

This Week in Glean: How Long Must I Wait Before I Can See My Data?

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. All “This Week in Glean” blog posts are listed in the TWiG index).

You’ve heard about this cool Firefox on Glean thing and wish to instrument a part of Firefox Desktop. So you add a metrics.yaml definition for a new Glean metric, commit a piece of instrumentation to mozilla-central, and then Lando lands it for you.

When can you expect the data collected when users’ Firefoxes trigger that instrumentation to show up in a queryable place like BigQuery?

The answer is one of the more common phrases I say when it comes to data: Well, it depends.

In the broadest sense, we’re looking at two days:

1) A new Nightly build will be generated within 12h of your commit.

2) Users will pick up the new Nightly build fairly quickly after that, and start triggering the instrumentation.

3) The following 4am a “metrics” ping containing data from your instrumentation will be submitted (or some time later if Firefox isn’t running at 4am)

4) A new schema generated to include your new metric definition will have been deployed overnight

5) The following 12am UTC a new partition of our user-facing stable views will have the previous day’s submissions available.

And then you commence querying! Easy as that.

Any questions?

The Questions:

What if I added a new metrics.yaml file?

That file needs to land in gecko-dev (the github mirror of mozilla-central) first. Only then can we (and by “we” I here mean the Data Team, by means of a bug you file) update the data pipeline. Then you get to wait until the next weekday’s midnight UTC (ish) for a schema deploy as per Step 4.

Generally this doesn’t add too much delay, but if landing the file happens after the pipeline folks have gone home, we get to wait until the next weekday’s midnight UTC.

The Nightly population is small and weird. How long until we get data from release?

Uptake of code to release takes a while longer. Code lands in mozilla-central, and gets into the next Nightly within 12h. Getting to Beta from Nightly means waiting until the next merge day (these days that’s on the first Monday of the month, or thereabouts). Getting to Release from Beta means waiting until the merge day after that.

If you’re unlucky, you’ll be waiting over two months for your instrumentation to be in a Release build that users can pull down.

And then you get to wait for enough Release users to update that you’re getting a representative sample. (This could take a week or so.)

So… nine weeks?

That sounds really bad! Is there anything we can do?

Why yes.

The first thing we can do is adjust our expectations. There’s a four-week sway from the worst-case to best-case on this slow path. It isn’t likely that you’ll always be landing instrumentation immediately after a merge day and get to wait the whole month until Nightly merges to Beta.

Your average wait for that is only two weeks. And the best case is a matter of a day or two.

So cross your fingers, and hope your luck is with you.

Secondly, instrumentation is (by itself) very low-risk, so you can “uplift” the instrumentation change directly to Beta without waiting for merge day.

This can cut your route to release down to _two weeks_, by (e.g.) landing in Nightly on Monday Nov 22, verifying that it works on Tuesday, requesting uplift on Wednesday, getting uplifted in the last Beta on Thursday Nov 25, then making the merge from Beta to Release on Dec 6.

(You do still get to wait a third week for the release population to update to the latest version.)

Thirdly, what are the chances that your instrumentation is measuring a feature you just built or just turned on? You want that feature to benefit from the slow-roll exposure to the more tolerant audiences of Nightly and Beta before it reaches Release, right? Automated testing is great, but nothing can simulate the wild variety of use cases and hardware combinations your feature will experience in the Real World.

So what point is there getting your instrumentation into Release before the feature under instrumentation reaches it? Instead of measuring the interval between landing instrumentation and beginning analysis, perhaps measure the interval between the release of the feature you wish to instrument and beginning analysis?

That interval is only a day: gotta wait for that partition in the stable view. Sounds much better, doesn’t it?

Still, can I get data any faster?

The fastest time from Point A) Landing a metric, to Point B) Performing preliminary analysis on a metric, is about 12h:

1) Land your code just before a new Nightly is cut.

2) Hope that the number of Nightly users that update to the latest build over the next twelve hours is enough for your purposes.

If you didn’t luck out and have a schema deploy, you’ll need to dig your data out of the additional_properties JSON column. If you are lucky, you can use the friendly columns instead.

To get to the data before the nightly copy-deduplicate to stable views, you’ll be querying the live tables instead. You need to fully-qualify that table name. You need to realize that we haven’t deduped anything in here. And you need to take narrow slices, because we can’t cluster the data effectively here, so querying can get expensive, fast.

Can I get data that quickly from release?

Not yet.

I’ve seen a proposal internally for dynamically-defined metrics which get pushed to running Firefox instances (talk to :esmyth if you’re interested). Though its present form is proposing the process and possibility, not the technology, there’s a version of this I can see that would (for a subset of data collection) take the time from “I wish to instrument this” to “I am performing analysis on data received from (a subset of) the release Firefox population” down to within a business day.

Which is neat! But that speed brings risk, so it’ll take a while to design a system that doesn’t expose our users to that risk.

Don’t expect this for Christmas, is I guess what I mean : )

:chutten

Adventures in Water Softening

I took some time off recently and, as I’m too foolish to allow myself to spend my time on worthwhile rest activities like reading, watching TV, or playing video games, I worked my way through a self-assembled list of “Things I Never Have Enough Time To Deal With”.

One was replacing the Moen cartridge in the upstairs bath since it appears to be the cause of a (excruciatingly-)slow drip. A visit to Lowe’s later, I had a no-charge replacement cartridge in-hand. Lifetime warranties plus customer service: nice.

As with most baths about which I’ve had the misfortune to learn their plumbing, there are no fixture-side shutoff valves so I waited until I had the house to myself and turned off the water to the whole house. (Thank goodness the most recent visit by a plumber included replacing the ancient screw valve with a 90deg valve. So much nicer to work with).

Alas, the Moen cart was the wrong size (I think I need the 1225, the helpful person at Lowe’s presumed it was a 1222b), so no joy there. Thus I turned the water back on. This was about a quarter past four in the afternoon.

Next morning around 9am my wife and I detect an intermittent beeping. Never a good sign.

We check the freezer, fridge, garage, laundry room, dehumidifier… nothing. But it’s coming from the basement.

Good news! The emergency “there’s water on my basement floor” alarm works.

Bad news! There’s water on my basement floor.

A slow (but quickening) leak had developed on my water softener to do with the below-illustrated parts. We have the large assembly which I call the bypass assembly (it connects to the softener via the two horizontal tubes), and two identical adaptors which adapt from the house’s copper (at least in my case) piping to the top two sockets of the assembly. Inlet’s on the right, outlet’s on the left.

A large plastic assembly, four retaining clips, and two adaptor tubes all in plastic, for connecting a water softener to a house water system

The leak was coming from the outlet socket between it and the adaptor. Oh no, I thought, there’s a crack in this large custom-made piece of plastic. And since the large piece of plastic is the bypass for the softener, the usual path for bypassing the fault for diagnosis and repair is no good. The leak doesn’t care whether the bypass assembly is set to Service or set to Bypass, so the bypass cannot bypass the leak.

Luckily, the previous softener didn’t have a single-valve bypass and so had a three-valve bypass in the inlet, outlet, and bridging copper. Open the bridge, close the outlet, close the inlet, good to go. (I’m not sure if that’s the correct order, but it seemed to work).

Diagram of a 3-valve plumbing bypass system

Unfortunately all these valves are screw valves and are decades old, likely not having been used in as long, so they leak when not fully closed or fully open and were stiff as heck to get moving. I’ll need to have those replaced at some point, but then we should also be looking into probably rearranging the whole utility room because the plumbing (gas and water and coolant) is a mess. (Ah, the joys of home ownership. The only thing worse is anything else.)

(( I’d usually include a digression here about water softening and why it’s so dang important in my part of the world. I’ll just leave you with this Wikipedia link on water softening for the former, and this map of water hardness in the Region of Waterloo (plus this link to the USGS saying that anything over 180 mg/L (of CaCO3) is “Very Hard”, which translates to anything over 10.57gpg. Note the map starts at 17gpg and goes up from there.) for the latter. Conclusions are left to the reader. ))

Clearly I was going to have to take it apart to see what was going on.

Unfortunately, a fluid-filled closed system like that is subject to certain pressures that made absolutely everything to do with this job an absolute trial. Just getting the pieces apart involved 1) removing the retaining clips (easy), then 2) Separating the O-ring-having adaptor tubes from the bypass assembly (difficult). I _think_ I had to overcome the resistance to vacuum in the pipes to force the first one apart, which of course dislodged the second one and they both dumped their contents exactly adjacent to the bucket I had placed. Water alarm went off again. I put it on a shelf.

My luck seemed to turn, though, as a visual inspection of the assembly and adaptors showed no sign of splits, tears, wear (it’s only been in place for 3 years (installed March 2018)), or other damage. The inlet socket was lousy with rust, but not only was the outlet socket intact, it was clean.

So I put it all back together and reopened the valves: close the bridge, open the outlet, open the inlet. There was some backwash into the softener I didn’t like by doing it this way, and it introduced a lot of air that would make itself explosively known at every fixture throughout the house (almost blew the lid off the upstairs toilet. How?), but it all came together.

And then the inlet socket on the bypass valve began to leak.

Le sigh.

Turn it all back off again: outlet closed, bridge open, inlet closed.

This time I was prepared for the pressure differential and the location of the bucket when I pulled the inlet pipe out of the bypass. What I didn’t account for was the outlet pipe’s water backwashing through the bypass and bubbling out of the inlet socket. Note to self: If you leave the bypass on “Service” the softener will resist the flow for you.

Again, no damage or wear on the inlet, but there was still a smear of rust. I cleaned that out and reseated the adaptor.

Checking a hunch, I noticed that the retaining clips were not bilaterally symmetric. They had an up and a down. So I replaced the clip with the up side up, and opened the inlet valve of the three-valve bypass.

Turns out you can create a pressure bomb if you allow mains pressure to push an air bubble against a closed valve all of a sudden. The outlet adaptor popped out of the outlet socket with a bang. Everything got wet (including the erstwhile plumber penning these words). It was only luck that I hadn’t seated the retaining clips sufficiently and so the pieces only came apart and didn’t actually break.

It was exciting in exactly the wrong sorts of way.

But it gave me an inkling that maybe by being indelicate about closing and opening the mains shutoff for the Moen cartridge replacement resulted in some water hammer that spread the softener’s outlet adaptor apart from the socket allowing a slow leak to begin. It doesn’t really make sense, since there’s the softener in the way which would dampen such effects, but I’m at a loss for understanding the leak at all, let alone why it happened then.

Anyway, there’s full supply pressure pouring on your floor, you can think later. Switch the three-valve bypass back to bypass, reseat the pieces, ensure the retaining clips are the right way up, dry everything off so you can see leaks if they happen. Good? Good. Let’s try again. Close the bridge slowly, open the outlet slowly, open the inlet slowly, and run a downstream faucet to try and release the captured air.

And the leak mysteriously disappeared without anything having been repaired or replaced, just disassembled (one time forceably) and reassembled.

Still had pops and booms from every fixture and faucet in the house as they were used the first time after the “fix”, but otherwise everything is (so far) okay. I put the emergency “there’s water on my basement floor” alarm back on the floor.

This serves as record of what happened and what I did. May it help you and future me should anything like this happen again.

This Week in Glean: The Three Roles of Data Engagements

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean.) All “This Week in Glean” blog posts are listed in the TWiG index).

I’ve just recently started my sixth year working at Mozilla on data and data-adjacent things. In those years I’ve started to notice some patterns in how data is approached, so I thought I’d set them down in a TWiG because Glean’s got a role to play in them.

Data Engagements

A Data Engagement is when there’s a question that needs to engage with data to be answered. Something like “How many bookmarks are used by Firefox users?”.

(No one calls these Data Engagements but me, and I only do because I need to call them _something_.)

I’ve noticed three roles in Data Engagements at Mozilla:

  1. Data Consumer: The Question-Asker. The Temperature-Taker. This is the one who knows what questions are important, and is frustrated without an answer until and unless data can be collected and analysed to provide it. “We need to know how many bookmarks are used to see if we should invest more in bookmark R&D.”
  2. Data Analyst: The Answer-Maker. The Stats-Cruncher. This is the one who can use Data to answer a Consumer’s Question. “Bookmarks are used by Canadians more than Mexicans most of the time, but only amongst profiles that have at least one bookmark.”
  3. Data Instrumentor: The Data-Digger. The Code-Implementor. This one can sift through product code and find the correct place to collect the right piece of data. “The Places database holds many things, we’ll need to filter for just bookmarks to count them.”

(diagrams courtesy of :brizental)

It’s through these three working in concert — The Consumer having a question that the Instrumentor instruments to generate data the Analyst can analyse to return an answer back to the Consumer — that a Data Engagement succeeds.

At Mozilla, Data Engagements succeed very frequently in certain circumstances. The Graphics team answers many deeply-technical questions about Firefox running in the wild to determine how well WebRender is working. The Telemetry team examines the health of the data collection system as a whole. Mike Conley’s old Tab Switcher Dashboard helped find and solve performance regressions in (unsurprisingly) Tab Switching. These go well, and there’s a common thread here that I think is the secret of why: 

In these and the other high-success-rate Data Engagements, all three roles (Consumer, Analyst, and Instrumentor) are embodied by the same person.

It’s a common problem in the industry. It’s hard to build anything at all, but it’s least hard to build something for yourself. When you are in yourself the Question-Asker, Answer-Maker, and Data-Digger, you don’t often mistakenly dig the wrong data to create an answer that isn’t to the question you had in mind. And when you accidentally do make a mistake (because, remember, this is hard), you can go back in and change the instrumentation, update the analysis, or reword the question.

But when these three roles are in different parts of the org, or different parts of the planet, things get harder. Each role is now trying to speak the others’ languages and infer enough context to do their jobs independently.

In comes the Data Org at Mozilla which has had great successes to date on the theme of “Making it easier for anyone to be their own Analyst”. Data Democratization. When you’re your own Analyst, then there’s fewer situations when the roles are disparate: Instrumentors who are their own Analysts know when data won’t be the right shape to answer their own questions and Consumers who are their own Analysts know when their questions aren’t well-formed.

Unfortunately we haven’t had as much success in making the other roles more accessible. Everyone can theoretically be their own Consumer: curiosity in a data-rich environment is as common as lanyards at an industry conference[1]. Asking _good_ questions is hard, though. Possible, but hard. You could just about imagine someone in a mature data organization becoming able to tell the difference between questions that are important and questions that are just interesting through self-serve tooling and documentation.

As for being your own Instrumentor… that is something that only a small fraction of folks have the patience to do. I (and Mozilla’s Community Managers) welcome you to try: it is possible to download and build Firefox yourself. It’s possible to find out which part of the codebase controls which pieces of UI. It’s… well, it’s more than possible, it’s actually quite pleasant to add instrumentation using Glean… but on the whole, if you are someone who _can_ Instrument Firefox Desktop you probably already have a copy of the source code on your hard drive. If you check right now and it’s not there, then there’s precious little likelihood that will change.

(Unless you come and work for Mozilla, that is.)

So let’s assume for now that democratizing instrumentation is impossible. Why does it matter? Why should it matter that the Consumer is a separate person from the Instrumentor?

Communication

Each role communicates with each other role with a different language:

  • Consumers talk to Instrumentors and Analysts in units of Questions and Answers. “How many bookmarks are there? We need to know whether people are using bookmarks.”
  • Analysts speak Data, Metadata, and Stats. “The median number of bookmarks is, according to a representative sample of Firefox profiles, twelve (confidence interval 99.5%).”
  • Instrumentors speak Data and Code. “There’s a few ways we delete bookmarks, we should cover them all to make sure the count’s correct when the next ping’s sent”

Some more of the Data Org and Mozilla’s greatest successes involve supplying context at the points in a Data Engagement where they’re most needed. We’ve gotten exceedingly good at loading context about data (metadata) to facilitate communication between Instrumentors and Analysts with tools like Glean Dictionary.

Ah, but once again the weak link appears to be the communication of Questions and Answers between Consumers and Instrumentors. Taking the above example, does the number of bookmarks include folders?

The Consumer knows, but the further away they sit from the Instrumentor, the less likely that the data coming from the product and fueling the analysis will be the “correct” one.

(Either including or excluding folders would be “correct” for different cases. Which one do you think was “more correct”?)

So how do we improve this?

Glean

Well, actually, Glean doesn’t have a solution for this. I don’t actually know what the solutions are. I have some ideas. Maybe we should share more context between Consumers and Instrumentors somehow. Maybe we should formalize the act of question-asking. Maybe we should build into the Glean SDK a high-enough level of metric abstraction that instead of asking questions, Consumers learn to speak a language of metrics.

The one thing I do know is that Glean is absolutely necessary to making any of these solutions possible. Without Glean, we have too many systems that are fractally complex for any context to be relevantly shared. How can we talk about sharing context about bookmark counts when we aren’t even counting things consistently[2]?

Glean brings that consistency. And from there we get to start solving these problems.

Expect me to come back to this realm of Engagements and the Three Roles in future posts. I’ve been thinking about:

  • how tooling affects the languages the roles speak amongst themselves and between each other,
  • how the roles are distributed on the org chart,
  • which teams support each role,
  • how Data Stewardship makes communication easier by adding context and formality,
  • how Telemetry and Glean handle the same situations in different ways, and
  • what roles Users play in all this. No model about data is complete without considering where the data comes from.

I’m not sure how many I’ll actually get to, but at least I have ideas.

:chutten

[1] Other rejected similes include “as common as”: maple syrup on Canadian breakfast tables, frustration in traffic, sense isn’t.

[2] Counting is harder than it looks.

Six-Year Moziversary

I’ve been working at Mozilla for six years today. Wow.

Okay, so what’s happened… I’ve been promoted to Staff Software Engineer. Georg and I’d been working on that before he left, and then, well *gestures at everything*. This means it doesn’t really _feel_ that different to be a Staff instead of a Senior since I’ve been operating at the latter level for over a year now, but the it’s nice that the title caught up. Next stop: well, actually, I think Staff’s a good place for now.

Firefox On Glean did indeed take my entire 2020 at work, and did complete on time and on budget. Glean is now available to be used in Firefox Desktop.

My efforts towards getting folks to actually _use_ Glean instead of Firefox Telemetry in Firefox Desktop have been mixed. The Background Update Task work went exceedingly well… but when there’s 2k pieces of instrumentation, you need project management and I’m trying my best. Now to “just” get buy-in from the powers that be.

I delivered a talk to Ubisoft (yeah, the video game folks) earlier this year. That was a blast and I’m low-key looking for another opportunity like it. If you know anyone who’d like me to talk their ears off about Data and Responsibility, do let me know.

Blogging’s still low-frequency. I rely on the This Week in Glean rotation to give me the kick to actually write long-form ideas down from time-to-time… but it’s infrequent. Look forward to an upcoming blog post about the Three Roles in Data Engagements.

Predictions for the future time:

  • There will be at least one Work Week planned if not executed by this time next year. Vaccines work.
  • Firefox Desktop will have at least started migrating its instrumentation to Glean.
  • I will still be spending a good chunk of my time coding, though I expect this trend of spending ever more time writing proposals and helping folks on chat will continue.

And that’s it for me for now.

:chutten

This Week in Glean: Data Reviews are Important, Glean Parser makes them Easy

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean.) All “This Week in Glean” blog posts are listed in the TWiG index).

At Mozilla we put a lot of stock in Openness. Source? Open. Bug tracker? Open. Discussion Forums (Fora?)? Open (synchronous and asynchronous).

We also have an open process for determining if a new or expanded data collection in a Mozilla project is in line with our Privacy Principles and Policies: Data Review.

Basically, when a new piece of instrumentation is put up for code review (or before, or after), the instrumentor fills out a form and asks a volunteer Data Steward to review it. If the instrumentation (as explained in the filled-in form) is obviously in line with our privacy commitments to our users, the Data Steward gives it the go-ahead to ship.

(If it isn’t _obviously_ okay then we kick it up to our Trust Team to make the decision. They sit next to Legal, in case you need to find them.)

The Data Review Process and its forms are very generic. They’re designed to work for any instrumentation (tab count, bytes transferred, theme colour) being added to any project (Firefox Desktop, mozilla.org, Focus) and being collected by any data collection system (Firefox Telemetry, Crash Reporter, Glean). This is great for the process as it means we can use it and rely on it anywhere.

It isn’t so great for users _of_ the process. If you only ever write Data Reviews for one system, you’ll find yourself answering the same questions with the same answers every time.

And Glean makes this worse (better?) by including in its metrics definitions almost every piece of information you need in order to answer the review. So now you get to write the answers first in YAML and then in English during Data Review.

But no more! Introducing glean_parser data-review and mach data-review: command-line tools that will generate for you a Data Review Request skeleton with all the easy parts filled in. It works like this:

  1. Write your instrumentation, providing full information in the metrics definition.
  2. Call python -m glean_parser data-review <bug_number> <list of metrics.yaml files> (or mach data-review <bug_number> if you’re adding the instrumentation to Firefox Desktop).
  3. glean_parser will parse the metrics definitions files, pull out only the definitions that were added or changed in <bug_number>, and then output a partially-filled-out form for you.

Here’s an example. Say I’m working on bug 1664461 and add a new piece of instrumentation to Firefox Desktop:

fog.ipc:
  replay_failures:
    type: counter
    description: |
      The number of times the ipc buffer failed to be replayed in the
      parent process.
    bugs:
      - https://bugzilla.mozilla.org/show_bug.cgi?id=1664461
    data_reviews:
      - https://bugzilla.mozilla.org/show_bug.cgi?id=1664461
    data_sensitivity:
      - technical
    notification_emails:
      - chutten@mozilla.com
      - glean-team@mozilla.com
    expires: never

I’m sure to fill in the `bugs` field correctly (because that’s important on its own _and_ it’s what glean_parser data-review uses to find which data I added), and have categorized the data_sensitivity. I also included a helpful description. (The data_reviews field currently points at the bug I’ll attach the Data Review Request for. I’d better remember to come back before I land this code and update it to point at the specific comment…)

Then I can simply use mach data-review 1664461 and it spits out:

!! Reminder: it is your responsibility to complete and check the correctness of
!! this automatically-generated request skeleton before requesting Data
!! Collection Review. See https://wiki.mozilla.org/Data_Collection for details.

DATA REVIEW REQUEST
1. What questions will you answer with this data?

TODO: Fill this in.

2. Why does Mozilla need to answer these questions? Are there benefits for users?
   Do we need this information to address product or business requirements?

TODO: Fill this in.

3. What alternative methods did you consider to answer these questions?
   Why were they not sufficient?

TODO: Fill this in.

4. Can current instrumentation answer these questions?

TODO: Fill this in.

5. List all proposed measurements and indicate the category of data collection for each
   measurement, using the Firefox data collection categories found on the Mozilla wiki.

Measurement Name | Measurement Description | Data Collection Category | Tracking Bug
---------------- | ----------------------- | ------------------------ | ------------
fog_ipc.replay_failures | The number of times the ipc buffer failed to be replayed in the parent process.  | technical | https://bugzilla.mozilla.org/show_bug.cgi?id=1664461


6. Please provide a link to the documentation for this data collection which
   describes the ultimate data set in a public, complete, and accurate way.

This collection is Glean so is documented
[in the Glean Dictionary](https://dictionary.telemetry.mozilla.org).

7. How long will this data be collected?

This collection will be collected permanently.
**TODO: identify at least one individual here** will be responsible for the permanent collections.

8. What populations will you measure?

All channels, countries, and locales. No filters.

9. If this data collection is default on, what is the opt-out mechanism for users?

These collections are Glean. The opt-out can be found in the product's preferences.

10. Please provide a general description of how you will analyze this data.

TODO: Fill this in.

11. Where do you intend to share the results of your analysis?

TODO: Fill this in.

12. Is there a third-party tool (i.e. not Telemetry) that you
    are proposing to use for this data collection?

No.

As you can see, this Data Review Request skeleton comes partially filled out. Everything you previously had to mechanically fill out has been done for you, leaving you more time to focus on only the interesting questions like “Why do we need this?” and “How are you going to use it?”.

Also, this saves you from having to remember the URL to the Data Review Request Form Template each time you need it. We’ve got you covered.

And since this is part of Glean, this means this is already available to every project you can see here. This isn’t just a Firefox Desktop thing. 

Hope this saves you some time! If you can think of other time-saving improvements we could add once to Glean so every Mozilla project can take advantage of, please tell us on Matrix.

If you’re interested in how this is implemented, glean_parser’s part of this is over here, while the mach command part is here.

:chutten