Two Days, or How Long Until The Data Is In

Two days.

It doesn’t seem like long, but that is how long you need to wait before looking at a day’s Firefox data and being sure than 95% of it has been received.

There are some caveats, of course. This only applies to current versions of Firefox (55 and later). This will very occasionally be wrong (like, say, immediately after Labour Day when people finally get around to waking up their computers that have been sleeping for quite some time). And if you have a special case (like trying to count nearly everything instead of just 95% of it) you might want to wait a bit longer.

But for most cases: Two Days.

As part of my 2017 Q3 Deliverables I looked into how long it takes clients to send their anonymous usage statistics to us using Telemetry. This was a culmination of earlier ponderings on client delay, previous work in establishing Telemetry client health, and an eighteen-month (or more!) push to actually look at our data from a data perspective (meta-data).

This led to a meeting in San Francisco where :mreid, :kparlante, :frank, :gfritzsche, and I settled upon a list of metrics that we ought to measure to determine how healthy our Telemetry system is.

Number one on that list: latency.

It turns out there’s a delay between a user doing something (opening a tab, for instance) and them sending that information to us. This is client delay and is broken into two smaller pieces: recording delay (how long from when the user does something until when we’ve put it in a ping for transport), and submission delay (how long it takes that ready-for-transport ping to get to Mozilla).

If you want to know how many tabs were opened on Tuesday, September the 5th, 2017, you couldn’t tell on the day itself. All the tabs people open late at night won’t even be in pings, and anyone who puts their computer to sleep won’t send their pings until they wake their computer in the morning of the 6th.

This is where “Two Days” comes in: On Thursday the 7th you can be reasonably sure that we have received 95% of all pings containing data from the 5th. In fact, by the 7th, you should even have that data in some scheduled datasets like main_summary.

How do we know this? We measured it:

Screenshot-2017-9-12 Client "main" Ping Delay for Latest Version(1).png(Remember what I said about Labour Day? That’s the exceptional case on beta 56)

Most data, most days, comes in within a single day. Add a day to get it into your favourite dataset, and there you have it: Two Days.

Why is this such a big deal? Currently the only information circulating in Mozilla about how long you need to wait for data is received wisdom from a pre-Firefox-55 (pre-pingsender) world. Some teams wait up to ten full days (!!) before trusting that the data they see is complete enough to make decisions about.

This slows Mozilla down. If we are making decisions on data, our data needs to be fast and reliably so.

It just so happens that, since Firefox 55, it has been.

Now comes the hard part: communicating that it has changed and changing those long-held rules of thumb and idées fixes to adhere to our new, speedy reality.

Which brings us to this blog post. Consider this your notice that we have looked into the latency of Telemetry Data and is looks pretty darn quick these days. If you want to know about what happened on a particular day, you don’t need to wait for ten days any more.

Just Two Days. Then you can have your answers.

:chutten

(Much thanks to :gsvelto and :Dexter’s work on pingsender and using it for shutdown pings, :Dexter’s analyses on ping delay that first showed these amazing improvements, and everyone in the data teams for keeping the data flowing while I poked at SQL and rearranged words in documents.)

 

Advertisements

Data Science is Hard: Client Delays

Delays suck, but unmeasured delays suck more. So let’s measure them.

I’ve previous talked about delays as they relate to crash pings. This time we’re looking at the core of Firefox Telemetry data collection: the “main” ping. We’ll be looking at a 10% sample of all “main” pings submitted on Tuesday, January 10th[1].

In my previous post on delays, I defined five types of delay: recording, submission, aggregation, migration, and query scheduling. This post is about delays on the client side of the equation, so we’ll be focusing on the first two: recording, and submission.

Recording Delay

How long does it take from something happening, to having a record of it happening? We count HTTP response codes (as one does), so how much time passes from that first HTTP response to the time when that response’s code is packaged into a ping to be sent to our servers?

output_20_1

This is a Cumulative Distribution Functions or CDF. The ones in this post show you what proportion (0% – 100%) of “main” pings we’re looking at arrived with data that falls within a certain timeframe (0 – 96 hours). So in this case, look at the red, “aurora”-branch line. It crosses the 0.9 y-axis line at about the 8 x-axis line. This means 90% of the pings had a recording delay of 8 hours or less.

Which is fantastic news, especially since every other channel (release and beta vying for fastest speeds) gets more of its pings in even faster. 90% of release pings have a recording delay of at most 4 hours.

And notice that shelf at 24 hours, where every curve basically jumps to 100%? If users leave their browsers open for longer than a day, we cut a fresh ping at midnight. Glad to see evidence that it’s working.

All in all it shows that we can expect recording delays of under 30min for most pings across all channels. This is not a huge source of delay.

Submission Delay

With all that data finally part of a “main” ping, how long before the servers are told? For now, Telemetry has to wait for the user to restart their Firefox before it is able to send its pings. How long can that take?

output_23_1

Ouch.

Now we see aurora is no longer the slowest, and has submission delays very similar to release’s submission delays.  The laggard is now beta… and I really can’t figure out why. If Beta users are leaving their browsers open longer, we’d expect to see them be on the slower side of the “Recording Delay CDF” plot. If Beta users are leaving their browser closed longer, we’d expect them to show up lower on Engagement Ratio plots (which they don’t).

A mystery.

Not a mystery is that nightly has the fastest submission times. It receives updates every day so users have an incentive to restart their browsers often.

Comparing Submission Delay to Recording Delay, you can see how this is where we’re focusing most of our “Get More Data, Faster” attentions. If we wait for 90% of “main” pings to arrive, then we have to wait at least 17 hours for nightly data, 28 hours for release and aurora… and over 96 hours for beta.

And that’s just Submission Delay. What if we measured the full client -> server delay for data?

Combined Client Delay

output_27_1

With things the way they were on 2017-01-10, to get 90% of “main” pings we need to wait a minimum of 22 hours (nightly) and a maximum of… you know what, I don’t even know. I can’t tell where beta might cross the 0.9 line, but it certainly isn’t within 96 hours.

If we limit ourselves to 80% we’re back to a much more palatable 11 hours (nightly) to 27 hours (beta). But that’s still pretty horrendous.

I’m afraid things are actually even worse than I’m making it out to be. We rely on getting counts out of “main” pings. To count something, you need to count every single individual something. This means we need 100% of these pings, or as near as we can get. Even nightly pings take longer than 96 hours to get us more than 95% of the way there.

What do we use “main” pings to count? Amongst other things, “usage hours” or “how long has Firefox been open”. This is imperative to normalizing crash information properly so we can determine the health and stability of a release.

As you can imagine, we’re interested in knowing this as fast as possible. And as things stood a couple of Tuesdays ago, we have a lot of room for improvement.

For now, expect more analyses like this one (and more blog posts like this one) examining how slowly or quickly we can possibly get our data from the users who generate it to the Mozillians who use it to improve Firefox.

:chutten

[1]: Why did I look at pings from 2017-01-10? It was a recent Tuesday (less weekend effect) well after Gregorian New Year’s Day, well before Chinese New Year’s Day, and even a decent distance from Epiphany. Also the 01-10 is a mirror which I thought was neat.