What’s the First Firefox Crash a User Sees?

Growth is going to be a big deal across Mozilla in 2017. We spent 2016 solidifying our foundations, and now we’re going to use that to spring to action and grow our influence and user base.

So this got me thinking about new users. We’re constantly getting new users: people who, for one reason or another, choose to install and run Firefox for the first time today. They run it and… well, then what?

Maybe they like it. They open a new tab. Then they open a staggeringly unbelievable number of tabs. They find and install an addon. Or two.

Fresh downloads and installs of Firefox continue at an excellent pace. New people, every day, are choosing Firefox.

So with the number of new users we already see, the key to Growth may not lie in attracting more of them… it might be that we need to keep the ones we already see.

So what might stop a user from using Firefox? Maybe after they open the seventy-first tab, Firefox crashes. It just disappears on them. They open it again, browse for a little while… but can’t forget that the browser, at any time, could just decide to disappear and let them down. So they migrate back to something else, and we lose them.

It is with these ideas in my head that I wondered “Are there particular types of crashes that happen to new users? Do they more likely crash because of a plugin, their GPU misbehaving, running out of RAM… What is their first crash, and how might it compare to the broader ecosystem of crashes we see and fix every day?”

With the new data available to me thanks to Gabriele Svelto’s work on client-side stack traces, I figured I could maybe try to answer it.

My full analysis is here, but let me summarize: sadly there’s too much noise in the process to make a full diagnosis. There are some strange JSON errors I haven’t tracked down… and even if I do, there are too many “mystery” stack frames that we just don’t have a mechanism to figure out yet.

And this isn’t even covering how we need some kind of service or algorithm to detect signatures in these stacks or at least cluster them in some meaningful way.

Work is ongoing, so I hope to have a more definite answer in the future. But for now, all I can do is invite you to tell me what you think causes users to stop using Firefox. You can find me on Twitter, or through my @mozilla.com email address.

:chutten

Canadian Holiday Inbound! (Sunday December 25th, Observed on Monday the 26)

Hello team!

A Canadian Holiday is once again at our chimneys as Christmas Day is approaching! The holiday itself is on the 25th, but because that falls on a weekend, it’ll be on the 26th that you’ll not be finding us in the office. Please be understanding if your meetings are a little underpopulated, and take the opportunity to run all the try builds you can think of.

Now, I know what you’re thinking. (Well, I don’t, but since I’ve turned off comments you can’t correct me.) You are thinking that Christmas isn’t a Canadian holiday… or at least not uniquely Canadian.

And you wouldn’t be too wrong. Other parts of the world certainly celebrate Christmas. Japan does it up amazingly, for instance, if you’re ever in that corner of the globe in December. And still other parts of the world people celebrate all sorts of Winter festivities.

And it’s not as though we’re spending the day dipping our All-Dressed Chips[1] into Stanley Cups[2] of Maple Syrup[3] while taking our Coffee Crisp[4] and wearing our Toques[5], Jeans, Jean Shirts[6], and A Boot[7], eh?

No, but the Canadian way of celebrating Christmas _is_ unique… in that it’s usually done through celebrating everyone else’s Winter celebrations. Canadians are more than happy to adopt and support any culture or festival that involves food, fun, friends, and family.

My family tends to observe Polish Wigilia by eating pierogies, white fish, and bigos. My wife’s family has a Christmas Eve Feast of crab dip, fourteen types of frozen hors d’oeurves, cheese, crackers, and smoked oysters eaten on TV trays in front of Log: The Christmas Special. Earlier this month we ate latkes with sour cream and applesauce with pfeffernusse for afters at the Christkindle Markt. Last year we went to Sir John A. MacDonald’s birthplace at Woodside for soft gingerbread and roasted chestnuts.

Then there’s turkey with the trimmings for the more traditional, sushi for deliberate anti-traditional, and everything in-between.

So no matter if or how you celebrate Canadian Christmas, know that we are (and are not) celebrating it too, with you, in the Great White North.

Because anything else just wouldn’t be polite.

( :bwinton reminds me to tell you that we will also be off on the 27th for Boxing Day. Our most famous pugilists will be hard at work discouraging (in effigy) the normally-docile moose herds from invading the United States once again. So we’ll be busy cheering them on, sorry. )

:chutten

[1] Tastes like… actually, I’m not really sure. Tasty, though.
[2] Named after Lord Stanley
[3] Probably harvested back in March in Quebec
[4] mocha-flavoured Nestle chocolate bar
[5] knit caps, often with pom-poms on top
[6] AKA “The Canadian Tuxedo”
[7] “about” pronounced in Canadianese is actually closer to “aboat” than “aboot”, eh

Privileged to be a Mozillian

Mike Conley noticed a bug. There was a regression on a particular Firefox Nightly build he was tracking down. It looked like this:

A time series plot with a noticeable regression at November 6

A pretty big difference… only there was a slight problem: there were no relevant changes between the two builds. Being the kind of developer he is, :mconley looked elsewhere and found a probe that only started being included in builds starting November 16.

The plot showed him data starting from November 15.

He brought it up on irc.mozilla.org#telemetry. Roberto Vitillo was around and tried to reproduce, without success. For :mconley the regression was on November 5 and the data on the other probe started November 15. For :rvitillo the regression was on November 6 and the data started November 16. After ruling out addons, they assumed it was the dashboard’s fault and roped me into the discussion. This is what I had to say:

Hey, guess what's different between rvitillo and mconley? About 5 hours.

You see, :mconley is in the Toronto (as in Canada) Mozilla office, and Roberto is in the London (as in England) Mozilla office. There was a bug in how dates were being calculated that made it so the data displayed differently depending on your timezone. If you were on or East of the Prime Meridian you got the right answer. West? Everything looks like it happens one day early.

I hammered out a quick fix, which means the dashboard is now correct… but in thinking back over this bug in a post-mortem-kind-of-way, I realized how beneficial working in a distributed team is.

Having team members in multiple timezones not only provided us with a quick test location for diagnosing and repairing the issue, it equipped us with the mindset to think of timezones as a problematic element in the first place. Working in a distributed fashion has conferred upon us a unique and advantageous set of tools, experiences, thought processes, and mechanisms that allow us to ship amazing software to hundreds of millions of users. You don’t get that from just any cube farm.

#justmozillathings

:chutten

Data Science is Hard – Case Study: How Do We Normalize Firefox Crashes?

When we use Firefox Crashes to determine the quality of a Firefox release, we don’t just use a count of the number of crashes:aurora51a2crashes

We instead look at crashes divided by the number of thousands of hours Firefox was running normally:auroramcscrashes

I’ve explained this before as a way to account for how crash volumes can change depending on Firefox usage on that particular day, even though Firefox quality likely hasn’t changed.

But “thousands of usage hours” is one of many possible normalization denominators we could have chosen. To explain our choice, we’ll need to explore our options.

Fans of Are We Stable Yet? may be familiar with a crash rate normalized by “hundred of Active Daily Instances (ADI)”. This is a valid denominator as Firefox usage does tend to scale linearly with the number of unique Firefox instances running that day. It is also very timely, as ADI comes to us from a server that Firefox instances contact at the beginning of their browsing sessions each day.

From across the fence, I am told that Google Chrome uses a crash rate normalized by “millions of pageloads”. This is a valid denominator as “loading a page” is one of the primary actions users take with their browsers. It is not any more or less timely than “thousands of usage hours” but with Google properties being primary page load destinations, this value could potentially be estimated server-side while waiting for user data to trickle in.

Denominators that would probably work, but I haven’t heard of anyone using, include: number of times the user opens the browser, amount of times the user scrolls, memory use, … generally anything could be used that increases at the same rate that crashes do on a given browser version.

So why choose “thousands of usage hours”? ADI comes in faster, and pageloads are more closely related to actions users take in the browser.

Compared to ADI, thousands of usage hours has proven to be a more reasonable and stable measure. In crashes-per-100-ADI there are odd peaks and valleys that don’t reflect decreases or increases in build quality. And though crashes scale proportionally with the number of Firefox instances running, it scales more closely with how heavily those running instances are being used.

As for why we don’t use pageloads… Well, the first reason is that “thousands of usage hours” is something we already have kicking around. A proper count of pageloads is something we’re adding at present. It will take a while for users to start sending us these numbers, and a little development effort to get that number into the correct dataset for analysis. Then we will evaluate its suitability. It won’t be faster or slower than “thousands of usage hours” (since it will use the same reporting mechanism) but I have heard no compelling evidence that it will result in any more stable or reasonable of a measure. So I’ll do what I always try to do: let the data decide.

So, for the present, that leaves us with crashes per thousands of usage hours which, aside from latency issues we have yet to overcome, seems to be doing fairly well.

:chutten

Data Science is Hard – Case Study: What is a Firefox Crash?

In the past I’ve gone on at length about the challenge of getting timely data to determine Firefox release quality with respect to how often Firefox crashes. Comparatively I’ve spent essentially no time at all on what a crash actually is.

A crash (broadly) is what happens when a computer process encounters an error it cannot recover from. Since it cannot recover, the system it is running within ends the process abruptly.

Not all crashes are equal. Not all crashes mean the same thing to users and to release managers and to computer programmers.

If you are in the middle of drafting an email and the web page content suddenly goes blank and says “Sorry, this tab has crashed.” then that’s a big deal. It’s even worse if the entire browser disappears without warning.

But what if Firefox crashes, but only after it has mostly shut down? Everything’s been saved properly, but we didn’t clean up after ourselves well. This is a crash (technically), but does it really matter to a user?

What if the process that contains Flash crashes and web advertisements stop working? It can be restarted without too much trouble, and no one likes ads, so is it really that bad of a thing?

And on top of these families of events, there are other horrible things that can happen to users we might want to call “crashes” even though they aren’t. For instance: what if the browser becomes completely unresponsive and the user has no recourse but to close it? The process didn’t encounter a fatal error, but that user’s situation is the same: Something weird happened, and now their data is gone.

Generally speaking, I look at four classes of crash: Main Crashes (M), Content Crashes (C), Content Shutdown Crashes (S), and Plugin Crashes (P).

In my opinion, the most reliable indicator of Firefox’s stability and quality is M + C – S. In plain English, it is the sum of the events where the whole Browser goes poof or the Web Content inside the browser goes poof, ignoring the times when the Web Content goes poof after the user has decided to shut down the browser.

It doesn’t include Plugin crashes, as those are less obtrusive and more predicted by the plugin code, not Firefox code. It does include some events where Firefox became unresponsive (or “hangs” for short) and had to be terminated.

This, to my mind, most accurately encompasses a measure of Firefox quality. If the number of these crashes goes up, that means there are more times where more users are having less fun with Firefox. If the number of these crashes goes down, that means there are fewer times that fewer people are having less fun with Firefox.

It doesn’t tell the whole story. What good is a not-crashing browser if it doesn’t scroll when you ask it to? What good is a stable piece of web content if half of it is missing because we don’t support it? What good is a Firefox that is open all the time if it takes twice as long to load the web pages you care about?

But it gives us one very important part of the Firefox Quality story, and that’s good enough for me.

:chutten

Data Science is Hard – Case Study: Latency of Firefox Crash Rates

Firefox crashes sometimes. This bothers users, so a large amount of time, money, and developer effort is devoted to keep it from happening.

That’s why I like that image of Firefox Aurora’s crash rate from my post about Firefox’s release model. It clearly demonstrates the result of those efforts to reduce crash events:auroramcscrashes

So how do we measure crashes?

That picture I like so much comes from this dashboard I’m developing, and uses a very specific measure of both what a crash is, and what we normalize it by so we can use it as a measure of Firefox’s quality.

Specifically, we count the number of times Firefox or the web page content disappears from the user’s view without warning. Unfortunately, this simple count of crash events doesn’t give us a full picture of Firefox’s quality, unless you think Firefox is miraculously 30% less crashy on weekends:aurora51a2crashes

So we need to normalize it based on some measure of how much Firefox is being used. We choose to normalize it by thousands of “usage hours” where a usage hour is one hour Firefox was open and running for a user without crashing.

Unfortunately, this choice of crashes per thousand usage hours as our metric, and how we collect data to support it, has its problems. Most significant amongst these problems is the delay between when a new build is released and when this measure can tell you if it is a good build or not.

Crashes tend to come in quickly. Generally speaking, when a user’s Firefox disappears out from under them, they are quick to restart it. This means this new Firefox instance is able to send us information about that crash usually within minutes of it happening. So for now, we can ignore the delay between a crash happening and our servers being told about it.

The second part is harder: when should users tell us that everything is fine?everythingisok

We can introduce code into Firefox that would tell us every minute that nothing bad happened… but could you imagine the bandwidth costs? Even every hour might be too often. Presently we record this information when the user closes their browser (or if the user doesn’t close their browser, at the user’s local midnight).

The difference between the user experiencing an hour of un-crashing Firefox and that data being recorded is recording delay. This tends to not exceed 24 hours.

If the user shuts down their browser for the day, there isn’t an active Firefox instance to send us the data for collection. This means we have to wait for the next time the user starts up Firefox to send us their “usage hours” number. If this was a Friday’s record, it could easily take until Monday to be sent.

The difference between the data being recorded and the data being sent is the submission delay. This can take an arbitrary length of time, but we tend to see a decent chunk of the data within two days’ time.

This data is being sent in throughout each and every day. Somewhere at this very moment (or very soon) a user is starting up Firefox and that Firefox will send us some Telemetry. We have the facilities to calculate at any given time the usage hours and the crash numbers for each and every part of each and every day… but this would be a wasteful approach. Instead, a scheduled task performs an aggregation of crash figures and usage hour records per day. This happens once per day and the result is put in the CrashAggregates dataset.

The difference between a crash or usage hour record being submitted and it being present in this daily derived dataset is aggregation delay. This can be anywhere from 0 to 23 hours.

This dataset is stored in one format (parquet), but queried in another (prestodb fronted by re:dash). This migration task is performed once per day some time after the dataset is derived.

The difference between the aggregate dataset being derived and its appearance in the query interface is migration delay. This is roughly an hour or two.

Many queries run against this dataset and are scheduled sequentially or on an ad hoc basis. The query that supplies the data to the telemetry crash dashboard runs once per day at 2pm UTC.

The difference between the dataset being present in the query interface and the query running is query scheduling delay. This is about an hour.

This provides us with a handy equation:

latency = reporting delay + submission delay + aggregation delay + migration delay + query scheduling delay

With typical values, we’re seeing:

latency = 6 hours + 24 hours + 12 hours + 1 hour + 1 hour

latency = 2 days

And since submission delay is unbounded (and tends to be longer than 24 hours on weekends and over holidays), the latency is actually a range of probable values. We’re never really sure when we’ve heard from everyone.

So what’s to blame, and what can we do about it?

The design of Firefox’s Telemetry data reporting system is responsible for reporting delay and submission delay: two of the worst offenders. submission delay could be radically improved if we devoted engineering resources to submitting Telemetry (both crash numbers and “usage hour” reports) without an active Firefox running (using, say, a small executable that runs as soon as Firefox crashes or closes). reporting delay will probably not be adjusted very much as we don’t want to saturate our users’ bandwidth (or our own).

We can improve aggregation delay simply by running the aggregation, migration, and query multiple times a day, as information is coming in. Proper scheduling infrastructure can remove all the non-processing overhead from migration delay and query scheduling delay which can bring them easily down below a single hour, combined.

In conclusion, even given a clear and specific metric and a data collection mechanism with which to collect all the data necessary to measure it, there are still problems when you try to use it to make timely decisions. There are technical solutions to these technical problems, but they require a focused approach to improve the timeliness of reported data.

:chutten