Data Science is Hard: Dangerous Data

I sit next to a developer at my coworking location (I’m one of the many Mozilla staff who work remotely) who recently installed the new Firefox Quantum Beta on his home and work machines. I showed him what I was working on at the time (that graph below showing how nicely our Nightly population has increased in the past six months), and we talked about how we count users.

Screenshot-2017-9-28 Desktop Nightly DAU MAU for the Last Six Months by Version

=> “But of course we’ll be counting you twice, since you started a fresh profile on each Beta you installed. Actually four times, since you used Nightly to download and install those builds.” This, among other reasons, is why counting users is hard.

<= “Well, you just have to link it to my Firefox Account and then I’ll only count as one.” He figured it’d be a quick join and then we’d have better numbers for some users.

=> “Are you nuts?! We don’t link your Firefox Account to Telemetry! Imagine what an attacker could do with that!”

In a world with adversarial trackers, advertising trackers, and ever more additional trackers, it was novel to this pseudo-coworker of mine that Mozilla would specifically not integrate its systems.

Wouldn’t it be helpful to ourselves and our partners to know more about our users? About their Firefox Accounts? About their browsing history…

Mozilla doesn’t play that game. And our mission, our policies, and our practices help keep us from accidentally providing “value” of this kind for anyone else.

We know the size of users’ history databases, but not what’s in them.

We know you’re the same user when you close and reopen Firefox, but not who you are.

We know whether users have a Firefox Account, but not which ones they are.

We know how many bookmarks users have, but not what they’re for.

We know how many tabs users have open, but not why. (And for those users reporting over 1000 tabs: WHY?!)

And even this much we only know when you let us:


Why? Why do we hamstring our revenue stream like this? Why do we compromise on the certainty that having complete information would provide? Why do we allow ourselves to wonder and move cautiously into the unknown when we could measure and react with surety?

Why do we make Data Science even harder by doing this?

Because we care about our users. We think about what a Bad Actor could do if they had access to the data we collect. Before we okay a new data collection we think of all the ways it could be abused: Can it identify the user? Does it link to another dataset? Might it reveal something sensitive?

Yes, we have confidence in our security, our defenses in depth, our privacy policies, and our motivations to work for users and their interests.

But we are also confident that others have motivations and processes and policies that don’t align with ours… and might be given either the authority or the opportunity to gain access in the future.

This is why Firefox Send doesn’t know your encryption key for the files you share with your friends. This is why Firefox Accounts only knows six things (two of them optional) about you, and why Firefox Sync cannot read the data it’s storing for you.

And this is why Telemetry doesn’t know your Firefox Account id.



One thought on “Data Science is Hard: Dangerous Data

  1. Pingback: Anatomy of a Firefox Update – chuttenblog

Comments are closed.