Data Science is Hard – Part 1: Data

You’d think that categorizing and measuring populations would be pretty simple. You count them all up, divide them into groups… simple, arithmetic-sounding stuff.

To a certain extent, that’s all it is. You want to know how many people contribute to Firefox in a day? Add ’em up. Want to know what fraction of them are from Europe? Compare the subset from Europe against the entire population.

But that’s where it gets squishy:

  • “in a day?” Which day? Did you choose a weekend day? A statutory holiday? A religious holiday? That’ll change the data. Which 24 hours are you counting? From midnight-to-midnight, sure, but which timezone?
  • “from Europe?” What is Europe? Just the EU? How do you tell if a contributor is from Europe? Are you running a geolocation query against their IP? What if their IP changes over the day, are we going to double-count that user? Are we asking contributors where they are from? What if they lie?

So that leads us to Part 1 of “Data Science is Hard”: Data is Hard.

In a recent 1-on-1, my manager :bsmedberg and I thought that it could be interesting to look into Firefox users whose Telemetry reports come from different parts of the world at different times. Maybe we could identify users who travel (Firefox Users Who Travel: Where do they travel to/from?). Maybe they can help us understand the differing needs of Firefox users who are on vacation as opposed to being at home. Maybe they’ll show us Tor Browser users, or users using other anonymizing techniques and technologies: and maybe we should see if there’s some special handling we could provide for them and their data.

I used this topic as a way to learn how to use our new re:dash dashboard onto the prestodb instance of the Longitudinal Dataset. (which lets me run SQL queries against a 1% random sample of Firefox users’ Telemetry data from the past 180 days)

Immediately I ran into problems. First, with remembering all the SQL I had forgotten in the *mumblesomething* years since I last had to write interesting queries.

But then I quickly ran into problems with the data. I ran a query to boil down how many (and which) unique countries each client had reported Telemetry from:

SELECT
    cardinality(array_distinct(geo_country)) AS country_count
    , array_distinct(geo_country) AS countries
FROM longitudinal_v20160314
ORDER BY country_count DESC
LIMIT 5
Country_count Countries
35 [“CN”,”MX”,”GB”,”HU”,”JP”,”US”,”RU”,”IN”,”HK”,”??”,”CA”,”KR”,”TW”,”CM”,”DK”,”CH”,”ZA”,”PH”,”DE”,”VN”,”NL”,”CO”,”KZ”,”MA”,”TR”,”FR”,”AU”,”GR”,”IE”,”AR”,”BY”,”AT”,”TN”,”BR”,”AM”]
34 [“DE”,”RU”,”LT”,”UA”,”MA”,”GB”,”GI”,”AE”,”FR”,”CN”,”AM”,”NG”,”NL”,”PT”,”TH”,”PL”,”ES”,”NO”,”CH”,”IL”,”ZA”,”BY”,”US”,”UZ”,”HK”,”TW”,”JP”,”PK”,”LU”,”SG”,”FI”,”EU”,”IN”,”ID”]
34 [“US”,”BR”,”KR”,”NZ”,”RO”,”JP”,”ES”,”GB”,”TW”,”CN”,”UA”,”AU”,”NL”,”FR”,”FI”,”??”,”NO”,”CA”,”ZA”,”CL”,”IT”,”SE”,”SG”,”CH”,”RU”,”DE”,”MY”,”IN”,”ID”,”VN”,”PL”,”PH”,”KE”,”EG”]
34 [“GB”,”CN”,”??”,”DE”,”US”,”RU”,”AL”,”ES”,”NL”,”FR”,”KR”,”FI”,”IR”,”CA”,”JP”,”HK”,”AU”,”CH”,”RO”,”CO”,”IE”,”BR”,”SE”,”GR”,”IN”,”MX”,”RS”,”AR”,”TW”,”IT”,”SA”,”ID”,”VN”,”TN”]
34 [“US”,”GI”,”??”,”GB”,”DE”,”SA”,”KR”,”AR”,”ZA”,”CN”,”IN”,”AT”,”CA”,”KE”,”IQ”,”VN”,”TR”,”KZ”,”JP”,”BR”,”FR”,”TW”,”IT”,”ID”,”SG”,”RU”,”CL”,”BA”,”NL”,”AU”,”BE”,”LT”,”PT”,”ES”]

35 unique countries visited? Wow.

The “Countries” column is in order of when they first appeared in the data, so we know that the first user was reporting from China then Mexico then Great Britain then Hungary then Japan then the US then Russia…

Either this is a globetrotting super spy, or we’re looking at some sort of VPN/Tor/anonymizing framework at play here.

( Either way I think it best to say, “Thank you for using Firefox, Ms. Super Spy!” )

Or maybe this is a sign that the geolocation service is unreliable, or that the data intake services are buggy, or something else that would be less than awesome.

Regardless: this data is hugely messy. But, 35 countries over 180 days? That’s just about doable in real life… except that it wasn’t over 180 days, but 2:

SELECT
    cardinality(array_distinct(geo_country)) AS country_count
    , cardinality(geo_country) AS subsession_count
    , cardinality(geo_country) / (date_diff('DAY', from_iso8601_timestamp(array_min(subsession_start_date)), from_iso8601_timestamp(array_max(subsession_start_date))) + 1) AS subsessions_per_day
    , date_diff('DAY', from_iso8601_timestamp(array_min(subsession_start_date)), from_iso8601_timestamp(array_max(subsession_start_date)) + 1) AS duration
FROM longitudinal_v20160314
ORDER BY country_count DESC
LIMIT 1
Country_count Subsession_count Subsessions_per_day Duration
35 169 84 2

This client reported from 35 countries over 2 days. At least 17 countries per day (we’re skipping duplicates).

Also of note to Telemetry devs, this client was reporting 84 subsessions per day.

(Subsessions happen at a user’s local midnight and whenever some aspect of the Environment block of Telemetry changes (your locale, your multiprocess setting, how many addons you have installed). If your Firefox is registering that many subsession edges per day, there might be something wrong with your install. Or there might be something wrong with our data intake or aggregation.)

I still plan on poking around this idea of Firefox Users Who Travel. As I do so I need to remember that the data we collect is only useful for looking at Populations. Knowing that there’s one user visiting 35 countries in 2 days doesn’t help us decide whether or not we should release a special Globetrotter Edition of Firefox… since that’s just 1 of 4 million clients of a dataset representing only 1% of Firefox users.

Knowing that about a dozen users reported days with over 250 subsessions might result in some evaluation of that code, but without something linking these high-subsession-rate users together into a Population (maybe they’re machines running automated testing?), there’s nothing much we can do about it.

Instead I should focus on how, in a 4M user dataset, 112k (2.7%) users report from exactly 2 countries over the duration of the dataset. There are only 44k that report from more than 2, and the other 3.9M or so report exactly 1.

2.7% is a sliver of 1% of the Firefox population, but it is a Population. A Population is something we can analyse and speak meaningfully about, as the noise and mess of individual points of data has been smoothed out by the sheer weight of the Firefox user base.

It’s nice having a user base large enough to speak meaningfully about.

:chutten

Advertisements

3 thoughts on “Data Science is Hard – Part 1: Data

  1. Pingback: Firefox User Engagement – chuttenblog

  2. Pingback: Firefox’s Windows XP Users’ Upgrade Path – chuttenblog

  3. Pingback: Data Science is Hard: Dangerous Data – chuttenblog

Comments are closed.