Data Science is Hard: Anomalies Part 3

So what do you do when you have a duplicate data problem and it just keeps getting worse?

You detect and discard.

Specifically, since we already have a few billion copies of pings with identical document ids (which are extremely-unlikely to collide), there is no benefit to continue storing them. So what we do is write a short report about what the incoming duplicate looked like (so that we can continue to analyze trends in duplicate submissions), then toss out the data without even parsing it.

As before, I’ll leave finding out the time the change went live as an exercise for the reader:newplot(1)