The Problem of Personally Identifiable Information and Large Datasets

By Deane Barker on January 14, 2010

I’ve read a couple interesting things lately about how large datasets of supposedly anonymous data can be reverse-engineered to reveal people’s identities and other information about them.

A lesbian woman has sued Netflix because she believes the data they released for their now-famous contest was enough to identify and “out” her.

[The] suit argues that the information is personal data protected by Netflix’s privacy policy, and that NetFlix should have known that people would be able to identify users based on that data alone.

This article mentions the other big case in this space – when AOL accidentally released search logs.  Each search query had a user ID, which means you could tie multiple queries to the same unique person.  Some of those queries, when pieced together, were enough to identify them.

While none of the records on the file are personally identifiable per se, certain keywords contain personally identifiable information as a result of the original searcher typing in his or her own name (ego-searching), as well as address, social security number, and other personal information. And since each user is identified on this list by a unique sequential key, it enables a researcher to compile a given user’s search history.

The Wired article about the Netflix suit includes this fairly scary quote:

[…] if a data set reveals a person’s ZIP code, birthdate and gender, there’s an 87 percent chance that the person can be uniquely identified.

Now, over on Reddit, a credit card fraud “agent,” is answering questions and he states how predictive models can use a dataset to predict your future behavior:

Some years ago, someone wrote a paper claiming he could get the age, gende and race only from the credit card purchase history. It worked very well. Today, with your full purchase information, we can even “guess” your income range, number of dependant and even weigh. We have a statistical profile of every customer. We can even calculate the odds you eat at McDonald’s today, considering you ate there once every X day. In 98% of the time, this model is very accurate.

This is all under the rubric of Personally Identifiable Information, which isn’t as obvious as it seems.  Sure, your social security number identifies you, but your identity is often a puzzle, and you often don’t need all the pieces to figure out what the big picture is.

Gadgetopia

Comments

  1. This is all under the rubric of Personally Identifiable Information, which isn’t as obvious as it seems. Sure, your social security number identifies you, but your identity is often a puzzle, and you often don’t need all the pieces to figure out what the big picture is.

  2. Each search query had a user ID, which means you could tie multiple queries to the same unique person. Some of those queries, when pieced together, were enough to identify them.

Comments are closed. If you have something you really want to say, email editors@gadgetopia.com and we‘ll get it added for you.