CS Bits & Bytes is a bi-weekly newsletter highlighting innovative computer science research. It is our hope that you will use CS Bits & Bytes to engage in the multi-faceted world of computer science to become not just a user, but also a creator of technology. Please visit our website at: www.nsf.gov/cise/csbytes/.

March 11, 2013
Volume 2, Issue 13

Privacy in the Information Age

These days, we know better than to reveal too much personal information online in order to protect ourselves against identity theft and to maintain our individual privacy. But how much information is too much?

Finger Print Image

Image credit: ThinkStock.

Simple demographic information (such as gender, birthdate or race) is commonly linked to identity (people’s names) in public records such as voter registration databases or birth records. This means that other, seemingly anonymous records containing private information about individuals, such as medical histories, may be traceable back to an individual even if his or her name and social security number were removed from the data. This creates a problem when access to data about individuals is required but privacy must be protected, for example, in datasets necessary for financial accounting or scientific research.

Data is made more anonymous by altering pieces of data that can be used to identify an individual, such as gender or birthdate (day, month and year of birth), called quasi-identifiers. This is done either by generalizing (making less specific), suppressing (removing), or distorting (changing) pieces of information. Such alterations result in a trade-off between privacy and either precision, completeness or accuracy of the data. How can we ensure that data is useful and minimally distorted while protecting the privacy of individuals?

Simple Demographics Often Identify People Uniquely

Simple demographics often identify people uniquely. Source: Latanya Sweeney, Harvard University.

Latanya Sweeney, a computer scientist at Harvard University, decided to tackle this problem. She found that just a few pieces of simple, demographic data are often enough to identify a specific individual. For example, 87 percent of Americans are uniquely identifiable by their gender, birthdate and zip code!

As a solution, Professor Sweeney created a computer algorithm to optimize the generalization and suppression of quasi-identifiers to ensure a minimum level of anonymity, called k-anonymity. For a record to meet the desired k-anonymity standard, the quasi-identifiers for any given record are identical to (and thus indistinguishable from) those for at least k − 1 other records, where k is a user-defined parameter. This Preferred Minimal Generalization Algorithm, or MinGen for short, provides k-anonymity protection with minimal distortion of data.

Latanya Sweeney

Image of Latanya Sweeney.

Who thinks of this stuff? Latanya Sweeney is the head of the Data Privacy Lab at Harvard University, where she solves real-world problems through research in computer science and public policy. Dr. Sweeney has created a variety of computational tools to protect individual privacy, including facial de-identification software, surveillance technology that operates with a customizable level of identifiability. She is also the creator of Scrub, a program that successfully identifies and replaces 99-100% of personally identifiable information about patients contained in notes and letters shared between physicians without inhibiting effective consultation on patient care. She has testified on re-identifiability of data to the Department of Homeland Security, the Department of Defense and the United States Senate. When she’s not working to find new ways to protect online privacy or prevent identity theft, you might spot Dr. Sweeney riding her motorcycle around Cambridge, Massachusetts.

Links:

Read more about Latanya Sweeney’s computer science and policy research, including k-anonymity, on her website (http://latanyasweeney.org/) and on the website for Harvard’s Data Privacy Lab (http://dataprivacylab.org/).

Check out Dr. Sweeney's most recent work on Discrimination in Online Ad Delivery at: http://dataprivacylab.org/projects/onlineads/.

Watch for a new interactive website (aboutmyinfo.org) from the Data Privacy Lab that will tell you how many people match your characteristics after you enter some basic demographical information.

In honor of Women’s History Month, read more about women in computer science at the Anita Borg Institute (http://anitaborg.org/news/profiles-of-technical-women/famous-women-in-computer-science/; http://anitaborg.org/news/archive/senior-technical-women-profiles-of-success/) and at the National Center for Women & Information Technology (https://www.ncwit.org/itnews).

Activity: De-identifying your classmates

In this activity, you will create a master list of “public” identifying information (name and birthdate) for everyone in the class. Then, each student will create a “private record” of their favorite food along with the same quasi-identifier (birthdate). The k-value (number of records with the same quasi-identifiers) will be determined for each record for four different levels of disclosure: (1) full disclosure, (2) generalizing birthdates by removing day of birth or (3) by only including the birth year, and (4) suppressing the birthdate all together. The higher the k-value, the more protected the personal information.

Terms

public identifying information – information that anyone can access about a certain person. In this activity, this is represented by the list of student names and birthdates.

quasi-identifier - a piece of information that, by itself or taken with other information, could be used to determine an individual’s identity. In this activity, student birthdates are the quasi-identifiers.

private record - information about an individual for which individual identity should remain anonymous. In this activity, the slips of paper with student birthdates and favorite foods represent private records.

Steps

  1. As a class, make a list of all students’ names and birthdates on the board. This list represents publicly available information about the students.
  2. Then, have each student write his or her birthdate and favorite food on a slip of paper and collect them in a hat. These represent a private record for each student.
  3. Have a student select one record at random and read the information aloud. Can you figure out whose favorite food is written on the paper from its accompanying information using the master birthdate list? What if you generalize the student’s date of birth to include year and month but not day? What if you generalize it to include just the year? What if you suppress the birthdate information all together? Repeat for each record.
  4. Develop small teams to determine the k-value, or number of records with the same birthdate, or quasi-identifiers, for each record of favorite food. Use the chart below:
k-values for each record for different disclosure levels
  generalized suppressed
record number complete record month and year only year only birthdate omitted
1        
2        
3        
4        
5....        

 

Additional Discussion Questions

Which level of information provides the highest k-value (the most privacy)? Is this enough information to be useful (enough to look for patterns or trends)?

What k-value would you be most comfortable with for keeping your favorite food private in this experiment? What if you wanted to look for trends in the recorded data about favorite food?

How would you expect this experiment to change if you had a much larger (or smaller) sample size (more or fewer students)?

What k-value would you be comfortable with in keeping your real-world personal information (such as medical records or financial history) private?