
NSF Org: |
CNS Division Of Computer and Network Systems |
Recipient: |
|
Initial Amendment Date: | August 28, 2012 |
Latest Amendment Date: | August 28, 2012 |
Award Number: | 1223825 |
Award Instrument: | Standard Grant |
Program Manager: |
Nan Zhang
CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | January 15, 2013 |
End Date: | December 31, 2016 (Estimated) |
Total Intended Award Amount: | $499,996.00 |
Total Awarded Amount to Date: | $499,996.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
MAIN CAMPUS WASHINGTON DC US 20057 (202)625-0100 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
37th & O St N W Washington DC US 20057-1789 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Secure &Trustworthy Cyberspace |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
This research project studies a new area of research - exposure detection - that is at the intersection of data mining, security, and natural language processing. Exposure detection refers to discovering components/attributes of a user's public profile that reduce the user's privacy. To help the public understand the privacy risks of sharing certain information on the web, this research project focuses on developing efficient algorithms for modeling how an adversary learns information using incomplete and schemaless public data sources. Theoretically sound and efficient techniques for identifying accurate web footprints are introduced, including: new methods for data matching using a novel probabilistic join operator on multi-granular data, automated approaches for generating inference rules, and new solutions for identifying missing information and unifying mismatched vocabulary using lightweight natural language processing and text mining. The research activities also investigate methods for quantifying and adjusting exposure and risk, facilitating a better understanding of individuals' vulnerability on the web. These techniques not only advance the state of the art in re-identification, probabilistic reasoning and inference logic, and natural language understanding, but also serve as a foundation for exposure detection.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
With the emergence of online social networks and the growing popularity of digital communication, more and more information about individuals is becoming available on the Internet. While much of this information is not sensitive, it is not uncommon for users to publish some sensitive information on social networking sites. The availability of this publicly accessible and potentially sensitive data can (and does) lead to abuse, exposing users to stalking and identity theft.
To help users better understand the potential risks associated with publishing certain data on the web, as well as the quantity and sensitivity of information that can be obtained by combining data from various online sources, we developed a multifaceted framework and prototype system that generates and analyzes web footprints. Web footprints are the traces of one’s social activities represented by a set of attributes that are known or can be inferred with a high probability by an adversary who has basic information about a user and has access to publicly available information from online sources. This research project focused on mitigating such privacy threats by constructing a framework that includes algorithms for modeling how an adversary learns information using incomplete and schema-less public data sources that enable web users to better understand their public profiles.
Our framework includes three types of inference – pattern based inference, probabilistic dependency inference, and population based inference. Pattern based inference uses bootstrapped patterns found in a corpus to extract structured attributes from text, e.g. identify a birthday from a tweet. This helps increase the amount of usable information for web footprint construction. In addition to observable data, probabilistic inference logic is applied to supplement web footprints with probable attribute value pairs learned using algebraic dependencies between attribute values in user profiles on different sites. Finally, we use site-level population data to further infer the user’s attribute values. To allow for population level comparison, our framework also quantifies a user’s level of public information exposure relative to others with similar traits as well as with regard to others in the population.
The final part of our framework focuses on helping users improve their privacy. We developed risk reduction recommendation algorithms that suggest removal or modification of a small number of attributes, thereby reducing the overall number of attributes that can be discovered with high confidence using inference methods. While we developed a number of different strategies, the most novel focuses on suggesting modifications to a user’s public profile to directly match a persona, where a persona is a set of attribute-value pairs that occur together in a population with a frequency above a predefined threshold. A profile is deemed safe if it matches a persona because that profile’s particular set of attributes occurs enough times in the population to allow it to blend into a homogeneous group. The concept of blending into a crowd is similar to the idea of k- anonymity, but we extend it to the realm of public social network data containing shared attributes across websites.
The project resulted in over 20 publications, including a best paper runner up, a prototype privacy application that can be used by students to monitor their web footprints, research training of seven undergraduates, three Master’s students and two PhD students, and a half a dozen outreach talks related to web privacy to high school and college students, as well as researchers and practicioners. Our work in the area of web footprinting, the ethics of using big data, privacy in information retrieval, dynamic search algorithms, voice authentication privacy, and network traffic privacy are pioneering in their respective communities. Research is being advanced in data mining, privacy, security, and information retrieval. Ideas related to this project have been shared through papers, presentations, invited panels, and workshop organization. We hope that the algorithms and software will be extended and used by different universities to help their students better understand the privacy risks associated with sharing so much data publicly.
Last Modified: 03/31/2017
Modified by: Lisa Singh
Please report errors in award information by writing to: awardsearch@nsf.gov.