Award Abstract # 1223825
TWC: Small: Assessing Online Information Exposure Using Web Footprints

NSF Org: CNS
Division Of Computer and Network Systems
Recipient: GEORGETOWN UNIVERSITY
Initial Amendment Date: August 28, 2012
Latest Amendment Date: August 28, 2012
Award Number: 1223825
Award Instrument: Standard Grant
Program Manager: Nan Zhang
CNS
 Division Of Computer and Network Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: January 15, 2013
End Date: December 31, 2016 (Estimated)
Total Intended Award Amount: $499,996.00
Total Awarded Amount to Date: $499,996.00
Funds Obligated to Date: FY 2012 = $499,996.00
History of Investigator:
  • Lisa Singh (Principal Investigator)
    singh@cs.georgetown.edu
  • Micah Sherr (Co-Principal Investigator)
  • Grace Hui Yang (Co-Principal Investigator)
Recipient Sponsored Research Office: Georgetown University
MAIN CAMPUS
WASHINGTON
DC  US  20057
(202)625-0100
Sponsor Congressional District: 00
Primary Place of Performance: Georgetown University
37th & O St N W
Washington
DC  US  20057-1789
Primary Place of Performance
Congressional District:
00
Unique Entity Identifier (UEI): TF2CMKY1HMX9
Parent UEI: TF2CMKY1HMX9
NSF Program(s): Secure &Trustworthy Cyberspace
Primary Program Source: 01001213DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7434, 7923, 9102
Program Element Code(s): 806000
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

This research project studies a new area of research - exposure detection - that is at the intersection of data mining, security, and natural language processing. Exposure detection refers to discovering components/attributes of a user's public profile that reduce the user's privacy. To help the public understand the privacy risks of sharing certain information on the web, this research project focuses on developing efficient algorithms for modeling how an adversary learns information using incomplete and schemaless public data sources. Theoretically sound and efficient techniques for identifying accurate web footprints are introduced, including: new methods for data matching using a novel probabilistic join operator on multi-granular data, automated approaches for generating inference rules, and new solutions for identifying missing information and unifying mismatched vocabulary using lightweight natural language processing and text mining. The research activities also investigate methods for quantifying and adjusting exposure and risk, facilitating a better understanding of individuals' vulnerability on the web. These techniques not only advance the state of the art in re-identification, probabilistic reasoning and inference logic, and natural language understanding, but also serve as a foundation for exposure detection.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Adam Bates, Kevin Butler, Micah Sherr, Clay Shields, Patrick Traynor and Dan Wallach "Accountable Wiretapping -or- I Know They Can Hear You Now" Journal of Computer Security , v.23 , 2015 , p.167--195
Andrew Jie Zhou, Hui Yang, Hongkai Wu. "Minerva II: A Novel Entity Discovery Tool" Conference on Human Factors in Computing Systems (CHI 2016) , 2016 10.1145/2851581.2892520
Hui Yang "Browsing Hierarchy Construction by Minimum Evolution" ACM Transactions on Information Systems (TOIS) , v.33 , 2015
Hui Yang, Dongyi Guan, Sicong Zhang "The Query-Change Model: Modeling Session Search as a Markov Decision Process" ACM Transactions on Information Systems (TOIS) , v.33 , 2015
Janet Zhu, Sicong Zhang, Lisa Singh, Grace Hui Yang and Micah Sherr "Generating Risk Reduction Recommendations to Decrease Vulnerability of Public Online Profiles" The 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining , 2016
Marc Sloan, Hui Yang, and Jun Wang. "A Term-Based Methodology for Query Reformulation Understanding" Information Retrieval Journal (IRJ) , v.18 , 2015
Sicong Zhang, Grace Hui Yang, Lisa Singh "Anonymizing Query Logs by Differential Privacy" In Proceedings of the 39th Annual ACM SIGIR Conference (SIGIR 2016). , 2016
Sicong Zhang, Grace Hui Yang, Lisa Singh, Li Xiong "Safelog: Supporting Web Search and Mining by Differentially-Private Query Logs" AAAI Fall Symposium on Privacy and Language Technologies , 2016

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

With the emergence of online social networks and the growing popularity of digital communication, more and more information about individuals is becoming available on the Internet. While much of this information is not sensitive, it is not uncommon for users to publish some sensitive information on social networking sites. The availability of this publicly accessible and potentially sensitive data can (and does) lead to abuse, exposing users to stalking and identity theft.

To help users better understand the potential risks associated with publishing certain data on the web, as well as the quantity and sensitivity of information that can be obtained by combining data from various online sources, we developed a multi­faceted framework and prototype system that generates and analyzes web footprints. Web footprints are the traces of one’s social activities represented by a set of attributes that are known or can be inferred with a high probability by an adversary who has basic information about a user and has access to publicly available information from online sources. This research project focused on mitigating such privacy threats by constructing a framework that includes algorithms for modeling how an adversary learns information using incomplete and schema-less public data sources that enable web users to better understand their public profiles. 

Our framework includes three types of inference – pattern­ based inference, probabilistic dependency inference, and population­ based inference. Pattern based inference uses bootstrapped patterns found in a corpus to extract structured attributes from text, e.g. identify a birthday from a tweet.  This helps increase the amount of usable information for web footprint construction. In addition to observable data, probabilistic inference logic is applied to supplement web footprints with probable attribute value pairs learned using algebraic dependencies between attribute values in user profiles on different sites. Finally, we use site-level population data to further infer the user’s attribute values. To allow for population level comparison, our framework also quantifies a user’s level of public information exposure relative to others with similar traits as well as with regard to others in the population.

The final part of our framework focuses on helping users improve their privacy. We developed risk reduction recommendation algorithms that suggest removal or modification of a small number of attributes, thereby reducing the overall number of attributes that can be discovered with high confidence using inference methods. While we developed a number of different strategies, the most novel focuses on suggesting modifications to a user’s public profile to directly match a persona, where a persona is a set of attribute-value pairs that occur together in a population with a frequency above a predefined threshold. A profile is deemed safe if it matches a persona because that profile’s particular set of attributes occurs enough times in the population to allow it to blend into a homogeneous group. The concept of blending into a crowd is similar to the idea of k- anonymity, but we extend it to the realm of public social network data containing shared attributes across websites. 

The project resulted in over 20 publications, including a best paper runner up, a prototype privacy application that can be used by students to monitor their web footprints, research training of seven undergraduates, three Master’s students and two PhD students, and a half a dozen outreach talks related to web privacy to high school and college students, as well as researchers and practicioners. Our work in the area of web footprinting, the ethics of using big data, privacy in information retrieval, dynamic search algorithms, voice authentication privacy, and network traffic privacy are pioneering in their respective communities. Research is being advanced in data mining, privacy, security, and information retrieval. Ideas related to this project have been shared through papers, presentations, invited panels, and workshop organization. We hope that the algorithms and software will be extended and used by different universities to help their students better understand the privacy risks associated with sharing so much data publicly. 


Last Modified: 03/31/2017
Modified by: Lisa Singh

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page