
NSF Org: |
OAC Office of Advanced Cyberinfrastructure (OAC) |
Recipient: |
|
Initial Amendment Date: | September 10, 2014 |
Latest Amendment Date: | September 10, 2014 |
Award Number: | 1443014 |
Award Instrument: | Standard Grant |
Program Manager: |
Amy Walton
awalton@nsf.gov (703)292-4538 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering |
Start Date: | January 1, 2015 |
End Date: | December 31, 2018 (Estimated) |
Total Intended Award Amount: | $1,498,683.00 |
Total Awarded Amount to Date: | $1,498,683.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
2200 W MAIN ST DURHAM NC US 27705-4640 (919)684-3030 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
NC US 27705-4010 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Data Cyberinfrastructure, Cybersecurity Innovation |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
This research project will develop a pilot of an integrated system for disseminating large-scale data about people. This project will address critical challenges that have inhibited the wide-spread dissemination of large-scale databases that can advance basic social, behavioral, and economic science research and that offer enormous potential benefits to society. Among the challenges the dissemination of these data have posed are the unintended disclosures of data subjects' identities and sensitive attributes, thereby violating promises and sometimes laws designed to protect data subjects' privacy and confidentiality. The products of this project will facilitate the development and dissemination of safe and useful large-scale datasets. The project will result in extensible and open-source products that constitute a proof of concept and that will provide valuable information for future larger-scale implementations of the system. The project therefore will lay the groundwork for a potential transformation in data dissemination, providing data stewards with the infrastructure they need to release data products that advance social science, policy making, and training. The project also will provide education and training opportunities for a post-doctoral researcher as well as graduate and undergraduate students.
The investigators will create new methodology and broadly applicable tools for meeting data dissemination challenges. From a technical perspective, they will advance methodology for generating synthetic datasets via nonparametric methods capable of handling highly dimensional data. They will advance methodology for providing feedback on the quality of inferences from heavily redacted data, and they will develop methods for in depth assessment and characterization of disclosure risks inherent in releasing large-scale synthetic data with and without verification servers. From an infrastructure perspective, the investigators will develop systems and architecture for integrating the three core tools (synthetic data, verification servers, and remote access) in ways that result in secure, scalable access to data. The pilot system will be built with the goal of disseminating a version of a dataset on the work histories of federal government employees.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
In this project, we developed methodology and tools for providing researchers, students, and other members of the public access to confidential social science data. The developments focus on three main concepts. The first concept is called synthetic data. A synthetic data file is comprised of entirely simulated data, generated so that the statistical properties of the simulated data mimic those of the confidential data. This data file can have low risks of unintended disclosures, since the released records do not correspond to actual individuals in the data file. We developed new methods for generating synthetic data for longitudinal data files with many variables. We used these methods to develop a synthetic dataset comprising work histories of employees in the federal government. The second concept is called a verification server. This is a computer server that holds the synthetic and confidential data files. Users of the synthetic data can query the server for feedback on the quality of results obtained from the synthetic data; for example, is the value of a regression coefficient estimated from the synthetic data similar to the estimate from the confidential data? We developed verification measures that satisfy differential privacy, so that releasing verification measures to users does not leak too much information about the confidential data. We developed software for implementing these measures, as well as a user interface, that could serve as the front end of verification servers developed in other contexts. The third concept is called secure remote access. Vetted users can log in to a server at a host institution to access the confidential data. We developed software that allows the host institution to provide multi-factor authenticated access to users outside the host institution, without having to create accounts for those users. We demonstrated how to integrate synthetic data, verification, and remote access in a single system. We use the synthetic data on federal employees to estimate differentials in pay by gender and race in the federal government; we find differences by race and gender. We verify these results using the differentially private measures. Finally, we run the analyses on the confidential data, and find that the disparities exist in those data as well. The integrated system performs as intended. The synthetic data allow users to get reasonable inferences about pay disparities. The verification server confirms findings from the analysis of the synthetic data that also exist in the confidential data, and it reveals findings from the synthetic data that do not hold up in the confidential data. All software from the project is available in a free, public repository on GitHub. The project trained one post-doctoral associate, two PhD students who finished their degrees, two master's students who finished their degrees, and two undergraduate students who finished their degrees.
Last Modified: 02/21/2019
Modified by: Jerome P Reiter
Please report errors in award information by writing to: awardsearch@nsf.gov.