Award Abstract # 1443014
CIF21 DIBBs: An Integrated System for Public/Private Access to Large-Scale, Confidential Social Science Data

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: DUKE UNIVERSITY
Initial Amendment Date: September 10, 2014
Latest Amendment Date: September 10, 2014
Award Number: 1443014
Award Instrument: Standard Grant
Program Manager: Amy Walton
awalton@nsf.gov
 (703)292-4538
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: January 1, 2015
End Date: December 31, 2018 (Estimated)
Total Intended Award Amount: $1,498,683.00
Total Awarded Amount to Date: $1,498,683.00
Funds Obligated to Date: FY 2014 = $1,498,683.00
History of Investigator:
  • Jerome Reiter (Principal Investigator)
    jreiter@duke.edu
  • John de Figueiredo (Co-Principal Investigator)
  • Ashwin Machanavajjhala (Co-Principal Investigator)
Recipient Sponsored Research Office: Duke University
2200 W MAIN ST
DURHAM
NC  US  27705-4640
(919)684-3030
Sponsor Congressional District: 04
Primary Place of Performance: Duke University
NC  US  27705-4010
Primary Place of Performance
Congressional District:
04
Unique Entity Identifier (UEI): TP7EK8DZV6N5
Parent UEI:
NSF Program(s): Data Cyberinfrastructure,
Cybersecurity Innovation
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 7434, 7726, 8027, 8048
Program Element Code(s): 772600, 802700
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

This research project will develop a pilot of an integrated system for disseminating large-scale data about people. This project will address critical challenges that have inhibited the wide-spread dissemination of large-scale databases that can advance basic social, behavioral, and economic science research and that offer enormous potential benefits to society. Among the challenges the dissemination of these data have posed are the unintended disclosures of data subjects' identities and sensitive attributes, thereby violating promises and sometimes laws designed to protect data subjects' privacy and confidentiality. The products of this project will facilitate the development and dissemination of safe and useful large-scale datasets. The project will result in extensible and open-source products that constitute a proof of concept and that will provide valuable information for future larger-scale implementations of the system. The project therefore will lay the groundwork for a potential transformation in data dissemination, providing data stewards with the infrastructure they need to release data products that advance social science, policy making, and training. The project also will provide education and training opportunities for a post-doctoral researcher as well as graduate and undergraduate students.

The investigators will create new methodology and broadly applicable tools for meeting data dissemination challenges. From a technical perspective, they will advance methodology for generating synthetic datasets via nonparametric methods capable of handling highly dimensional data. They will advance methodology for providing feedback on the quality of inferences from heavily redacted data, and they will develop methods for in depth assessment and characterization of disclosure risks inherent in releasing large-scale synthetic data with and without verification servers. From an infrastructure perspective, the investigators will develop systems and architecture for integrating the three core tools (synthetic data, verification servers, and remote access) in ways that result in secure, scalable access to data. The pilot system will be built with the goal of disseminating a version of a dataset on the work histories of federal government employees.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 19)
Y Chen, A Machanavajjhala, M Hay, G Miklau "PeGaSus: Data-adaptive differentially private stream processing" ACM CCS 2017 , 2017
A Bolton, J de Figueiredo, D Lewis "Elections, ideology, and turnover in the U.S. federal government (NBER Working Paper #22932)" Journal of Law, Economics, and Organization , 2018
AF Barrientos, A Bolton, T Balmat, JP Reiter, JM de Figueiredo, A Machanavajjhala, Y Chen, C Kneifel, M DeLong "Providing access to confidential research data through synthesis and verification: An application to data on employees of the U.S. federal government" Annals of Applied Statistics , 2018 , p.1124
AF Barrientos, A Jara, C Wehrhahn "Posterior convergence rate of a class of Dirichlet process mixture model forcompositional data" Statistics and Probability Letters , v.120 , 2017 , p.45
AF Barrientos, A Jara, F Quintana "Fully nonparametric regression for bounded data using dependent Bernstein polynomials" Journal of the American Statistical Association , v.112 , 2017 , p.806
C Zhang, JM de Figueiredo "Are recessions good for government hires? The effect of unemployment on public sector human capital" Economic Letters , v.170 , 2018 , p.1
D Zhang, R McKenna, I Kotsogiannis, M Hay, A Machanavajjhala, G Miklau "Ektelo: A framework for defining differentially-private computations" SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data , 2018
G Amitai, JP Reiter "Differentially private posterior summaries for linear regression coefficients" Journal of Privacy and Confidentiality , v.8 , 2018
H Yu, JP Reiter "Differentially private verification of regression predictions from synthetic data" Transactions on Data Privacy , v.11 , 2018 , p.279
I Kotsogiannis, A Machanavajjhala, M Hay, G Miklau "Pythia: Differentially private algorithm selection" ACM SIGMOD 2017 , 2017
JP Reiter "Differential privacy and federal data releases" Annual Review of Statistics and Its Application , 2019
(Showing: 1 - 10 of 19)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

In this project, we developed methodology and tools for providing researchers, students, and other members of the public access to confidential social science data.  The developments focus on three main concepts.  The first concept is called synthetic data.  A synthetic data file is comprised of entirely simulated data, generated so that the statistical properties of the simulated data mimic those of the confidential data.  This data file can have low risks of unintended disclosures, since the released records do not correspond to actual individuals in the data file.  We developed new methods for generating synthetic data for longitudinal data files with many variables.  We used these methods to develop a synthetic dataset comprising work histories of employees in the federal government.  The second concept is called a verification server.  This is a computer server that holds the synthetic and confidential data files.  Users of the synthetic data can query the server for feedback on the quality of results obtained from the synthetic data; for example, is the value of a regression coefficient estimated from the synthetic data similar to the estimate from the confidential data?  We developed verification measures that satisfy differential privacy, so that releasing verification measures to users does not leak too much information about the confidential data.  We developed software for implementing these measures, as well as a user interface, that could serve as the front end of verification servers developed in other contexts.  The third concept is called secure remote access.   Vetted users can log in to a server at a host institution to access the confidential data.  We developed software that allows the host institution to provide multi-factor authenticated access to users outside the host institution, without having to create accounts for those users.  We demonstrated how to integrate synthetic data, verification, and remote access in a single system.  We use the synthetic data on federal employees to estimate differentials in pay by gender and race in the federal government; we find differences by race and gender.  We verify these results using the differentially private measures.  Finally, we run the analyses on the confidential data, and find that the disparities exist in those data as well.  The integrated system performs as intended.  The synthetic data allow users to get reasonable inferences about pay disparities.  The verification server confirms findings from the analysis of the synthetic data that also exist in the confidential data, and it reveals findings from the synthetic data that do not hold up in the confidential data.   All software from the project is available in a free, public repository on GitHub.  The project trained one post-doctoral associate, two PhD students who finished their degrees, two master's students who finished their degrees, and two undergraduate students who finished their degrees.


Last Modified: 02/21/2019
Modified by: Jerome P Reiter

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page