
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | January 23, 2015 |
Latest Amendment Date: | July 17, 2018 |
Award Number: | 1453432 |
Award Instrument: | Continuing Grant |
Program Manager: |
Phillip Regalia
pregalia@nsf.gov (703)292-2981 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 1, 2015 |
End Date: | June 30, 2021 (Estimated) |
Total Intended Award Amount: | $540,000.00 |
Total Awarded Amount to Date: | $556,200.00 |
Funds Obligated to Date: |
FY 2016 = $8,200.00 FY 2017 = $218,356.00 FY 2018 = $113,779.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
3 RUTGERS PLZ NEW BRUNSWICK NJ US 08901-8559 (848)932-0150 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
94 Brett Road Piscataway NJ US 08854-8058 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Special Projects - CCF, Comm & Information Foundations, Secure &Trustworthy Cyberspace |
Primary Program Source: |
01001617DB NSF RESEARCH & RELATED ACTIVIT 01001718DB NSF RESEARCH & RELATED ACTIVIT 01001819DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Medical technologies such as imaging and sequencing make it possible to gather massive amounts of information at increasingly lower cost. Sharing data from studies can advance scientific understanding and improve healthcare outcomes. Concern about patient privacy, however, can preclude open data sharing, thus hampering progress in understanding stigmatized conditions such as mental health disorders. This research seeks to understand how to analyze and learn from sensitive data held at different sites (such as medical centers) in a way that quantifiably and rigorously protects the privacy of the data.
The framework used in this research is differential privacy, a recently-proposed model for measuring privacy risk in data sharing. Differentially private algorithms provide approximate (noisy) answers to protect sensitive data, involving a tradeoff between privacy and utility. This research studies how to combine private approximations from different sites to improve the overall quality or utility of the result. The main goals of this research are to understand the fundamental limits of private data sharing, to design algorithms for making private approximations and rules for combining them, and to understand the consequences of sites having more complex privacy and sharing restrictions. The methods used to address these problems are a mix of mathematical techniques from statistics, computer science, and electrical engineering.
The educational component of this research will involve designing introductory university courses and material on data science, undergraduate research projects, curricular materials for graduate courses, and outreach to the growing data-hacker community via presentations, tutorial materials, and open-source software.
The primary aim of this research is bridge the gap between theory and practice by developing algorithmic principles for practical privacy-preserving algorithms. These algorithms will be validated on neuroimaging data used to understand and diagnose mental health disorders. Implementing the results of this research will create a blueprint for building practical privacy-preserving learning for research in healthcare and other fields. The tradeoffs between privacy and utility in distributed systems lead naturally to more general questions of cost-benefit tradeoffs for learning problems, and the same algorithmic principles will shed light on information processing and machine learning in general distributed systems where messages may be noisy or corrupted.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Medical technologies such as imaging and sequencing make it possible to gather massive amounts of information at increasingly lower cost. Sharing data from studies can advance scientific understanding and improve healthcare outcomes. Concerns about patient privacy, however, can preclude open data sharing, thus hampering progress in understanding stigmatized conditions such as mental health disorders. Suppose there are several research groups studying a brain disorder (e.g. early-onset Alzheimers). They want to collaborate to see if they can learn common patterns in the MRI images from their research subjects. However, due to privacy concerns, each site wants to protect against the risk that someone could identify one of their research subjects.
This research seeks to understand how to analyze and learn from sensitive data held at different sites in a way that quantifiably and rigorously protects the privacy of the data. The framework used for this work is differential privacy, which measures the privacy risk when publishing information derived from sensitive data. We can guarantee privacy by providing only approximate answers, which means we trade off accuracy (or some other measure of utility) with privacy risk. In this project we looked at several different aspects of applying differential privacy to scenarios similar to the collaborative research example above.
We first looked at the negative impact of differentally private approximations on utility. This impact is mitigated by more data; a larger sample size can enable better utility with lower privacy risk. Turning the question around, we ask: for a given desired level of privacy risk and utility, how much data do we need to meet both criteria? Comparing this to the amount of data needed when privacy is not a concern lets us quantify the cost of privacy in terms of the additional data needed. We looked at this cost in several illustrative but simple examples in a model of local differential privacy. This allowed us to understand if modeling assumptions can help improve the privacy-utility tradeoff.
A second branch of this work involved developing algorithms for decentralized learning and estimation using differential privacy. Here we focused on algorithms which try to estimate or factorize matrices that characterize the statistical structure of the underlying data. Some of the challenges involved in designing these methods are the potentially large number of messages that need to be exchanged, the total privacy risk and the quality of the result. This adds a third dimension (communication cost) to the privacy-utility tradeoff. When thinking about collaborative research projects, there are two points of comparison to evaluate these tradeoffs: one where a site simply uses its own data to perform the estimation (no collaboration, or a local solution) and one where the data from all sites is collected in one place (a global solution). This allows us to evaluate the cost of privacy, the cost of decentralization, and the benefits of collaboration.
The last part of this research involved putting the principles and ideas developed in the first two parts into practice. We worked with researchers in neuroimaging to implement and evaluate several of our methods, designing decentralized algorithms for key tasks in neuroimaging data processing pipelines. Through this process we also examined how some additional resources could enhance the privacy-utility tradeoff. In particular, we saw that if the sites could generate correlated random numbers, independent of the data, then we could achieve much better tradeoffs, sometimes as good as those for the global solution. We demonstrated this in a neuroimaging application to show that reasonable privacy-utility tradeoffs are possible when the consortium collectively has enough data.
Intellectual Merit: This proposal helped bridge the gap between the theory and practice of distributed privacy-preserving information processing. Overall, this project has led to a number of different insights around how privacy works in networked systems. In the intervening years, some of the models for decentralized estimaton have been rebranded as "federated learning." While we take medical research as a canonical example, this project encompasses many other application domains in which data holders wish to collaborate but have an oblication to protect the privacy of their data. We hope that some of the results developed in this project can inform the design of future privacy-preserving federated learning algorithms that can be applied to scientific collaborations.
Broader Impact: This work has directly informed the design of a collaborative research system for neuroimaging research. Evaluating algorithms within this system will lead to new insights in what the possible privacy-utility tradeoffs are for real data. The project supported several undergraduate students through diversity-focused REU programs, many of whom have gone on to successful graduate careers. Finally, this project resulted in tutorials for signal processing and machine learning audiences as well as a short course for statisticans on differential privacy to help bring differential privacy into statistical practice.
Last Modified: 11/18/2021
Modified by: Anand Sarwate
Please report errors in award information by writing to: awardsearch@nsf.gov.