NSF Award Search: Award # 1453432 - CAREER: Privacy-preserving learning for distributed data

Award Abstract # 1453432

CAREER: Privacy-preserving learning for distributed data

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	RUTGERS, THE STATE UNIVERSITY
Initial Amendment Date:	January 23, 2015
Latest Amendment Date:	July 17, 2018
Award Number:	1453432
Award Instrument:	Continuing Grant
Program Manager:	Phillip Regalia pregalia@nsf.gov (703)292-2981 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	July 1, 2015
End Date:	June 30, 2021 (Estimated)
Total Intended Award Amount:	$540,000.00
Total Awarded Amount to Date:	$556,200.00
Funds Obligated to Date:	FY 2015 = $215,865.00 FY 2016 = $8,200.00 FY 2017 = $218,356.00 FY 2018 = $113,779.00
History of Investigator:	Anand Sarwate (Principal Investigator) anand.sarwate@rutgers.edu
Recipient Sponsored Research Office:	Rutgers University New Brunswick 3 RUTGERS PLZ NEW BRUNSWICK NJ US 08901-8559 (848)932-0150
Sponsor Congressional District:	12
Primary Place of Performance:	Rutgers University New Brunswick 94 Brett Road Piscataway NJ US 08854-8058
Primary Place of Performance Congressional District:	06
Unique Entity Identifier (UEI):	M1LVPE5GLSD9
Parent UEI:
NSF Program(s):	Special Projects - CCF, Comm & Information Foundations, Secure &Trustworthy Cyberspace
Primary Program Source:	01001516DB NSF RESEARCH & RELATED ACTIVIT 01001617DB NSF RESEARCH & RELATED ACTIVIT 01001718DB NSF RESEARCH & RELATED ACTIVIT 01001819DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	1045, 7434, 7797, 7935, 9251
Program Element Code(s):	287800, 779700, 806000
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Medical technologies such as imaging and sequencing make it possible to gather massive amounts of information at increasingly lower cost. Sharing data from studies can advance scientific understanding and improve healthcare outcomes. Concern about patient privacy, however, can preclude open data sharing, thus hampering progress in understanding stigmatized conditions such as mental health disorders. This research seeks to understand how to analyze and learn from sensitive data held at different sites (such as medical centers) in a way that quantifiably and rigorously protects the privacy of the data.

The framework used in this research is differential privacy, a recently-proposed model for measuring privacy risk in data sharing. Differentially private algorithms provide approximate (noisy) answers to protect sensitive data, involving a tradeoff between privacy and utility. This research studies how to combine private approximations from different sites to improve the overall quality or utility of the result. The main goals of this research are to understand the fundamental limits of private data sharing, to design algorithms for making private approximations and rules for combining them, and to understand the consequences of sites having more complex privacy and sharing restrictions. The methods used to address these problems are a mix of mathematical techniques from statistics, computer science, and electrical engineering.

The educational component of this research will involve designing introductory university courses and material on data science, undergraduate research projects, curricular materials for graduate courses, and outreach to the growing data-hacker community via presentations, tutorial materials, and open-source software.

The primary aim of this research is bridge the gap between theory and practice by developing algorithmic principles for practical privacy-preserving algorithms. These algorithms will be validated on neuroimaging data used to understand and diagnose mental health disorders. Implementing the results of this research will create a blueprint for building practical privacy-preserving learning for research in healthcare and other fields. The tradeoffs between privacy and utility in distributed systems lead naturally to more general questions of cost-benefit tradeoffs for learning problems, and the same algorithmic principles will shed light on information processing and machine learning in general distributed systems where messages may be noisy or corrupted.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 31)

Show All

Baker, Bradley T. and Abrol, Anees and Silva, Rogers F. and Damaraju, Eswar and Sarwate, Anand D. and Calhoun, Vince D. and Plis, Sergey M. "Decentralized temporal independent component analysis: Leveraging fMRI data in collaborative settings" NeuroImage , v.186 , 2019 10.1016/j.neuroimage.2018.10.072 Citation Details

Baker, Bradley T. and Silva, Rogers F. and Calhoun, Vince D. and Sarwate, Anand D. and Plis, Sergey M. "Large scale collaboration with autonomy: Decentralized data ICA" 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP) , 2015 10.1109/MLSP.2015.7324344 Citation Details

B. Baker, R. Silva, V.D. Calhoun, A.D. Sarwate, S. Plis "Large scale collaboration with autonomy: decentralized data ICA" Proceedings of the IEEE International Workshop on Machine Learning For Signal Processing (MLSP) , 2015 , p.1 10.1109/MLSP.2015.7324344

Bradley Baker and Anees Abrol and Rogers F. Silva and Eswar Damaraju and Anand D Sarwate and Vince D Calhoun and Sergey M Plis "Decentralized Temporal Independent Component Analysis: Leveraging {fMRI} Data in Collaborative Settings" NeuroImage , v.186 , 2019 , p.557--569 10.1016/j.neuroimage.2018.10.072

Dionysios S Kalogerias and Konstantinos E Nikolakakis and Anand D Sarwate and Or Sheffet "Quantile Multi-Armed Bandits: Optimal Best-Arm Identification and a Differentially Private Scheme" IEEE Journal on Selected Areas in Information Theory , v.2 , 2021 10.1109/JSAIT.2021.3081525

Ghassemi, Mohsen and Sarwate, Anand D. and Wright, Rebecca N. "Differentially Private Online Active Learning with Applications to Anomaly Detection" Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security , 2016 10.1145/2996758.2996766 Citation Details

Hafiz Imtiaz and Anand D. Sarwate "Distributed Differentially-Private Algorithms for Matrix and Tensor Factorization" IEEE Journal of Selected Topics in Signal Processing , v.12 , 2018 , p.1449--146 10.1109/JSTSP.2018.2877842

Harshvardhan Gazula and Ross Kelly and Javier Romero and Eric Verner and Bradley T. Baker and Rogers F. Silva and Hafiz Imtiaz and Debbrata Kumar Saha and Rajikha Raja and Jessica A. Turner and Anand D. Sarwate and Sergey M. Plis and Vince D. Calhoun "COINSTAC: Collaborative Informatics and Neuroimaging Suite Toolkit for Anonymous Computation" Journal of Open Source Software , v.5 , 2020 , p.2166 10.21105/joss.02166

H. Imtiaz, A.D. Sarwate "Symmetric Matrix Perturbation For Differentially-Private Principal Component Analysis" Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2016 , p.2339 10.1109/ICASSP.2016.7472095

Imtiaz, Hafiz and Sarwate, Anand D. "Distributed Differentially Private Algorithms for Matrix and Tensor Factorization" IEEE Journal of Selected Topics in Signal Processing , v.12 , 2018 10.1109/JSTSP.2018.2877842 Citation Details

Imtiaz, Hafiz and Sarwate, Anand D. "Symmetric matrix perturbation for differentially-private principal component analysis" 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2016 10.1109/ICASSP.2016.7472095 Citation Details

(Showing: 1 - 10 of 31)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Medical technologies such as imaging and sequencing make it possible to gather massive amounts of information at increasingly lower cost. Sharing data from studies can advance scientific understanding and improve healthcare outcomes. Concerns about patient privacy, however, can preclude open data sharing, thus hampering progress in understanding stigmatized conditions such as mental health disorders. Suppose there are several research groups studying a brain disorder (e.g. early-onset Alzheimers). They want to collaborate to see if they can learn common patterns in the MRI images from their research subjects. However, due to privacy concerns, each site wants to protect against the risk that someone could identify one of their research subjects.

This research seeks to understand how to analyze and learn from sensitive data held at different sites in a way that quantifiably and rigorously protects the privacy of the data. The framework used for this work is differential privacy, which measures the privacy risk when publishing information derived from sensitive data. We can guarantee privacy by providing only approximate answers, which means we trade off accuracy (or some other measure of utility) with privacy risk. In this project we looked at several different aspects of applying differential privacy to scenarios similar to the collaborative research example above.

We first looked at the negative impact of differentally private approximations on utility. This impact is mitigated by more data; a larger sample size can enable better utility with lower privacy risk. Turning the question around, we ask: for a given desired level of privacy risk and utility, how much data do we need to meet both criteria? Comparing this to the amount of data needed when privacy is not a concern lets us quantify the cost of privacy in terms of the additional data needed. We looked at this cost in several illustrative but simple examples in a model of local differential privacy. This allowed us to understand if modeling assumptions can help improve the privacy-utility tradeoff.

A second branch of this work involved developing algorithms for decentralized learning and estimation using differential privacy. Here we focused on algorithms which try to estimate or factorize matrices that characterize the statistical structure of the underlying data. Some of the challenges involved in designing these methods are the potentially large number of messages that need to be exchanged, the total privacy risk and the quality of the result. This adds a third dimension (communication cost) to the privacy-utility tradeoff. When thinking about collaborative research projects, there are two points of comparison to evaluate these tradeoffs: one where a site simply uses its own data to perform the estimation (no collaboration, or a local solution) and one where the data from all sites is collected in one place (a global solution). This allows us to evaluate the cost of privacy, the cost of decentralization, and the benefits of collaboration.

The last part of this research involved putting the principles and ideas developed in the first two parts into practice. We worked with researchers in neuroimaging to implement and evaluate several of our methods, designing decentralized algorithms for key tasks in neuroimaging data processing pipelines. Through this process we also examined how some additional resources could enhance the privacy-utility tradeoff. In particular, we saw that if the sites could generate correlated random numbers, independent of the data, then we could achieve much better tradeoffs, sometimes as good as those for the global solution. We demonstrated this in a neuroimaging application to show that reasonable privacy-utility tradeoffs are possible when the consortium collectively has enough data.

Intellectual Merit: This proposal helped bridge the gap between the theory and practice of distributed privacy-preserving information processing. Overall, this project has led to a number of different insights around how privacy works in networked systems. In the intervening years, some of the models for decentralized estimaton have been rebranded as "federated learning." While we take medical research as a canonical example, this project encompasses many other application domains in which data holders wish to collaborate but have an oblication to protect the privacy of their data. We hope that some of the results developed in this project can inform the design of future privacy-preserving federated learning algorithms that can be applied to scientific collaborations.

Broader Impact: This work has directly informed the design of a collaborative research system for neuroimaging research. Evaluating algorithms within this system will lead to new insights in what the possible privacy-utility tradeoffs are for real data. The project supported several undergraduate students through diversity-focused REU programs, many of whom have gone on to successful graduate careers. Finally, this project resulted in tutorials for signal processing and machine learning audiences as well as a short course for statisticans on differential privacy to help bring differential privacy into statistical practice.

Last Modified: 11/18/2021
Modified by: Anand Sarwate

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error