
NSF Org: |
CNS Division Of Computer and Network Systems |
Recipient: |
|
Initial Amendment Date: | August 26, 2014 |
Latest Amendment Date: | August 26, 2014 |
Award Number: | 1408874 |
Award Instrument: | Standard Grant |
Program Manager: |
Shannon Beck
CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2014 |
End Date: | September 30, 2019 (Estimated) |
Total Intended Award Amount: | $360,000.00 |
Total Awarded Amount to Date: | $360,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
107 S INDIANA AVE BLOOMINGTON IN US 47405-7000 (317)278-3473 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
919 E. 10th Street BLOOMINGTON IN US 47408-3912 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Secure &Trustworthy Cyberspace |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Support for research on distributed data sets is challenged by stakeholder requirements limiting sharing. Researchers need early stage access to determine whether data sets are likely to contain the data they need. The Broker Leads project is developing privacy-enhancing technologies adapted to this discovery phase of data-driven research. Its approach is inspired by health information exchanges that are based on a broker system where data are held by healthcare providers and collected in distributed queries managed by the broker. Such systems have potential to support public health and biomedical research. The project targets "similar patient queries" where the query is a patient medical record and the response is information about similar patients. Such queries have value for many applications, including developing cohorts for finding institutions for further discussions about joint research.
Broker Leads uses the concept of a "lead" in which data holders provide representative collections of non-identifiable real or synthetic data meeting strong privacy guarantees, e.g., differential privacy. Even though such data may be unsuitable for clinical decision making and scientific discovery due to the transformations done for privacy protection, they guide a user of a broker lead system to the data sets very likely to be useful to addressing a given similar patient query. These data sets can then be used with other privacy-protecting strategies, such as secure multiparty computation or restrictive data use agreements ensuring adequate data protection. In addition to providing practical and well-analyzed strategies for early stages of research on healthcare data, this project will provide new insights into practical issues with privacy technology in end-to-end applications.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
This project contributed models and techniques to protect the privacy of shared data. Many of the case studies were based on leads generated by brokers from biomedical data, but the results are applicable to all types of data and covered a wide range of techniques.
A key area of investigation was the ability to support early stages of research, which are often characterized by a need for exploration in which researchers do not know the details of the hypotheses they will find most interesting. The project developed techniques for measuring the privacy protections of synthetic data that can be studied in a flexible manner while still being mathematically assured of protecting the privacy of the parties on which the synthetic data was based. One new technique developed in the project used “seedbased” synthetic generation that creates synthetic data based on a mixture of traits of subjects. Another new technique concerned how to measure membership privacy based on established privacy models and machine learning testing strategies.
Machine learning is a fundamental component of modern data analytics on biomedical data. The project carried out the first investigation of distributed, collaborative learning from the data privacy perspective. This multi-year study showed that (a) modern machine learning models may reveal the sensitive data used to train them, and (b) this leakage is exacerbated in collaborative learning scenarios. The project also demonstrated that these potential privacy violations are rooted in how today’s machine learning frameworks and pipelines operate on data, and proposed new methods for mitigating threats to individual privacy. These results open the road to secure, privacy-preserving, distributed machine learning.
The project also studied various techniques to support broker lead based privacy-preserving data sharing and applied such protection to various biomedical data. More specifically, the project developed the techniques for secure similar patient query, using approximation to simplify complicated protecting tasks. The project demonstrated the weaknesses in the beacon-based sharing and built up more effective protection from leads by adding noise to achieve differential privacy in response to the queries from the data user. This more effective protection is shown to work on different kinds of biomedical data, not only human genomes but also DNA methylation data. It was also shown that the side effects introduced by the noise could be addressed using trusted execution environments, which offer an efficient and secure channel for evaluating the utility of data before sharing. The project demonstrated the great potential for lead-based secure data sharing by leveraging different confidential computing technologies. These results will continue to foster the community of biomedical data protection through the high-impact iDASH genome privacy competition.
Last Modified: 12/30/2019
Modified by: Xiaofeng Wang
Please report errors in award information by writing to: awardsearch@nsf.gov.