Award Abstract # 1636766
BD Spokes: SPOKE: NORTHEAST: Collaborative: A Licensing Model and Ecosystem for Data Sharing

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: MASSACHUSETTS INSTITUTE OF TECHNOLOGY
Initial Amendment Date: August 26, 2016
Latest Amendment Date: January 4, 2021
Award Number: 1636766
Award Instrument: Standard Grant
Program Manager: Martin Halbert
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2016
End Date: December 31, 2021 (Estimated)
Total Intended Award Amount: $444,000.00
Total Awarded Amount to Date: $816,440.00
Funds Obligated to Date: FY 2016 = $444,000.00
FY 2019 = $372,440.00
History of Investigator:
  • Samuel Madden (Principal Investigator)
    madden@csail.mit.edu
  • Daniel Weitzner (Co-Principal Investigator)
Recipient Sponsored Research Office: Massachusetts Institute of Technology
77 MASSACHUSETTS AVE
CAMBRIDGE
MA  US  02139-4301
(617)253-1000
Sponsor Congressional District: 07
Primary Place of Performance: Massachusetts Institute of Technology
77 Massachusetts Ave.
Cambridge
MA  US  02139-4307
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): E2NYLCDML6V1
Parent UEI: E2NYLCDML6V1
NSF Program(s): BD Spokes -Big Data Regional I
Primary Program Source: 01001617DB NSF RESEARCH & RELATED ACTIVIT
01001920RB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 026Z, 028Z, 7433, 8083
Program Element Code(s): 024Y00
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Sharing of data sets can provide tremendous mutual benefits for industry, researchers and nonprofit organizations. For example, companies can profit from the fact that university researchers explore their data sets and make discoveries, which help the company to improve their business. At the same time, researchers are always on the search for real world data sets to show that their newly developed techniques work in practice. Unfortunately, many attempts to share relevant data sets between different stakeholders in industry and academia fail or require a large investment to make data sharing possible. A major obstacle is that data often comes with prohibitive restrictions on how it can be used (e.g., requiring the enforcement of legal terms or other policies, handling data privacy issues, etc.). In order to enforce these requirements today, lawyers are usually involved in negotiation the terms of each contract. It is not atypical that this process of creating an individual contract for data sharing ends up in protracted negotiations, as both sides struggle with the implications and possibilities of modern security, privacy, and data sharing techniques. Worse, fears of missing a loophole in how the data might be (mis)used often prevents many data sharing efforts from even getting started. To address these challenges, our new data sharing spoke will enable data providers to easily share data while enforcing constraints on the use of the data. This effort has two key components:(1) Creating a licensing model for data that facilitates sharing data that is not necessarily open or free between different organizations and (2) Developing a prototype data sharing software platform, ShareDB, which enforces the terms and restrictions of the developed licenses. We believe these efforts will have a transformative impact on how data sharing takes place. By moving data out of the silos of individuals and single organizations and into the hands of broader society, we can tackle many societally significant problems.

This new data sharing spoke will enable data providers to easily share data while enforcing constraints on the use of the data. Many services and platforms that provide access to data sets exist already today. However, these platforms generally promote completely open access and do not address the aforementioned issues that arise when dealing with proprietary data. Thus, the effort has three key components: (1) Creating a licensing model for data that facilitates sharing data that is not necessarily open or free between different organizations, (2) developing a prototype data sharing software platform, ShareDB, which enforces the terms and restrictions of the developed licenses, and (3) developing and integrating relevant metadata that will accompany the datasets shared under the different licenses, making them easily searchable and interpretable. To ensure that the developed tools and licenses are useful, the project will form the Northeast Data Sharing Group, comprising many different stakeholders to make the licensing model widely accepted and usable in many application domains (e.g., health and finance). The intellectual merit of this proposal is to design a licensing model and a data sharing platform that is widely accepted and usable as a template in many different domains. While there exist other efforts to enable data sharing (e.g., Creative Commons), they focus on the case where the data owner is willing to openly share the data on the Internet. This licensing model and the ecosystem is different since it allows data owners to enforce certain requirements stated in a data sharing agreement (e.g., on who is allowed to access the data) and also provides tools to make data sharing of sensitive information safe. The licenses and software we propose to investigate will make it easier for organizations to open up their data to the appropriate organizations, while maintaining the ability to ensure it is protected, that access is revocable, and that access controls and audit logs are maintained.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cao, Lei and Xiao, Dongqing and Yan, Yizhou and Madden, Samuel and Li, Guoliang "ATLANTIC: Making Database Differentially Private and Faster with Accuracy Guarantee" Proceedings of the International Conference on Very Large Data Bases , v.14 , 2021 https://doi.org/10.14778/3476311.3476337 Citation Details
Fernandez, Raul Castro and Madden, Samuel "Termite: a system for tunneling through heterogeneous data" Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management , 2019 10.1145/3329859.3329877 Citation Details
Hulsebos, Madelon and Hu, Kevin and Bakker, Michiel and Zgraggen, Emanuel and Satyanarayan, Arvind and Kraska, Tim and Demiralp, Çagatay and Hidalgo, César "Sherlock: A Deep Learning Approach to Semantic Data Type Detection" Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2019 10.1145/3292500.3330993 Citation Details
Rezig, El Kindi and Cao, Lei and Simonini, Giovanni and Schoemans, Maxime and Madden, Samuel and Tang, Nan and Ouzzani, Mourad and Stonebraker, Michael "Dagger: A Data (not code) Debugger" CIDR 2020, 10th Conference on Innovative Data Systems Research, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings , 2020 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The Northeast Big Data SPOKE project  “A Licensing Model and Ecosystem for Data Sharing” was a collaborative project involving researchers in the Computer Science  at the Massachusetts Institute of Technology (MIT) and the Metadata Research Center in the  College of Computing and Informatics at Drexel University. The overall aim was to develop a data sharing system and an approach that addresses legal matters, policies, privacy concerns, as well as a number of technical challenges that too frequently hold up the process of collaborating through data.  Specific results included: 

1) Creating a licensing model for data that facilitates sharing data that is not necessarily open or free between different organizations.

We collected a large number of data sharing agreements and conducted a survey of the types of licenses that are used in them.  This allowed us to create a metadata taxonomy to classify and simplify sharing agreements. 

2) Developing a prototype data sharing software platform, ShareDB that enforces agreement terms and restrictions for the licenses developed, and that includes features for building processing pipelines over those shared data sets and finding errors and anomalies in that data 

Our prototype sharing system ShareDB included several different anonymization features, including differential privacy. 

3) We developed and integrated relevant metadata that accompany the datasets shared under the different licenses, making them easily searchable and interpretable.

 

 

 


Last Modified: 05/02/2022
Modified by: Samuel Madden

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page