Award Abstract # 1761812
Spokes: MEDIUM: NORTHEAST: Collaborative Research: Data Science Foundry: A Collaborative Platform for Computational Social Science

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: MASSACHUSETTS INSTITUTE OF TECHNOLOGY
Initial Amendment Date: July 31, 2018
Latest Amendment Date: July 31, 2018
Award Number: 1761812
Award Instrument: Standard Grant
Program Manager: Cheryl Eavey
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2018
End Date: August 31, 2021 (Estimated)
Total Intended Award Amount: $500,000.00
Total Awarded Amount to Date: $500,000.00
Funds Obligated to Date: FY 2018 = $500,000.00
History of Investigator:
  • Devavrat Shah (Principal Investigator)
    devavrat@mit.edu
  • Munther Dahleh (Co-Principal Investigator)
  • Alberto Abadie (Co-Principal Investigator)
  • Kalyan Veeramachaneni (Co-Principal Investigator)
Recipient Sponsored Research Office: Massachusetts Institute of Technology
77 MASSACHUSETTS AVE
CAMBRIDGE
MA  US  02139-4301
(617)253-1000
Sponsor Congressional District: 07
Primary Place of Performance: Massachusetts Institute of Technology
77 Massachusetts Avenue
Cambridge
MA  US  02139-4301
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): E2NYLCDML6V1
Parent UEI: E2NYLCDML6V1
NSF Program(s): BD Spokes -Big Data Regional I
Primary Program Source: 01001819DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 028Z, 8083, 9102
Program Element Code(s): 024Y00
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

This research project will develop a collaborative data science platform for computational social science called the Data Science Foundry. The collection and management of large-scale data currently is a relatively unstructured process, with data-processing decisions being made in an ad hoc fashion. Society has started to rely on data-driven science to address policy-related questions, however. The development of a collaborative platform that provides structure will allow social scientists to collaborate and validate each other's studies. This project has the potential to transform how studies are designed and how data will be processed. The collaborative platform will result in a higher level of trust in the studies conducted via the collaborative curation of study design, procedures, and validation. The collaborative platform also will increase the number of studies that can be done in a short span of time. The platform will be developed as open-source, thereby facilitating interactions with the community and enabling different institutions to install the program.

This project will develop a collaborative platform that social scientists can use to collaborate and validate each other's studies. The investigative team will attempt to identify the best possible collaborative model for data-driven social science, determine how automation can most enhance the studies, and develop explicit and implicit mechanisms to establish trust in end-to-end data processing pipelines and the results they generate. To aid in the platform's development, the research team will focus on the prediction of outcomes from surveys, a specific yet widely applicable type of problem within computational social science. This class of problems involves much subjective assessment during the feature engineering state as well as copious interpretation during the data transformation stage. These unique challenges will benefit both from a collaborative workflow and from mechanisms that enable trust in the eventual results. The project will bring together three distinct teams to develop this platform: computer scientists to develop abstractions, APIs and systems; statisticians to help with methods and study design; and social scientists to help define the problems and workflow and to provide user feedback.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

In this research project we developed a set of collaborative data science systems for computational social science. The management and processing of large-scale data  to derive insights currently is a relatively unstructured process, with data-processing decisions being made in an ad hoc fashion. Society has started to rely on data-driven science to address policy-related questions, however. The development of collaborative data science systems will provide structure and will allow social scientists, data scientists and domain experts to collaborate and validate each other's work. This project has the potential to transform how studies are designed and how data insights could be derived collaboratively. In this project, our team included researchers from MIT, Princeton and Columbia. We developed and tested a number of software systems for collaborative data science systems. These are called Ballet, Sibyl, Cardea. Ballet focuses on providing collaboration around a foundational step in data science - feature engineering. Sibyl enables domain experts to interact with machine learning output and collaboratively take decisions via understanding those outputs. Cardea provides a software system to develop and test a wide variety of healthcare models in a low-code setting. The challenge the team tried to solve is to simultaneously maximize automation of repetitive tasks, while enhancing collaboration around tasks where it is known to lead to better results. 

Our collaborative systems will result in a higher level of trust in the studies conducted via the collaborative curation of study design, procedures, and validation. We piloted these systems around many problems of social significance ranging from child welfare screening to fragile families challenge to predicting readmission risk in health care settings. All our systems are open-source, documented and are already in use by data scientists and domain experts. .


 

 


Last Modified: 03/22/2022
Modified by: Kalyan Veeramachaneni

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page