Award Abstract # 1760052
Spokes: MEDIUM: NORTHEAST: Collaborative Research: Data Science Foundry: A Collaborative Platform for Computational Social Science

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: THE TRUSTEES OF PRINCETON UNIVERSITY
Initial Amendment Date: July 31, 2018
Latest Amendment Date: July 31, 2018
Award Number: 1760052
Award Instrument: Standard Grant
Program Manager: Cheryl Eavey
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2018
End Date: August 31, 2021 (Estimated)
Total Intended Award Amount: $250,000.00
Total Awarded Amount to Date: $250,000.00
Funds Obligated to Date: FY 2018 = $250,000.00
History of Investigator:
  • Matthew Salganik (Principal Investigator)
Recipient Sponsored Research Office: Princeton University
1 NASSAU HALL
PRINCETON
NJ  US  08544-2001
(609)258-3090
Sponsor Congressional District: 12
Primary Place of Performance: Princeton University
Wallace Hall
Princeton
NJ  US  08544-1005
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): NJ1YPQXQG7U5
Parent UEI:
NSF Program(s): BD Spokes -Big Data Regional I
Primary Program Source: 01001819DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 028Z, 8083, 9102
Program Element Code(s): 024Y00
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

This research project will develop a collaborative data science platform for computational social science called the Data Science Foundry. The collection and management of large-scale data currently is a relatively unstructured process, with data-processing decisions being made in an ad hoc fashion. Society has started to rely on data-driven science to address policy-related questions, however. The development of a collaborative platform that provides structure will allow social scientists to collaborate and validate each other's studies. This project has the potential to transform how studies are designed and how data will be processed. The collaborative platform will result in a higher level of trust in the studies conducted via the collaborative curation of study design, procedures, and validation. The collaborative platform also will increase the number of studies that can be done in a short span of time. The platform will be developed as open-source, thereby facilitating interactions with the community and enabling different institutions to install the program.

This project will develop a collaborative platform that social scientists can use to collaborate and validate each other's studies. The investigative team will attempt to identify the best possible collaborative model for data-driven social science, determine how automation can most enhance the studies, and develop explicit and implicit mechanisms to establish trust in end-to-end data processing pipelines and the results they generate. To aid in the platform's development, the research team will focus on the prediction of outcomes from surveys, a specific yet widely applicable type of problem within computational social science. This class of problems involves much subjective assessment during the feature engineering state as well as copious interpretation during the data transformation stage. These unique challenges will benefit both from a collaborative workflow and from mechanisms that enable trust in the eventual results. The project will bring together three distinct teams to develop this platform: computer scientists to develop abstractions, APIs and systems; statisticians to help with methods and study design; and social scientists to help define the problems and workflow and to provide user feedback.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Aczel, Balazs and Szaszi, Barnabas and Nilsonne, Gustav and van den Akker, Olmo R and Albers, Casper J and van Assen, Marcel ALM and Bastiaansen, Jojanneke A and Benjamin, Daniel and Boehm, Udo and Botvinik-Nezer, Rotem and Bringmann, Laura F and Busch, N "Consensus-based guidance for conducting and reporting multi-analyst studies" eLife , v.10 , 2021 https://doi.org/10.7554/eLife.72185 Citation Details
Hofman, Jake M. and Watts, Duncan J. and Athey, Susan and Garip, Filiz and Griffiths, Thomas L. and Kleinberg, Jon and Margetts, Helen and Mullainathan, Sendhil and Salganik, Matthew J. and Vazire, Simine and Vespignani, Alessandro and Yarkoni, Tal "Integrating explanation and prediction in computational social science" Nature , v.595 , 2021 https://doi.org/10.1038/s41586-021-03659-0 Citation Details
Kindel, Alexander T. and Bansal, Vineet and Catena, Kristin D. and Hartshorne, Thomas H. and Jaeger, Kate and Koffman, Dawn and McLanahan, Sara and Phillips, Maya and Rouhani, Shiva and Vinh, Ryan and Salganik, Matthew J. "Improving Metadata Infrastructure for Complex Surveys: Insights from the Fragile Families Challenge" Socius: Sociological Research for a Dynamic World , v.5 , 2019 https://doi.org/10.1177/2378023118817378 Citation Details
Liu, David M. and Salganik, Matthew J. "Successes and Struggles with Computational Reproducibility: Lessons from the Fragile Families Challenge" Socius: Sociological Research for a Dynamic World , v.5 , 2019 https://doi.org/10.1177/2378023119849803 Citation Details
Lundberg, Ian and Narayanan, Arvind and Levy, Karen and Salganik, Matthew_J "Privacy, Ethics, and Data Access: A Case Study of the Fragile Families Challenge" Socius: Sociological Research for a Dynamic World , v.5 , 2019 https://doi.org/10.1177/2378023118813023 Citation Details
Salganik, Matthew and Maffeo, Lauren and Rudin, Cynthia "Prediction, Machine Learning, and Individual Lives: an Interview with Matthew Salganik" Harvard Data Science Review , 2020 https://doi.org/10.1162/99608f92.eecdfa4e Citation Details
Salganik, Matthew J. and Lundberg, Ian and Kindel, Alexander T. and Ahearn, Caitlin E. and Al-Ghoneim, Khaled and Almaatouq, Abdullah and Altschul, Drew M. and Brand, Jennie E. and Carnegie, Nicole Bohme and Compton, Ryan James and Datta, Debanjan and Dav "Measuring the predictability of life outcomes with a scientific mass collaboration" Proceedings of the National Academy of Sciences , v.117 , 2020 https://doi.org/10.1073/pnas.1915006117 Citation Details
Salganik, Matthew J. and Lundberg, Ian and Kindel, Alexander T. and McLanahan, Sara "Introduction to the Special Collection on the Fragile Families Challenge" Socius: Sociological Research for a Dynamic World , v.5 , 2019 https://doi.org/10.1177/2378023119871580 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

In this research project, we developed approaches to collaborative computational social science, and we used those approaches to study the predictability of life outcomes.  As part of this project, we conducted the Fragile Families Challenge, a scientific mass collaboration, involving more than 450 researchers from around the world (Salganik et al., 2020). These researchers attempted to predict six life outcomes, such as a child’s grade point average and whether a family would be evicted from their home. Researchers used machine learning methods optimized for prediction, and they drew on all the data collected during the Fragile Families and Child Wellbeing Study.  However, no researchers were able to make very accurate predictions. For policymakers considering using predictive models in settings such as criminal justice and child-protective services, these results raise a number of concerns. Additionally, researchers must reconcile the idea that they understand life trajectories with the fact that none of the predictions were very accurate.

While conducting the mass collaboration, we developed approaches to address a number of methodological challenges that we encountered related to: privacy and ethics of data access (Lundberg et al., 2019), survey metadata (Kindel et al., 2019), and computational reproducibility (Liu and Salganik, 2019).  We also contributed to reporting guidelines for future multi-analyst studies (Aczel et al., 2021).  Collectively, these methodological contributions should make future mass collaborations more scientifically valuable and easier to conduct. 

Finally, we shared our approach and results with the broader data science community in both written form (Salganik, Maffeo, and Rudin, 2020) and through presentations at universities, companies, and government agencies.  We hope that our approach and results will lead to more scientific research and improved use of predictive models in high-stakes social settings.

 


Last Modified: 04/21/2022
Modified by: Matthew J Salganik

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page