
NSF Org: |
OAC Office of Advanced Cyberinfrastructure (OAC) |
Recipient: |
|
Initial Amendment Date: | July 31, 2018 |
Latest Amendment Date: | July 31, 2018 |
Award Number: | 1761810 |
Award Instrument: | Standard Grant |
Program Manager: |
Cheryl Eavey
ceavey@nsf.gov (703)292-7269 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2018 |
End Date: | August 31, 2021 (Estimated) |
Total Intended Award Amount: | $250,000.00 |
Total Awarded Amount to Date: | $250,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
615 W 131ST ST NEW YORK NY US 10027-7922 (212)854-6851 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
NY US 10027-7003 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Economics, Methodology, Measuremt & Stats |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
This research project will develop a collaborative data science platform for computational social science called the Data Science Foundry. The collection and management of large-scale data currently is a relatively unstructured process, with data-processing decisions being made in an ad hoc fashion. Society has started to rely on data-driven science to address policy-related questions, however. The development of a collaborative platform that provides structure will allow social scientists to collaborate and validate each other's studies. This project has the potential to transform how studies are designed and how data will be processed. The collaborative platform will result in a higher level of trust in the studies conducted via the collaborative curation of study design, procedures, and validation. The collaborative platform also will increase the number of studies that can be done in a short span of time. The platform will be developed as open-source, thereby facilitating interactions with the community and enabling different institutions to install the program.
This project will develop a collaborative platform that social scientists can use to collaborate and validate each other's studies. The investigative team will attempt to identify the best possible collaborative model for data-driven social science, determine how automation can most enhance the studies, and develop explicit and implicit mechanisms to establish trust in end-to-end data processing pipelines and the results they generate. To aid in the platform's development, the research team will focus on the prediction of outcomes from surveys, a specific yet widely applicable type of problem within computational social science. This class of problems involves much subjective assessment during the feature engineering state as well as copious interpretation during the data transformation stage. These unique challenges will benefit both from a collaborative workflow and from mechanisms that enable trust in the eventual results. The project will bring together three distinct teams to develop this platform: computer scientists to develop abstractions, APIs and systems; statisticians to help with methods and study design; and social scientists to help define the problems and workflow and to provide user feedback.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Data pipelines are often a critical part of the infrastructure of interdisciplinary data science, that often goes beyond a simple connecting tool to have lasting impact on which and how questions are posed and answered. Unfortunately, those pipelines are severely challenged as soon as we move away from a “common task” model where a single prediction task is attempted in parallel by multiple teams using data that are made publicly available. Multiple application areas of data science have more complex requirements, that involve collaboration between entities with different data access, and an application task that may evolve with inputs from both side as limitations of formulation and analysis techniques affect the results. At the same time, pipelines are required to become more rigourous and trustworthy in ensuring that the concerns on data bias affecting high stake decisions are met with solid guarantee either local or global on its operation.
Over the course of three years the team of this project have experienced the everyday effort of interdisciplinary data science (from multiple domains: sociology, environmental science, psychology and public health, economics, media studies). At the same time, it creates tools and analyses aiming at documenting the opportunity of more complex data pipelines, on topic including but not limited to the theoretical limits of prediction, validation/invalidation of observations from limited sample. It also identified opportunities, when data is reuse among multiple parties, that fairness guarantees can be deployed faster using incentives.
In addition to research publications (published in multiple journals and conference, in multiple disciplines and computer science) that provided insightful knowledge, we focused on delivering these research outcomes in a sustainable, repeatable way - keeping with the times of open source movement, frameworks that address reproducibility crisis and development of scalable, sustainable communities of software development and usage. We here highlight the three powerful axes along which we generated key outcomes and highlight our unique achievements.
Open source libraries: We have developed and delivered numerous “usable” open source libraries. These include, Ballet, MLBlocks, MLprimitives (both part of MLBazaar), Cardea. We highlight the word “usable” to mean that they have all followed the standard software engineering practices, multiple releases, continuous integration, documentation, testing. This creates a sustainable ecosystem beyond the timeframe of this project. Most of these libraries have active community usage and have gone on to solve impact problems in different domains.
Empirical studies: The teams conducted numerous empirical studies both of collaborative data science (without the platforms) and of collaboration of data scientists and domain experts on the platforms.
Closing the loop: Studies of how collaboration happens, solving a societal problem using data and developing platforms that are informed by the former two is a tough endeavor to balance. Often research covers one or two of the axes. Our collaboration resulted in tying together these three threads into a set of coherent outcomes that we hope are informative, usable in future development and progress in this direction.
The project also offered multiple opportunities to broaden the base of leaders bringing data science and AI to multiple domains by organizing a summer school on computational social science for graduate students and a highschool program empowering diverse voices to join AI in its application to multiple fields.
Last Modified: 10/21/2022
Modified by: Augustin Chaintreau
Please report errors in award information by writing to: awardsearch@nsf.gov.