
NSF Org: |
OAC Office of Advanced Cyberinfrastructure (OAC) |
Recipient: |
|
Initial Amendment Date: | June 1, 2015 |
Latest Amendment Date: | May 16, 2019 |
Award Number: | 1450277 |
Award Instrument: | Standard Grant |
Program Manager: |
Bogdan Mihaila
bmihaila@nsf.gov (703)292-8235 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering |
Start Date: | June 1, 2015 |
End Date: | May 31, 2020 (Estimated) |
Total Intended Award Amount: | $1,422,728.00 |
Total Awarded Amount to Date: | $1,422,728.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
1033 MASSACHUSETTS AVE STE 3 CAMBRIDGE MA US 02138-5366 (617)495-5501 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
33 Oxford Street Cambridge MA US 02138-3846 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Special Projects - CCF, Software Institutes |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Reproducability is the cornerstone of scientific progress. Historically, scientists make their work reproducible by including a formulaic description of the experimental methodology used in an experiment. In an age of computational science, such descriptions no longer adequately describe scientific methodology. Instead, scientific reproducibility relies on a precise and actionable description of the data and programs used to conduct the research. Provenance is the name given to the description of how a digital artifact came to be in its present state. Provenance includes a precise specification of an experiment's input data and the programs or procedures applied to that data. Most computational platforms do not record such data provenance, making it difficult to ensure reproducability. This project addresses this problem through the development of tools that transparently and automatically capture data provenance as part of a scientist's normal computational workflow.
An interdisciplinary team of computer scientists and ecologists have come together to develop tools to facilitate the capture, management, and query of data provenance -- the history of how a digital artifact came to be in its present state. Such data provenance improves the transparency, reliability, and reproducibility of scientific results. Most existing provenance systems require users to learn specialized tools and jargon and are unable to integrate provenance from different sources; these are serious obstacles to adoption by domain scientists. This project includes the design, development, deployment, and evaluation of an end-to-end system (eeProv) that encompasses the range of activity from original data analysis by domain scientists to management and analysis of the resulting provenance in a common framework with common tools. This project leverages and integrates development efforts on (1) an emerging system for generating provenance from a computing environment that scientists actually use (the R statistical language) with (2) an emerging system that utilizes a library of language and database adapters to store and manage provenance from virtually any source. Accomplishing the goals of this proposal requires fundamental research in resolving the semantic gap between provenance collected in different environments, capturing detailed provenance at the level of a programming language, defining precisely aspects of provenance required for different use cases, and making provenance accessible to scientists.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Reproducibility is the cornerstone of scientific inquiry. Historically, experimental scientists kept detailed laboratory notebooks to document their research findings. As experimental science has become increasingly reliant on computation over the past two decades, an ever-growing fraction of experimental procedures have been documented via computer code and data. Unfortunately, the complexity of these codes and their dependence on their surrounding computational environment has led to a crisis in reproducibility.
This research addresses the reproducibility crisis through the use of data provenance, the formal and structured representation of a computation. Data provenance documents how a digital artifact or product came to be in its current state. The multidisciplinary team from Mount Holyoke College, the Harvard John A. Paulson School of Engineering and Applied Sciences, and the Harvard Forest has undertaken two parallel efforts to bring data provenance to computational scientists: 1) the design and development of tools that capture, store, and process data provenance and 2) the design and development of applications that use data provenance to make conducting computational science easier. The applications provide motivation and incentive for users to adopt provenance tools. Once users collect data provenance as part of their experimental workflow, providing replicability and/or reproducibility becomes significantly easier.
While many data provenance tools existed prior to this work, they suffered from three major obstacles: 1) there was little incentive for adoption, 2) many of them required that scientists learn a new programming language, workflow system, or computational platform, and 3) there was no way to integrate data provenance collected from different systems. This project addressed all three obstacles. To encourage adoption, the team built a suite of provenance-based tools that aid in debugging computational processes (provDebugR for R and ProvBuild for Python), make it easier for scientists to understand existing experimental workflows (provSummarizR, provExplainR, and Rclean), facilitate push-button reproduction or replication (encapsulator and containR), and detect system intrusions (Unicorn). The multi-lingual (R and Python) approach makes provenance accessible to scientists in a range of disciplines (e.g., Ecologists, Biologists, and Statisticians frequently use R; computer scientists and many data scientists frequently use Python). The scientists can obtain provenance using the languages in which they are most comfortable, without making significant changes to their preferred workflow. The team has defined a schema in which to represent language-level provenance so that additional languages can be incorporated into the ecosystem via development of provenance captures tools that generate their output in the documented format. Through the use of whole-system provenance capture and a library accessible to any provenance capture tool, provenance can be integrated among different capture mechanisms, providing an end to end solution capable of documenting an entire experimental process.
Most of the R-based tools developed in this project are available for easy download from the Comprehensive R Archive Network (CRAN). The other tools are also available via github repositories, web sites, and pre-packaged virtual machines.
Education has been an important focus throughout the duration of the project. The team included two postdocs, one of whom is currently a professor at Bristol University, two Ph.D. candidates in Computer science, and numerous undergraduate students at Mt Holyoke, Harvard, and Harvard Forest. Ten students (including five women and one African American man) participated in the Harvard Forest REU program in Ecology, which has allowed computer science students to better understand the role of computation and provenance in ecological research. Additionally, four women undergraduates at Mt. Holyoke carried out independent study projects related to this research.
Last Modified: 06/09/2020
Modified by: Margo I Seltzer
Please report errors in award information by writing to: awardsearch@nsf.gov.