Award Abstract # 1450277
SI2-SSI: Collaborative Research: Bringing End-to-End Provenance to Scientists

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: PRESIDENT AND FELLOWS OF HARVARD COLLEGE
Initial Amendment Date: June 1, 2015
Latest Amendment Date: May 16, 2019
Award Number: 1450277
Award Instrument: Standard Grant
Program Manager: Bogdan Mihaila
bmihaila@nsf.gov
 (703)292-8235
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: June 1, 2015
End Date: May 31, 2020 (Estimated)
Total Intended Award Amount: $1,422,728.00
Total Awarded Amount to Date: $1,422,728.00
Funds Obligated to Date: FY 2015 = $1,422,728.00
History of Investigator:
  • Margo Seltzer (Principal Investigator)
    margo@eecs.harvard.edu
  • Aaron Ellison (Co-Principal Investigator)
  • Emery Boose (Co-Principal Investigator)
Recipient Sponsored Research Office: Harvard University
1033 MASSACHUSETTS AVE STE 3
CAMBRIDGE
MA  US  02138-5366
(617)495-5501
Sponsor Congressional District: 05
Primary Place of Performance: Harvard University
33 Oxford Street
Cambridge
MA  US  02138-3846
Primary Place of Performance
Congressional District:
05
Unique Entity Identifier (UEI): LN53LCFJFL45
Parent UEI:
NSF Program(s): Special Projects - CCF,
Software Institutes
Primary Program Source: 01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 8009
Program Element Code(s): 287800, 800400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Reproducability is the cornerstone of scientific progress. Historically, scientists make their work reproducible by including a formulaic description of the experimental methodology used in an experiment. In an age of computational science, such descriptions no longer adequately describe scientific methodology. Instead, scientific reproducibility relies on a precise and actionable description of the data and programs used to conduct the research. Provenance is the name given to the description of how a digital artifact came to be in its present state. Provenance includes a precise specification of an experiment's input data and the programs or procedures applied to that data. Most computational platforms do not record such data provenance, making it difficult to ensure reproducability. This project addresses this problem through the development of tools that transparently and automatically capture data provenance as part of a scientist's normal computational workflow.

An interdisciplinary team of computer scientists and ecologists have come together to develop tools to facilitate the capture, management, and query of data provenance -- the history of how a digital artifact came to be in its present state. Such data provenance improves the transparency, reliability, and reproducibility of scientific results. Most existing provenance systems require users to learn specialized tools and jargon and are unable to integrate provenance from different sources; these are serious obstacles to adoption by domain scientists. This project includes the design, development, deployment, and evaluation of an end-to-end system (eeProv) that encompasses the range of activity from original data analysis by domain scientists to management and analysis of the resulting provenance in a common framework with common tools. This project leverages and integrates development efforts on (1) an emerging system for generating provenance from a computing environment that scientists actually use (the R statistical language) with (2) an emerging system that utilizes a library of language and database adapters to store and manage provenance from virtually any source. Accomplishing the goals of this proposal requires fundamental research in resolving the semantic gap between provenance collected in different environments, capturing detailed provenance at the level of a programming language, defining precisely aspects of provenance required for different use cases, and making provenance accessible to scientists.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 14)
Barbara Lerner and Emery Boose and Luis Perez "Using Introspection to Collect Provenance in R" Informatics , v.5 , 2018 http://www.mdpi.com/2227-9709/5/1/12/htm
Chan, S., Cheney, J., Bhatotia, P., Pasquier, T., Gehani, A., Irshad, H., Carata, L., Seltzer, M., "ProvMark: A provenance expressiveness benchmarking system" Proceedings of the International Middleware Conference , 2019
Ellison, A., Boose, E., Lerner, B., Fong, E., Seltzer, M., "People of Data: The End-to-End Provenance Project" iScience Patterns , v.1 , 2020 10.1016/j.patter.2020.100016
Han, X., Pasquier, T., Bates, A., Watson, R., Mickens, J., Seltzer, M., "Unicorn: Revisiting Host-Based Intrusion Detection in the Age of Data Provenance" Proceedings of the Network and Distributed System Security Symposium , 2020
Lau, M., Pasquier T., Seltzer, M., "Rclean: A Tool for Writing Cleaner, More Transparent Code" JOSS: The Journal of Open Source Software , v.5 , 2020 10.21105/joss.01312
Pasquier, T., Eyers, D., Seltzer, M., "From Here to Provtopia" Proceedings of the Workshop Towards Polystores that manage multiple Databases, Privacy, Security and/or Policy Issues for Heterogenous Data (POLY, co-located with VLDB 2019) , 2019
Pasquier, T., Han, X., Moyer, T., Bates, A., Herman, O., Eyers, D., Bacon, J., Seltzer, M. "Runtime Analysis of Whole-System Provenance" Proceedings of the 2018 Conference on Computer and Communications Security (CCS?18) , 2018
Pasquier, Thomas and Han, Xueyuan and Goldstein, Mark and Moyer, Thomas and Eyers, David and Seltzer, Margo and Bacon, Jean "Practical Whole-System Provenance Capture" Symposium on Cloud Computing (SoCC '17) , 2017 , p.405 10.1145/3127479.3129249
Pasquier, T., Lau, M., Han, X., Fong, E., Lerner, B., Boose, E., Crosas, M., Ellison, A., Seltzer, M. "Sharing and Preserving Computational Analyses for Posterity with encapsulator" IEEE Computing in Science and Engineering , v.20 , 2018 10.1109/MCSE.2018.042781334
Thomas Pasquier and Jatinder Singh and Julia Powles and David Eyers and Margo Seltzer and Jean Bacon "{Data provenance to audit compliance with privacy policy in the Internet of Things}" Springer Personal and Ubiquitous Computing , 2018
Thomas Pasquier and Matthew Lau and Ana Trisovic and Emery Boose and Ben Couturier and Aaron Ellison and Valerie Gibson and Chris Jones and Margo Seltzer "{If these data could talk}" Nature Scientific Data , 2017
(Showing: 1 - 10 of 14)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

 

Reproducibility is the cornerstone of scientific inquiry. Historically, experimental scientists kept detailed laboratory notebooks to document their research findings. As experimental science has become increasingly reliant on computation over the past two decades, an ever-growing fraction of experimental procedures have been documented via computer code and data. Unfortunately, the complexity of these codes and their dependence on their surrounding computational environment has led to a crisis in reproducibility.

This research addresses the reproducibility crisis through the use of data provenance, the formal and structured representation of a computation. Data provenance documents how a digital artifact or product came to be in its current state. The multidisciplinary team from Mount Holyoke College, the Harvard John A. Paulson School of Engineering and Applied Sciences, and the Harvard Forest has undertaken two parallel efforts to bring data provenance to computational scientists: 1) the design and development of tools that capture, store, and process data provenance and 2) the design and development of applications that use data provenance to make conducting computational science easier. The applications provide motivation and incentive for users to adopt provenance tools. Once users collect data provenance as part of their experimental workflow, providing replicability and/or reproducibility becomes significantly easier.

While many data provenance tools existed prior to this work, they suffered from three major obstacles: 1) there was little incentive for adoption, 2) many of them required that scientists learn a new programming language, workflow system, or computational platform, and 3) there was no way to integrate data provenance collected from different systems. This project addressed all three obstacles.  To encourage adoption, the team built a suite of provenance-based tools that aid in debugging computational processes (provDebugR for R and ProvBuild for Python), make it easier for scientists to understand existing experimental workflows (provSummarizR, provExplainR, and Rclean), facilitate push-button reproduction or replication (encapsulator and containR), and detect system intrusions (Unicorn). The multi-lingual (R and Python) approach makes provenance accessible to scientists in a range of disciplines (e.g., Ecologists, Biologists, and Statisticians frequently use R; computer scientists and many data scientists frequently use Python). The scientists can obtain provenance using the languages in which they are most comfortable, without making significant changes to their preferred workflow. The team has defined a schema in which to represent language-level provenance so that additional languages can be incorporated into the ecosystem via development of provenance captures tools that generate their output in the documented format.  Through the use of whole-system provenance capture and a library accessible to any provenance capture tool, provenance can be integrated among different capture mechanisms, providing an end to end solution capable of documenting an entire experimental process.

Most of the R-based tools developed in this project are available for easy download from the Comprehensive R Archive Network (CRAN). The other tools are also available via github repositories, web sites, and pre-packaged virtual machines.

Education has been an important focus throughout the duration of the project. The team included two postdocs, one of whom is currently a professor at Bristol University, two Ph.D. candidates in Computer science, and numerous undergraduate students at Mt Holyoke, Harvard, and Harvard Forest. Ten students (including five women and one African American man) participated in the Harvard Forest REU program in Ecology, which has allowed  computer science students to better understand the role of computation and  provenance in ecological research.  Additionally, four women undergraduates at Mt. Holyoke carried out independent study projects related to this research.

 

 


Last Modified: 06/09/2020
Modified by: Margo I Seltzer

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page