Award Abstract # 0960535
Filtered Push: Continuous Quality Control for Distributed Collections and Other Species-Occurrence Data.

NSF Org: DBI
Division of Biological Infrastructure
Recipient: PRESIDENT AND FELLOWS OF HARVARD COLLEGE
Initial Amendment Date: July 27, 2010
Latest Amendment Date: December 22, 2010
Award Number: 0960535
Award Instrument: Standard Grant
Program Manager: Peter McCartney
DBI
 Division of Biological Infrastructure
BIO
 Directorate for Biological Sciences
Start Date: August 1, 2010
End Date: July 31, 2015 (Estimated)
Total Intended Award Amount: $1,640,289.00
Total Awarded Amount to Date: $1,640,289.00
Funds Obligated to Date: FY 2010 = $1,640,289.00
History of Investigator:
  • James Hanken (Principal Investigator)
    hanken@oeb.harvard.edu
  • Bertram Ludaescher (Co-Principal Investigator)
  • James Macklin (Co-Principal Investigator)
  • James Macklin (Former Principal Investigator)
Recipient Sponsored Research Office: Harvard University
1033 MASSACHUSETTS AVE STE 3
CAMBRIDGE
MA  US  02138-5366
(617)495-5501
Sponsor Congressional District: 05
Primary Place of Performance: Harvard University
1033 MASSACHUSETTS AVE STE 3
CAMBRIDGE
MA  US  02138-5366
Primary Place of Performance
Congressional District:
05
Unique Entity Identifier (UEI): LN53LCFJFL45
Parent UEI:
NSF Program(s): ADVANCES IN BIO INFORMATICS,
Cross-BIO Activities
Primary Program Source: 01001011DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1718, 6895, 9178, 9183, 9184, BIOT
Program Element Code(s): 116500, 727500
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.074

ABSTRACT

Harvard University is awarded a grant to develop a networked solution to enable annotation of distributed biological collection data and to share assertions about their quality or usability. Internet inquiries that are posed to multiple datasets may yield varying results depending on the suitability or quality of the targeted data. In some cases it might be possible to inquire of experts or software agents that can assist in determining the fitness for use; in other cases such experts or agents might already have recorded an assessment of the data. However, that information is not typically available to the originator of the query. The proposed system will make these value-added assertions accessible to the end users of biodiversity datasets.

The Filtered Push network uses natural science collections as a reference implementation for a cyberinfrastructure with which any community can render an expert opinion about the quality of data, and the fitness for use of a data set or a subset of records. The emergent knowledgebase of the Filtered Push network supports the ability of interested parties to get immediate or historical access to these annotations, filtered by criteria expressing constraints on their interests. The network can also provide for the automatic execution of scientific workflows triggered by expert commentary, by the introduction or discovery of new data, or by a change in scientific viewpoints. As with the annotations, the outputs of such workflows can be distributed to interested parties, software or human. Filtered Push networks therefore allow for continuous quality control by the scientific community, based on human expertise, statistical or logical machine reasoning or advances in the domain science itself. The Filtered Push project maintains a wiki at http://www.etaxonomy.org/mw/FilteredPush. This project is part of a 10-year effort to digitize and mobilize the scientific information associated with biological specimens held in U.S. research collections. The images and digitized data from this project will be integrated into the online national resource as outlined in the community strategic plan available at http://digbiocol.files.wordpress.com/2010/05/digistratplanfinaldraft.pdf.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Dou, L., G. Cao, P. J. Morris, R. A. Morris, B. Ludäscher, J. A. Macklin and J. Hanken "Kurator: A Kepler package for data curation workflows" Procedia Computer Science , v.9 , 2012 , p.1614 10.1016/j.procs.2012.04.177
McPhillips, T., T. Song, T. Kolisnik, S. Aulenbach, K. Belhajjame, R. K. Bocinsky, Y. Cao, J. Cheney, F. Chirigati, S. Dey, J. Freire, C. Jones, J. Hanken, K. W. Kintigh, T. A. Kohler, D. Koop, J. A. Macklin, P. Missier, M. Schildhauer, C. Schwalm, Y. Wei "YesWorkflow: A user-oriented, language-independent tool for recovering workflow information from scripts" Internat. J. Digit. Curat. , v.10 , 2015 , p.298 10.2218/ijdc.v10i1.370
R.A. Morris, L. Dou, J. Hanken, M. Kelly, D.B. Lowery, B. Ludaescher, J.A. Macklin, and P.J. Morris "Semantic annotation of mutable data" PLoS One , v.8 , 2013 , p.e76093 10.1371/journal.pone.0076093
R.A. Morris, L. Dou, J. Hanken, M. Kelly, D.B. Lowery, B. Ludäscher, J.A. Macklin, and P.J. Morris "Semantic annotation of mutable data" PLoS ONE , v.8 , 2013 , p.e76093 10.1371/journal.pone.0076093
Song T., S. Köhler, B. Ludäscher, J. Hanken, M. Kelly, D. Lowery, J. A. Macklin, P. J. Morris and R. A. Morris "Towards automated design, analysis and optimization of declarative curation workflows" Internat. J. Dig. Curat. , v.9 , 2014 , p.111 10.2218/ijdc.v9i2.337
Tschöpe, O. J.A. Macklin, R.A. Morris, L. Suhrbier, and W.G. Berendsohn "Annotating biodiversity data" Taxon , v.62 , 2013 , p.1248
Wang, Z "Entropy on covers" DATA MINING AND KNOWLEDGE DISCOVERY , v.24 , 2012 , p.288 10.1007/s10618-011-0230-1
Wang, ZM "Entropy on covers" DATA MINING AND KNOWLEDGE DISCOVERY , v.24 , 2012 , p.288 View record at Web of Science 10.1007/s10618-011-0230-

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

A typical natural-science collection specimen or observation is associated with information about who, what, where, when and how the specimen was collected and/or observed in nature. For hundreds of years, this information was captured on handwritten or typed labels associated with the specimens or in catalogs, but now these data are in high demand in digital form for use in scientific research. Digitization is typically performed by the data owners, who then freely distribute the data to relevant aggregators of information. This process inevitably generates errors, and it typically is limited in terms of the types and quantity of information that is captured (Figure 1). Consequently, data quality assessment is essential to determine the applicability, or fitness, of the data for a specific research purpose. Data quality can also be enhanced by using semi-automated tools that check for errors by comparing digitized data against authoritative sources, that suggest corrections, and that allow addition or improvements to data records based on expert opinion. A further complication is that such edits can be made by anyone, anywhere, and yet this knowledge is rarely communicated back to the data owner, especially in a useful, standardized form. The FilteredPush (FP) project has begun to address these challenges by producing online tools for improving the fitness for use of distributed data through computer analysis and annotation, and through human review of data quality annotations.

Intellectual Merit:

The FilteredPush team developed a novel suite of tools that use structured annotations to propose data edits. These tools extend those used for commenting about information on web pages. The annotations may be recorded in a variety of clients and “pushed” back to the data owners. Data curators serve as the gatekeepers (“filters”) who evaluate proposed edits to records in their database and update records accordingly. Annotations can be contributed by experts or non-experts; where multiple opinions exist, they become conversations. The concept of data annotation was embraced by the Biodiversity Information Standards (TDWG) body, which facilitated interoperability of FP with AnnoSys, a European implementation. Scientific workflow software for data quality control (Kepler-Kuration, FP-Akka) was developed to assess and recommend changes within natural-science collection datasets. These workflows include detailed provenance information that promotes transparency and reusability. The structured annotations also facilitate application of semantic web technologies to biodiversity data management.

Broader Impacts:

The FilteredPush project provides both direct and indirect support for several Thematic Collections Networks (TCNs) in the NSF’s Advancing Digitization of Biological Collections program and its coordinating hub, iDigBio. Direct support was provided to the Southwest Collections of Arthropods Network (SCAN), InvertEBase, and New England Vascular Plants (NEVP) TCNs. The support includes FP functionality for creating annotations and registering taxonomic groups of interest to experts, which is embedded in Symbiota, a community collection-management tool and data portal. Data quality reports were also produced for institutions participating in TCNs (Figure 2), which continue to use the tools. Indirect support was provided through further enhancements to Symbiota and contributions to iDigBio hackathons, webinars and workshops. Training was provided to a broad spectrum of users through hands-on workshops and presentations. The project directly involved one undergraduate, three graduate students, and two postdoctoral fellows. Two courses taught at the graduate level incorporated content on data curation workflows.

The project showed that a proposed web-annotation standard of the World Wide Web C...

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page