Award Abstract # 1443062
Beyond Data Discovery: Shared Services for Community Metadata Improvement

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: HDF GROUP
Initial Amendment Date: April 27, 2015
Latest Amendment Date: April 27, 2015
Award Number: 1443062
Award Instrument: Standard Grant
Program Manager: Amy Walton
awalton@nsf.gov
 (703)292-4538
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: May 1, 2015
End Date: April 30, 2019 (Estimated)
Total Intended Award Amount: $1,498,604.00
Total Awarded Amount to Date: $1,498,604.00
Funds Obligated to Date: FY 2015 = $1,498,604.00
History of Investigator:
  • Ray Habermann (Principal Investigator)
    ted@metadatagamechangers.com
  • Matthew Jones (Co-Principal Investigator)
Recipient Sponsored Research Office: The HDF Group
410 E UNIVERSITY AVE STE 200
CHAMPAIGN
IL  US  61820-3871
(217)531-6100
Sponsor Congressional District: 13
Primary Place of Performance: The HDF Group
1800 S. Oak Street
Champaign
IL  US  61820-7059
Primary Place of Performance
Congressional District:
13
Unique Entity Identifier (UEI): HKAXEDY58N79
Parent UEI:
NSF Program(s): Data Cyberinfrastructure,
EarthCube
Primary Program Source: 01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 8048
Program Element Code(s): 772600, 807400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Science data and results must be well documented in order to be reproducible and re-usable. Metadata -- ancillary contextual information such as science objectives, data provenance, and uncertainty estimates at each step -- is a fundamental part of the research documentation, reuse, and collaboration process.

This project develops flexible tools for evaluating metadata, using consistent measurement systems that encourage community engagement, integrate guidance for improvement, and are a critical element in cross-community metadata improvement efforts. Provision of these new metadata and data evaluation services across communities will improve the ability to integrate and reuse trustworthy data for crosscutting synthesis and analysis across science communities. The focus on use metadata rather than discovery metadata is a significant shift in focus. Use metadata is a fundamental building block needed to allow effective scientific analysis workflows. The team builds a significant collaboration with several interdisciplinary partner organizations that provide guidance to this project.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Gordon, S.C. and T. Habermann "Evaluating the Interdisciplinary Discoverability of Data" AGU Fall Meeting , 2017
Habermann, T and Robinson, E. "The Road to Independently Understandable Information" AGU Fall Meeting , 2017
Sean Gordon, Ted Habermann "The influence of community recommendations on metadata completeness" Ecological Informatics , v.43 , 2018 , p.38 10.1016/j.ecoinf.2017.09.005
Ted Habermann "Mapping ISO 19115-1 geographic metadata standards to CodeMeta" PeerJ CompSci , 2019 https://doi.org/10.7717/peerj-cs.174
Ted Habermann "Metadata Life Cycles, Use Cases and Hierarchies" Geosciences , v.8 , 2018 , p.179 https://doi.org/10.3390/geosciences8050179

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Scientific research increasingly focuses on challenges that span large regions and long time periods and involve data from many disciplines.  These studies depend on our ability to reuse existing data that has been shared in data repositories along with high-quality metadata, i.e. documentation that describes the content, structure, and research context of the data in sufficient detail to enable discovery and correct interpretation.  In this project, we developed approaches and systems for measuring the completeness and effectiveness of the metadata that is used to preserve, discover, access, and understand data from the Earth, environmental, and life sciences.

Communities that create and use data provide recommendations for metadata needed to address use cases for that are important to them. Rather than making new recommendations, we focused on facilitating evaluations of metadata collections using existing and emerging community recommendations. Communities use many dialects for naming metadata elements and many formats for storing these elements. Rather than making new dialects and formats, we focused on identifying concepts shared across communities and mapped between them to facilitate evaluations in many native dialects, i.e., using evaluation to encourage convergence among these communities of practice.

In addition to a diverse set of existing recommendations and dialects, the metadata environment continues to evolve as new requirements and capabilities emerge. Tracking the response of actual metadata to those changes provides insight into how effectively we as a community are documenting and preserving data for reuse in cross-disciplinary, synthetic studies, and for supporting reproducible scientific results.

We initially described and demonstrated this approach by examining the influence of community recommendations on metadata in several dialects from the NSF Long-Term Ecological Research (LTER) Program. We continued our LTER collaboration with a detailed  examination of changes in metadata from all LTER sites through time. In this case, a network-wide migration to a single software environment had the most significant effect on metadata completeness and consistency.

Developing tools that implement these metadata evaluation capabilities and integrating evaluations into repository workflows was an important goal of this project. We achieved this goal with the development of the Metadata Quality Engine implemented for the NSF Arctic Data Center, the KNB Data Repository, and the DataONE network of over forty member data repositories. These DataONE repositories span many Earth Science disciplines, academic research institutions, and government agencies around the world. This is the first systematic metadata evaluation capability available for DataONE members and the impact of these tools will grow over time.

We also extended this evaluation work to metadata repositories at the center of international academic publishing and permanent identifier (PID) creation and management, CrossRef and DataCite. These repositories include metadata for tens of millions of research articles and datasets and are expanding into metadata for software, scientific instruments, samples, and other kinds of research objects. As these repositories grow, they can play important roles in identifying and connecting published papers to the people, institutions, data and software that contributed to the research behind the paper. Those connections depend on identifiers and links in the metadata. We found that kind of information is missing from many of the metadata records and that, in many cases, content is limited to the minimal required fields. The services and capabilities these repositories provide are changing rapidly. Consistent evaluation across providers in many disciplines can provide information about metadata required to support these new capabilities and good examples that demonstrate usage and benefits.

We found similar limitations in over twenty metadata dialects that had been mapped to the Codemeta vocabulary for metadata for software. The codemeta vocabulary included over 60 items that covered discovery, access, use, and understanding use cases. The dialects mapped to codemeta included only eleven of these items on average. Strong focus on discovery in many systems, tools, and recommendations needs to be overcome if these metadata are going to support interoperability and data reuse.

The tools we developed are now part of the DataONE infrastructure, and will continue to provide a systematic and quantitative means for data repository managers and researchers from around the US and the world to evaluate and improve the extent to which their data are Findable, Accessible, Interoperable, and Reusable (FAIR), which in turn will increase the long-term impact of research data by accelerating cross-disciplinary, synthetic research with existing data.


 


Last Modified: 06/13/2019
Modified by: Ray Habermann

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page