Award Abstract # 1930645
Developing Evidence-based Data Sharing and Archiving Policies

NSF Org: SMA
SBE Office of Multidisciplinary Activities
Recipient: REGENTS OF THE UNIVERSITY OF MICHIGAN
Initial Amendment Date: July 31, 2019
Latest Amendment Date: July 31, 2019
Award Number: 1930645
Award Instrument: Standard Grant
Program Manager: Mary Feeney
SMA
 SBE Office of Multidisciplinary Activities
SBE
 Directorate for Social, Behavioral and Economic Sciences
Start Date: December 1, 2019
End Date: November 30, 2022 (Estimated)
Total Intended Award Amount: $498,643.00
Total Awarded Amount to Date: $498,643.00
Funds Obligated to Date: FY 2019 = $498,643.00
History of Investigator:
  • Libby Hemphill (Principal Investigator)
    libbyh@umich.edu
  • Elizabeth Yakel (Co-Principal Investigator)
  • Amy Pienta (Co-Principal Investigator)
  • Dharma Akmon (Co-Principal Investigator)
  • Andrea Thomer (Co-Principal Investigator)
Recipient Sponsored Research Office: Regents of the University of Michigan - Ann Arbor
1109 GEDDES AVE STE 3300
ANN ARBOR
MI  US  48109-1015
(734)763-6438
Sponsor Congressional District: 06
Primary Place of Performance: University of Michigan Ann Arbor
MI  US  48106-1248
Primary Place of Performance
Congressional District:
06
Unique Entity Identifier (UEI): GNJ7BBP73WE9
Parent UEI:
NSF Program(s): SciSIP-Sci of Sci Innov Policy
Primary Program Source: 01001920DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 062Z, 7626
Program Element Code(s): 762600
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.075

ABSTRACT

Access to original research data supports innovative, interdisciplinary, and integrative research, and enables replication and review of prior work. Consequently, a growing number of funding agencies, journal publishers, and scientific societies now require that original research data must be shared and archived promptly after its collection or publication. However, there are still many unanswered questions about the best way to share and archive research data. For instance: how can data repositories best allocate their limited resources for different aspects of data archiving and processing? What is the most effective way of making data usable by the broadest audience? What data sharing policies most effectively achieve stakeholders? transparency and innovation goals? This project answers these questions by studying the impact of different "curatorial actions" (e.g., standardizing variables, improving documentation) on the reuse of data archived by the Inter-university Consortium for Political and Social Research (ICPSR). As one of the largest social science archives in the world and a leader in digital data curation practice, ICPSR is well-suited as a site for this project. ICPSR is also well-positioned to provide funding agencies and policy makers recommendations for data sharing policies that articulate the metrics needed in evaluating the appropriateness of data sharing and curation plans and their associated costs. This project achieves broader impacts by (1) recommending evidence-based data sharing policies to funders, repository staff,, and researchers and (2) improving research data curation practices.

To determine the impact of various curatorial activities on data reuse, the project first defines the different kinds of "curatorial actions" and "impact," and then explains the relationships among actions and impact. To identify curatorial actions and other features of datasets and ICPSR services that influence reuse, the project examines ICPSR's legacy curation logs and use records (such as downloads and citations). Curation logs contain data about specific data transformations or preservation steps. By connecting curation logs to data usage records, the actions are associated with higher rates of reuse or access will be identified. The project examines the utility of two measures of impact--secondary impact and diversity--by comparing use logs to the ICPSR Bibliography of Data-Related Literature. The ICPSR Bibliography links over 80,000 research publications to the ICPSR data on which they are based. "Secondary impact" is a measure of how many times the reuse publications have been cited and is constructed by gathering citation data for all items in the bibliography that are not the original PI's publications. "Diversity" measures the breadth of disciplines that use the data and can similarly be constructed from the bibliography. The project employs multivariate regression analysis and structural equation modeling to determine the relationships among curatorial actions, metadata, the dataset itself, ICPSR services, and reuse and impact. This analysis enables the development of cost models and metrics that allow repository managers to evaluate the return on investment of specific curatorial actions. The project will use these models to inform evidence-based data sharing and archiving policies.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Hemphill, Libby and Pienta, Amy and Lafia, Sara and Akmon, Dharma and Bleckley, David A. "How do properties of data, their curation, and their funding relate to reuse?" Journal of the Association for Information Science and Technology , v.73 , 2022 https://doi.org/10.1002/asi.24646 Citation Details
Lafia, S and Kuhn, W and Caylor, K and Hemphill, L "Mapping research topics at multiple levels of detail" Patterns , 2021 https://doi.org/10.31223/OSF.IO/523EX Citation Details
Lafia, Sara and Fan, Lizhou and Hemphill, Libby "A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature" Proceedings of the Association for Information Science and Technology , v.59 , 2022 https://doi.org/10.1002/pra2.614 Citation Details
Lafia, Sara and Thomer, Andrea and Bleckley, David and Akmon, Dharma and Hemphill, Libby "Leveraging Machine Learning to Detect Data Curation Activities" eScience 2021 , 2021 https://doi.org/10.1109/eScience51609.2021.00025 Citation Details
Thomer, Andrea K. and Akmon, Dharma and York, Jeremy J. and Tyler, Allison R. and Polasek, Faye and Lafia, Sara and Hemphill, Libby and Yakel, Elizabeth "The Craft and Coordination of Data Curation: Complicating Workflow Views of Data Science" Proceedings of the ACM on Human-Computer Interaction , v.6 , 2022 https://doi.org/10.1145/3555139 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

We studied the impacts of data curation on data reuse and developed metrics for measuring those relationships. Data curation organizes, describes, and prepares data for preservation and public use. Curating data for long-term preservation is costly in terms of human time and effort. The intensity of data curation activities and features of datasets contribute to their long-term reuse.

Our findings fall into three areas: curation activities, their effects on data reuse, and ways to measure data reuse. Data curators spend more time doing quality checks and communicating across their team than they do transforming data or planning their data activities. 

Second, we found that datasets were more likely to be reused if they were intensely curated and represented larger, longitudinal studies associated with institutional investigators. The overall level of curation effort and individual curation actions, like attaching subject terms, were correlated with more data use. Institutional funding and improvements to metadata and findability also increased data use. Institutional datasets with more variables also attracted more users. 

Third, we developed a computational pipeline to identify data references in academic literature and described different structural positions data occupy in the network of scientific outputs. Sometimes data serve as connectors across disciplines (crossroads), and sometimes data are more valuable for narrower intellectual communities (subdivisions). 

Together, these findings advance theory in the fields of library and information science and scientometrics. We explain the return on various types of resource investment in data work and make it possible for funders, archives, and researchers to measure the likely impacts of their investments. 

We produced various software tools that (1) enable archives to automatically identify papers that use data in their collections and (2) allow researchers to recognize references to datasets in text. These software products improve the infrastructure for science by helping archives capture the impacts of their work (e.g., identify products the data help produce).

 


Last Modified: 03/15/2023
Modified by: Libby Hemphill

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page