
NSF Org: |
SMA SBE Office of Multidisciplinary Activities |
Recipient: |
|
Initial Amendment Date: | July 31, 2019 |
Latest Amendment Date: | July 31, 2019 |
Award Number: | 1930645 |
Award Instrument: | Standard Grant |
Program Manager: |
Mary Feeney
SMA SBE Office of Multidisciplinary Activities SBE Directorate for Social, Behavioral and Economic Sciences |
Start Date: | December 1, 2019 |
End Date: | November 30, 2022 (Estimated) |
Total Intended Award Amount: | $498,643.00 |
Total Awarded Amount to Date: | $498,643.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
1109 GEDDES AVE STE 3300 ANN ARBOR MI US 48109-1015 (734)763-6438 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
MI US 48106-1248 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | SciSIP-Sci of Sci Innov Policy |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.075 |
ABSTRACT
Access to original research data supports innovative, interdisciplinary, and integrative research, and enables replication and review of prior work. Consequently, a growing number of funding agencies, journal publishers, and scientific societies now require that original research data must be shared and archived promptly after its collection or publication. However, there are still many unanswered questions about the best way to share and archive research data. For instance: how can data repositories best allocate their limited resources for different aspects of data archiving and processing? What is the most effective way of making data usable by the broadest audience? What data sharing policies most effectively achieve stakeholders? transparency and innovation goals? This project answers these questions by studying the impact of different "curatorial actions" (e.g., standardizing variables, improving documentation) on the reuse of data archived by the Inter-university Consortium for Political and Social Research (ICPSR). As one of the largest social science archives in the world and a leader in digital data curation practice, ICPSR is well-suited as a site for this project. ICPSR is also well-positioned to provide funding agencies and policy makers recommendations for data sharing policies that articulate the metrics needed in evaluating the appropriateness of data sharing and curation plans and their associated costs. This project achieves broader impacts by (1) recommending evidence-based data sharing policies to funders, repository staff,, and researchers and (2) improving research data curation practices.
To determine the impact of various curatorial activities on data reuse, the project first defines the different kinds of "curatorial actions" and "impact," and then explains the relationships among actions and impact. To identify curatorial actions and other features of datasets and ICPSR services that influence reuse, the project examines ICPSR's legacy curation logs and use records (such as downloads and citations). Curation logs contain data about specific data transformations or preservation steps. By connecting curation logs to data usage records, the actions are associated with higher rates of reuse or access will be identified. The project examines the utility of two measures of impact--secondary impact and diversity--by comparing use logs to the ICPSR Bibliography of Data-Related Literature. The ICPSR Bibliography links over 80,000 research publications to the ICPSR data on which they are based. "Secondary impact" is a measure of how many times the reuse publications have been cited and is constructed by gathering citation data for all items in the bibliography that are not the original PI's publications. "Diversity" measures the breadth of disciplines that use the data and can similarly be constructed from the bibliography. The project employs multivariate regression analysis and structural equation modeling to determine the relationships among curatorial actions, metadata, the dataset itself, ICPSR services, and reuse and impact. This analysis enables the development of cost models and metrics that allow repository managers to evaluate the return on investment of specific curatorial actions. The project will use these models to inform evidence-based data sharing and archiving policies.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
We studied the impacts of data curation on data reuse and developed metrics for measuring those relationships. Data curation organizes, describes, and prepares data for preservation and public use. Curating data for long-term preservation is costly in terms of human time and effort. The intensity of data curation activities and features of datasets contribute to their long-term reuse.
Our findings fall into three areas: curation activities, their effects on data reuse, and ways to measure data reuse. Data curators spend more time doing quality checks and communicating across their team than they do transforming data or planning their data activities.
Second, we found that datasets were more likely to be reused if they were intensely curated and represented larger, longitudinal studies associated with institutional investigators. The overall level of curation effort and individual curation actions, like attaching subject terms, were correlated with more data use. Institutional funding and improvements to metadata and findability also increased data use. Institutional datasets with more variables also attracted more users.
Third, we developed a computational pipeline to identify data references in academic literature and described different structural positions data occupy in the network of scientific outputs. Sometimes data serve as connectors across disciplines (crossroads), and sometimes data are more valuable for narrower intellectual communities (subdivisions).
Together, these findings advance theory in the fields of library and information science and scientometrics. We explain the return on various types of resource investment in data work and make it possible for funders, archives, and researchers to measure the likely impacts of their investments.
We produced various software tools that (1) enable archives to automatically identify papers that use data in their collections and (2) allow researchers to recognize references to datasets in text. These software products improve the infrastructure for science by helping archives capture the impacts of their work (e.g., identify products the data help produce).
Last Modified: 03/15/2023
Modified by: Libby Hemphill
Please report errors in award information by writing to: awardsearch@nsf.gov.