
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 4, 2018 |
Latest Amendment Date: | August 4, 2018 |
Award Number: | 1816325 |
Award Instrument: | Standard Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 1, 2018 |
End Date: | July 31, 2022 (Estimated) |
Total Intended Award Amount: | $515,770.00 |
Total Awarded Amount to Date: | $515,770.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
526 BRODHEAD AVE BETHLEHEM PA US 18015-3008 (610)758-3021 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
19 Memorial Dr. West Bethlehem PA US 18015-3006 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Today, the size of the Web is such that one cannot imagine finding much information without a web search engine. Similarly, the number of collections of public datasets now available has become so large as to be difficult for a researcher to track all of them within his or her discipline, and impossible to do so across disciplines. To help searchers find data in a discipline-agnostic manner, this project will investigate new, promising approaches to full-content dataset search. This research will provide the technology and develop the prototype of a tool that can ultimately assist many kinds of scientists to locate data that they can use to perform exploratory analysis and test hypotheses. Thus, this work will enable public dataset discovery and reuse, regardless of who produced the data or where it is stored. A dataset search engine using these methods benefits society by helping researchers to accelerate their work and reduce duplicate efforts. It will also benefit others, such as data journalists, as data promises a new source of evidence and for story discovery, a new way for story-telling and fact-checking, to make reporting that is both meaningful and trustworthy. This work will help any data analyst locate relevant datasets. This project will impact the training of graduate students and undergraduates. This involvement will make it possible to broaden participation by underrepresented groups and the development of educational materials. The researchers will incorporate results of this work in courses, including Data Science, Web Search Engines, Data Journalism, and Semantic Web Topics.
Existing dataset search services are cumbersome, focusing on searching descriptions, not data, and cater to searchers looking within their own discipline. The project's goal is to develop a prototype dataset search engine incorporating new techniques for full-content indexing to enable searchers to find data across the web, regardless of domain. The investigators will combine principles and novel methods from information retrieval, databases, and data mining. The design and development of the prototype will also take a user-centric approach, involving professionals and practitioners in observational, interview and experimental studies to inform and guide this process. The outcomes of this work include: (1) The development of new principles, methods, and technologies for the construction of search indexes from hundreds of thousands of real-world public datasets: the researchers will create novel methods for a) full-content indexing and analysis, b) inferring additional metadata such as attribute names when the existing descriptors are lacking and, c) inferring additional descriptors that can be used to resolve schema and data heterogeneity. (2) The understanding of searchers' cognitive processes as they search for and consider use of datasets. A social cognitive model will be built to describe human-system interactions in dataset searches, and to predict the effectiveness of the system in various scenarios. (3) The development of novel interfaces to support the search, exploration, and presentation of datasets to such users. Through this process, the researchers will develop a set of instruments for evaluating the dataset search technology and interface from the user's perspective. Research results will be disseminated broadly by presenting and publishing at conferences and journals, sharing on the web, giving talks, and making developed software open source.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Publicly supported research in the current era includes expectations to make such data available to others. Differing processes to make many millions of datasets available and the ease of sharing online has led to data being available across a wide variety of websites. Search services are now needed to find datasets, especially datasets outside of one?s area of expertise. Existing search services predominantly search only the descriptions (and metadata) of the datasets, and not the data itself. Our principal hypothesis was that the contents of datasets would provide valuable signals to find relevant results in dataset search.
This project examined what scientists and journalists need to find datasets, using interviews, surveys, and reviews of dataset search services. We found that dataset search for non-experts remains a challenge, as little attention has been paid to non-experts? emerging data needs, significantly constraining the design of technological tools such as searchability, interactivity, and usability for supporting non-expert data search.
We created a prototype dataset search engine built upon elasticsearch infrastructure, utilizing a novel full-content, cell-centric indexing scheme. We scaled the engine to index and search millions of tables from public datasets.
During this project, we developed novel techniques to enable better retrieval and ranking of datasets. We have explored new methods for table schema label generation, and showed the value of such generated labels to enhance retrieval. We published a series of papers showing better representations and content-sensitive retrieval approaches that leveraged an understanding of the implicit relationships of data in tables. These papers utilized table-specific embeddings, leveraged pretrained language models, constructed a knowledge graph, created a neural architecture incorporating both semantic and relevance matching, built graph neural networks, and designed a structure-aware language model to fuse the textual and structural information of a data table.
We explored how to present dataset search services to users. We learned about interactive features of value to scientists and data journalists, and performed experimental studies to determine how specific interface features, including dataset preview, customization, and computer-mediated communication, facilitate information processing and enhance user experience. We also implemented a novel interface to explore a dataset collection, interactively building the corresponding query and visualizing the contents of the matching data.
In the process of the above, we evaluated the developed systems and techniques. We also created and published the first benchmark for dataset retrieval and a larger benchmark for web table retrieval so that others can compare performance.
Beyond the scientific outcomes above, this effort has helped train and support five graduate students (two female) and seventeen undergraduates (ten female). Two doctoral dissertations were produced, in addition to three undergraduate theses, and more than a dozen peer-reviewed publications.
Last Modified: 02/12/2023
Modified by: Brian D Davison
Please report errors in award information by writing to: awardsearch@nsf.gov.