Skip to feedback

Award Abstract # 1816325
III: Small: Domain-Agnostic Dataset Search

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: LEHIGH UNIVERSITY
Initial Amendment Date: August 4, 2018
Latest Amendment Date: August 4, 2018
Award Number: 1816325
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 1, 2018
End Date: July 31, 2022 (Estimated)
Total Intended Award Amount: $515,770.00
Total Awarded Amount to Date: $515,770.00
Funds Obligated to Date: FY 2018 = $515,770.00
History of Investigator:
  • Brian Davison (Principal Investigator)
    bdd3@lehigh.edu
  • Jeffrey Heflin (Co-Principal Investigator)
  • Haiyan Jia (Co-Principal Investigator)
Recipient Sponsored Research Office: Lehigh University
526 BRODHEAD AVE
BETHLEHEM
PA  US  18015-3008
(610)758-3021
Sponsor Congressional District: 07
Primary Place of Performance: Lehigh University
19 Memorial Dr. West
Bethlehem
PA  US  18015-3006
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): E13MDBKHLDB5
Parent UEI:
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01001819DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7364, 7923, 9251
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Today, the size of the Web is such that one cannot imagine finding much information without a web search engine. Similarly, the number of collections of public datasets now available has become so large as to be difficult for a researcher to track all of them within his or her discipline, and impossible to do so across disciplines. To help searchers find data in a discipline-agnostic manner, this project will investigate new, promising approaches to full-content dataset search. This research will provide the technology and develop the prototype of a tool that can ultimately assist many kinds of scientists to locate data that they can use to perform exploratory analysis and test hypotheses. Thus, this work will enable public dataset discovery and reuse, regardless of who produced the data or where it is stored. A dataset search engine using these methods benefits society by helping researchers to accelerate their work and reduce duplicate efforts. It will also benefit others, such as data journalists, as data promises a new source of evidence and for story discovery, a new way for story-telling and fact-checking, to make reporting that is both meaningful and trustworthy. This work will help any data analyst locate relevant datasets. This project will impact the training of graduate students and undergraduates. This involvement will make it possible to broaden participation by underrepresented groups and the development of educational materials. The researchers will incorporate results of this work in courses, including Data Science, Web Search Engines, Data Journalism, and Semantic Web Topics.

Existing dataset search services are cumbersome, focusing on searching descriptions, not data, and cater to searchers looking within their own discipline. The project's goal is to develop a prototype dataset search engine incorporating new techniques for full-content indexing to enable searchers to find data across the web, regardless of domain. The investigators will combine principles and novel methods from information retrieval, databases, and data mining. The design and development of the prototype will also take a user-centric approach, involving professionals and practitioners in observational, interview and experimental studies to inform and guide this process. The outcomes of this work include: (1) The development of new principles, methods, and technologies for the construction of search indexes from hundreds of thousands of real-world public datasets: the researchers will create novel methods for a) full-content indexing and analysis, b) inferring additional metadata such as attribute names when the existing descriptors are lacking and, c) inferring additional descriptors that can be used to resolve schema and data heterogeneity. (2) The understanding of searchers' cognitive processes as they search for and consider use of datasets. A social cognitive model will be built to describe human-system interactions in dataset searches, and to predict the effectiveness of the system in various scenarios. (3) The development of novel interfaces to support the search, exploration, and presentation of datasets to such users. Through this process, the researchers will develop a set of instruments for evaluating the dataset search technology and interface from the user's perspective. Research results will be disseminated broadly by presenting and publishing at conferences and journals, sharing on the web, giving talks, and making developed software open source.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 14)
Chen, Zhiyu and Jia, Haiyan and Heflin, Jeff and Davison, Brian D. "Leveraging Schema Labels to Enhance Dataset Search" 42nd European Conference on Information Retrieval, LNCS , v.12035 , 2020 10.1007/978-3-030-45439-5_18 Citation Details
Chen, Zhiyu and Trabelsi, Mohamed and Heflin, Jeff and Xu, Yinan and Davison, Brian D. "Table Search Using a Deep Contextualized Language Model" 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , 2020 10.1145/3397271.3401044 Citation Details
Chen, Zhiyu and Trabelsi, Mohamed and Heflin, Jeff and Yin, Dawei and Davison, Brian D. "MGNETS: Multi-Graph Neural Networks for Table Search" Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM) , 2021 https://doi.org/10.1145/3459637.3482140 Citation Details
Chen, Zhiyu and Zhang, Shuo and Davison, Brian D. "WTR: A Test Collection for Web Table Retrieval" Proceedings of 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2021 https://doi.org/10.1145/3404835.3463260 Citation Details
Chen, Zhiyul Trabelsi and Davison, Brian D and Heflin, Jeff "Towards Knowledge Acquisition of Metadata on AI Progress" CEUR workshop proceedings , v.2721 , 2020 Citation Details
Heflin, Jeff and Davison, Brian D. and Jia, Haiyan "Exploring Datasets via Cell-Centric Indexing" Proceedings of DESIRES 2021: Design of Experimental Search and Information REtrieval Systems, CEUR Workshop Proceedings , v.2950 , 2021 Citation Details
Jia, Haiyan and Miller, Larrisa I. and Hicks, Jessica and Moscot, Ethan and Landberg, Alissa and Heflin, Jeff and Davison, Brian D. "Truth in a sea of data: adoption and use of data search tools among researchers and journalists" Information, Communication & Society , 2022 https://doi.org/10.1080/1369118X.2022.2147398 Citation Details
Johnson, Drake and Register, Keith and Davison, Brian D. and Heflin, Jeff "An Exploratory Interface for Dataset Repositories Using Cell-Centric Indexing" IEEE International Conference on Big Data (IEEE BigData 2020) , 2020 https://doi.org/10.1109/BigData50022.2020.9378057 Citation Details
Qiu, Lixuan and Jia, Haiyan and Davison, Brian D. and Heflin, Jeff "An Architecture for Cell-Centric Indexing of Datasets" CEUR workshop proceedings , v.2722 , 2020 Citation Details
Trabelsi, Mohamed and Chen, Zhiyu and Davison, Brian D. and Heflin, Jeff "A Hybrid Deep Model for Learning to Rank Data Tables" 2020 IEEE International Conference on Big Data (Big Data) , 2020 https://doi.org/10.1109/BigData50022.2020.9378185 Citation Details
Trabelsi, Mohamed and Chen, Zhiyu and Davison, Brian D. and Heflin, Jeff "Relational Graph Embeddings for Table Retrieval" IEEE International Conference on Big Data: Seventh International Workshop on High Performance Big Graph Data Management, Analysis, and Mining (BigGraphs 2020) , 2020 https://doi.org/10.1109/BigData50022.2020.9378239 Citation Details
(Showing: 1 - 10 of 14)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Publicly supported research in the current era includes expectations to make such data available to others.  Differing processes to make many millions of datasets available and the ease of sharing online has led to data being available across a wide variety of websites.  Search services are now needed to find datasets, especially datasets outside of one?s area of expertise.  Existing search services predominantly search only the descriptions (and metadata) of the datasets, and not the data itself.  Our principal hypothesis was that the contents of datasets would provide valuable signals to find relevant results in dataset search.

This project examined what scientists and journalists need to find datasets, using interviews, surveys, and reviews of dataset search services.  We found that dataset search for non-experts remains a challenge, as little attention has been paid to non-experts? emerging data needs, significantly constraining the design of technological tools such as searchability, interactivity, and usability for supporting non-expert data search.

We created a prototype dataset search engine built upon elasticsearch infrastructure, utilizing a novel full-content, cell-centric indexing scheme.  We scaled the engine to index and search millions of tables from public datasets.

During this project, we developed novel techniques to enable better retrieval and ranking of datasets.  We have explored new methods for table schema label generation, and showed the value of such generated labels to enhance retrieval.  We published a series of papers showing better representations and content-sensitive retrieval approaches that leveraged an understanding of the implicit relationships of data in tables.  These papers utilized table-specific embeddings, leveraged pretrained language models, constructed a knowledge graph, created a neural architecture incorporating both semantic and relevance matching, built graph neural networks, and designed a structure-aware language model to fuse the textual and structural information of a data table.

We explored how to present dataset search services to users.  We learned about interactive features of value to scientists and data journalists, and performed experimental studies to determine how specific interface features, including dataset preview, customization, and computer-mediated communication, facilitate information processing and enhance user experience.  We also implemented a novel interface to explore a dataset collection, interactively building the corresponding query and visualizing the contents of the matching data.  

In the process of the above, we evaluated the developed systems and techniques.  We also created and published the first benchmark for dataset retrieval and a larger benchmark for web table retrieval so that others can compare performance.

Beyond the scientific outcomes above, this effort has helped train and support five graduate students (two female) and seventeen undergraduates (ten female).  Two doctoral dissertations were produced, in addition to three undergraduate theses, and more than a dozen peer-reviewed publications.

 


Last Modified: 02/12/2023
Modified by: Brian D Davison

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page