NSF Award Search: Award # 1443070

Award Abstract # 1443070

CIF21 DIBBs: User Driven Architecture for Data Discovery

NSF Org:	OAC Office of Advanced Cyberinfrastructure (OAC)
Recipient:	CORPORATION FOR NATIONAL RESEARCH INITIATIVES
Initial Amendment Date:	August 18, 2014
Latest Amendment Date:	August 18, 2014
Award Number:	1443070
Award Instrument:	Standard Grant
Program Manager:	Amy Walton awalton@nsf.gov (703)292-4538 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering
Start Date:	September 1, 2014
End Date:	April 30, 2018 (Estimated)
Total Intended Award Amount:	$1,484,940.00
Total Awarded Amount to Date:	$1,484,940.00
Funds Obligated to Date:	FY 2014 = $1,484,940.00
History of Investigator:	Giridhar Manepalli (Principal Investigator) gmanepalli@cnri.reston.va.us Laurence Lannom (Co-Principal Investigator) Allison Powell (Co-Principal Investigator)
Recipient Sponsored Research Office:	Corporation for National Research Initiatives (NRI) 1895 PRESTON WHITE DR RESTON VA US 20191-5469 (703)620-8990
Sponsor Congressional District:	11
Primary Place of Performance:	Corporation for National Research Initiatives (NRI) 1895 Preston White Drive Reston VA US 20191-5434
Primary Place of Performance Congressional District:	11
Unique Entity Identifier (UEI):	WQK1UGJYNMD7
Parent UEI:
NSF Program(s):	Info Integration & Informatics, Data Cyberinfrastructure
Primary Program Source:	01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7433, 8048, 8083
Program Element Code(s):	736400, 772600
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

The number, size, and availability of scientific datasets have grown enormously over the last few years. As scientific activity becomes more data intensive and collaborative, a key challenge for cross-disciplinary research will be discovery of diverse data sets, managed within distributed repositories and registries. Currently, discovery of information on the Internet is largely performed through automated approaches, characterized by web crawling and associated algorithms, or labor intensive indexing and categorization, such as the National Library of Medicine index for medical literature. There are significant amounts of data housed in repositories where only researchers with expertise in the specific field know and access the data.

This project builds a user driven architecture for data discovery (UDADD), a capability that enhances discovery of scientific datasets by building a global index from diverse communities with minimal input. In the UDADD approach user actions, such as dataset queries or downloads, drive the construction of a global index. These actions are recorded and gathered automatically, through cooperation with repository managers. Two software plugins are provided to help the repositories interact with the UDADD system. The architecture includes ranking techniques based on frequency and recency of use of the datasets.

The pilot architecture will be demonstrated and evaluated using cooperating repositories within the DataNet Federation Consortium. Currently, six science and engineering communities participate in the consortium, including national scale projects in oceanography, social science, cognitive science, hydrology, engineering, and plant biology.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Challenge:

How do you enable discovery of scientific datasets that largely consist of numbers? Existing search engine capabilities rely on textual information for enabling keyword searches which are not useful in the case of number-heavy datasets. Metadata associated with datasets becomes the key if we want to leverage those search engine capabilities. But metadata is not always available and, wherever available, may not be reliable.

Our Work:

In this project, we have addressed the dataset discovery challenge by relying on information held outside of the datasets: that is, in their usage. By identifying what datasets are used by whom, without using the dataset internals or semantics, we were able to create personalized recommendations. We have created several variations of recommendation approaches, some of which relied purely on usage while others relied additionally on any metadata information associated with the datasets. We have designed approaches to create user taste profiles based on datasets the users have interacted previously. For creating those taste profiles, we leveraged two distinct vector space models: one based on term frequency, inverse document frequency (TF-IDF) a vector space model used predominantly in the information retrieval community and a second one based on word vectors (Doc2Vec) used primarily by the natural language processing community. In addition to creating taste profile vectors, we have also created vectors for each of the datasets. Personalized recommendations are produced by identifying dataset vectors that are closer, in the vector space, to the taste profile vectors. We have demonstrated that these approaches outperform widely known variations of recommendation solutions that are referred to as collaborative filtering techniques.

Separately from the recommendation algorithms, we have identified ways to retrieve usage information from any web service that makes datasets available. In particular, if those services are integrated with analytic frameworks such as Piwik and Google Analytics, usage information can be harvested and filtered to remove personally identifying information prior to feeding to recommendation algorithms.

Broader Impact:

The impact of our work is amplified from these three activities:

1. The Vermont forestry community will integrate our solution into their service that makes available hundreds of geoscience related datasets to the researchers.

2. United Nations’ International Telecommunication Union (ITU) is evaluating our solutions to recommend global standards, documents, and related material to the public.

3. We have produced a software module that integrates with the broadly used Cordra software that is being used by a wide variety of organizations including those from financial, entertainment, construction, and scientific research domains.

Last Modified: 07/30/2018
Modified by: Giridhar Manepalli

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error