
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 15, 2019 |
Latest Amendment Date: | August 15, 2019 |
Award Number: | 1909875 |
Award Instrument: | Standard Grant |
Program Manager: |
Hector Munoz-Avila
hmunoz@nsf.gov (703)292-4481 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 15, 2019 |
End Date: | July 31, 2023 (Estimated) |
Total Intended Award Amount: | $443,183.00 |
Total Awarded Amount to Date: | $443,183.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
9500 GILMAN DR LA JOLLA CA US 92093-0021 (858)534-4896 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
9500 Gilman Drive La Jolla CA US 92093-0934 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Large-scale analysis of complex, heterogeneous datasets is now an integral part of various social and natural sciences, digital journalism, law, enterprises, and numerous other application domains. Users in such fields are increasingly grappling with the need to perform holistic integrated analytics spanning a variety of data models beyond just structured or semi-structured data to include graph data, text data, etc. Such multi-model data repositories are also growing in volume due to the widespread availability of online data sources such as social media and news media, which have opened up new avenues for insight in various domains. To take advantage of these opportunities, it is necessary to develop joint understanding and processing of at least three data models - relations, graphs, and text - including their evolution over time. This project aims to enable faster and scalable cross-model data analytics.
An emerging information architecture for such heterogeneous data problem is the "polystore" approach that uses multiple "uni-model" backend engines such as RDBMSs, graph DBMSs, Solr, etc., and provides a translation layer in the middle to farm out different parts of a cross-model query to different engines. This approach is gaining popularity because it allows us to exploit the full functionality and native performance of uni-model engines for the corresponding parts of the queries. Amongst polystores, there are loosely-coupled solutions that have a very thin processing layer whose task is to "stitch the parts" together, and primarily provide support for data placement, movement and transformation. This project will focus on the query architecture and optimization principles for a tighter-coupled polystore. A usable, efficient, and scalable data analytics platform for queries spanning three data models, viz., relations, graphs, and text (including temporal evolution), that arise from social media and other sources, will be designed. A cross-model dataflow optimizer will be created for this "tri-store" setting to study fundamental systems optimization principles and will be implemented within the AWESOME polystore system. Further, several novel cross-model query optimization techniques will be devised to exploit the semantics of these three data models. Special attention will be paid to the temporality of data such that the optimizations treat temporal evolution of the data as a first-class primitive and support such queries efficiently on top of the existing engines even though they may lack native support for temporal queries.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
A polystore system is a heterogeneous data management system that sits on top of a number of existing database systems that typically support different data models, and offer a uniform query and analytics interface over them as if they together constitute a single multi-model information system. In this project, we studied different query processing and optimization strategies for polystore systems with an application focus on workloads that arbitrarily interleave retrieval and analytical operations for heterogeneous data. We first fully implemented an end-to-end polystore system that supports relational, graph and text data that can be queried through a dataflow-style polystore language called ADIL. In this implementation, we studied physical and execution-level query optimization and showed that a combination of data parallelization, function fusion and capability-based rewriting of execution plans achieve better performance than alternatives. However, we also discovered that contrary to our expectations, pipelined parallelism could not be effectively exploited.
The second part of the project investigated a new cross-model logical query planning strategy. We developed several graph-relational operations and a new logical planner that generates all valid plans for ADIL queries that span graphs and relations. This planner optimizes cross-model join operations and uses a neural learning algorithm to compare alternative plans.
Our work has resulted in two issued US patents - one on ADIL-based data ingestion into a polystore and the other on query processing on polystores using composite indices.
Our system is being used in several science disciplines. A Quantum Materials Project is using the system for storing materials synthesis procedures, and experimental results. A Physiological Analytics Project is using it for combing time-series sensor data from patients. A Food and Nutrition Security Project is beginning to use it to construct a virtual Knowledge Graph over the polystore to feed interactive recommendation for small businesses.
Based on these practical and user-facing applications, we have identified new theoretical issues that we will pursue in future.
Last Modified: 11/30/2023
Modified by: Amarnath Gupta
Please report errors in award information by writing to: awardsearch@nsf.gov.