Award Abstract # 1909875
III:Small: Towards Cross-Model Query Optimizations for Multi-model Heterogeneous Data Analytics

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF CALIFORNIA, SAN DIEGO
Initial Amendment Date: August 15, 2019
Latest Amendment Date: August 15, 2019
Award Number: 1909875
Award Instrument: Standard Grant
Program Manager: Hector Munoz-Avila
hmunoz@nsf.gov
 (703)292-4481
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 15, 2019
End Date: July 31, 2023 (Estimated)
Total Intended Award Amount: $443,183.00
Total Awarded Amount to Date: $443,183.00
Funds Obligated to Date: FY 2019 = $443,183.00
History of Investigator:
  • Amarnath Gupta (Principal Investigator)
    gupta@sdsc.edu
  • Arun Kumar (Co-Principal Investigator)
Recipient Sponsored Research Office: University of California-San Diego
9500 GILMAN DR
LA JOLLA
CA  US  92093-0021
(858)534-4896
Sponsor Congressional District: 50
Primary Place of Performance: University of California-San Diego
9500 Gilman Drive
La Jolla
CA  US  92093-0934
Primary Place of Performance
Congressional District:
50
Unique Entity Identifier (UEI): UYTTZT6G9DT1
Parent UEI:
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01001920DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7364
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Large-scale analysis of complex, heterogeneous datasets is now an integral part of various social and natural sciences, digital journalism, law, enterprises, and numerous other application domains. Users in such fields are increasingly grappling with the need to perform holistic integrated analytics spanning a variety of data models beyond just structured or semi-structured data to include graph data, text data, etc. Such multi-model data repositories are also growing in volume due to the widespread availability of online data sources such as social media and news media, which have opened up new avenues for insight in various domains. To take advantage of these opportunities, it is necessary to develop joint understanding and processing of at least three data models - relations, graphs, and text - including their evolution over time. This project aims to enable faster and scalable cross-model data analytics.

An emerging information architecture for such heterogeneous data problem is the "polystore" approach that uses multiple "uni-model" backend engines such as RDBMSs, graph DBMSs, Solr, etc., and provides a translation layer in the middle to farm out different parts of a cross-model query to different engines. This approach is gaining popularity because it allows us to exploit the full functionality and native performance of uni-model engines for the corresponding parts of the queries. Amongst polystores, there are loosely-coupled solutions that have a very thin processing layer whose task is to "stitch the parts" together, and primarily provide support for data placement, movement and transformation. This project will focus on the query architecture and optimization principles for a tighter-coupled polystore. A usable, efficient, and scalable data analytics platform for queries spanning three data models, viz., relations, graphs, and text (including temporal evolution), that arise from social media and other sources, will be designed. A cross-model dataflow optimizer will be created for this "tri-store" setting to study fundamental systems optimization principles and will be implemented within the AWESOME polystore system. Further, several novel cross-model query optimization techniques will be devised to exploit the semantics of these three data models. Special attention will be paid to the temporality of data such that the optimizations treat temporal evolution of the data as a first-class primitive and support such queries efficiently on top of the existing engines even though they may lack native support for temporal queries.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Zheng, Xiuwen and Gupta, Amarnath "An Algebraic Approach for High-level Text Analytics" 32nd International Conference on Scientific and Statistical Database Management , 2020 10.1145/3400903.3400926 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

A polystore system is a heterogeneous data management system that sits on top of a number of existing database systems that typically support different data models, and offer a uniform query and analytics interface over them as if they together constitute a single multi-model information system. In this project, we studied different query processing and optimization strategies for polystore systems with an application focus on workloads that arbitrarily interleave retrieval and analytical operations for heterogeneous data. We first fully implemented an end-to-end polystore system that supports relational, graph and text data that can be queried through a dataflow-style polystore language called ADIL. In this implementation, we studied physical and execution-level query optimization and showed that a combination of data parallelization, function fusion and capability-based rewriting of execution plans achieve better performance than alternatives. However, we also discovered that contrary to our expectations, pipelined parallelism could not be effectively exploited.

The second part of the project investigated a new cross-model logical query planning strategy. We developed several graph-relational operations and a new logical planner that generates all valid plans for ADIL queries that span graphs and relations. This planner optimizes  cross-model join operations and uses a neural learning algorithm to compare alternative plans. 

Our work has resulted in two issued US patents - one on ADIL-based data ingestion into a polystore and the other on query processing on polystores using composite indices.

Our system is being used in several science disciplines. A Quantum Materials Project is using the system for storing materials synthesis procedures, and experimental results. A Physiological Analytics Project is using it for combing time-series sensor data from patients. A Food and Nutrition Security Project is beginning to use it to construct a virtual Knowledge Graph over the polystore to feed interactive recommendation for small businesses. 

Based on these practical and user-facing applications, we have identified new theoretical issues that we will pursue in future.


Last Modified: 11/30/2023
Modified by: Amarnath Gupta

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page