Award Abstract # 1218524
III: Small: High-Performance Complex Processing of Continuous Uncertain Data

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF MASSACHUSETTS
Initial Amendment Date: August 30, 2012
Latest Amendment Date: May 6, 2013
Award Number: 1218524
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2012
End Date: August 31, 2017 (Estimated)
Total Intended Award Amount: $495,961.00
Total Awarded Amount to Date: $511,961.00
Funds Obligated to Date: FY 2012 = $495,961.00
FY 2013 = $16,000.00
History of Investigator:
  • Yanlei Diao (Principal Investigator)
    yanlei@cs.umass.edu
  • Anna Liu (Co-Principal Investigator)
Recipient Sponsored Research Office: University of Massachusetts Amherst
101 COMMONWEALTH AVE
AMHERST
MA  US  01003-9252
(413)545-0698
Sponsor Congressional District: 02
Primary Place of Performance: University of Massachusetts Amherst
CompSci 140 Governors Drive
Amherst
MA  US  01003-9264
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): VGJHK59NMPK9
Parent UEI: VGJHK59NMPK9
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01001213DB NSF RESEARCH & RELATED ACTIVIT
01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7364, 7923, 9251
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

The objective of this project is to design and develop a data management system that supports query processing on continuous uncertain data by returning a full probability distribution of query output and optimizes such processing for performance. This project includes four thrusts: (1) supporting continuous uncertain data processing using both the traditional relational model and the array model; (2) addressing complex correlation that arises in continuous uncertain data processing using new statistical graphical models; (3) supporting arbitrary user-defined functions, besides standard query operations, by exploring advanced techniques such as Gaussian processes and functional interpolation; and (4) developing a prototype system and evaluating it using real-world applications. Expected results include statistical models and techniques, data storage schemes, query processing and optimization techniques, and a publicly available prototype to fully support query processing on continuous uncertain data.

The results of the project can benefit applications such as severe weather monitoring and computational astrophysics, as well as the broader scientific community. Since applications such as tornado detection may trigger actions based on derived information, the ability to characterize uncertainty of output may result in significant social impacts. This project also integrates research and education with curriculum development and engaging women in research through college outreach and CRA's distributed mentor program. The results of the project are disseminated at the project web site: http://claro.cs.umass.edu.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao "AIDE: An Active Learning-based Approach for Interactive Data Exploration." IEEE Transactions on Knowledge and Data Engineering (TKDE) , 2016
Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao "Explore-by-Example: An Automatic Query Steering Framework for Interactive Data Exploration" ACM SIGMOD Conference , 2014 , p.517-528
Liping Peng and Yanlei Diao "Supporting Data Uncertainty in Array Databases" ACM SIGMOD Conference , 2015 , p.545-560
Olga Papaemmanouil, Yanlei Diao, Kyriaki Dimitriadou, Liping Peng "Interactive Data Exploration via Machine Learning Models" IEEE Data Eng. Bull. , v.39 , 2016 , p.38-49
Thanh T. L. Tran, Yanlei Diao, Charles Sutton, and Anna Liu "Supporting User-Defined Functions on Uncertain Data" Journal of "Proceedings of Very Large Databases (PVLDB) , v.6 , 2013 , p.1
Thanh T.L. Tran, Yanlei Diao, Charles Sutton, and Anna Liu "Supporting User-Defined Functions on Uncertain Data" Journal ?Proceedings of the VLDB Endowment? (PVLDB) , v.6 , 2013 , p.469-480
Yanlei Diao, Kyriaki Dimitriadou, Zhan Li, Wenzhao Liu, Olga Papaemmanouil, Kemi Peng, and Liping Peng "AIDE: An Automatic User Navigation System for Interactive Data Exploration" Journal ?Proceedings of the VLDB Endowment? (PVLDB) , v.8 , 2015

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The overall research goal of our proposal was to design, develop, and evaluate a data management system that provides fundamental support for query processing on large sensor and scientific datasets, which often involve much uncertainty in the content of data as well as in the query evaluation process. Our work distinguishes from prior work in three key aspects: (1) It supports both relational algebra and array algebra for query processing, where the uncertainty in query processing arises from the uncertainty of data content. Our support of both algebras entails broader applicability in scientific domains. (2) Our work provides efficient algorithms not only for algebraic operators, but also for user-defined functions that are prevalent in real-world applications and hard to support.  (3) We further broadened our project to support uncertainty in the user data interest itself using interactive data exploration, which combines machine learning techniques and database optimizations.

Results of this project significantly advanced the state of the art with the following contributions: (1) Our work supports uncertain data management in both relational and array databases. For array databases, our project is the first to provide the formal semantics of array operations on uncertain data. We also provide efficient algorithms for these array operations, which can outperform existing methods by up to 1-2 orders of magnitude in efficiency. (2) Besides algebraic operators, our work also supports user-de?ned functions (UDFs) on uncertain data. Our approach based on Gaussian processes (GPs) characterizes the UDF output using probability distributions and error bounds, which is the ?rst result to quantify output distributions of Gaussian processes with error bounds. In addition, our optimization techniques allow our GP techniques to offer up to two orders of magnitude speedup over MC sampling. (3) To support uncertainty in the user data interest, our interactive data exploration techniques outperform traditional active learning and random sampling in both accuracy and interactive performance. Our user study results further reveal that compared to the manual exploration approach, our system can reduce the user labeling effort by up 87%, with an average of 66% reduction.

For the broader scientific community, our proposed techniques for uncertain data processing have the potential to add fundamental support for reasoning the result quality when such results are computed from uncertain data. Our work on supporting query uncertainty through interactive data exploration will significantly increase the utility of the database when users come to explore large scientific databases with complex structure and content as well as imprecise goals. As such, our project will increase both the quality of analytical results computed from uncertain data, and the utility of the database when the user data interest cannot be precisely stated upfront – both benefits will be of significant importance to the scientific community for data-driven discovery. Besides research activities, this project also involved a number of educational efforts, including an integrated undergraduate and graduate curriculum on data analytics and statistical analysis, and outreach and mentoring activities to engage women in research.

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

 


Last Modified: 12/15/2017
Modified by: Yanlei Diao

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page