Award Abstract # 1447549
BIGDATA: F: DKM: DKA: Big Data Modeling and Analysis with Depth and Scale

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: THE RESEARCH FOUNDATION FOR THE STATE UNIVERSITY OF NEW YORK
Initial Amendment Date: August 27, 2014
Latest Amendment Date: August 27, 2014
Award Number: 1447549
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 1, 2014
End Date: July 31, 2020 (Estimated)
Total Intended Award Amount: $1,500,000.00
Total Awarded Amount to Date: $1,500,000.00
Funds Obligated to Date: FY 2014 = $1,500,000.00
History of Investigator:
  • C. Ramakrishnan (Principal Investigator)
    cram@cs.stonybrook.edu
  • Scott Smolka (Co-Principal Investigator)
  • IV Ramakrishnan (Co-Principal Investigator)
  • Maureen O'Leary (Co-Principal Investigator)
  • Yanhong Liu (Co-Principal Investigator)
Recipient Sponsored Research Office: SUNY at Stony Brook
W5510 FRANKS MELVILLE MEMORIAL LIBRARY
STONY BROOK
NY  US  11794-0001
(631)632-9949
Sponsor Congressional District: 01
Primary Place of Performance: SUNY at Stony Brook
Stony Brook
NY  US  11794-4400
Primary Place of Performance
Congressional District:
01
Unique Entity Identifier (UEI): M746VC6XMNH9
Parent UEI: M746VC6XMNH9
NSF Program(s): Big Data Science &Engineering
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 8083
Program Element Code(s): 808300
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

An important step in understanding large volumes of data is the construction of a model: a succinct but abstract representation of the phenomenon that produced the data. In order to understand a phenomenon, a data analyst needs to be able to propose a model, evaluate how the proposed model explains the data, and refine the model as new data becomes available. Statistical models, which specify relationships among random variables, have traditionally been used to understand large volumes of noisy data. Logical models have been used widely in databases and knowledge bases for organizing and reasoning with large and complex data sets. This project is aimed at developing a programming language and system for the creation, evaluation and refinement of combined statistical and logical models for the express purpose of understanding very large and complex data sets. Apart from its direct effect on model development for Big Data problems, the semantic foundations and scalable computing infrastructure resulting from this project is expected to directly impact the areas of system development and verification, planning, and optimization, with broad application in Science and Engineering. The tools developed in this project will facilitate the training of a new generation of scientists capable of transforming data into knowledge for use across disciplines. The project's education and outreach component is designed to train select undergraduate students on Big Data modeling and analysis via annual workshops and research mentorship; and graduate students via curriculum modifications including a specialization in Data Science.

The project will develop Px, a language with well-defined declarative semantics, to support high-level model construction and analysis. Px will be capable of expressing generative and discriminative probabilistic and relational models, and the Px system will support complex queries over such models. The project will encompass three significant and complementary research directions, aimed at developing: (1) semantic foundations, including language constructs needed for succinct specification of complex models with rich logical and statistical structure; (2) scalable inference techniques combining exact and approximate methods, and query optimizations over combined logic/statistical models; and (3) programming extensions as well as static and dynamic analyses to support the creation and refinement of complex models. The Px language and system will be evaluated using two important and diverse application problems: (1) analysis and verification of infinite-state probabilistic systems, including parameterized systems, and (2) construction of phylogenetic trees from phenomic data, used in the Tree of Life project, for mapping the evolutionary history of organisms. The project is expected to make significant contributions towards creating a unifying framework combining probabilistic inference, logical inference, and constraint processing, with an emphasis on semantic clarity, efficiency, and scalability. The project will also demonstrate the practical utility of the proposed integrated framework by developing complex models from big data that take advantage of this technology in fundamental ways.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 56)
Arun Nampally, Timothy Zhang, C. R. Ramakrishnan "Constraint-Based Inference in Probabilistic Logic Programs." International Conf. on Logic Programming , 2018
Arun Nampally, C. R. Ramakrishnan "Inference in Probabilistic Logic Programs using Lifted Explanations" International Conference on Logic Programming (ICLP) , 2016
A. Lukina, L. Esterle, C. Hirsch, E. Bartocci, J. Yang, A. Tiwari, S. A. Smolka, and R. Grosu "ARES: Adaptive Receding-Horizon Synthesis of Optimal Plans" Proceedings of TACAS 2017: 23rd International Conference on Tools and Algorithms for the Construction and Analysis of Systems , 2017
A. Lukina, S. A. Smolka, A. Tiwari, R. Grosu "Distributed Adaptive-Neighborhood Control for Stochastic Reachability in Multi-Agent Systems" Proceedings of {SAC} 2019, 34th ACM/SIGAPP Symposium on Applied Computing, Intelligent Robotics and Multi-Agent Systems ({IRMAS}) track , 2019
Andrii Soviak, Anatoliy Borodin, Vikas Ashok, Yevgen Borodin, Yury Puzis, I. V. Ramakrishnan "Tactile Accessibility: Does Anyone Need a Haptic Glove?" 18th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS?16) , 2016
Arun Nampally, C. R. Ramakrishnan "Constraint-Based Inference in Probabilistic Logic Programs." International Workshop on Probabilistic Logic Programming. Cork, Ireland. , 2015 , p.46
A. Tiwari, S. A. Smolka, L. Esterle, A. Lukina, J. Yang, and R. Grosu "Attacking the V: On the Resiliency of Adaptive-Horizon MPC" Proceedings of ATVA 2017, 15th International Symposium on Automated Technology for Verification and Analysis, Pune, India , 2017
A. Tiwari, S. A. Smolka, L. Esterle, A. Lukina, J. Yang, and R. Grosu "Resilient Control for Cyber-Physical Systems" Proceedings of MT-CPS 2018, Third Workshop on Monitoring and Testing of Cyber-Physical Systems , 2018
A. Tiwari, S. A. Smolka, L. Esterle, A. Lukina, J. Yang, and R. Grosu "Resilient Control for Cyber-Physical Systems" Proceedings of MT-CPS 2018, Third Workshop on Monitoring and Testing of Cyber-Physical Systems , 2018
C. Jegourel, A. Lukina, A. Legay, S. A. Smolka, R. Grosu, and E. Bartocci "Feedback Control for Statistical Model Checking of Cyber-Physical Systems" Proceedings of ISoLA 2016, Eighth International Symposium on Leveraging Applications , 2016
D. Phan, J. Yang, M. Clark, R. Grosu, J. D. Schierman, S. A. Smolka,and S. D. Stoller "A Component-Based Simplex Architecture for High-Assurance Cyber-Physical Systems" Proceedings of ACSD 2017: 17th International Conference on Application of Concurrency to System Design , 2017
(Showing: 1 - 10 of 56)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

An important step in understanding large volumes of data is the construction of  a model: a succinct but abstract representation of the phenomenon that produced the data.   This project focused on a logic-based language for representing complex models, and procedures supporting expressive queries over the models.  

Intellectual Merit:  We developed the formal semantics and inference procedures in a high-level language capable of combining relational models with statistical models.  In particular, we developed a suite of techniques for scalable inference including approximate inference, constraint-based inference to abstractly represent subsets of large domains, and lifted inference for reasoning over large systems with conditionally independent components.  We extended this language to be capable of modeling agent-based systems, and developed a framework for verifying an expressive set of branching-time properties for such systems.  

We studied the problem of analyzing the high-level models for performing tasks other than traditional inference.  One such problem was to consider "what if" questions-- characterizing the qualitative and quantitative improvements to query answers if some aspects of the model or certain pieces of data are better known.  This is related to the ?Value of Information? (VoI) problem from decision theory.  While VoI optimization problem is intractable in general, we developed a set of algorithms for the VoI optimization problem under different conditions, all of which also led to efficient solutions for cases when the problem was tractable.  

We also introduced neural state classification for analyzing reachability properties of probabilistic and hybrid systems by posing the verification problem in terms of constructing a classifier using neural networks (NN).  We defined neural simplex architecture (NSA), a framework for incorporating NN-based controllers into a safety-critical system without violating its verifiable safe behavior.  In this architecture, a verified baseline controller (BC) is supplemented with a NN-based controller such that BC is called into service whenever the NN-based controller is expected to generate unsafe behaviors; and during such times, the NN-based controller learns from BC?s actions, thereby increasing its capability to keep the controlled system within safe regions.  For generating better training data for NN-based controllers, we developed a next generation method for constructing Lagrangian Reach Tubes for estimating the future behaviors of a hybrid system. 

Systems for logical reasoning (even without the addition of statistical reasoning) may be based on different (sometimes incompatible) semantics.  We developed the Founded Semantics, as an approach to unifying the existing semantics.  We extended this to a Constrained Semantics that supports unrestricted negation, as well as unrestricted existential and universal quantifications.  These unified approaches enable an analyst to build models where different components rely on assumptions stemming from different semantics.  We have also developed a unified semantics for recursive rules with aggregation, extending the unified founded semantics and constraint semantics for recursive rules with negation.

Challenge applications for the project?s fundamental developments in modeling and reasoning were in formal methods and verification of complex systems, described earlier;  and in the construction of phylogenetic trees from phenomic data, used in the Tree of Life project, for mapping the evolutionary history of organisms.  For the latter, we investigated the use of combined logical/statistical models for detecting similarities in taxonomic data to assist in data cleaning and organization in Morphobank, an open-source tool for storing and sharing phenotypic and morphological data of species. 

Broader Impact:  This project led to several developments that are immediately relevant outside of its primary research area.  We applied the fundamental research results to a number of problems in the Health IT domain.  Rapid response events (RREs) correspond to deteriorating conditions of a patient that, if not attended to immediately, can cause fatality.  Based on our work on optimizing Value of Information, we are developing models to predict RREs as early as possible, potentially leading to better care and improved patient outcomes.  Such an "early warning" system was also developed to predict Sepsis, a  life-threatening condition brought on by the body's response to infection. It is considered to be one of the leading causes of death in hospitals. 

We maintained MorphoBank with 7/24/365 uptime and trained minority undergraduate computer science majors through summer internships.  We fixed bugs and implemented software upgrades on MorphoBank.  MorphoBank retained a diverse staff with a female director and an African American lead software developer. MorphoBank also made strides towards sustainability through collaboration with Phoenix Bioinformatics. 

In terms of human resources, the project led to the training of more than 100 graduate students on the basics of probabilistic logic programs and on modeling complex systems using probabilistic logic programs.  In the context of its larger research community, developments in this project were the basis for the Applications of Logic Programming workshop held as a satellite event of Intl. Conf. on Logic Programming 2016, and two subsequent  Logic and Practice of Programming workshops held in 2018 and scheduled for Nov. 2020. 

 


Last Modified: 08/28/2020
Modified by: C. R Ramakrishnan

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page