
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 27, 2014 |
Latest Amendment Date: | August 27, 2014 |
Award Number: | 1447549 |
Award Instrument: | Standard Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 1, 2014 |
End Date: | July 31, 2020 (Estimated) |
Total Intended Award Amount: | $1,500,000.00 |
Total Awarded Amount to Date: | $1,500,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
W5510 FRANKS MELVILLE MEMORIAL LIBRARY STONY BROOK NY US 11794-0001 (631)632-9949 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
Stony Brook NY US 11794-4400 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Big Data Science &Engineering |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
An important step in understanding large volumes of data is the construction of a model: a succinct but abstract representation of the phenomenon that produced the data. In order to understand a phenomenon, a data analyst needs to be able to propose a model, evaluate how the proposed model explains the data, and refine the model as new data becomes available. Statistical models, which specify relationships among random variables, have traditionally been used to understand large volumes of noisy data. Logical models have been used widely in databases and knowledge bases for organizing and reasoning with large and complex data sets. This project is aimed at developing a programming language and system for the creation, evaluation and refinement of combined statistical and logical models for the express purpose of understanding very large and complex data sets. Apart from its direct effect on model development for Big Data problems, the semantic foundations and scalable computing infrastructure resulting from this project is expected to directly impact the areas of system development and verification, planning, and optimization, with broad application in Science and Engineering. The tools developed in this project will facilitate the training of a new generation of scientists capable of transforming data into knowledge for use across disciplines. The project's education and outreach component is designed to train select undergraduate students on Big Data modeling and analysis via annual workshops and research mentorship; and graduate students via curriculum modifications including a specialization in Data Science.
The project will develop Px, a language with well-defined declarative semantics, to support high-level model construction and analysis. Px will be capable of expressing generative and discriminative probabilistic and relational models, and the Px system will support complex queries over such models. The project will encompass three significant and complementary research directions, aimed at developing: (1) semantic foundations, including language constructs needed for succinct specification of complex models with rich logical and statistical structure; (2) scalable inference techniques combining exact and approximate methods, and query optimizations over combined logic/statistical models; and (3) programming extensions as well as static and dynamic analyses to support the creation and refinement of complex models. The Px language and system will be evaluated using two important and diverse application problems: (1) analysis and verification of infinite-state probabilistic systems, including parameterized systems, and (2) construction of phylogenetic trees from phenomic data, used in the Tree of Life project, for mapping the evolutionary history of organisms. The project is expected to make significant contributions towards creating a unifying framework combining probabilistic inference, logical inference, and constraint processing, with an emphasis on semantic clarity, efficiency, and scalability. The project will also demonstrate the practical utility of the proposed integrated framework by developing complex models from big data that take advantage of this technology in fundamental ways.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
An important step in understanding large volumes of data is the construction of a model: a succinct but abstract representation of the phenomenon that produced the data. This project focused on a logic-based language for representing complex models, and procedures supporting expressive queries over the models.
Intellectual Merit: We developed the formal semantics and inference procedures in a high-level language capable of combining relational models with statistical models. In particular, we developed a suite of techniques for scalable inference including approximate inference, constraint-based inference to abstractly represent subsets of large domains, and lifted inference for reasoning over large systems with conditionally independent components. We extended this language to be capable of modeling agent-based systems, and developed a framework for verifying an expressive set of branching-time properties for such systems.
We studied the problem of analyzing the high-level models for performing tasks other than traditional inference. One such problem was to consider "what if" questions-- characterizing the qualitative and quantitative improvements to query answers if some aspects of the model or certain pieces of data are better known. This is related to the ?Value of Information? (VoI) problem from decision theory. While VoI optimization problem is intractable in general, we developed a set of algorithms for the VoI optimization problem under different conditions, all of which also led to efficient solutions for cases when the problem was tractable.
We also introduced neural state classification for analyzing reachability properties of probabilistic and hybrid systems by posing the verification problem in terms of constructing a classifier using neural networks (NN). We defined neural simplex architecture (NSA), a framework for incorporating NN-based controllers into a safety-critical system without violating its verifiable safe behavior. In this architecture, a verified baseline controller (BC) is supplemented with a NN-based controller such that BC is called into service whenever the NN-based controller is expected to generate unsafe behaviors; and during such times, the NN-based controller learns from BC?s actions, thereby increasing its capability to keep the controlled system within safe regions. For generating better training data for NN-based controllers, we developed a next generation method for constructing Lagrangian Reach Tubes for estimating the future behaviors of a hybrid system.
Systems for logical reasoning (even without the addition of statistical reasoning) may be based on different (sometimes incompatible) semantics. We developed the Founded Semantics, as an approach to unifying the existing semantics. We extended this to a Constrained Semantics that supports unrestricted negation, as well as unrestricted existential and universal quantifications. These unified approaches enable an analyst to build models where different components rely on assumptions stemming from different semantics. We have also developed a unified semantics for recursive rules with aggregation, extending the unified founded semantics and constraint semantics for recursive rules with negation.
Challenge applications for the project?s fundamental developments in modeling and reasoning were in formal methods and verification of complex systems, described earlier; and in the construction of phylogenetic trees from phenomic data, used in the Tree of Life project, for mapping the evolutionary history of organisms. For the latter, we investigated the use of combined logical/statistical models for detecting similarities in taxonomic data to assist in data cleaning and organization in Morphobank, an open-source tool for storing and sharing phenotypic and morphological data of species.
Broader Impact: This project led to several developments that are immediately relevant outside of its primary research area. We applied the fundamental research results to a number of problems in the Health IT domain. Rapid response events (RREs) correspond to deteriorating conditions of a patient that, if not attended to immediately, can cause fatality. Based on our work on optimizing Value of Information, we are developing models to predict RREs as early as possible, potentially leading to better care and improved patient outcomes. Such an "early warning" system was also developed to predict Sepsis, a life-threatening condition brought on by the body's response to infection. It is considered to be one of the leading causes of death in hospitals.
We maintained MorphoBank with 7/24/365 uptime and trained minority undergraduate computer science majors through summer internships. We fixed bugs and implemented software upgrades on MorphoBank. MorphoBank retained a diverse staff with a female director and an African American lead software developer. MorphoBank also made strides towards sustainability through collaboration with Phoenix Bioinformatics.
In terms of human resources, the project led to the training of more than 100 graduate students on the basics of probabilistic logic programs and on modeling complex systems using probabilistic logic programs. In the context of its larger research community, developments in this project were the basis for the Applications of Logic Programming workshop held as a satellite event of Intl. Conf. on Logic Programming 2016, and two subsequent Logic and Practice of Programming workshops held in 2018 and scheduled for Nov. 2020.
Last Modified: 08/28/2020
Modified by: C. R Ramakrishnan
Please report errors in award information by writing to: awardsearch@nsf.gov.