
NSF Org: |
DMS Division Of Mathematical Sciences |
Recipient: |
|
Initial Amendment Date: | February 23, 2017 |
Latest Amendment Date: | June 21, 2021 |
Award Number: | 1651995 |
Award Instrument: | Continuing Grant |
Program Manager: |
Yong Zeng
yzeng@nsf.gov (703)292-7299 DMS Division Of Mathematical Sciences MPS Directorate for Mathematical and Physical Sciences |
Start Date: | July 1, 2017 |
End Date: | June 30, 2023 (Estimated) |
Total Intended Award Amount: | $400,000.00 |
Total Awarded Amount to Date: | $400,000.00 |
Funds Obligated to Date: |
FY 2018 = $77,558.00 FY 2019 = $80,206.00 FY 2020 = $81,982.00 FY 2021 = $84,795.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
77 MASSACHUSETTS AVE CAMBRIDGE MA US 02139-4301 (617)253-1000 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
77 Massachusetts Avenue Cambridge MA US 02139-4301 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
STATISTICS, Division Co-Funding: CAREER |
Primary Program Source: |
01001819DB NSF RESEARCH & RELATED ACTIVIT 01001920DB NSF RESEARCH & RELATED ACTIVIT 01002021DB NSF RESEARCH & RELATED ACTIVIT 01002122DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.049 |
ABSTRACT
Technological advances and the information era allow the collection of massive amounts of data at unprecedented resolution. Making use of this data to gain insight into complex phenomena requires characterizing the relationships among a large number of variables. Graphical models explicitly capture the statistical relationships between the variables of interest in the form of a network. Such a representation, in addition to enhancing interpretability of the model, enables computationally efficient inference. The investigator develops methodology to infer undirected and directed networks between a large number of variables from observational data. This research has broad societal impact, as it affects application domains from weather forecasting to phylogenetics and to personalized medicine. In addition, the PI is one of the initial faculty hires in a new MIT-wide effort in statistics. As such, the PI has major impact on creating new undergraduate and PhD programs in statistics to train the next generation in big data analytics, crucial for taking on challenging roles in this data-rich world.
The goal of this project is to study probabilistic graphical models using an integrated approach that combines ideas from applied algebraic geometry, convex optimization, mathematical statistics, and machine learning, and to apply these models to scientifically important novel problems. The research agenda is structured into three projects. In the first project, the investigator develops methods to infer causal relationships between variables from observational data using the framework of directed Gaussian graphical models combined with tools from optimization and algebraic geometry. The end goal is to apply this new methodology to learn tissue- and person-specific gene regulatory networks from gene expression data such as the Genotype-Tissue Expression (GTEx) project. In the second project, the investigator develops scalable methods for maximum likelihood estimation in Gaussian models with linear constraints on the covariance matrix or its inverse. Such models are important for inference of phylogenetic trees or cellular differentiation trees. The third project is an application of graphical models to weather forecasting; the investigator develops new parametric methods based on Gaussian copulas and also non-parametric methods for the post-processing of numerical weather prediction models that take into account the complicated dependence structure of weather variables in space and time.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
A central problem in biology and for biomedical discovery is the inference of gene regulatory networks, causal networks that allow predicting the effect of any intervention in the system. Key outcomes of this project were the development of theory, algorithms, and computational methods for: 1) causal structure discovery, i.e., learning the underlying cause-and-effect relationships such as which gene up- or down-regulates which other gene, from a mix of observational and interventional data; 2) optimal experimental design of interventions, a key problem given that the space of possible perturbations that can be performed in biology (e.g. all combinations of genetic perturbations, any combination of drugs) is huge and cannot be fully explored experimentally; 3) predicting the effect of untested intervention-context pairs, such as a drug in a new disease context.
With respect to causal structure discovery, we developed methods that could deal with key issues that make this problem challenging in practice, including: the presence of latent (unmeasured) confounders, off-target intervention effects (knock-outs may target also other genes with similar sequences), measurement error in the data collection process (single-cell RNA-seq data is highly zero-inflated), as well as the data coming from unknown disease subtypes and hence data coming from a mixture of causal models. In particular, we developed methods for causal structure discovery that are provably consistent under strictly weaker assumptions than previous algorithms and scale to the large graph sizes needed for applications to gene regulation.
With respect to experimental design, we developed methods for identifying interventions that are optimal in different ways, including: with respect to the amount of information they carry about the underlying causal graph (e.g. the gene regulatory network), as well as with respect to moving the distribution from any given state to a desired state via interventions.
With respect to predicting the effect of untested interventions from a set of tested interventions, we viewed this causal transportability problem as a tensor completion problem and developed novel algorithms based on infinitely wide neural networks that are fast, flexible and effective for the problem of causal transportability. We also benchmarked these algorithms on the problem of virtual drug screening.
Throughout the project, the principal investigator (PI) has actively engaged in activities related to education and research by building a diverse research group attracting talents from underrepresented groups and women, and training these graduate students, undergraduate students, and postdoctoral fellows at the intersection of statistics, machine learning, and the biomedical sciences. The training spanned theory, method development, and applications to important biological and medical problems. To help build the research area and community at large, the PI initiated and organized various conferences, including introducing a new machine learning conference focused solely on causal inference "Causal Learning and Reasoning (CLeaR)" as well as co-organizing the semester-long program on causality at the Simons Institute at UC Berkeley.
The research resulting from this project has been widely disseminated: Several open-source software packages have been developed; in particular, all causal inference algorithms resulting from this project are implemented and freely available in the group's causaldag python library which can be found in the group's github repository (https://github.com/uhlerlab). In addition, the PI disseminated the research results through keynote presentations at various conferences, conference presentations by the involved PhD students, as well as many publications, which were made freely available already pre-publication on the arXiv or bioRxiv preprint repositories to enable accelerated scientific discovery. Finally, material developed as part of this project has been integrated into two courses that the PI teaches at MIT.
Last Modified: 09/28/2023
Modified by: Caroline Uhler
Please report errors in award information by writing to: awardsearch@nsf.gov.