Award Abstract # 1223057
ATD Collaborative Research: New theorems and algorithms for comprehensive analysis of metagenomic data via statistical phylogenetics

NSF Org: DMS
Division Of Mathematical Sciences
Recipient: FRED HUTCHINSON CANCER RESEARCH CENTER
Initial Amendment Date: August 28, 2012
Latest Amendment Date: August 28, 2012
Award Number: 1223057
Award Instrument: Standard Grant
Program Manager: Leland Jameson
DMS
 Division Of Mathematical Sciences
MPS
 Directorate for Mathematical and Physical Sciences
Start Date: September 1, 2012
End Date: August 31, 2018 (Estimated)
Total Intended Award Amount: $496,702.00
Total Awarded Amount to Date: $496,702.00
Funds Obligated to Date: FY 2012 = $496,702.00
History of Investigator:
  • Frederick Matsen (Principal Investigator)
    matsen@fhcrc.org
  • Aaron Darling (Co-Principal Investigator)
Recipient Sponsored Research Office: Fred Hutchinson Cancer Research Center
1100 FAIRVIEW AVE N J6-300
SEATTLE
WA  US  98109-4433
(206)667-4868
Sponsor Congressional District: 07
Primary Place of Performance: Fred Hutchinson Cancer Research Center
1100 Fairview Avenue North
Seattle
WA  US  98109-1204
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): HMSNCM57QNR5
Parent UEI: HMSNCM57QNR5
NSF Program(s):
Primary Program Source: 01001213RB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 6877
Program Element Code(s):
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.049

ABSTRACT

Current whole-metagenome analysis tools are primarily based on sequence similarity (assembly, BLAST) and taxonomies ("binning" approaches); while useful, these approaches have the following limitations. Assembly methods require read overlap and thus only reconstruct the most abundant organisms in a mixed sample. Sequence similarity approaches such as BLAST cannot relate reads to ancestral organisms and do not indicate the evolutionary significance of mutations. Taxonomic methods are too coarse to reflect the subtle DNA sequence changes that may characterize a biological threat. The investigators propose to overcome these limitations by developing the theoretical underpinnings of methods to: reconstruct the cellular compartmentalization of DNA in environmental samples, even when read counts are small, detect synthetic genomes and evidence of directed evolution within a metagenomic sample by performing a phylogenetic comparison with extant genomes, detect combinations of genetic material that are anomalous given their location or time of observation, statistically distinguish meaningful shifts in microbial community composition from noise, even when those shifts happen at a level below that detectable using currently available methods.

The tools of genetic engineering are in the hands of scientists of many countries; these tools can be used to synthesize biological weapons. Prevention of casualties from these weapons depends on their prompt detection and identification. Although high-throughput DNA sequencing could be used to monitor biological threats, the currently available tools for analyzing the wealth of information it generates are insufficient to statistically analyze threat risk. A biodefense monitoring approach informed by a statistical analysis of evolutionary signal could yield a means to detect genetic anomalies and threats directly from "metagenomic" data: high throughput shotgun sequencing data from environmental samples.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Alex Gavryushkin, Christopher Whidden, and Frederick A. Matsen IV "The combinatorics of discrete time-trees: theory and open problems" Journal of Mathematical Biology , 2017 10.1007/s00285-017-1167-9
Claywell, Dinh, McCoy, and Matsen IV "A surrogate function for one-dimensional phylogenetic likelihoods" Molecular Biology and Evolution , 2017
Dinh, Darling, and Matsen IV "Online Bayesian phylogenetic inference: theoretical foundations via Sequential Monte Carlo" Systematic Biology , 2016
Fourment, Claywell, Dinh, McCoy, Matsen IV, and Darling "Effective Online Bayesian Phylogenetics Via Sequential Monte Carlo With Guided Proposals" Systematic Biology , 2017
Sara C. Billey, Matjaz Konvalinka, and Frederick A. Matsen IV "On the enumeration of tanglegrams and tangled chains" Journal of Combinatorial Theory, Series A , v.146 , 2017 , p.239 10.1016/j.jcta.2016.10.003
Vu Dinh and Frederick A. Matsen IV "The shape of the one-dimensional phylogenetic likelihood function" The Annals of Applied Probability , v.27 , 2017 , p.1646 10.1214/16-AAP1240
Vu Dinh, Arman Bilge, Cheng Zhang, Frederick A. Matsen IV "Probabilistic Path Hamiltonian Monte Carlo" International Conference on Machine Learning, PMLR , v.70 , 2017 , p.1009
Whidden, C., & Matsen IV, F. A.. "Quantifying MCMC exploration of phylogenetic tree space" Systematic Biology , v.64 , 2015 , p.472 PMC4395846

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Phylogenetics is the practice of reconstructing evolutionary history from molecular sequences such as DNA. It represents the inferred relationships between the sequences as an evolutionary tree, with splits in the tree representing evolutionary divergences.  
Our work advanced phylogenetics in both practical and theoretical directions. The primary intellectual merit of this work was a collection of foundational theoretical and algorithmic work that gives us new insight into the essential structures of phylogenetic inference, as well as foundations for new and efficient algorithms. These algorithms are for comparing collections of trees, understanding phylogenetic likelihoods, and for doing Bayesian phylogenetic inference in new and efficient ways.  
INTELLECTUL CONTRIBUTIONS  
Tree comparison: 
  • Whidden & Matsen (Systematic Biology, 2015). Quantifying MCMC exploration of phylogenetic tree space: we use graph-theoretical tools to show that tree samplers get stuck in local peaks. 
  • Whidden & Matsen (Theoretical Computer Science, 2017). Ricci Ollivier curvature of the rooted phylogenetic subtree “prune“ regraft graph: we show that "tree space" is quite inhomogeneous in terms of its curvature, meaning some parts are inherently easier to move around than others
  • Gavryushkin, Whidden, & Matsen. (Journal of Mathematical Biology, 2017). The combinatorics of discrete time-trees, theory and open problems: we analyze the space of "timetrees", which are trees with time assigned to the internal nodes, and show that it's actually simpler than one might expect. 
  • Billey, Konvalinka, & Matsen (Journal of Combinatorial Theory. Series A, 2017). On the enumeration of tanglegrams and tangled chains: we discover new integer series counting "tanglegrams", which are pairs of phylogenetic trees with connections between their leaves that are useful as means of tree comparison. 
  • Matsen, Billey, Kas, & Konvalinka (Transactions on Computational Biology and Bioinformatics, 2018). Tanglegrams, A Reduction Tool for Mathematical Phylogenetics: we describe how these tanglegrams can reduce the amount of work require for computational biologists. 
  • Whidden, & Matsen (Workshop on Analytic Algorithmics and Combinatorics, 2018). Efficiently Inferring Pairwise Subtree Prune-and-Regraft Adjacencies between Phylogenetic Trees: we develop time-optimal algorithms for constructing the sorts of graphs that proved essential for understanding MCMC behavior in our 2015 paper. 
  • Whidden & Matsen (unpublished; arXiv). Chain Reduction Preserves the Unrooted Subtree Prune-and-Regraft Distance: we solve a long-standing conjecture establishing a key component of efficient tree-comparison methods. 

Phylogenetic likelihood function: 

  • Dinh & Matsen (Annals of Applied Probability, 2017). The shape of the one-dimensional phylogenetic likelihood function: we analyze phylogenetic likelihood functions, showing that under simple models they are well-behaved but under more complex models they can have strange behavior, including a arbitrary number of local maxima. 
  • Claywell, Dinh, Fourment, McCoy, & Matsen. (Molecular Biology and Evolution, 2017). A surrogate function for one-dimensional phylogenetic likelihoods: we show that a simple family of functions does a good job of approximating likelihood curves, and develop a library to use this surrogate family of functions for optimization and sampling.

Novel phylogenetic inference algorithms: 

  • Dinh, Bilge, Zhang, & Matsen (International Conference on Machine Learning, 2017). Probabilistic Path Hamiltonian Monte Carlo: we extend the cutting-edge Hamiltonian Monte Carlo sampler to non-Euclidean spaces for the first time, resulting in a novel Bayesian phylogenetics algorithm. 
  • Dinh, Ho, Suchard, & Matsen (Annals of Statistics, 2018). Consistency and convergence rate of phylogenetic inference via regularization: we extend and formalize a statistical framework for gene tree inference in the presence of external tree structure information.
  • Dinh, Darling, & Matsen (Systematic Biology, 2017). Online Bayesian phylogenetic inference, theoretical foundations via Sequential Monte Carlo: we develop the theoretical foundations for the first online phylogenetic inference algorithm, which can progressively update phylogenetic inferences with new sequences as they arrive.
  • Fourment, Claywell, Dinh, McCoy, Matsen & Darling (Systematic Biology, 2017). Effective online Bayesian phylogenetics via sequential Monte Carlo with guided proposals: we show sequential Monte Carlo to be an effective tool for online inference, namely that it can outperform classical batch algorithms for updating trees.
  • Whidden, Claywell, Fisher, Magee, Fourment, & Matsen (submitted, 2018). Systematic exploration of the high likelihood set of phylogenetic tree topologies: we develop a new approach to Bayesian phylogenetics via systematic exploration, in which we explore out from local maxima. 
  • Fourment, Magee, Whidden, Bilge, Matsen, Minin (submitted, 2018). 19 dubious ways to compute the marginal likelihood of a phylogenetic tree topology: we develop a component needed for this alternate direction for Bayesian phylogenetics, namely fast means of integrating out the continuous parameters of trees. 

BROADER IMPACTS:  

Our work on online phylogenetic inference stimulated special interest in the pathogen outbreak community, where two groups are working on incorporating it into the two main forks of the BEAST software package that is used by infectious disease researchers. This allows them to track pathogen evolution in real time. Our work on understanding the underpinnings of phylogenetic inference has allowed us to develop completely new means of doing Bayesian phylogenetic inference.  During the granting period, our group hosted 6 high school interns and 2 undergraduate summer students.

 


Last Modified: 11/30/2018
Modified by: Frederick A Matsen

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page