Award Abstract # 1535977
AitF: Full: Collaborative Research: Graph-theoretic algorithms to improve phylogenomic analyses

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: UNIVERSITY OF ILLINOIS
Initial Amendment Date: August 12, 2015
Latest Amendment Date: August 12, 2015
Award Number: 1535977
Award Instrument: Standard Grant
Program Manager: A. Funda Ergun
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2015
End Date: August 31, 2020 (Estimated)
Total Intended Award Amount: $360,000.00
Total Awarded Amount to Date: $360,000.00
Funds Obligated to Date: FY 2015 = $360,000.00
History of Investigator:
  • Tandy Warnow (Principal Investigator)
    warnow@illinois.edu
  • Chandra Chekuri (Co-Principal Investigator)
Recipient Sponsored Research Office: University of Illinois at Urbana-Champaign
506 S WRIGHT ST
URBANA
IL  US  61801-3620
(217)333-2187
Sponsor Congressional District: 13
Primary Place of Performance: University of Illinois at Urbana-Champaign
IL  US  61820-7473
Primary Place of Performance
Congressional District:
13
Unique Entity Identifier (UEI): Y8CWNJRCNN91
Parent UEI: V2PHZ2CSCH63
NSF Program(s): Algorithms in the Field
Primary Program Source: 01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 012Z
Program Element Code(s): 723900
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Understanding the history of life on earth ? how species evolved from their common ancestor ? is a major goal of biological research. These evolutionary trees are very hard to construct with high accuracy, because nearly all of the most accurate approaches require the solution to computationally hard optimization problems. Furthermore, research has shown that the evolutionary tree for a single gene can be different from the evolutionary tree for the species, and current methods do not provide adequate accuracy on genome-scale data. As a result, large evolutionary trees, covering big portions of ?The Tree of Life?, are very difficult to compute with high accuracy. This project will develop methods that can enable highly accurate species tree estimation. The key approach is the development of novel divide-and-conquer strategies, whereby a dataset is divided into overlapping subsets, species trees are constructed on the subsets, and then the subset species trees are merged together into a tree on the full dataset. These approaches will be combined with powerful statistical estimation methods, to potentially transform the capability of evolutionary biologists to analyze their data. This project will also provide open source software for the new methods that are developed, and provide training in the use of the software to biologists at national meetings. The project will also contribute to interdisciplinary training for two doctoral students, one at Illinois and one at Berkeley, and course materials for computational biology will be made available online.

Understanding evolution, and how it has operated on species and on genes, is a major part of biological data analysis. Statistical estimation approaches often provide the best accuracy, but cannot scale to dataset sizes that are required for modern biology. In addition, species tree estimation is challenged by the heterogeneity of evolutionary trees across the genome, and no current methods are able to provide highly accurate species trees for genome-scale data. These challenges make it essential that new methods be developed in order to make highly accurate large-scale evolutionary tree estimation possible under these complex evolutionary scenarios. This project will develop novel algorithmic strategies to address three key problems: supertree estimation, species tree estimation in the presence of gene tree heterogeneity, and scaling statistical methods to large datasets. In addition to developing graph-theoretic algorithms, the project team will establish mathematical guarantees for these methods using chordal graph theory and probabilistic analysis, under stochastic models of gene and sequence evolution.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 24)
E. Molloy and T. Warnow "Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge" Algorithms for Molecular Biology, (special issue of extended abstracts of selected papers from RECOMB-CG 2018) , v.14 , 2019 10.1186/s13015-019-0151-x
E. Molloy and T. Warnow "TreeMerge: A new method for improving the scalability of species tree estimation methods" Bioinformatics (special issue for ISMB 2019) , v.35 , 2019 , p.i417 10.1093/bioinformatics/btz344
E. Molloy and T. Warnow. . In: Blanchette M., Ouangraoua A. (eds) Comparative Genomics. RECOMB-CG 2018. Lecture Notes in Computer Science, vol 11183. Springer, Cham. Proceedings of RECOMB-CG (Satellite conference for Comparative Genomics). "NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees" Lecture Notes in Computer Science, Springer, Cham. Proceedings of RECOMB-CG (Satellite conference for Comparative Genomics) , v.11183 , 2018 , p.260 10.1007/978-3-030-00834-5_15
Erin K Molloy, Tandy Warnow "FastMulRFS: fast and accurate species tree estimation under generic gene duplication and loss models" Bioinformatics , v.36 , 2020 , p.i57 10.1093/bioinformatics/btaa444
Erin Molloy and Tandy Warnow "To include or not to include: the impact of gene filtering on species tree estimation" Sytematic Biology , 2017 10.1093/sysbio/syx077
Erin Molloy and Tandy Warnow "To include or not to include: the impact of gene filtering on species tree estimation" Sytematic Biology , 2017
Legried B., Molloy E.K., Warnow T., Roch S. "Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss." RECOMB 2020. Lecture Notes in Computer Science, vol 12074. Springer, Cham. , 2020 10.1007/978-3-030-45257-5_8
Md. Shamsuzzoha Bayzid and Tandy Warnow "Gene Tree Parsimony for Incomplete Gene Trees" WABI and LIPIcs-Leibniz International Proceedings in Informatics , v.88 , 2017 10.4230/LIPIcs.WABI.2017.2
Md. Shamsuzzoha Bayzid and Tandy Warnow "Gene tree parsimony for incomplete gene trees: addressing true biological loss" Algorithms for Molecular Biology , v.13 , 2018 , p.1 10.1186/s13015-017-0120-1
Michael Nute, Jed Chou, Erin Molly, and Tandy Warnow "The performance of coalescent-basedspecies tree estimation methods under modelsof missing data" BMC Genomics , v.19 , 2018 , p.286 10.1186/s12864-018-4619-8
Pranjal Vachaspati and Tandy Warnow "Enhancing Searches for Optimal Trees Using SIESTA" RECOMB-Comparative Genomics, Lecture Notes in Computer Science (LNCS) , v.10562 , 2017 , p.232 10.1007/978-3-319-67979-2_13
(Showing: 1 - 10 of 24)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Intellectual Merit. The main focus of this project was the development of new algorithms with provable theoretical guarantees for estimating evolutionary trees (also called "phylogenies"). We specifically examined methods that can address genome-scale phylogeny estimation, which is complicated by biological processes, such as incomplete lineage sorting (ILS) and gene duplication and loss (GDL), that can cause the evolutionary trees for single genes to be different from that of the species tree.  Current evolutionary biology projects often seek to estimate species trees with hundreds of species and thousands or tens of thousands of genes, and standard methods, such as using maximum likelihood methods applied to the concatenated gene sequences, do not have good theoretical performance guarantees.  For this reason, new methods are needed that have theoretical performance guarantees (such as being statistically consistent, and so proven to recover the true species tree with high probability given enough data) and have good accuracy in practice. This was a main goal of our project. 

For species tree estimation addressing ILS, our research group developed several new methods with excellent theoretical and empirical advantages over current leading methods. Among the highlights of this effort, we developed ASTRID (Vachaspati and Warnow, BMC Genomics 16.S10 (2015): S3), a polynomial time method for species tree estimation that addresses incomplete lineage sorting (ILS) and is faster than any of the current available methods on large and ultra-large datasets, with 10,000 species and genes.  We proved the surprising result that partitioned analyses using maximum likelihood on multi-locus datasets can be positively misleading and so converge to the wrong species tree as the number of genes increases (Roch et al., Systematic Biology 68.2 (2019): 281-297).  In Legried et al., RECOMB 2020, we proved that ASTRAL-multi (Rabiee et al., Molecular Phylogenetics and Evolution 130 (2019): 286-296), a species tree estimation method to address ILS, is statistically consistent in the presence of gene duplication and loss (GDL); this was the first method proven statistically consistent under GDL.

Another major effort was to develop general purpose methods for scaling computationally intensive methods to large datasets. This is important because many of the most accurate methods are based on hard optimization problems, and so can be computationally intensive on large datasets (and infeasible on even moderately large datasets).  Therefore, we developed approaches based on divide-and-conquer. These "meta-methods" apply computationally intensive methods on small subsets of the species, and then combine the subset trees together. For the case where the subset trees are overlapping, we developed new supertree methods, with FastRFS (Vachaspati and Warnow, Bioinformatics  33.5 (2017): 631-639) a particularly powerful technique: it finds an exact solution to a hard optimization problem within a constrained search space, and does so in polynomial time. FastRFS improves on the prior best methods for the popular Robinson-Foulds Supertree Problem. For the case where the subset trees are disjoint, we developed methods that use auxiliary information to combine these disjoint subset trees, and are proven to maintain the desirable statistical guarantees of their base methods and to run in polynomial time. Two of the major advances include TreeMerge (Molloy and Warnow, Bioinformatics 35.14 (2019): i417-i426) and the Guide Tree Merger (Smirnov and Warnow, BMC Genomics 21 (2020): 1-17).  We showed that GTM and TreeMerge maintain accuracy and greatly reduce the running time of existing methods for species tree estimation, especially on large datasets.  

Broader Impacts: We provided open-source software for new methods in phylogeny estimation and trained biologists in the use of these software packages at national and international meetings. Three PhD students and two MS students graduated and were supported by this project.  The methods we developed are now being used in phylogenomic projects. 

 


Last Modified: 11/01/2020
Modified by: Tandy Warnow

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page