Award Abstract # 1902892
Scalable Model-Based Reconstruction of Network Evolution

NSF Org: DMS
Division Of Mathematical Sciences
Recipient: UNIVERSITY OF WISCONSIN SYSTEM
Initial Amendment Date: May 20, 2019
Latest Amendment Date: July 8, 2021
Award Number: 1902892
Award Instrument: Continuing Grant
Program Manager: Zhilan Feng
zfeng@nsf.gov
 (703)292-7523
DMS
 Division Of Mathematical Sciences
MPS
 Directorate for Mathematical and Physical Sciences
Start Date: June 1, 2019
End Date: May 31, 2023 (Estimated)
Total Intended Award Amount: $724,239.00
Total Awarded Amount to Date: $724,239.00
Funds Obligated to Date: FY 2019 = $235,227.00
FY 2020 = $182,662.00

FY 2021 = $306,350.00
History of Investigator:
  • Cecile Ane (Principal Investigator)
    cecile.ane@wisc.edu
  • Curtis Brandt (Co-Principal Investigator)
  • Sebastien Roch (Co-Principal Investigator)
Recipient Sponsored Research Office: University of Wisconsin-Madison
21 N PARK ST STE 6301
MADISON
WI  US  53715-1218
(608)262-3822
Sponsor Congressional District: 02
Primary Place of Performance: University of Wisconsin-Madison
Madison
WI  US  53706-1510
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): LCLSJAGTNZQ7
Parent UEI:
NSF Program(s): NIGMS
Primary Program Source: 01002021DB NSF RESEARCH & RELATED ACTIVIT
01001920DB NSF RESEARCH & RELATED ACTIVIT

01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):
Program Element Code(s): 804700
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.049

ABSTRACT

The availability of full genomes across individuals from many populations and many species offers rich information about past evolutionary history. By comparing the genes of different individuals, one can detect which individuals are most closely related and reconstruct the history of population splits and speciation, as visualized in a phylogenetic tree. Challenges arise however because of genealogical differences between individuals within each species, current or ancestral. This project focuses on the detection of species convergences: when species hybridize, or when individuals from one species migrate to another, or when strains recombine. The history of a group of species is then best described by a network, where a backbone tree represents speciation and extra branches describe gene flow from one population into another. Current methods to estimate phylogenetic networks cannot analyze data sets with more than a few dozen species. Based on novel theoretical foundations, the PIs will develop statistical methods and software that will scale to hundreds of species and thousands of genetic loci. These new methods will also be particularly valuable to advance knowledge in bacterial and virus evolution, where recombination is prevalent. The project will support graduate and undergraduate students, who will gain training beyond traditional disciplinary boundaries with involvement in the larger community of campus researchers interested in networks in data science.

Through the mathematical analysis of coalescent processes on phylogenetic networks, the PIs will determine the maximal substructures of these networks that can be theoretically identified from various data types, such as from gene trees, or genetic distances between pairs of individuals, using one or more individuals per populations. Theory will also be developed to determine the amount of data necessary to reconstruct the phylogenetic network with accuracy. These theoretical findings will guide the development of new statistical methods and software to estimate phylogenetic networks from data, with a focus on the use of genetic distances to devise fast algorithms that can handle hundreds of species. These fast reconstruction methods will allow the deployment of a cross-validation method to learn from data the appropriate complexity of the network, that is, the appropriate number of gene flow events. The proposed research will advance knowledge of the evolutionary history in many groups where gene flow and recombination is suspected, such as the early radiation of mammals or land plants, and the evolutionary history of the herpes virus family.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 19)
Tabatabaee, Yasamin and Roch, Sebastien and Warnow, Tandy "QR-STAR: A Polynomial-Time Statistically Consistent Method for Rooting Species Trees Under the Coalescent" Journal of computational biology , v.30 , 2023 Citation Details
Xu, Jingcheng and Ané, Cécile "Identifiability of local and global features of phylogenetic networks from average distances" Journal of Mathematical Biology , v.86 , 2023 https://doi.org/10.1007/s00285-022-01847-8 Citation Details
Hill, Max and Roch, Sebastien "Inconsistency of Triplet-Based and Quartet-Based Species Tree Estimation under Intralocus Recombination" Journal of Computational Biology , v.29 , 2022 https://doi.org/10.1089/cmb.2022.0265 Citation Details
Kolb, Aaron W. and Brandt, Curtis R. "Genomic nucleotide-based distance analysis for delimiting old world monkey derived herpes simplex virus species" BMC Genomics , v.21 , 2020 https://doi.org/10.1186/s12864-020-06847-w Citation Details
Legried, Brandon and Molloy, Erin K. and Warnow, Tandy and Roch, Sebastien "Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss" International Conference on Research in Computational Molecular Biology (RECOMB 2020) , 2020 Citation Details
Legried, Brandon and Molloy, Erin K. and Warnow, Tandy and Roch, Sébastien "Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss" Journal of Computational Biology , v.28 , 2021 https://doi.org/10.1089/cmb.2020.0424 Citation Details
Legried, Brandon and Roch, Sebastien "Pairwise sequence alignment at arbitrarily large evolutionary distance" The Annals of Applied Probability , v.34 , 2024 https://doi.org/10.1214/23-AAP2009 Citation Details
Roch, Sebastien "Expanding the Class of Global Objective Functions for Dissimilarity-Based Hierarchical Clustering" Journal of Classification , 2023 https://doi.org/10.1007/s00357-023-09447-x Citation Details
Tabatabaee, Y. and Roch, S. and Warnow, T. "Statistically Consistent Rooting of Species Trees Under the Multispecies Coalescent Model" Research in Computational Molecular Biology. RECOMB 2023. Lecture Notes in Computer Science. Springer. , v.13976 , 2023 https://doi.org/10.1007/978-3-031-29119-7_3 Citation Details
Teo, Benjamin and Rose, Jeffrey and Bastide, Paul and Ané, Cécile "Accounting for Within-Species Variation in Continuous Trait Evolution on a Phylogenetic Network" Bulletin of the Society of Systematic Biologists , v.2 , 2023 https://doi.org/10.18061/bssb.v2i3.8977 Citation Details
Ané, Cécile and Fogg, John and Allman, Elizabeth S and Baños, Hector and Rhodes, John A "Anomalous networks under the multispecies coalescent: theory and prevalence" Journal of Mathematical Biology , v.88 , 2024 https://doi.org/10.1007/s00285-024-02050-7 Citation Details
(Showing: 1 - 10 of 19)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This research was concerned with the discovery of past events that brought separate populations and species together, such as when individuals from one population migrate to another, or when hybridization creates a new species, or when two virus strains recombine with each other. Phylogenetic networks can represent these reticulation events, where each edge represents a population evolving over time. Edges meet at nodes where a population splits into descendant populations, or where genetic material flows from one population into another. Biologists have access to full genomes across individuals from many populations and many species, yet new methods are needed to discover the phylogenetic network describing the past history of populations, and questions still remain on what is discoverable about ancient reticulations from data on present-day populations.

Our team proved theoretical results about what local structures and global structures of the phylogenetic network are discoverable using data on genetic distances. We proved that the network's backbone tree-like structure is identifiable from average genetic distances, and that the precise structure of a local reticulate subgraph is not identifiable if it contains more than one reticulation. If this subgraph does contain a single reticulation and if population sampling is sufficient, then we proved that this local reticulate subgraph can be reconstructed from various types of distance measures. In situations when the phylogenetic network (or tree) is discoverable, we proved new requirements on the amount of data - number of genes and number of DNA sites in each gene - that are necessary for an accurate inference of the network. Moreover, in the easier tree case (i.e., in the absence of reticulation), we showed that existing reconstruction methods relying on gene tree topologies require an amount of data that cannot be improved when rates of evolution vary across genes, which commonly holds in real datasets.

Based on this theory, we developed statistical methods to analyze genomic data for network inference, including a test to decide if a candidate network fits genomic data adequately, methods to analyze the evolution of traits on a network, and to simulate gene trees from networks. These methods were tested and implemented in open-source software with extensive documentation, for wide distribution and use by the research community. A network visualization software was also developed to support other analyses.

We applied these methods and pre-existing methods to genetic variation in herpesviruses. We developed a novel nucleotide distance-based criterion, to be used alongside other criteria, for species delimitation in viruses. Using this criterion we discovered that the two monkey alpha herpes viruses are separate species. When conducting extensive analyses of simulated data and of empirical data on Bovine herpesviruses, we uncovered a lack of robustness of various methods that infer phylogenetic networks and detect gene flow / hybridization. Some methods were found to be very sensitive to violations of their assumptions, or to the choice of tuning parameters or prior distributions. These results are significant because they inform method choice and best practices for discovering past recombination events.

The project supported the interdisciplinary training of seven undergraduate students and nine graduate students in mathematics, statistics, software development and genomic research. Results were presented to professional scientists in multiple peer-reviewed publications (13 already published), research talks and online resources on computational methods. During the funding period the investigators also participated in mentoring for middle school students from underrepresented groups. New course materials on relevant mathematical techniques and their applications were developed and are freely available.


Last Modified: 06/30/2023
Modified by: Cecile M Ane

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page