
NSF Org: |
DMS Division Of Mathematical Sciences |
Recipient: |
|
Initial Amendment Date: | May 20, 2019 |
Latest Amendment Date: | July 8, 2021 |
Award Number: | 1902892 |
Award Instrument: | Continuing Grant |
Program Manager: |
Zhilan Feng
zfeng@nsf.gov (703)292-7523 DMS Division Of Mathematical Sciences MPS Directorate for Mathematical and Physical Sciences |
Start Date: | June 1, 2019 |
End Date: | May 31, 2023 (Estimated) |
Total Intended Award Amount: | $724,239.00 |
Total Awarded Amount to Date: | $724,239.00 |
Funds Obligated to Date: |
FY 2020 = $182,662.00 FY 2021 = $306,350.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
21 N PARK ST STE 6301 MADISON WI US 53715-1218 (608)262-3822 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
Madison WI US 53706-1510 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | NIGMS |
Primary Program Source: |
01001920DB NSF RESEARCH & RELATED ACTIVIT 01002122DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): | |
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.049 |
ABSTRACT
The availability of full genomes across individuals from many populations and many species offers rich information about past evolutionary history. By comparing the genes of different individuals, one can detect which individuals are most closely related and reconstruct the history of population splits and speciation, as visualized in a phylogenetic tree. Challenges arise however because of genealogical differences between individuals within each species, current or ancestral. This project focuses on the detection of species convergences: when species hybridize, or when individuals from one species migrate to another, or when strains recombine. The history of a group of species is then best described by a network, where a backbone tree represents speciation and extra branches describe gene flow from one population into another. Current methods to estimate phylogenetic networks cannot analyze data sets with more than a few dozen species. Based on novel theoretical foundations, the PIs will develop statistical methods and software that will scale to hundreds of species and thousands of genetic loci. These new methods will also be particularly valuable to advance knowledge in bacterial and virus evolution, where recombination is prevalent. The project will support graduate and undergraduate students, who will gain training beyond traditional disciplinary boundaries with involvement in the larger community of campus researchers interested in networks in data science.
Through the mathematical analysis of coalescent processes on phylogenetic networks, the PIs will determine the maximal substructures of these networks that can be theoretically identified from various data types, such as from gene trees, or genetic distances between pairs of individuals, using one or more individuals per populations. Theory will also be developed to determine the amount of data necessary to reconstruct the phylogenetic network with accuracy. These theoretical findings will guide the development of new statistical methods and software to estimate phylogenetic networks from data, with a focus on the use of genetic distances to devise fast algorithms that can handle hundreds of species. These fast reconstruction methods will allow the deployment of a cross-validation method to learn from data the appropriate complexity of the network, that is, the appropriate number of gene flow events. The proposed research will advance knowledge of the evolutionary history in many groups where gene flow and recombination is suspected, such as the early radiation of mammals or land plants, and the evolutionary history of the herpes virus family.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
This research was concerned with the discovery of past events that brought separate populations and species together, such as when individuals from one population migrate to another, or when hybridization creates a new species, or when two virus strains recombine with each other. Phylogenetic networks can represent these reticulation events, where each edge represents a population evolving over time. Edges meet at nodes where a population splits into descendant populations, or where genetic material flows from one population into another. Biologists have access to full genomes across individuals from many populations and many species, yet new methods are needed to discover the phylogenetic network describing the past history of populations, and questions still remain on what is discoverable about ancient reticulations from data on present-day populations.
Our team proved theoretical results about what local structures and global structures of the phylogenetic network are discoverable using data on genetic distances. We proved that the network's backbone tree-like structure is identifiable from average genetic distances, and that the precise structure of a local reticulate subgraph is not identifiable if it contains more than one reticulation. If this subgraph does contain a single reticulation and if population sampling is sufficient, then we proved that this local reticulate subgraph can be reconstructed from various types of distance measures. In situations when the phylogenetic network (or tree) is discoverable, we proved new requirements on the amount of data - number of genes and number of DNA sites in each gene - that are necessary for an accurate inference of the network. Moreover, in the easier tree case (i.e., in the absence of reticulation), we showed that existing reconstruction methods relying on gene tree topologies require an amount of data that cannot be improved when rates of evolution vary across genes, which commonly holds in real datasets.
Based on this theory, we developed statistical methods to analyze genomic data for network inference, including a test to decide if a candidate network fits genomic data adequately, methods to analyze the evolution of traits on a network, and to simulate gene trees from networks. These methods were tested and implemented in open-source software with extensive documentation, for wide distribution and use by the research community. A network visualization software was also developed to support other analyses.
We applied these methods and pre-existing methods to genetic variation in herpesviruses. We developed a novel nucleotide distance-based criterion, to be used alongside other criteria, for species delimitation in viruses. Using this criterion we discovered that the two monkey alpha herpes viruses are separate species. When conducting extensive analyses of simulated data and of empirical data on Bovine herpesviruses, we uncovered a lack of robustness of various methods that infer phylogenetic networks and detect gene flow / hybridization. Some methods were found to be very sensitive to violations of their assumptions, or to the choice of tuning parameters or prior distributions. These results are significant because they inform method choice and best practices for discovering past recombination events.
The project supported the interdisciplinary training of seven undergraduate students and nine graduate students in mathematics, statistics, software development and genomic research. Results were presented to professional scientists in multiple peer-reviewed publications (13 already published), research talks and online resources on computational methods. During the funding period the investigators also participated in mentoring for middle school students from underrepresented groups. New course materials on relevant mathematical techniques and their applications were developed and are freely available.
Last Modified: 06/30/2023
Modified by: Cecile M Ane
Please report errors in award information by writing to: awardsearch@nsf.gov.