
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | February 14, 2019 |
Latest Amendment Date: | May 19, 2021 |
Award Number: | 1845967 |
Award Instrument: | Continuing Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | February 15, 2019 |
End Date: | January 31, 2024 (Estimated) |
Total Intended Award Amount: | $549,239.00 |
Total Awarded Amount to Date: | $549,239.00 |
Funds Obligated to Date: |
FY 2020 = $106,414.00 FY 2021 = $67,025.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
9500 GILMAN DR LA JOLLA CA US 92093-0021 (858)534-4896 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
9500 Gilman Drive San Diego CA US 92093-0407 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Info Integration & Informatics, Systematics & Biodiversity Sci |
Primary Program Source: |
01002021DB NSF RESEARCH & RELATED ACTIVIT 01002122DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
The present diversity of life has evolved from a single ancestor through billions of years of evolution. Understanding these evolutionary histories is fascinating, but more importantly, is a crucial precursor to many biological analyses. Some evolutionary relationships are obvious (e.g., a cat is closer to a lion than a chicken) but other consequential relationships are hard to discern. Luckily, evolution operates on the genomes of organisms, and the sequence of genetic changes leaves a trace of the evolutionary histories. Following these traces and reconstructing the evolutionary past, however, is a computational problem, and as it turns out, is a difficult problem. Sophisticated methods are needed to infer a phylogeny: a tree, called tree-of-life, that shows the historical relationships between species. When sequencing whole genomes became possible in the mid-2000s, many believed the sheer amount of data would result in robust reconstructions of phylogenies. While genome sequencing has fulfilled some of its promises, other challenges remain. Large-scale data are hard to adequately model and are hard to screen for errors. As a result, different analyses do not always agree, and also, inference algorithms are pushed to their limits of scalability. Thus, an improved understanding of the tree-of-life requires not just more data but also better algorithms. Interestingly, as data sciences permeate many areas of science, issues of robustness to error and scalability faced in phylogenetics will confront many disciplines. Thus, the next generation of data scientists needs to be trained to consider these concerns when developing algorithms for data analysis.
This project seeks to address current limitations in phylogenomics (phylogeny inference from whole genomes) and to integrate issues of robustness and scalability into teaching. The main challenge in phylogenomics is data heterogeneity, and there are two sources of data heterogeneity: real biological processes driving genome evolution that lead to discordant histories across the genome, and artefactual heterogeneity that results from complex pipelines used to prepare the data for inference. Models of real heterogeneity exist. However, current methods often require knowing the source of heterogeneity in advance, are often not scalable, are not always robust to artefactual heterogeneity. The approach taken here is to combine unsupervised learning and discrete optimization to build methods for identifying errors. These techniques will strive to minimize assumptions and will use both parametric and non-parametric statistics. The project will draw on machine learning, multi-criteria optimization, and high-performance computing. If successful, it will dramatically improve the accuracy and scalability of genome-wide phylogeny reconstruction and will help researchers understand intricate patterns in genome evolution. To integrate research and education, this project will enable yearly hackathons that bring together students with computational and biological expertise with the goal of developing robust and scalable methods. The project will also seek to improve the understanding of data science for undergrad and K-12 students, emphasizing for them both the excitement and challenges of analyzing large error-prone datasets. The tools developed here will be publicly available and well-documented. Yearly workshops will be held to help biologists learn and use the tools.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
This CAREER project resulted in significant advances in the field of phylogenomics (inference of evolutionary histories of species from genomic data) and related fields. Focusing on Aim 1, the ability to handle errors introduced in analytical pipelines used to infer species trees has been identified as important by the field. This project created several algorithms aimed to detect and ameliorate such errors. Some of these methods (e.g., TAPER) directly look for such methods. Others, (e.g., wASTRAL) reduce their impact by changing how downstream analyses are done. Yet others (e.g., ASTRAL-Pro) eliminate the need for some of the error-prone steps by using alternative methods. The second aim of the project, to improve the accuracy and scalability of methods to infer both gene trees and species trees (perhaps jointly) was addressed using several methods. A theme of these methods was that many (though not all) were non-parametric, using quartet frequencies to infer evolutionary histories in ways that are statistically consistent under several models. Overall, the research side of the project led to 26 papers, 17 software tools, and several public datasets. These tools have already been adopted by many biologists in their analyses and will help advance the state-of-the-art in phylogenomics.
The educational side of this project was integrated with the research. Some of the theories behind the methods developed were integrated into PI's graduate course. In addition, the tools presented were demonstrated to biologists in various workshops and software schools. Undergraduate researchers were mentored in the process and published referred articles. Thus, the project helped advance both research and education.
Last Modified: 06/02/2024
Modified by: Siavash Mir Arabbaygi
Please report errors in award information by writing to: awardsearch@nsf.gov.