NSF Award Search: Award # 1845967

Award Abstract # 1845967

CAREER: Robust and scalable genome-wide phylogenetics

NSF Org:	IIS Division of Information & Intelligent Systems
Recipient:	UNIVERSITY OF CALIFORNIA, SAN DIEGO
Initial Amendment Date:	February 14, 2019
Latest Amendment Date:	May 19, 2021
Award Number:	1845967
Award Instrument:	Continuing Grant
Program Manager:	Sylvia Spengler sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	February 15, 2019
End Date:	January 31, 2024 (Estimated)
Total Intended Award Amount:	$549,239.00
Total Awarded Amount to Date:	$549,239.00
Funds Obligated to Date:	FY 2019 = $375,800.00 FY 2020 = $106,414.00 FY 2021 = $67,025.00
History of Investigator:	Siavash Mir arabbaygi (Principal Investigator) smirarab@ucsd.edu
Recipient Sponsored Research Office:	University of California-San Diego 9500 GILMAN DR LA JOLLA CA US 92093-0021 (858)534-4896
Sponsor Congressional District:	50
Primary Place of Performance:	UC San Diego 9500 Gilman Drive San Diego CA US 92093-0407
Primary Place of Performance Congressional District:	50
Unique Entity Identifier (UEI):	UYTTZT6G9DT1
Parent UEI:
NSF Program(s):	Info Integration & Informatics, Systematics & Biodiversity Sci
Primary Program Source:	01001920DB NSF RESEARCH & RELATED ACTIVIT 01002021DB NSF RESEARCH & RELATED ACTIVIT 01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7364, 1045
Program Element Code(s):	736400, 737400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

The present diversity of life has evolved from a single ancestor through billions of years of evolution. Understanding these evolutionary histories is fascinating, but more importantly, is a crucial precursor to many biological analyses. Some evolutionary relationships are obvious (e.g., a cat is closer to a lion than a chicken) but other consequential relationships are hard to discern. Luckily, evolution operates on the genomes of organisms, and the sequence of genetic changes leaves a trace of the evolutionary histories. Following these traces and reconstructing the evolutionary past, however, is a computational problem, and as it turns out, is a difficult problem. Sophisticated methods are needed to infer a phylogeny: a tree, called tree-of-life, that shows the historical relationships between species. When sequencing whole genomes became possible in the mid-2000s, many believed the sheer amount of data would result in robust reconstructions of phylogenies. While genome sequencing has fulfilled some of its promises, other challenges remain. Large-scale data are hard to adequately model and are hard to screen for errors. As a result, different analyses do not always agree, and also, inference algorithms are pushed to their limits of scalability. Thus, an improved understanding of the tree-of-life requires not just more data but also better algorithms. Interestingly, as data sciences permeate many areas of science, issues of robustness to error and scalability faced in phylogenetics will confront many disciplines. Thus, the next generation of data scientists needs to be trained to consider these concerns when developing algorithms for data analysis.

This project seeks to address current limitations in phylogenomics (phylogeny inference from whole genomes) and to integrate issues of robustness and scalability into teaching. The main challenge in phylogenomics is data heterogeneity, and there are two sources of data heterogeneity: real biological processes driving genome evolution that lead to discordant histories across the genome, and artefactual heterogeneity that results from complex pipelines used to prepare the data for inference. Models of real heterogeneity exist. However, current methods often require knowing the source of heterogeneity in advance, are often not scalable, are not always robust to artefactual heterogeneity. The approach taken here is to combine unsupervised learning and discrete optimization to build methods for identifying errors. These techniques will strive to minimize assumptions and will use both parametric and non-parametric statistics. The project will draw on machine learning, multi-criteria optimization, and high-performance computing. If successful, it will dramatically improve the accuracy and scalability of genome-wide phylogeny reconstruction and will help researchers understand intricate patterns in genome evolution. To integrate research and education, this project will enable yearly hackathons that bring together students with computational and biological expertise with the goal of developing robust and scalable methods. The project will also seek to improve the understanding of data science for undergrad and K-12 students, emphasizing for them both the excitement and challenges of analyzing large error-prone datasets. The tools developed here will be publicly available and well-documented. Yearly workshops will be held to help biologists learn and use the tools.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 26)

Show All

Zhang, Chao and Zhao, Yiming and Braun, Edward L. and Mirarab, Siavash "TAPER: Pinpointing errors in multiple sequence alignments despite varying rates of evolution" Methods in Ecology and Evolution , v.12 , 2021 https://doi.org/10.1111/2041-210X.13696 Citation Details

Zhang, Chao and Scornavacca, Celine and Molloy, Erin K and Mirarab, Siavash "ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy" Molecular Biology and Evolution , 2020 https://doi.org/10.1093/molbev/msaa139 Citation Details

Zhang, Chao and Mirarab, Siavash and Schwartz, ed., Russell "ASTRAL-Pro 2: ultrafast species tree reconstruction from multi-copy gene family trees" Bioinformatics , 2022 https://doi.org/10.1093/bioinformatics/btac620 Citation Details

Zhang, Chao and Mirarab, Siavash "Weighting by Gene Tree Uncertainty Improves Accuracy of Quartet-based Species Trees" Molecular Biology and Evolution , v.39 , 2022 https://doi.org/10.1093/molbev/msac215 Citation Details

Balaban, Metin and Mirarab, Siavash "Phylogenetic double placement of mixed samples" Bioinformatics , v.36 , 2020 10.1093/bioinformatics/btaa489 Citation Details

Balaban, Metin and Moshiri, Niema and Mai, Uyen and Jia, Xingfan and Mirarab, Siavash and Bozdag, Serdar "TreeCluster: Clustering biological sequences using phylogenetic trees" PLOS ONE , v.14 , 2019 10.1371/journal.pone.0221068 Citation Details

Arasti, Shayesteh and Mirarab, Siavash "Median quartet tree search algorithms using optimal subtree prune and regraft" Algorithms for Molecular Biology , v.19 , 2024 https://doi.org/10.1186/s13015-024-00257-3 Citation Details

Balaban, Metin and Bristy, Nishat Anjum and Faisal, Ahnaf and Bayzid, Md Shamsuzzoha and Mirarab, Siavash and Lengauer, ed., Thomas "Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model" Bioinformatics Advances , v.2 , 2022 https://doi.org/10.1093/bioadv/vbac055 Citation Details

Tilic, Ekin and Sayyari, Erfan and Stiller, Josefin and Mirarab, Siavash and Rouse, Greg W. "More is neededThousands of loci are required to elucidate the relationships of the flowers of the sea (Sabellida, Annelida)" Molecular Phylogenetics and Evolution , v.151 , 2020 https://doi.org/10.1016/j.ympev.2020.106892 Citation Details

Tabatabaee, Yasamin and Zhang, Chao and Warnow, Tandy and Mirarab, Siavash "Phylogenomic branch length estimation using quartets" Bioinformatics , v.39 , 2023 https://doi.org/10.1093/bioinformatics/btad221 Citation Details

Sayyari, Erfan and Kawas, Ban and Mirarab, Siavash "TADA: phylogenetic augmentation of microbiome samples enhances phenotype classification" Bioinformatics , v.35 , 2019 https://doi.org/10.1093/bioinformatics/btz394 Citation Details

(Showing: 1 - 10 of 26)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This CAREER project resulted in significant advances in the field of phylogenomics (inference of evolutionary histories of species from genomic data) and related fields. Focusing on Aim 1, the ability to handle errors introduced in analytical pipelines used to infer species trees has been identified as important by the field. This project created several algorithms aimed to detect and ameliorate such errors. Some of these methods (e.g., TAPER) directly look for such methods. Others, (e.g., wASTRAL) reduce their impact by changing how downstream analyses are done. Yet others (e.g., ASTRAL-Pro) eliminate the need for some of the error-prone steps by using alternative methods. The second aim of the project, to improve the accuracy and scalability of methods to infer both gene trees and species trees (perhaps jointly) was addressed using several methods. A theme of these methods was that many (though not all) were non-parametric, using quartet frequencies to infer evolutionary histories in ways that are statistically consistent under several models. Overall, the research side of the project led to 26 papers, 17 software tools, and several public datasets. These tools have already been adopted by many biologists in their analyses and will help advance the state-of-the-art in phylogenomics.
The educational side of this project was integrated with the research. Some of the theories behind the methods developed were integrated into PI's graduate course. In addition, the tools presented were demonstrated to biologists in various workshops and software schools. Undergraduate researchers were mentored in the process and published referred articles. Thus, the project helped advance both research and education.

Last Modified: 06/02/2024
Modified by: Siavash Mir Arabbaygi

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error