Award Abstract # 1845967
CAREER: Robust and scalable genome-wide phylogenetics

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF CALIFORNIA, SAN DIEGO
Initial Amendment Date: February 14, 2019
Latest Amendment Date: May 19, 2021
Award Number: 1845967
Award Instrument: Continuing Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: February 15, 2019
End Date: January 31, 2024 (Estimated)
Total Intended Award Amount: $549,239.00
Total Awarded Amount to Date: $549,239.00
Funds Obligated to Date: FY 2019 = $375,800.00
FY 2020 = $106,414.00

FY 2021 = $67,025.00
History of Investigator:
  • Siavash Mir arabbaygi (Principal Investigator)
    smirarab@ucsd.edu
Recipient Sponsored Research Office: University of California-San Diego
9500 GILMAN DR
LA JOLLA
CA  US  92093-0021
(858)534-4896
Sponsor Congressional District: 50
Primary Place of Performance: UC San Diego
9500 Gilman Drive
San Diego
CA  US  92093-0407
Primary Place of Performance
Congressional District:
50
Unique Entity Identifier (UEI): UYTTZT6G9DT1
Parent UEI:
NSF Program(s): Info Integration & Informatics,
Systematics & Biodiversity Sci
Primary Program Source: 01001920DB NSF RESEARCH & RELATED ACTIVIT
01002021DB NSF RESEARCH & RELATED ACTIVIT

01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1045, 7364
Program Element Code(s): 736400, 737400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

The present diversity of life has evolved from a single ancestor through billions of years of evolution. Understanding these evolutionary histories is fascinating, but more importantly, is a crucial precursor to many biological analyses. Some evolutionary relationships are obvious (e.g., a cat is closer to a lion than a chicken) but other consequential relationships are hard to discern. Luckily, evolution operates on the genomes of organisms, and the sequence of genetic changes leaves a trace of the evolutionary histories. Following these traces and reconstructing the evolutionary past, however, is a computational problem, and as it turns out, is a difficult problem. Sophisticated methods are needed to infer a phylogeny: a tree, called tree-of-life, that shows the historical relationships between species. When sequencing whole genomes became possible in the mid-2000s, many believed the sheer amount of data would result in robust reconstructions of phylogenies. While genome sequencing has fulfilled some of its promises, other challenges remain. Large-scale data are hard to adequately model and are hard to screen for errors. As a result, different analyses do not always agree, and also, inference algorithms are pushed to their limits of scalability. Thus, an improved understanding of the tree-of-life requires not just more data but also better algorithms. Interestingly, as data sciences permeate many areas of science, issues of robustness to error and scalability faced in phylogenetics will confront many disciplines. Thus, the next generation of data scientists needs to be trained to consider these concerns when developing algorithms for data analysis.

This project seeks to address current limitations in phylogenomics (phylogeny inference from whole genomes) and to integrate issues of robustness and scalability into teaching. The main challenge in phylogenomics is data heterogeneity, and there are two sources of data heterogeneity: real biological processes driving genome evolution that lead to discordant histories across the genome, and artefactual heterogeneity that results from complex pipelines used to prepare the data for inference. Models of real heterogeneity exist. However, current methods often require knowing the source of heterogeneity in advance, are often not scalable, are not always robust to artefactual heterogeneity. The approach taken here is to combine unsupervised learning and discrete optimization to build methods for identifying errors. These techniques will strive to minimize assumptions and will use both parametric and non-parametric statistics. The project will draw on machine learning, multi-criteria optimization, and high-performance computing. If successful, it will dramatically improve the accuracy and scalability of genome-wide phylogeny reconstruction and will help researchers understand intricate patterns in genome evolution. To integrate research and education, this project will enable yearly hackathons that bring together students with computational and biological expertise with the goal of developing robust and scalable methods. The project will also seek to improve the understanding of data science for undergrad and K-12 students, emphasizing for them both the excitement and challenges of analyzing large error-prone datasets. The tools developed here will be publicly available and well-documented. Yearly workshops will be held to help biologists learn and use the tools.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 26)
Arasti, Shayesteh and Mirarab, Siavash "Median quartet tree search algorithms using optimal subtree prune and regraft" Algorithms for Molecular Biology , v.19 , 2024 https://doi.org/10.1186/s13015-024-00257-3 Citation Details
Balaban, Metin and Bristy, Nishat Anjum and Faisal, Ahnaf and Bayzid, Md Shamsuzzoha and Mirarab, Siavash and Lengauer, ed., Thomas "Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model" Bioinformatics Advances , v.2 , 2022 https://doi.org/10.1093/bioadv/vbac055 Citation Details
Balaban, Metin and Jiang, Yueyu and Roush, Daniel and Zhu, Qiyun and Mirarab, Siavash "Fast and accurate distancebased phylogenetic placement using divide and conquer" Molecular Ecology Resources , 2021 https://doi.org/10.1111/1755-0998.13527 Citation Details
Balaban, Metin and Jiang, Yueyu and Zhu, Qiyun and McDonald, Daniel and Knight, Rob and Mirarab, Siavash "Generation of accurate, expandable phylogenomic trees with uDance" Nature Biotechnology , v.42 , 2024 https://doi.org/10.1038/s41587-023-01868-8 Citation Details
Balaban, Metin and Mirarab, Siavash "Phylogenetic double placement of mixed samples" Bioinformatics , v.36 , 2020 10.1093/bioinformatics/btaa489 Citation Details
Balaban, Metin and Moshiri, Niema and Mai, Uyen and Jia, Xingfan and Mirarab, Siavash and Bozdag, Serdar "TreeCluster: Clustering biological sequences using phylogenetic trees" PLOS ONE , v.14 , 2019 10.1371/journal.pone.0221068 Citation Details
Chen, Lei and Qiu, Qiang and Jiang, Yu and Wang, Kun and Lin, Zeshan and Li, Zhipeng and Bibi, Faysal and Yang, Yongzhi and Wang, Jinhuan and Nie, Wenhui and Su, Weiting and Liu, Guichun and Li, Qiye and Fu, Weiwei and Pan, Xiangyu and Liu, Chang and Yang "Large-scale ruminant genome sequencing provides insights into their evolution and distinct traits" Science , v.364 , 2019 10.1126/science.aav6202 Citation Details
Elghraoui, Afif and Mirarab, Siavash and Swenson, Krister M. and Valafar, Faramarz and Schwartz, ed., Russell "Evaluating impacts of syntenic block detection strategies on rearrangement phylogeny using Mycobacterium tuberculosis isolates" Bioinformatics , v.39 , 2023 https://doi.org/10.1093/bioinformatics/btad024 Citation Details
Jiang, Yueyu and Balaban, Metin and Zhu, Qiyun and Mirarab, Siavash and Solis-Lemus, ed., Claudia "DEPP: Deep Learning Enables Extending Species Trees using Single Genes" Systematic Biology , v.72 , 2022 https://doi.org/10.1093/sysbio/syac031 Citation Details
Mai, Uyen and Mirarab, Siavash "Log Transformation Improves Dating of Phylogenies" Molecular Biology and Evolution , 2020 https://doi.org/10.1093/molbev/msaa222 Citation Details
Mai, Uyen and Mirarab, Siavash and Schwartz, ed., Russell "Completing gene trees without species trees in sub-quadratic time" Bioinformatics , v.38 , 2022 https://doi.org/10.1093/bioinformatics/btab875 Citation Details
(Showing: 1 - 10 of 26)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This CAREER project resulted in significant advances in the field of phylogenomics (inference of evolutionary histories of species from genomic data) and related fields. Focusing on Aim 1, the ability to handle errors introduced in analytical pipelines used to infer species trees has been identified as important by the field. This project created several algorithms aimed to detect and ameliorate such errors. Some of these methods (e.g., TAPER) directly look for such methods. Others, (e.g., wASTRAL) reduce their impact by changing how downstream analyses are done. Yet others (e.g., ASTRAL-Pro) eliminate the need for some of the error-prone steps by using alternative methods. The second aim of the project, to improve the accuracy and scalability of methods to infer both gene trees and species trees (perhaps jointly) was addressed using several methods. A theme of these methods was that many (though not all) were non-parametric, using quartet frequencies to infer evolutionary histories in ways that are statistically consistent under several models. Overall, the research side of the project led to 26 papers, 17 software tools, and several public datasets. These tools have already been adopted by many biologists in their analyses and will help advance the state-of-the-art in phylogenomics. 
The educational side of this project was integrated with the research. Some of the theories behind the methods developed were integrated into PI's graduate course. In addition, the tools presented were demonstrated to biologists in various workshops and software schools. Undergraduate researchers were mentored in the process and published referred articles. Thus, the project helped advance both research and education. 


Last Modified: 06/02/2024
Modified by: Siavash Mir Arabbaygi

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page