NSF Award Search: Award # 1553421 - CAREER: Algorithms for Domain-Level Analysis of Gene Family Evolution

Award Abstract # 1553421

CAREER: Algorithms for Domain-Level Analysis of Gene Family Evolution

NSF Org:	IIS Division of Information & Intelligent Systems
Recipient:	UNIVERSITY OF CONNECTICUT
Initial Amendment Date:	January 29, 2016
Latest Amendment Date:	January 29, 2019
Award Number:	1553421
Award Instrument:	Continuing Grant
Program Manager:	Sylvia Spengler sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	February 1, 2016
End Date:	January 31, 2023 (Estimated)
Total Intended Award Amount:	$499,576.00
Total Awarded Amount to Date:	$499,576.00
Funds Obligated to Date:	FY 2016 = $198,388.00 FY 2017 = $93,774.00 FY 2018 = $101,102.00 FY 2019 = $106,312.00
History of Investigator:	Mukul Bansal (Principal Investigator) mukul.bansal@uconn.edu
Recipient Sponsored Research Office:	University of Connecticut 438 WHITNEY RD EXTENSION UNIT 1133 STORRS CT US 06269-9018 (860)486-3622
Sponsor Congressional District:	02
Primary Place of Performance:	University of Connecticut 371 Fairfield Way Storrs CT US 06269-4155
Primary Place of Performance Congressional District:	02
Unique Entity Identifier (UEI):	WNTPS995QBM7
Parent UEI:
NSF Program(s):	Info Integration & Informatics
Primary Program Source:	01001617DB NSF RESEARCH & RELATED ACTIVIT 01001718DB NSF RESEARCH & RELATED ACTIVIT 01001819DB NSF RESEARCH & RELATED ACTIVIT 01001920DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	1045, 7364
Program Element Code(s):	736400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

The genome of an organism helps to determine its biology. Understanding how different genes evolve and acquire new functions is a fundamental biological problem with many computational methods developed for studying how gene families evolve and change over time in different organisms. These existing methods assume that the gene is the basic unit of evolution and that evolutionary processes such as gene duplication, gene loss, and horizontal gene transfer act on entire genes, rather than on parts of genes. It is well known that most genes consist of one or more "protein domains," well-characterized functional units that can be independently lost or gained during evolution, and that domain shuffling is one of the primary mechanisms through which genes evolve and gain new functions. Proper inference and accounting of domain-level evolutionary events is therefore crucial to understanding how genes evolve and function. The proposed research will lay the methodological and algorithmic foundations for a novel computational framework that addresses this critical problem. The new computational framework and algorithms will enable more powerful comparative genomic techniques for understanding gene function and biology, and may also contribute to improvements in human health and agriculture. The proposed research will shape future computational advances in the study of domain, gene, and genome evolution for many years to come, and will also spur the development of more comprehensive computational models in other areas of molecular evolution. The algorithms developed as part of this research will be implemented into a user-friendly software package and made freely available. The project will directly involve two graduate and up to ten undergraduate students, introduce several high-school students to computer science, bioinformatics, and research, and provide training to many high-school science teachers on the role of computer science in biology.

This project will lead to the development of the first "three-tree" model of domain evolution that explicitly captures the interdependence of domain-, gene-, and species-level evolution. The proposed three-tree computational framework is based on phylogenetic reconciliation, where the goal is to find a most parsimonious joint reconciliation of the given gene trees with the species tree and of the given domain trees with the gene trees. The resulting optimization problems will be solved using various algorithmic techniques including dynamic programming, branch and bound, enumeration and sampling, and local search. The framework will decouple domain-level events from gene-level events and provide a fine-grained view of gene family and domain family evolution that is both more accurate and much easier to interpret. Specific aims include: (i) development of the three-tree computational framework and corresponding algorithms, (ii) enhancing inference accuracy by accounting for multiple optima and domain tree errors, and (iii) extension of the three-tree framework to microbial gene families by allowing for horizontal gene transfer.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 15)

Show All

Keegan Yao and Mukul S. Bansal "Optimal Completion and Comparison of Incomplete Phylogenetic Trees Under Robinson-Foulds Distance" 32nd Annual Symposium on Combinatorial Pattern Matching; Leibniz International Proceedings in Informatics (LIPIcs) , v.191 , 2021 , p.1 10.4230/LIPIcs.CPM.2021.25

Lei Li and Mukul S. Bansal "An Integer Linear Programming Solution for the Domain-Gene-Species Reconciliation Problem" ACM-BCB 2018: 9th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics , 2018 , p.386 https://doi.org/10.1145/3233547.3233603

Lei Li and Mukul S. Bansal "An Integrated Reconciliation Framework for Domain, Gene, and Species Level Evolution" IEEE/ACM Transactions on Computational Biology and Bioinformatics , v.16 , 2019 , p.63 10.1109/TCBB.2018.2846253

Lei Li and Mukul S. Bansal "Simultaneous Multi-Domain-Multi-Gene Reconciliation Under the Domain-Gene-Species Reconciliation Model" International Symposium on Bioinformatics Research and Applications (ISBRA 2019), Lecture Notes in Computer Science , v.11490 , 2019 , p.73 https://doi.org/10.1007/978-3-030-20242-2_7

Misagh Kordi and Mukul S. Bansal "Exact Algorithms for Duplication-Transfer-Loss Reconciliation with Non-Binary Gene Trees" ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB) 2016. , 2016 , p.297 http://dx.doi.org/10.1145/2975167.2975198

Misagh Kordi and Mukul S. Bansal "Exact Algorithms for Duplication-Transfer-Loss Reconciliation with Non-Binary Gene Trees" IEEE/ACM Transactions of Computational Biology and Bioinformatics , v.16 , 2019 , p.1077 10.1109/TCBB.2017.2710342

Misagh Kordi, Soumya Kundu, and Mukul S. Bansal "On Inferring Additive and Replacing Horizontal Gene Transfers Through Phylogenetic Reconciliation" 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB '19) , 2019 , p.514 https://doi.org/10.1145/3307339.3342168

Mukul S. Bansal "Linear-Time Algorithms for Phylogenetic Tree Completion Under Robinson-Foulds Distance" Algorithms for Molecular Biology , v.15 , 2020 https://doi.org/10.1186/s13015-020-00166-1

Mukul S. Bansal "Linear-Time Algorithms for some Phylogenetic Tree Completion Problems Under Robinson-Foulds Distance" RECOMB Comparative Genomics Conference (RECOMB-CG) 2018; Lecture Notes in Computer Science , v.11183 , 2018 , p.209 http://dx.doi.org/10.1007/978-3-030-00834-5_12

Mukul S. Bansal, Manolis Kellis, Misagh Kordi, Soumya Kundu "RANGER-DTL 2.0: Rigorous Reconstruction of Gene-Family Evolution by Duplication, Transfer, and Loss" Bioinformatics , v.34 , 2018 , p.3214 https://doi.org/10.1093/bioinformatics/bty314

Soumya Kundu and Mukul S. Bansal "On the Impact of Uncertain Gene Tree Rooting on Duplication-Transfer-Loss Reconciliation" BMC Bioinformatics , v.19 , 2018 , p.290 https://doi.org/10.1186/s12859-018-2269-0

(Showing: 1 - 10 of 15)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The overarching goal of this project was to develop novel computational techniques and algorithms for integrating domain-level, gene-level, and species-level evolution, with the aim of transforming biologists' ability to systematically study the evolution of gene families and protein domains. The project specifically focused on the development of a novel, three-tree phylogenetic reconciliation framework, called the Domain-Gene-Species (DGS) reconciliation framework, and on laying its methodological and algorithmic foundations.

To achieve the specific research goals of this project, we developed many new algorithms and computational methods for analyzing gene and subgene/domain evolution, released several new open-source software packages, and applied the new methods to large-scale simulated and real datasets to demonstrate their impact in practice. Some of the significant methodological and algorithmic innovations enabled by this project include (i) development of the first integrated DGS reconciliation framework that explicitly models the interdependence of domain-, gene-, and species-level evolution, (ii) development of effective exact and heuristic algorithms for computing optimal DGS reconciliations, (iii) extensions of the DGS reconciliation framework to account for multiple domains and to allow for horizontal transfer of genes and domains, along with development of associated algorithms, (iv) development of novel computational techniques for handling gene tree error and uncertainty, (v) development of the first phylogenetic simulation framework for simulating subgene/domain level evolution within gene families, (vi) development of a novel computational approach for identifying gene families affected by domain-level or partial gene transfer, and (vii) development of new computational techniques for improving gene family sequence alignment and gene tree accuracy by accounting for domain gain/loss and rearrangement.

The research supported by this project has resulted in 15 peer-reviewed journal and conference publications, one book chapter, and one manuscript that is currently under review for journal publication. Two of the conference publications received best paper awards. In addition, 2 manuscripts are currently in preparation for journal submission. The project also enabled the development of 9 freely available open-source software packages: RANGER-DTL v2.0 for phylogenetic reconciliation, SEADOG, SEADOG-ILP, SEADOG-MD, and SEADOG-Gen for computing Domain-Gene-Species reconciliations, SaGePhy for probabilistic phylogenetic simulation of gene and subgene evolution, RF+ for phylogenetic tree comparison, trippd for identification of gene families affected by partial gene transfer, and virDTL for viral recombination inference. All software packages are freely available from the PI’s lab website: https://compbio.engr.uconn.edu/software/

Many of the methods and software tools mentioned above are the first of their kind and make it possible to study those aspects of domain and gene family evolution that were either difficult or impossible to study previously. The open-source and easy-to-use software packages developed through this project will allow biologists and other researchers to easily apply the new methods to their own datasets.

This project provided invaluable research experience to 5 PhD students and 7 undergraduate students. Among the 5 PhD students, 3 graduated with their PhDs during this project, with one starting a postdoctoral research position, one returning to their home country to work as a senior algorithm engineer, and one joining the US technology industry. Among the 7 undergraduate students (3 female), 3 are currently pursuing their PhDs in computer science at Stanford, UCLA, and Duke, and a fourth will be joining the PhD program in computer science at the University of Maryland in Fall 2023. The project also enabled the training of 24 middle- and high-school science teachers (of which 19 were female and 6 were African American) from across the US as part of the intensive week-long workshop titled "Bioinformatics: Using computer science to understand life" taught by the PI in the summers of 2016, 2017, 2018, 2019, 2021, and 2022. Participating teachers were trained in the basic principles and techniques of bioinformatics and were provided with simple bioinformatics exercises for use in their own classrooms. Finally, this project also provided research experiences to several Connecticut high-school students.

Last Modified: 05/25/2023
Modified by: Mukul Bansal

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error