Award Abstract # 1838344
RoL: FELS: EAGER: Design Rules for Multidomain Proteins Across the Tree of Life

NSF Org: DBI
Division of Biological Infrastructure
Recipient: CARNEGIE MELLON UNIVERSITY
Initial Amendment Date: July 9, 2018
Latest Amendment Date: July 9, 2018
Award Number: 1838344
Award Instrument: Standard Grant
Program Manager: Peter McCartney
DBI
 Division of Biological Infrastructure
BIO
 Directorate for Biological Sciences
Start Date: August 1, 2018
End Date: July 31, 2021 (Estimated)
Total Intended Award Amount: $299,853.00
Total Awarded Amount to Date: $299,853.00
Funds Obligated to Date: FY 2018 = $299,853.00
History of Investigator:
  • Marie Dannie Durand (Principal Investigator)
    durand@cmu.edu
Recipient Sponsored Research Office: Carnegie-Mellon University
5000 FORBES AVE
PITTSBURGH
PA  US  15213-3815
(412)268-8746
Sponsor Congressional District: 12
Primary Place of Performance: Carnegie Mellon University
4400 FifthAvenue
Pittsburgh
PA  US  15213-2683
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): U3NKNFLNQ613
Parent UEI: U3NKNFLNQ613
NSF Program(s): Cross-BIO Activities
Primary Program Source: 01001819DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 068Z, 7916
Program Element Code(s): 727500
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.074

ABSTRACT

This project seeks to discover how the biology of the cell shapes the design rules of multidomain proteins. Multidomain proteins are mosaics of sequence fragments that encode structural or functional modules, called domains. The modular nature of a multidomain protein is integral to its function because different constituent domains play different functional roles. For example, in signaling proteins, some domains are responsible for the recognition and others for the transmission of an environmental signal. These modular proteins allow cells to interact with their world, via cell-cell signaling, cellular adhesion, and cellular migration. In human health, multidomain families are fundamental to apoptosis, innate immunity, inflammation response, and tissue repair. The multidomain architectures that are observed in nature represent a tiny fraction of possible domain combinations. These domain combinations are the product of the mutational processes that give rise to new sequence mosaics and the selective forces that promote or discourage their retention. In a given species, mutation and selection are both dependent on genome organization, mechanisms of DNA replication, transcription and repair, and the interaction of the cell with its environment. Multidomain architectures vary substantially across species, as do genomic and cellular properties. This project exploits this comparative framework to investigate how the biology of the cell shapes the processes of multidomain evolution. This research has the potential to transform our understanding of protein evolution by identifying multidomain design rules that may provide a foundation for predictive models linking evolution and function, with concrete applications for human health and protein engineering. This project advances research infrastructure through the development and distribution of computational tools that may contribute to national scientific resources. This project also contributes to building a broadly inclusive scientific work force through research experiences for women in Carnegie Mellon's undergraduate program in computational biology.

This project uses a three-pronged approach to investigate the universal and lineage-specific design rules of multidomain proteins. First, computational tools will be developed to infer evolution on the domain, gene, and species levels, by modeling a multidomain family as a set of domains that are co-evolving with the associated genes and species. Each entity is represented by an evolutionary tree. The history of evolutionary events is inferred using topological comparison of the domain, gene, and species trees. Combining information from three levels of biological organization reveals when domain events occurred relative to events in gene, genome, and organismal evolution, providing the information required to investigate how changes in domain architecture correlate with changes in genomic and cellular properties. Second, these methods are applied to reconstruct multidomain evolution in vertebrate and proteobacterial genomes, revealing shared and lineage-specific evolutionary patterns. Third, comparison of these evolutionary patterns with differences in genome organization and cellular machinery in vertebrate and proteobacterial cells will support the inference of design rules for multidomain evolution across the tree of life. The resulting data and computational tools will be available at http://www.cs.cmu.edu/??durand/Lab/multidomain.html.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cui, Xiaoyue and Stolzer, Maureen and Durand, Dannie "Evidence for exon shuffling is sensitive to model choice" Journal of Bioinformatics and Computational Biology , v.19 , 2021 https://doi.org/10.1142/S0219720021400138 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Multidomain proteins are characterized by a mosaic of sequence fragments that encode structural or functional modules, called domains.  A protein's domain architecture (i.e., its constituent domains in N-terminal to C-terminal order) is integral to its function, because different constituent domains are responsible for different subfunctions, for example, recognition and transmission of an environmental signal..  Multidomain architectures are the product of mutational processes that generate new combinations of domains, which in turn depend on gene and genome organization. Multidomain architectures vary substantially across different taxonomic lineages, as do genomic and cellular properties.  This project examined how gene architecture shapes multidomain protein evolution and whether it does so in a lineage-dependent manner. 

The exon shuffling hypothesis states that new domain combinations arise through intronic recombination. More generally, this hypothesis posits a link between gene and protein architecture.  If the exon shuffling hypothesis is true, then introns should co-occur with domain boundaries more often than expected by chance.  Prior studies tested this hypothesis on a broad range of eukaryotic genomes and concluded that gene structure contributes to the formation of new domain architectures and does so throughout the eukaryotes.  However, those studies did not examine the relationship between gene architecture and domain architecture in different eukaryotic lineages. 

We investigated domain-intron co-occurrence in 16 fungal and 5 metazoan genomes. Fungal genomes possess a broad range of intron sizes and frequencies, making this a good comparative system.  We assessed the evidence for exon shuffling in the data set as a whole, and in the animal and fungal genomes separately. 

Prior studies estimated the expected number of intron-domain boundary co-occurrences under the assumption that introns are uniformly distributed across genes.  While computationally tractable, this null model is biologically unrealistic.  The positions of introns in genuine genes are not even approximately uniformly distributed. For this project, we designed and implemented two strategies for randomizing intron positions that provide good approximations of the genuine intron positioning without excessive computational costs. We carried out an explicit comparison of the uniform null model and our two new models with respect to the implied exon length distributions and the propensity to reject the null hypothesis.  We observed that uniformly distributed intron positioning results in a skewed exon length distribution with an excess of short exons. In contrast, exon lengths obtained with our randomization strategies closely approximate the genuine exon length distribution. 

We further observe that the three models differ substantially in their ability to reject the null hypothesis.  Relative to our methods, the uniform model dramatically underestimates the expected frequency of agreement between introns and domain boundaries.  This, in turn, exaggerates the significance of domain intron co-occurrence in the genuine data.  In animal genomes, the association between domains and introns was significant with all null models, suggesting that this association is so strong that it exceeds chance expectation even when the null model of intron positioning is inaccurate.  In contrast, only half of the fungal genomes tested displayed a significant association with all three models.  Moreover, even when this association is significant, the magnitude of the association is small. In the vertebrate genomes tested, 21% of all domain instances are flanked by introns at both ends. In contrast, only 7% of invertebrate and 3% of fungal domains satisfy this criterion.

In summary, the results of this project challenge the idea that exon shuffling has played an important role throughout the eukaryotic tree.  More accurate statistical tests developed for this project reveal that the genome-scale statistical association between gene architecture and domain architecture in fungi is weak and the effect size is small, suggesting that the contribution of exon shuffling to multidomain protein evolution in fungi is minor. This suggests that the evolutionary processes that shape multidomain protein evolution differ in different eukaryotic lineages.  The evidence for exon shuffling outside of metazoa should be revisited using more sensitive statistical models.

This project contributed to national scientific resources in the form of new, publicly available software, including a tree-based method for reconstruction of domain shuffling events and randomization heuristics that support realistic, conservative statistical tests without excessive computational costs.  This project has provided research training for undergraduate and doctoral students and postdoctoral fellows. 

 


Last Modified: 11/30/2022
Modified by: Marie Dannie Durand

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page