
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 31, 2021 |
Latest Amendment Date: | August 31, 2021 |
Award Number: | 2125218 |
Award Instrument: | Standard Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2021 |
End Date: | September 30, 2025 (Estimated) |
Total Intended Award Amount: | $500,000.00 |
Total Awarded Amount to Date: | $500,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
910 WEST FRANKLIN ST RICHMOND VA US 23284-9005 (804)828-6772 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
401 West Main Street Richmond VA US 23298-0568 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Proteins are remarkable biological machines. Hundreds of millions of protein sequences were decoded over the last two decades creating a significant knowledge gap related to the fact that we do not know what most of them do. A common way to decipher protein functions relies on the sequence-to-structure-to-function paradigm where protein function is learned from the protein structure that is produced from the sequence. However, recent research has identified a large family of the intrinsically disordered proteins that lack a stable structure under physiological conditions and which therefore cannot be characterized using the structure-based approaches. These proteins are particularly abundant in the eukaryotes and are involved in the pathogenesis of numerous human diseases. The discovery of the intrinsically disordered proteins has prompted the development of a new generation of computational methods that predict presence of intrinsic disorder directly from protein sequences. A recently completed Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment has shown that these methods are fast and provide accurate results. However, while intrinsic disorder can be readily and accurately identified in protein sequences, its function remains a mystery. This proposal will conceptualize, design, implement, test and deploy an innovative machine learning method that provides highly accurate and integrated predictions of disorder and disorder functions directly from protein sequences. The team will utilize this method to produce functional annotations of disorder on an unprecedented scale of dozens of millions of proteins, addressing the knowledge gap problem for this protein family. In the long run this project will advance understanding of fundamental biological processes and related human health issues in the context of the intrinsically disordered proteins. This project will also train STEM students and researchers via high-school outreach and multidisciplinary teaching and mentoring of undergraduate and graduate students and postdoctoral researchers, producing highly skilled researchers who are sought after by industry and academia.
An interdisciplinary and challenging problem of the structure of intrinsically disorder protein structure at the intersection of bioinformatics and machine learning fields is addressed by the team. Building on expertise in the computational analysis of intrinsic disorder and with focus on technical innovation, this project will deliver a novel deep sequential multi-label transformer architecture that provides accurate predictions of disorder and disorder functions. The solution will be designed to accommodate for the biological underpinnings of protein data, such as the inherently multi-label outcomes, imbalanced labels and sequential nature of protein data. Moreover, this architecture will feature modular design to facilitate transfer to other areas of protein and nucleic acids bioinformatics. The resulting method will be extensively benchmarked and disseminated to maximize impact. The code will be deposited into relevant public repositories and pre-computed functional annotations of intrinsic disorder will be made available using modern online resources, such as data repositories and webservers, in order to meet the needs of a broad spectrum of users including biologists, biochemist, biophysicists and bioinformaticians.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
Please report errors in award information by writing to: awardsearch@nsf.gov.