
NSF Org: |
DBI Division of Biological Infrastructure |
Recipient: |
|
Initial Amendment Date: | August 19, 2015 |
Latest Amendment Date: | August 4, 2020 |
Award Number: | 1458359 |
Award Instrument: | Standard Grant |
Program Manager: |
Peter McCartney
DBI Division of Biological Infrastructure BIO Directorate for Biological Sciences |
Start Date: | September 1, 2015 |
End Date: | August 31, 2021 (Estimated) |
Total Intended Award Amount: | $506,490.00 |
Total Awarded Amount to Date: | $506,490.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
1350 BEARDSHEAR HALL AMES IA US 50011-2103 (515)294-5225 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
1138 Pearson Hall Ames IA US 50011-2207 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | ADVANCES IN BIO INFORMATICS |
Primary Program Source: |
|
Program Reference Code(s): | |
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.074 |
ABSTRACT
Biologists are deluged with sequence data yet have derived comparatively little biological information from it. The accurate annotation of protein function is key to understanding life, but experimentally determining what each protein does is costly and difficult, and cannot scale up to accommodate the vast amount of sequence data already available. Therefore discovering protein protein function by computational, rather than experimental means, is of primary importance. Genomic sequence data are available from thousands of species, and those are coupled with massive high-throughput experimental data. Together, these data have created new opportunities as well as challenges for computational function prediction. As a result, many computational annotation methods have been developed by research groups worldwide, but their accuracy and applicability need to be improved upon. The mission of the Automated Function Prediction Special Interest Group (AFP-SIG) is to bring together computational biologists, experimental biologists and biocurators who are dealing with the important problem of predicting protein function, to share ideas, and create collaborations. To improve computational function prediction methods, the Critical Assessment of protein Function Annotation algorithms (CAFA) was established as an ongoing experiment. CAFA was designed to provide a large-scale assessment of computational methods dedicated to predicting protein function. By challenging dozens of research groups worldwide to develop and provide their best software for function prediction, the researchers involved in the AFP-SIG will improve the ability of biologists to understand life at the molecular level. The AFP-SIG researchers will also generate experimental data from fruit-flies, fungi and bacteria to be used as benchmarks to test the software participating in CAFA, and a deeper understanding of these model organisms.
It is now possible to collect data that comprehensively profile many different states of complex biological systems. Using these data it should be possible to understand and explain the underlying systems, but significant challenges remain. One of the primary challenges is that, as researchers collect more data from many different organisms in many different systems, they discover more and different genes. Assigning functions to these newly discovered genes represents a key step towards interpretation of high-throughput data. This leads to a critical need to assess the quality of the function prediction methods that researchers have developed in recent years. The mission of the Automated Function Prediction Special Interest Group (AFP-SIG), founded in 2005, is to bring together bioinformaticians and biologists who are addressing this key challenge of gene function prediction. In addition to sharing ideas and creating collaboration, AFP-SIG has created CAFA: the Critical Assessment of (protein) Function Annotation. CAFA is a community-driven challenge to assess the performance of protein function prediction software, and it has been carried out twice since 2010. The investigators will provide the following outcomes: (1) robust open-source software to be used in function prediction and assessment of function prediction methods, incorporated into the high-profile annotation pipelines of UniProt-GOA; (2) expansion of the AFP community by engaging bioinformaticians, biocurators and experimentalists, thereby improving the quality and relevance of function prediction methods; (3) large-scale experimental screens in Drosophila, Candida and Pseudomonas for novel associations of targeted functional terms with genes; (4) an expanded CAFA event, incorporating both the curated annotations from the literature and our own experimental screens, in the last two years of the project. The progress of the AFP-SIG and CAFA will be available from http://BioFunctionPrediction.org
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Cheap and fast genome sequencing has given us a plethora of genomic data. But while we have the ability to sequence full genomes in a massive scale, understanding the functions of genes is still woefully lagging behind. This phenomenon is akin to a library acquiring dozens of new books each day, but the patrons are only able to comprehend 0-70% of the words in each book.
Understanding the meaning of the words, or the function of the genes, is one of the main challenges we are facing the in the genomic era.
INTELLECTUAL MERIT: the chief scientific advance this project provided is dramatically improving our ability, as a scientific community, to computationally predict protein function, and providing these tools to experimentalists for quickly creating falsifiable hypotheses that can be experimentally verified. One of the metrics used to gauge method performance is Fmax, which is normalized on a scale of 0-1. Fmax for prediction of certain aspects of protein function has improved from an Fmax of 0.4 to 0.7. Furthermore, new machine learning based methods have been developed within the framework of CAFA for predicting protein function. It is unlikely that such a broad effort to develop so many new methods would have taken place without the incentive of the CAFA competition. Leveraging the framework of CAFA, we have also applied predictions to plant phenotypes, and explored the utility of classification employing citizen scientists and Amazon MTurkers to create training data for prediction algorithms. Another project involved predicting genes to be involved in long-term memory in fruit flies, and then verifying that they are indeed playing such a role using an mRNA knock-down assay. We have also developed a novel method for classifying antibiotic resistance genes using an adaptation of machine learning methods commonly used for document classification (Word2vec) with excellent accuracy, and intend to continue this line of research with experimental verifications. Additionally, 10 trainees have been involved in biocuration, and creating prediction targets from whole-genome experiments.
BROADER IMPACTS: The main impact from this award was the initiation and growth of a large international community of computational function predictors, biocurators, and experimental biologists all working together under the umbrella of CAFA and the Function COSI (Community of Special Interest) to improve computational function prediction methods. We estimate this community to consist of some 70 groups, with the mailing list containing 200 active members. Members of this community have been meeting annually in the Function COSI meeting to share and debate the merits of methods, assessment metrics, and best ways to improve and expand community activities in the field.
The creation and growth of this community, and the resulting progress via the CAFA challenge have transformed the way the broader life-science community addresses computational function prediction. One illustrative example is that the journal Nucleic Acids Research now requires CAFA-like assessment results to publish function prediction methods. The three papers reporting the results of the three CAFA challenges have been collectively cited over 1,300 times, attesting to the high interest in the broader biological community. A literature survey provided us with an estimate of at least 70 trainees, graduate students (mostly) and postdocs who, in part or in whole, have been working on developing CAFA-competitive methods over the period of the funding of this grant. This is a huge impact not only on training, but on the ecosystem and availability of different software methods for computational function prediction.
In the PI's lab, funding from this award was used to train (in full or in part) 3 PhD students in the PI's lab. In addition, two programmers who were trained in bioinformatics, and one postdoc. The PI also taught two undergraduate courses in computational genomics, and one graduate course in the same discipline, where he also introduced Gene Ontology and computational function prediction to the curriculum.
Last Modified: 12/14/2021
Modified by: Iddo Friedberg
Please report errors in award information by writing to: awardsearch@nsf.gov.