
NSF Org: |
IOS Division Of Integrative Organismal Systems |
Recipient: |
|
Initial Amendment Date: | August 19, 2016 |
Latest Amendment Date: | August 30, 2018 |
Award Number: | 1546838 |
Award Instrument: | Continuing Grant |
Program Manager: |
Gerald Schoenknecht
gschoenk@nsf.gov (703)292-5076 IOS Division Of Integrative Organismal Systems BIO Directorate for Biological Sciences |
Start Date: | August 15, 2016 |
End Date: | July 31, 2022 (Estimated) |
Total Intended Award Amount: | $2,193,335.00 |
Total Awarded Amount to Date: | $2,193,335.00 |
Funds Obligated to Date: |
FY 2017 = $1,104,329.00 FY 2018 = $546,490.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
5241 BROAD BRANCH RD NW WASHINGTON DC US 20015-1305 (202)387-6400 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
260 Panama Street Stanford CA US 94305-4101 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Plant Genome Research Project, Cross-BIO Activities |
Primary Program Source: |
01001718DB NSF RESEARCH & RELATED ACTIVIT 01001819DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.074 |
ABSTRACT
It has been estimated that agricultural productivity needs to be increased to meet the demands imposed by population growth and climate change. Changing the metabolism of crop species is one way to improve productivity. Thus, increasing our knowledge of plant metabolism can significantly accelerate crop improvement efforts. New DNA sequencing technologies have produced an enormous amount of data. However, it has been difficult to obtain useful metabolic information from those DNA sequences. The plant research community needs efficient tools that can extract information related to metabolism from those DNA sequences. This project will produce the tools and datasets that will be used to systematically characterize the components of metabolism: enzymes, transporters, and pathways. These tools will make it easy to compare the metabolic genetic potential of two or more species, and enable the identification of targets for crop improvement. This project will also offer training opportunities in biochemistry and computer sciences to postdoctoral associates and students. In addition, workshops will be offered at professional meetings to train members of the plant research community on the use of the tools developed by the project. Finally, the tools developed by this project will be made available to the scientific community through a web portal.
Accurate and rapid annotation of metabolic enzymes and transporters from sequenced genomes and their metabolic network reconstructions are essential resources for interpreting the results of 'omics' data systematically and enabling the generation of new hypotheses. This proposal aims to meet these needs by developing a computational pipeline to enable rapid and accurate prediction of genome-scale metabolic complements of any sequenced plant based on the large pool of experimentally characterized information. First, the team will improve the accuracy of enzyme function prediction by adding new classifiers and features to a redesigned machine-learning framework. Additions of new classifiers such as phylogenomics-based function prediction and new features such as conserved protein domain architecture and conserved residues would reduce false positive predictions of proteins that share high sequence similarity with known enzymes but catalyze distinct functions. The team will also develop a new learning based algorithm to predict subcellular locations of enzymes and reactions for any plant species. The algorithm will combine the localization likelihoods of enzymes derived from the experimentally determined localization information of their orthologs and the localization information of the neighboring reactions in the metabolic network to propagate the localization likelihoods among all the reactions in the network. Another new algorithm will be developed to predict transporters and the substrates of transporters. All data generated from this project will be integrated into the PMN databases. In addition, a pipeline will be packaged to enable users to submit their genome sequences online and obtain the prediction results through a web server. Finally, innovative, integrated views of metabolic pathways with gene co-expression, transporters and subcellular compartments will be developed.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
During the course of this grant, we accomplished a massive expansion of the Plant Metabolic Network (PMN, https://plantcyc.org), an online database of plant metabolism. We published six major releases, bringing the number of single-species databases from 21 to 126. Each database contains enzymes predicted from the genome using our in-house pipeline E2P2, and reactions, pathways, and compounds implied by that enzyme set, along with varying degrees of curated information from the literature. PMN has added 161 metabolic pathways, 3,654 reactions, 2,777 compounds, and 1,065,384 proteins since the start of the grant. We have also added 8,705 proteins to our reference protein sequence dataset (RPSD) used to make enzyme function predictions.
PMN has been an invaluable tool to many plant biologists, with more than 1,100 unique visitors per month, 1,200 users registered for full-database downloads, and 429 literature citations. Common uses include transforming omics data, interrogating hypotheses about the evolution of plant metabolism, and annotating new genomes using the PMN BLAST feature. Our own group has published 28 papers on plant metabolism with support from this grant, including two major PMN update papers. Twenty-nine people were trained on this grant, including 6 postdocs, 6 postbac research assistants, 3 biocurators, and 14 undergraduate interns. This cohort included 15 women (52%), 15 people of color (52%), and 2 URMs (7%). All are still in STEM fields and many have moved on to the next stage of their careers, including 4 in PhD programs, 2 in MS programs, 4 in industry, 1 government, and 1 academic lab positions.
We made significant improvements to our database-generation pipeline. The Ensemble Enzyme Prediction Pipeline (E2P2) and the associated RPSD that it uses to make predictions have been kept up to date with new information from the literature. Numerous classifiers have been tested for potential addition to E2P2, and two (the neural network-based DeepEC and the structure-based AlphaFold+TMAlign) have been selected for integration into E2P2 for future releases. We also developed a natural language processing (NLP) machine learning model to identify and assess papers with enzyme function information and differentiate between those whose enzyme classifications are based on experimental data from those based on computational prediction. We used this tool to filter enzyme function information from the BRENDA database for inclusion in the RPSD, to prevent computational predictions from being used to make more computational predictions. The semi-automated validation infrastructure (SAVI) is software that lets biocurators enter rules for inclusion or exclusion of specific pathways from the plant databases based on their phylogenetic placement. We added rules for more than 300 pathways to SAVI, bringing the total number of pathways to 1,352.
Several new website features have been implemented. Pathways that involve transport between cells or cellular compartments now show the membrane and compartments on the pathway display. Virtual PlantCyc is a new feature that allows users to select up to 10 PMN databases (genomes), for which the predicted enzymes will be pulled in, on the fly, and displayed on the PlantCyc pathway diagrams. Virtual PlantCyc also displays colored stack boxes next to each reaction of the pathway, with each box representing a genome, to indicate the presence (colored) or absence (white) of enzyme annotations from a given genome.
A number of new web applications have also been developed and published. Co-Expression Viewer has been created that can be used to view co-expression data for all the genes in a given PMN pathway, drawing data from the ATTED-II plant co-expression database. It supports all nine of the ATTED-II plant species and is accessible from PMN pathway views. Another new site shows the status of genome function annotation for several important model organisms and crops (https://genomeannotation.rheelab.org). A third new site presents plant metabolic clusters for 8 plant species computed using our PlantClusterFinder software (https://metabolicclusterviewer.dpb.carnegiescience.edu).
The project has conducted substantial outreach to the general public. We designed and implemented a program to introduce middle-school students to plant biology, and worked with Thomas R. Pollicita Middle School in Daly City, CA, a school whose student body consists of 98% Black, Indigenous and People of Color communities, to implement the program. More than 800 students across 6 classrooms learned about plant biology and the scientific process through hands-on experiments dissecting and regrowing vegetables. We have also partnered with the Canopy blog (https://canopy.org/blog) to publish 29 tree stories, posts about specific tree species, their history, and their importance to both humans and the environment. There have been 75,000 blog readers over the last 12 months and the Canopy TreEnews eblast is sent to 4,559 subscribers. The tree stories blogs authored by Rhee Lab trainees have received 41,588 views since the partnership began in 2018.
Last Modified: 10/07/2022
Modified by: Seung Rhee
Please report errors in award information by writing to: awardsearch@nsf.gov.