NSF Award Search: Award # 1513629

Award Abstract # 1513629

III: AF: Medium: Collaborative Research: Scalable and Highly Accurate Methods for Metagenomics

NSF Org:	IIS Division of Information & Intelligent Systems
Recipient:	UNIVERSITY OF ILLINOIS
Initial Amendment Date:	August 12, 2015
Latest Amendment Date:	August 27, 2019
Award Number:	1513629
Award Instrument:	Continuing Grant
Program Manager:	Sylvia Spengler sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	September 1, 2015
End Date:	August 31, 2020 (Estimated)
Total Intended Award Amount:	$626,711.00
Total Awarded Amount to Date:	$626,711.00
Funds Obligated to Date:	FY 2015 = $108,674.00 FY 2016 = $518,037.00
History of Investigator:	Tandy Warnow (Principal Investigator) warnow@illinois.edu William Gropp (Co-Principal Investigator)
Recipient Sponsored Research Office:	University of Illinois at Urbana-Champaign 506 S WRIGHT ST URBANA IL US 61801-3620 (217)333-2187
Sponsor Congressional District:	13
Primary Place of Performance:	University of Illinois at Urbana-Champaign IL US 61820-7473
Primary Place of Performance Congressional District:	13
Unique Entity Identifier (UEI):	Y8CWNJRCNN91
Parent UEI:	V2PHZ2CSCH63
NSF Program(s):	Info Integration & Informatics
Primary Program Source:	01001516DB NSF RESEARCH & RELATED ACTIVIT 01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7364, 7924, 9102
Program Element Code(s):	736400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Metagenomic studies of microbial communities can generate millions to billions of sequencing reads. The assignment of accurate taxonomic labels to these sequences is a critical component in many analyses, but is complicated by the fact that the majority of the organisms found in environmental or host-associated communities cannot be easily cultured in a laboratory. Even among the organisms that can be cultured, relatively few have been sequenced, even partially. Thus, many commonly encountered organisms are largely absent from existing databases of known genomes and genes. Providing taxonomic labels to metagenomic sequences, thus, requires extrapolating the knowledge contained in sequence databases to previously unseen DNA strings. Simple similarity-based approaches (e.g., picking the best database hit as the best guess at the taxonomic label) have been shown to be insufficiently accurate, leading to the development of more sophisticated methods. Further developments are necessary to handle the characteristics of emerging sequencing technologies, such as high error rates with large numbers of insertions and deletions. To date, metagenomic taxon identification methods have been evaluated with respect to their ability to estimate the distribution of bacterial taxa (species, genera, families, etc.) within a metagenomic sample. Yet, different scientific and clinical settings may require specific types of analyses, and this one type of evaluation may not be the most appropriate for all settings. For example, in a clinical setting the most important question may be to detect whether a specific pathogen is present, while in a scientific setting the most interesting question may be to be able to determine if an observed read comes from a never-been-seen-before species. New evaluation strategies must be developed that specifically target the specific needs of the application domain. All the methods developed in the project will be made into open-source software that is freely available to the scientific public. Researchers will provide training activities each year with funds available to students and postdocs from around the country, and an outreach program to minority serving institutions and women?s colleges. A summer REU program will also be provided at the University of Maryland, College Park.

The team will develop a new framework for integrating the formal definition of biological use-cases with evaluation datasets and metrics in order to ensure the software being developed adequately addresses the needs of the end-users. Second, they will develop new approaches for marker-based taxon identification and abundance profiling that can leverage multiple sources of information (e.g., multiple markers) as well as handle the high error rates of third-generation sequencing technologies. These approaches will build upon experience developing TIPP - a taxonomic profiling package recently published by the team that outperforms the leading metagenomic taxonomic profiling software, in particular for novel sequences, or for longer, high-error sequences. Finally they plan to develop high-performance computing implementations of these methods in order to enable rapid analysis of sample. Speed of analysis is particularly important in clinical settings where medical treatments may depend on the rate at which the method can return an analysis. Speed is also important in non-medical applications where faster analyses enable researchers to perform deeper or broader analyses of microbial communities.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 15)

Show All

B.M. Boyd, J.M. Allen, N. Nguyen, A.D. Sweet, T. Warnow, M.D. Shapiro, S.M. Villa, S.E. Bush, D.H. Clayton, and K.P. Johnson "Phylogenomics using Target-restricted Assembly Resolves Intra-generic Relationships of Parasitic Lice (Phthiraptera: Columbicola)" Systematic Biology , 2017 doi: 10.1093/sysbio/syx027

Christensen, S., Molloy, E.K., Vachaspati, P., Yamanuru, A, and Warnow, T. "Non-parametric correction of estimated gene trees using TRACTION." Algorithms Mol Biol , v.15 , 2020 10.1186/s13015-019-0161-8

Erin K Molloy and Tandy Warnow "FastMulRFS: fast and accurate species tree estimation under generic gene duplication and loss models," Bioinformatics , v.36 , 2020 , p.i57 10.1093/bioinformatics/btaa444

Katherine R Amato, Jon G Sanders, Se Jin Song, Michael Nute, Jessica L Metcalf, Luke R Thompson, James T Morton, Amnon Amir, Valerie J McKenzie, Gregory Humphrey, Grant Gogul, James Gaffney, Andrea L Baden, Gillian AO Britton, Frank P Cuozzo, Anthony Di F "Evolutionary trends in host physiology outweigh dietary niche in structuring primate gut microbiomes" The ISME Journal , v.13 , 2019 , p.576 10.1038/s41396-018-0175-0

Legried B., Molloy E.K., Warnow T., Roch S. "Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss" Research in Computational Molecular Biology (RECOMB 2020), Lecture Notes in Computer Science , v.12074 , 2020 , p.120 10.1007/978-3-030-45257-5_8

Molloy, E.K. and Warnow, T. "Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge." Algorithms Mol Biol , v.14 , 2019 , p.14 10.1186/s13015-019-0151-x

Nguyen, Nam-Phuong and Warnow, Tandy and Pop, Mihai and White, Bryan "A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity" Npj Biofilms And Microbiomes , v.2 , 2016 doi:10.1038/npjbiofilms.2016.4

N. Nguyen, M. Nute, S. Mirarab, and T. Warnow (2016). "HIPPI: Highly accurate protein family classification with ensembles of HMMs." BMC Genomics , v.17 , 2016 , p.765 DOI 10.1186/s12864-016-3097-0

N. Nguyen, T. Warnow, M. Pop, and B. White "A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity." npj Biofilms and Microbiomes , v.2 , 2016 doi:10.1038/npjbiofilms.2016.4.

N. Shah, M. Nute, T. Warnow, and M. Pop "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows" Bioinformatics , v.35 , 2019 , p.1613 10.1093/bioinformatics/bty833

T. Hansen, S. Mollerup, N. Nguyen, L. Vinner, N. White, M. Coghlan, D. Alquezar-Planas, T. Joshi, R. Jensen, H. Fridholm, K. Kjaransdottir, T. Mourier, T. Warnow, G. Belsham, T. Gilbert, L. Orlando, M. Bunce, E. Willerslev, L. Nielsen, and A. Hansen "High diversity of picornaviruses in rats from different continents revealed by deep sequencing" Emerging Microbes & Infections , v.5 , 2016 doi:doi:10.1038/emi.2016.90

(Showing: 1 - 10 of 15)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Metagenomic studies of microbial communities can generate millions to billions of sequencing reads. The assignment of accurate taxonomic labels to these sequences is a critical component in many analyses, but is complicated by the fact that the majority of the organisms found in environmental or host-associated communities cannot be easily cultured in a laboratory. Even among the organisms that can be cultured, relatively few have been sequenced, even partially. Thus, many commonly encountered organisms are largely absent from existing databases of known genomes and genes. Providing taxonomic labels to metagenomic sequences, thus, requires extrapolating the knowledge contained in sequence databases to previously unseen DNA strings. Simple similarity-based approaches (e.g., picking the best database hit as the best guess at the taxonomic label) have been shown to be insufficiently accurate, leading to the development of more sophisticated methods.

The main goal of this project was to improve taxonomic identification of reads generated in these metagenomic studies and enable highly accurate estimates of abundance profiles. The main contribution of the effort is the TIPP2 software, which includes a collection of reference alignments and taxonomies for 40 marker genes (i.e., genes that are believed to be single copy and universal). TIPP2 is based on a machine learning model called an "Ensemble of Hidden Markov Models" and improves accuracy compared to other methods, including recently developed advances. TIPP2 is available as open source software.

The other main contribution of this project is HIPPI, a method for classifying protein sequences into protein families, and which aso uses the Ensemble of Hidden Markov Models approach. HIPPI improves on the use of a single HMM and also on BLAST, which respect to both precision and recall.

The Broader Impacts of this project include annual software schools teaching software and bioinformatics methods relevant to metagenomics and open source software. Two PhD students were trained on the grant and graduated with their doctorates.

Last Modified: 11/02/2020
Modified by: Tandy Warnow

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error