
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | December 1, 2014 |
Latest Amendment Date: | December 1, 2014 |
Award Number: | 1507998 |
Award Instrument: | Standard Grant |
Program Manager: |
Mitra Basu
mbasu@nsf.gov (703)292-8649 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | December 1, 2014 |
End Date: | November 30, 2016 (Estimated) |
Total Intended Award Amount: | $200,000.00 |
Total Awarded Amount to Date: | $200,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
110 INNER CAMPUS DR AUSTIN TX US 78712-1139 (512)471-6424 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
101 East 27th St., Suite 5.300 Austin TX US 78712-1532 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Information Technology Researc, Algorithmic Foundations |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Broader Significance:
Ebola is an RNA virus characterized by a high mutation rate. The genetic diversity of RNA viruses enables them to adapt to varying conditions over the course of infection and keep proliferating. Estimating viral genetic diversity is essential for the understanding of their origin and mutation patterns, and for the development of effective drug treatments. A viral population is characterized by the sequences and frequencies of the genomes that comprise it. High-throughput DNA sequencing technologies enable fast and affordable analysis of viral genomes. However, errors and limited read lengths of high-throughput sequencing platforms render the problem of estimating viral genetic diversity challenging.
Technical Description:
The aim of this research is to develop novel algorithms for determining and analyzing genetic diversity of RNA viruses and applying them to the analysis of the Ebola virus. The investigator specifically aims to: (1) Develop correlation clustering framework and computationally efficient methods for estimating viral genetic diversity from high-throughput sequencing data. In this line of research, reconstruction of viral genomes is cast as the max-k-cut problem and efficiently solved using semi-definite programming. (2) Design graphical models and belief propagation algorithms for inferring viral genomes in a diverse set analyzed with high-throughput sequencing technologies. The focus of this research thrust is on scalable message-passing methods for estimating viral genetic diversity. (3) Relying on the developed methods, analyze the diversity of the Ebola virus using publicly available high-throughput sequencing data. The results of the outlined work are expected to have an immediate impact on the understanding of the Ebola outbreak mechanisms and virus mutation patterns.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
RNA viruses (e.g., Ebola, HIV, SARS) replicate with high mutation rates, creating closely related viral populations. The heterogeneous virus populations, referred to as viral quasispecies, rapidly adapt to environmental changes thus adversely affecting efficiency of antiviral drugs and vaccines. Therefore, understanding the underlying genetic heterogeneity of viral populations plays a significant role in the development of effective therapeutic treatments. Recent high-throughput sequencing technologies have provided invaluable opportunity for uncovering the structure of quasispecies populations (i.e., reconstruction of viral sequences and discovery of their relative frequencies). However, accurate reconstruction of viral quasispecies remains difficult due to limited read-lengths and presence of sequencing errors.
As part of this project, we first developed a novel correlation clustering framework for viral quasispecies reconstruction that relies on semidefinite programming to accurately estimate the sub-species and their frequencies in a viral population. Extensive comparisons with existing methods on both synthetic and real data demonstrated efficacy and superior performance of the developed scheme.
Arguably, viral quasispecies reconstruction is particularly challenging when the strains in a population are highly similar, i.e., the constituent sequences are characterized by low mutual genetic distances, and further exacerbated if some of those strains are relatively rare; this is the setting where state-of-the-art methods struggle. Motivated by this observation, we next developed a novel viral quasispecies reconstruction method that combines ideas from hierarchical clustering and Bayesian inference to enable highly accurate reconstruction of closely related viral strains, i.e., of quasispecies characterized by low diversity.
Accurate methods for the reconstruction of viral quasispecies are expected to play a significant role in the development of effective therapeutic treatments. Therefore, methods developed in this project have potential to make a broad impact that goes well beyond the academic world.
Last Modified: 02/28/2017
Modified by: Haris Vikalo
Please report errors in award information by writing to: awardsearch@nsf.gov.