Award Abstract # 1507998
RAPID: Methods for Estimating Genetic Diversity of the Ebola Virus

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: UNIVERSITY OF TEXAS AT AUSTIN
Initial Amendment Date: December 1, 2014
Latest Amendment Date: December 1, 2014
Award Number: 1507998
Award Instrument: Standard Grant
Program Manager: Mitra Basu
mbasu@nsf.gov
 (703)292-8649
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: December 1, 2014
End Date: November 30, 2016 (Estimated)
Total Intended Award Amount: $200,000.00
Total Awarded Amount to Date: $200,000.00
Funds Obligated to Date: FY 2015 = $200,000.00
History of Investigator:
  • Haris Vikalo (Principal Investigator)
    hvikalo@ece.utexas.edu
Recipient Sponsored Research Office: University of Texas at Austin
110 INNER CAMPUS DR
AUSTIN
TX  US  78712-1139
(512)471-6424
Sponsor Congressional District: 25
Primary Place of Performance: University of Texas at Austin
101 East 27th St., Suite 5.300
Austin
TX  US  78712-1532
Primary Place of Performance
Congressional District:
25
Unique Entity Identifier (UEI): V6AFQPN18437
Parent UEI:
NSF Program(s): Information Technology Researc,
Algorithmic Foundations
Primary Program Source: 01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1640, 7931, 7914, 001Z
Program Element Code(s): 164000, 779600
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Broader Significance:

Ebola is an RNA virus characterized by a high mutation rate. The genetic diversity of RNA viruses enables them to adapt to varying conditions over the course of infection and keep proliferating. Estimating viral genetic diversity is essential for the understanding of their origin and mutation patterns, and for the development of effective drug treatments. A viral population is characterized by the sequences and frequencies of the genomes that comprise it. High-throughput DNA sequencing technologies enable fast and affordable analysis of viral genomes. However, errors and limited read lengths of high-throughput sequencing platforms render the problem of estimating viral genetic diversity challenging.

Technical Description:

The aim of this research is to develop novel algorithms for determining and analyzing genetic diversity of RNA viruses and applying them to the analysis of the Ebola virus. The investigator specifically aims to: (1) Develop correlation clustering framework and computationally efficient methods for estimating viral genetic diversity from high-throughput sequencing data. In this line of research, reconstruction of viral genomes is cast as the max-k-cut problem and efficiently solved using semi-definite programming. (2) Design graphical models and belief propagation algorithms for inferring viral genomes in a diverse set analyzed with high-throughput sequencing technologies. The focus of this research thrust is on scalable message-passing methods for estimating viral genetic diversity. (3) Relying on the developed methods, analyze the diversity of the Ebola virus using publicly available high-throughput sequencing data. The results of the outlined work are expected to have an immediate impact on the understanding of the Ebola outbreak mechanisms and virus mutation patterns.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

RNA viruses (e.g., Ebola, HIV, SARS) replicate with high mutation rates, creating closely related viral populations. The heterogeneous virus populations, referred to as viral quasispecies, rapidly adapt to environmental changes thus adversely affecting efficiency of antiviral drugs and vaccines. Therefore, understanding the underlying genetic heterogeneity of viral populations plays a significant role in the development of effective therapeutic treatments. Recent high-throughput sequencing technologies have provided invaluable opportunity for uncovering the structure of quasispecies populations (i.e., reconstruction of viral sequences and discovery of their relative frequencies). However, accurate reconstruction of viral quasispecies remains difficult due to limited read-lengths and presence of sequencing errors.

As part of this project, we first developed a novel correlation clustering framework for viral quasispecies reconstruction that relies on semidefinite programming to accurately estimate the sub-species and their frequencies in a viral population. Extensive comparisons with existing methods on both synthetic and real data demonstrated efficacy and superior performance of the developed scheme.

Arguably, viral quasispecies reconstruction is particularly challenging when the strains in a population are highly similar, i.e., the constituent sequences are characterized by low mutual genetic distances, and further exacerbated if some of those strains are relatively rare; this is the setting where state-of-the-art methods struggle. Motivated by this observation, we next developed a novel viral quasispecies reconstruction method that combines ideas from hierarchical clustering and Bayesian inference to enable highly accurate reconstruction of closely related viral strains, i.e., of quasispecies characterized by low diversity.

Accurate methods for the reconstruction of viral quasispecies are expected to play a significant role in the development of effective therapeutic treatments. Therefore, methods developed in this project have potential to make a broad impact that goes well beyond the academic world.


Last Modified: 02/28/2017
Modified by: Haris Vikalo

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page