Award Abstract # 1618427
AF: Small: Reconstructing Mixtures of DNA Sequences from High-Throughput Sequencing Data

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: UNIVERSITY OF TEXAS AT AUSTIN
Initial Amendment Date: June 3, 2016
Latest Amendment Date: June 3, 2016
Award Number: 1618427
Award Instrument: Standard Grant
Program Manager: Mitra Basu
mbasu@nsf.gov
 (703)292-8649
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2016
End Date: August 31, 2020 (Estimated)
Total Intended Award Amount: $400,000.00
Total Awarded Amount to Date: $400,000.00
Funds Obligated to Date: FY 2016 = $400,000.00
History of Investigator:
  • Haris Vikalo (Principal Investigator)
    hvikalo@ece.utexas.edu
Recipient Sponsored Research Office: University of Texas at Austin
110 INNER CAMPUS DR
AUSTIN
TX  US  78712-1139
(512)471-6424
Sponsor Congressional District: 25
Primary Place of Performance: University of Texas at Austin
101 East 27th St., Suite 5.300
Austin
TX  US  78712-1532
Primary Place of Performance
Congressional District:
25
Unique Entity Identifier (UEI): V6AFQPN18437
Parent UEI:
NSF Program(s): Algorithmic Foundations,
Comm & Information Foundations
Primary Program Source: 01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7931
Program Element Code(s): 779600, 779700
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

In a wide range of problems in genomics and personalized medicine, it is of critical importance to accurately reconstruct distinct nucleotide sequences present in a heterogeneous mixture. Examples include viral quasispecies reconstruction, mapping repertoire of immune cells, and haplotyping. While recent advancements in high-throughput DNA sequencing have enabled affordable studies of genetic variations, technological limitations of sequencing platforms as well as potentially non-uniform frequencies of the sequences in a mixture render the analysis of heterogeneous mixtures a challenging and computationally intensive task.

This research aims to develop fast and accurate algorithms for reconstruction and frequency estimation of sequences in diverse mixtures that will assist practitioners in pharmacogenomics and personalized medicine. The project includes a focus on fostering diversity, dissemination of new interdisciplinary research across disciplines, and enrichment of the educational experience of participating engineering students.

Specific goals of the project include: First, the design and analysis of matrix factorization methods for accurate and efficient reconstruction of distinct sequences present in a heterogeneous mixture and for estimation of their frequencies. In the proposed framework, sequence reconstruction is formulated as the problem of factorizing structured, partially observed low-rank matrices and efficiently solved by exploiting salient features of high-throughput sequencing data. Second, the development of a methodology for the analysis of dynamically evolving mixtures of sequences temporally sampled by means of high-throughput sequencing. This research thrust will lead to novel sequence reconstruction methods capable of tracking the evolution of sequences over time and accurate identification of their frequencies. The third and final goal is the development of algorithmic solutions to specific sequence diversity analysis problems that fully exploit structural features of the respective applications and thus enable superior performance.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 12)
Abolfazl Hashemi, Banghua Zhu and Haris Vikalo "Sparse tensor decomposition for haplotype assembly of diploids and polyploids" The 4th International Workshop on Computational Network Biology: Modeling, Analysis, and Control (CNB-MAC) , 2017
A. Hashemi, B. Zhu and H. Vikalo "Sparse tensor decomposition for haplotype assembly of diploids and polyploids" BMC Genomics , v.19 , 2018 , p.191 https://doi.org/10.1186/s12864-018-4551-y
A. Sankararaman, H. Vikalo and F. Baccelli "ComHapDet: A spatial community detection algorithm for haplotype assembly" BMC Genomics , v.21 , 2020 https://doi.org/10.1186/s12864-020-06935-x
S. Ahn and H. Vikalo "aBayesQR: A Bayesian method for reconstruction of viral populations characterized by low diversity" Journal of Computational Biology , v.25 , 2018 , p.637-648 doi:10.1089/cmb.2017.0249
S. Ahn, Z. Ke and H. Vikalo "Viral quasispecies reconstruction via tensor factorization" 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton) , 2018 , p.1236-1239
S. Ahn, Z. Ke and H. Vikalo "Viral quasispecies reconstruction via tensor factorization with successive removal" Bioinformatics (The 26th Annual Conference on Intelligent Systems for Molecular Biology Special Issue) , v.34 , 2018 , p.i23?i31 https://doi.org/10.1093/bioinformatics/bty291
S. Barik and H. Vikalo "Matrix completion and performance guarantees for single individual haplotyping" IEEE Transactions on Signal Processing , v.67 , 2019 , p.4782 - 47 10.1109/TSP.2019.2931207
S. Barik, S. Das, and H. Vikalo "QSdpR: Viral quasispecies reconstruction via correlation clustering" Genomics , 2017 https://doi.org/10.1016/j.ygeno.2017.12.007
S. Barik, S. Das, and H. Vikalo "Viral quasispecies reconstruction via correlation clustering" Genomics , v.110 , 2018 , p.375-381 https://doi.org/10.1016/j.ygeno.2017.12.007
S. Consul and H. Vikalo "Reconstructing intra-tumor heterogeneity via convex optimization and branch-and-bound search" The 10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB) , 2019 , p.524-529 https://doi.org/10.1145/3307339.3342178
Soyeon Ahn and Haris Vikalo "aBayesQR: A Bayesian method for reconstruction of viral populations characterized by low diversity" The 21st Annual International Conference on Research in Computational Molecular Biology (RECOMB), Hong Kong. , 2017
(Showing: 1 - 10 of 12)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

A number of problems in genomics and personalized medicine requires accurate reconstruction of distinct nucleotide sequences present in a heterogeneous mixture. Examples include haplotype assembly and reconstruction of viral populations from high-throughput sequencing data. However, technological limitations of sequencing platforms as well as inherent diversity of the considered genomic populations render the analysis of heterogeneous genomic mixtures challenging.

Motivated by the concepts from machine learning and statistics, this project introduced fast and accurate novel methods for haplotype assembly and reconstruction of viral communities, and provided rigorous guarantees of performance of the developed techniques. Highlights of the project results include: (a) Matrix factorization formulation of single individual haplotyping, and the analysis of convergence properties of efficient alternating minimization algorithms for solving it. The established sample complexity requirements have important implications on experimental design, namely, suggest sequencing coverage needed for successful completion of the haplotype assembly task. (b) Tensor factorization framework for viral quasispecies reconstruction. Such reconstruction is particularly challenging in settings where a viral population is characterized by highly uneven frequencies of its components. The proposed framework enables highly accurate discovery of rare strains in a population, and comes with performance guarantees. (c) Introducing ideas from community detection on graphs to the problems of haplotype assembly and viral quasispecies reconstruction. This line of research has enabled rapid haplotype assembly from massive sets of sequencing data.

The project outcomes have been disseminated to communities of interest via publications in journals, presentations at conferences and workshops, and by publicly releasing software that resulted from the research effort. The project further served as an education springboard to training students interested in working at a juncture of computational biology, statistical signal processing and machine learning.


Last Modified: 01/05/2021
Modified by: Haris Vikalo

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page