
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | June 3, 2016 |
Latest Amendment Date: | June 3, 2016 |
Award Number: | 1618427 |
Award Instrument: | Standard Grant |
Program Manager: |
Mitra Basu
mbasu@nsf.gov (703)292-8649 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2016 |
End Date: | August 31, 2020 (Estimated) |
Total Intended Award Amount: | $400,000.00 |
Total Awarded Amount to Date: | $400,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
110 INNER CAMPUS DR AUSTIN TX US 78712-1139 (512)471-6424 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
101 East 27th St., Suite 5.300 Austin TX US 78712-1532 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Algorithmic Foundations, Comm & Information Foundations |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
In a wide range of problems in genomics and personalized medicine, it is of critical importance to accurately reconstruct distinct nucleotide sequences present in a heterogeneous mixture. Examples include viral quasispecies reconstruction, mapping repertoire of immune cells, and haplotyping. While recent advancements in high-throughput DNA sequencing have enabled affordable studies of genetic variations, technological limitations of sequencing platforms as well as potentially non-uniform frequencies of the sequences in a mixture render the analysis of heterogeneous mixtures a challenging and computationally intensive task.
This research aims to develop fast and accurate algorithms for reconstruction and frequency estimation of sequences in diverse mixtures that will assist practitioners in pharmacogenomics and personalized medicine. The project includes a focus on fostering diversity, dissemination of new interdisciplinary research across disciplines, and enrichment of the educational experience of participating engineering students.
Specific goals of the project include: First, the design and analysis of matrix factorization methods for accurate and efficient reconstruction of distinct sequences present in a heterogeneous mixture and for estimation of their frequencies. In the proposed framework, sequence reconstruction is formulated as the problem of factorizing structured, partially observed low-rank matrices and efficiently solved by exploiting salient features of high-throughput sequencing data. Second, the development of a methodology for the analysis of dynamically evolving mixtures of sequences temporally sampled by means of high-throughput sequencing. This research thrust will lead to novel sequence reconstruction methods capable of tracking the evolution of sequences over time and accurate identification of their frequencies. The third and final goal is the development of algorithmic solutions to specific sequence diversity analysis problems that fully exploit structural features of the respective applications and thus enable superior performance.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
A number of problems in genomics and personalized medicine requires accurate reconstruction of distinct nucleotide sequences present in a heterogeneous mixture. Examples include haplotype assembly and reconstruction of viral populations from high-throughput sequencing data. However, technological limitations of sequencing platforms as well as inherent diversity of the considered genomic populations render the analysis of heterogeneous genomic mixtures challenging.
Motivated by the concepts from machine learning and statistics, this project introduced fast and accurate novel methods for haplotype assembly and reconstruction of viral communities, and provided rigorous guarantees of performance of the developed techniques. Highlights of the project results include: (a) Matrix factorization formulation of single individual haplotyping, and the analysis of convergence properties of efficient alternating minimization algorithms for solving it. The established sample complexity requirements have important implications on experimental design, namely, suggest sequencing coverage needed for successful completion of the haplotype assembly task. (b) Tensor factorization framework for viral quasispecies reconstruction. Such reconstruction is particularly challenging in settings where a viral population is characterized by highly uneven frequencies of its components. The proposed framework enables highly accurate discovery of rare strains in a population, and comes with performance guarantees. (c) Introducing ideas from community detection on graphs to the problems of haplotype assembly and viral quasispecies reconstruction. This line of research has enabled rapid haplotype assembly from massive sets of sequencing data.
The project outcomes have been disseminated to communities of interest via publications in journals, presentations at conferences and workshops, and by publicly releasing software that resulted from the research effort. The project further served as an education springboard to training students interested in working at a juncture of computational biology, statistical signal processing and machine learning.
Last Modified: 01/05/2021
Modified by: Haris Vikalo
Please report errors in award information by writing to: awardsearch@nsf.gov.