
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | June 22, 2020 |
Latest Amendment Date: | June 22, 2020 |
Award Number: | 2007597 |
Award Instrument: | Standard Grant |
Program Manager: |
Phillip Regalia
pregalia@nsf.gov (703)292-2981 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 1, 2020 |
End Date: | June 30, 2024 (Estimated) |
Total Intended Award Amount: | $484,140.00 |
Total Awarded Amount to Date: | $484,140.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
506 S WRIGHT ST URBANA IL US 61801-3620 (217)333-2187 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
506 S. Wright Street Urbana IL US 61801-3620 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Comm & Information Foundations |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
The demand for data storage continues to increase at a rapid pace, posing significant challenges to current data centers and spurring significant interest in the development of new storage technologies. DNA, the molecule that carries the genetic information of all living matter, has become a promising medium for long-term archival data storage due to its longevity and very high information density. This new approach to data storage presents unique challenges. Unlike typical hard drives, where data bits are stored in a well-ordered linear fashion, storing data on DNA requires the synthesis of a large number of DNA molecules that are then mixed out of order in a liquid solution. This makes the process of reliably reading the data after storage significantly more expensive and computationally complex. The goal of this project is to understand the fundamental limitations and capabilities of DNA as a storage medium. In particular, this research seeks to characterize basic tradeoffs between cost, information density, reading and writing speeds, and reliability, aiming to develop new coding strategies that can unlock the full potential of this innovative approach to data storage.
The project will investigate the fundamental limits of DNA storage systems by focusing on three main objectives. The first objective is to develop an information theory framework to formally analyze these systems. DNA storage systems will be modeled via the abstraction of a shuffling channel, which captures the fact that, in DNA-based storage, many blocks of data are shuffled out of order. The capacity of these channels will be characterized under different noise models and properties of optimal coding schemes will be studied. A particular question of interest is how to design capacity-optimal indexing strategies that allow the proper reordering of the data. Since the cost of synthesizing long DNA strands is the main obstacle to practical DNA storage, the second research objective will deal with systems that store data on many very short DNA strands, each of which is too short to encode any meaningful information. For that reason, new strategies to encode information in the concentration of different DNA molecules in the solution will be proposed, and their fundamental capabilities established. The third research objective will focus on the computational challenges associated with the joint processing of a large set of DNA sequences. In particular, basic tradeoffs between storage capacity and computational requirements will be established, and recent algorithmic advances in large-scale sequence alignment will be leveraged towards the development of computationally efficient decoding algorithms.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
This project investigated the fundamental limitations and capabilities of DNA-based data storage, a potentially transformative emerging technology. The project focused on several information-theoretic questions regarding achievable storage data rates and basic tradeoffs between data rates, DNA synthesis costs, and computational efficiency.
The first major goal of the project was to develop new information-theoretic models that capture key distinguishing aspects of DNA data storage. We began this effort by proposing the noisy shuffling channel illustrated in Figure 1. This channel model captures the fact that the data is written onto short DNA molecules that are stored out-of-order in a solution (shuffling) and read via random sampling in a noisy fashion (noisy channel). We characterized the capacity (maximum number of bits that can be reliably stored) for this channel under several choices of sampling distribution (e.g., Bernoulli and Poisson) and noise (e.g., substitutions and erasures). We developed optimal coding strategies and showed that, in many scenarios, a simple index-based coding scheme is capacity optimal.
Another distinguishing aspect of DNA (and, more generally, macromolecular) data storage is that molecules may be subject to random breaks when stored over long periods of time. To study how this impacts the storage capacity, we proposed the torn-paper channel (TPC) model. This channel breaks the input message into fragments of random sizes, which are then read out of order. In addition, to study the impact of using short-read shotgun sequencing technologies in DNA storage systems, we proposed the shotgun sequencing channel (SSC) model. The output of this channel are randomly sampled fixed-length substrings from the input message. The TPC and SSC are illustrated in Figure 2. We developed new achievability and converse techniques to characterize the capacity of these two channel models exactly under several distributional assumptions. The capacity expressions describe how the maximum storage rate varies as a function of fragment lengths and number of observed fragments. We developed a unified coding strategy to deal with the out-of-order nature of these two channels, based on the idea of spreading synchronization bits throughout the codewords. This coding strategy is computationally efficient and achieves rates close to capacity in several regimes.
The second major goal of the project was to understand how short the stored DNA molecules can be while still allowing for a significant amount of data to be stored. In particular, we explored the very-short-molecule regime, where the asymptotic capacity is technically zero, but it is still possible to store non-negligible amounts of information in the concentration of the different molecules. As illustrated in Figure 3, in this setting one seeks to store information by choosing a “histogram” of DNA molecules, which is corrupted by the synthesis and sequencing processes. We developed a connection between this storage problem and the Poisson channel, previously studied in the context of optical communications. This connection allowed us to develop a coding scheme for this regime with optimality guarantees. While this coding strategy is nearly optimal for the very-short molecules, the actual storage rates achieved turn out to be too small for large-scale data storage and are more likely useful in applications where few data bits need to be stored (e.g., metadata storage or fingerprinting).
The third major focus of the project was related to the computational challenges that arise when having to process the output of a large-scale DNA data storage system. Practically, one needs to (1) cluster the output sequences to identify clusters of reads corresponding to the same input molecule and (2) align the sequences in each cluster to each other to produce a consensus (error-corrected) sequence. For the clustering task, we developed a new algorithm called BanditPAM (and its optimized version, BanditPAM++). BanditPAM utilizes a multi-armed bandit algorithm to adaptively and efficiently identify groups of similar reads (or other objects), which allows clustering to be performed while avoiding a full pairwise comparison of all reads. We also developed a multi-armed bandit approach to speed up the alignment task and released a tool called J-bandit that can efficiently perform multiple-sequence alignment on a large number of sequences.
The theoretical framework that was developed in this project led to the publication of a long-form monograph on Information-Theoretic Foundations of DNA Data Storage. This monograph and, more generally, the outcomes from this project have been influential in turning this topic into an active research area. The PI also organized and participated in several invited sessions and workshops on information theory and coding for macromolecular storage. The computational tools that were developed have applications that are broader than DNA data storage and are useful for the processing of more general genomic data. Finally, results from this project have been included as modern applications in a graduate-level Information Theory course and have contributed to the development of a new advanced course on Fundamental Limits in Data Science.
Last Modified: 08/09/2024
Modified by: Ilan Shomorony
Please report errors in award information by writing to: awardsearch@nsf.gov.