NSF Award Search: Award # 2007597 - CIF: Small: Fundamental Limits of DNA-Based Storage

Award Abstract # 2007597

CIF: Small: Fundamental Limits of DNA-Based Storage

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	UNIVERSITY OF ILLINOIS
Initial Amendment Date:	June 22, 2020
Latest Amendment Date:	June 22, 2020
Award Number:	2007597
Award Instrument:	Standard Grant
Program Manager:	Phillip Regalia pregalia@nsf.gov (703)292-2981 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	July 1, 2020
End Date:	June 30, 2024 (Estimated)
Total Intended Award Amount:	$484,140.00
Total Awarded Amount to Date:	$484,140.00
Funds Obligated to Date:	FY 2020 = $484,140.00
History of Investigator:	Ilan Shomorony (Principal Investigator) ilans@illinois.edu
Recipient Sponsored Research Office:	University of Illinois at Urbana-Champaign 506 S WRIGHT ST URBANA IL US 61801-3620 (217)333-2187
Sponsor Congressional District:	13
Primary Place of Performance:	Board of Trustees of the University of Illinois 506 S. Wright Street Urbana IL US 61801-3620
Primary Place of Performance Congressional District:	13
Unique Entity Identifier (UEI):	Y8CWNJRCNN91
Parent UEI:	V2PHZ2CSCH63
NSF Program(s):	Comm & Information Foundations
Primary Program Source:	01002021DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7923, 7935
Program Element Code(s):	779700
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

The demand for data storage continues to increase at a rapid pace, posing significant challenges to current data centers and spurring significant interest in the development of new storage technologies. DNA, the molecule that carries the genetic information of all living matter, has become a promising medium for long-term archival data storage due to its longevity and very high information density. This new approach to data storage presents unique challenges. Unlike typical hard drives, where data bits are stored in a well-ordered linear fashion, storing data on DNA requires the synthesis of a large number of DNA molecules that are then mixed out of order in a liquid solution. This makes the process of reliably reading the data after storage significantly more expensive and computationally complex. The goal of this project is to understand the fundamental limitations and capabilities of DNA as a storage medium. In particular, this research seeks to characterize basic tradeoffs between cost, information density, reading and writing speeds, and reliability, aiming to develop new coding strategies that can unlock the full potential of this innovative approach to data storage.

The project will investigate the fundamental limits of DNA storage systems by focusing on three main objectives. The first objective is to develop an information theory framework to formally analyze these systems. DNA storage systems will be modeled via the abstraction of a shuffling channel, which captures the fact that, in DNA-based storage, many blocks of data are shuffled out of order. The capacity of these channels will be characterized under different noise models and properties of optimal coding schemes will be studied. A particular question of interest is how to design capacity-optimal indexing strategies that allow the proper reordering of the data. Since the cost of synthesizing long DNA strands is the main obstacle to practical DNA storage, the second research objective will deal with systems that store data on many very short DNA strands, each of which is too short to encode any meaningful information. For that reason, new strategies to encode information in the concentration of different DNA molecules in the solution will be proposed, and their fundamental capabilities established. The third research objective will focus on the computational challenges associated with the joint processing of a large set of DNA sequences. In particular, basic tradeoffs between storage capacity and computational requirements will be established, and recent algorithmic advances in large-scale sequence alignment will be leveraged towards the development of computationally efficient decoding algorithms.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 22)

Show All

Baharav, Tavor and Kamath, Govinda and Shomorony, Ilan "Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment" Advances in Neural Information Processing Systems 33 (NeurIPS 2020) , 2020 Citation Details

Gerzon, Yuval and Shomorony, Ilan and Weinberger, Nir "Capacity of Frequency-based Channels: Encoding Information in Molecular Concentrations" Proceedings of IEEE International Symposium on Information Theory , 2024 Citation Details

Greenberg, Grant and Ravi, Aditya Narayan and Shomorony, Ilan "LexicHash: sequence similarity estimation via lexicographic comparison of hashes" Bioinformatics , v.39 , 2023 https://doi.org/10.1093/bioinformatics/btad652 Citation Details

Greenberg, Grant and Shomorony, Ilan "The Metagenomic Binning Problem: Clustering Markov Sequences" IEEE Transactions on Molecular, Biological, and Multi-Scale Communications , v.10 , 2024 https://doi.org/10.1109/TMBMC.2023.3336254 Citation Details

Levick, Kel and Heckel, Reinhard and Shomorony, Ilan "Achieving the Capacity of a DNA Storage Channel with Linear Coding Schemes" 56th Annual Conference on Information Sciences and Systems (CISS) , 2022 https://doi.org/10.1109/CISS53076.2022.9751151 Citation Details

Levick, Kel and Shomorony, Ilan "Fundamental Limits of Multiple Sequence Reconstruction from Substrings" IEEE International Symposium on Information Theory , 2023 Citation Details

Mazooji, Kayvon and Shomorony, Ilan "Fast multiple sequence alignment via multi-armed bandits" Bioinformatics , v.40 , 2024 https://doi.org/10.1093/bioinformatics/btae225 Citation Details

Mazooji, Kayvon and Shomorony, Ilan "Substring Density Estimation from Traces" IEEE International Symposium on Information Theory , 2023 Citation Details

Mazooji, Kayvon and Shomorony, Ilan "Substring Density Estimation From Traces" IEEE Transactions on Information Theory , v.70 , 2024 https://doi.org/10.1109/TIT.2024.3418377 Citation Details

Nassirpour, Sajjad and Shomorony, Ilan and Vahid, Alireza "DNA Merge-Sort: A Family of Nested Varshamov-Tenengolts Reassembly Codes for Out-of-Order Media" IEEE Transactions on Communications , v.72 , 2024 https://doi.org/10.1109/TCOMM.2023.3335409 Citation Details

Ravi, Aditya Narayan and Vahid, Alireza and Shomorony, Ilan "An Information Theory for Out-of-Order Media With Applications in DNA Data Storage" IEEE Transactions on Molecular, Biological, and Multi-Scale Communications , v.10 , 2024 https://doi.org/10.1109/TMBMC.2024.3403759 Citation Details

(Showing: 1 - 10 of 22)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project investigated the fundamental limitations and capabilities of DNA-based data storage, a potentially transformative emerging technology. The project focused on several information-theoretic questions regarding achievable storage data rates and basic tradeoffs between data rates, DNA synthesis costs, and computational efficiency.

The first major goal of the project was to develop new information-theoretic models that capture key distinguishing aspects of DNA data storage. We began this effort by proposing the noisy shuffling channel illustrated in Figure 1. This channel model captures the fact that the data is written onto short DNA molecules that are stored out-of-order in a solution (shuffling) and read via random sampling in a noisy fashion (noisy channel). We characterized the capacity (maximum number of bits that can be reliably stored) for this channel under several choices of sampling distribution (e.g., Bernoulli and Poisson) and noise (e.g., substitutions and erasures). We developed optimal coding strategies and showed that, in many scenarios, a simple index-based coding scheme is capacity optimal.

Another distinguishing aspect of DNA (and, more generally, macromolecular) data storage is that molecules may be subject to random breaks when stored over long periods of time. To study how this impacts the storage capacity, we proposed the torn-paper channel (TPC) model. This channel breaks the input message into fragments of random sizes, which are then read out of order. In addition, to study the impact of using short-read shotgun sequencing technologies in DNA storage systems, we proposed the shotgun sequencing channel (SSC) model. The output of this channel are randomly sampled fixed-length substrings from the input message. The TPC and SSC are illustrated in Figure 2. We developed new achievability and converse techniques to characterize the capacity of these two channel models exactly under several distributional assumptions. The capacity expressions describe how the maximum storage rate varies as a function of fragment lengths and number of observed fragments. We developed a unified coding strategy to deal with the out-of-order nature of these two channels, based on the idea of spreading synchronization bits throughout the codewords. This coding strategy is computationally efficient and achieves rates close to capacity in several regimes.

The second major goal of the project was to understand how short the stored DNA molecules can be while still allowing for a significant amount of data to be stored. In particular, we explored the very-short-molecule regime, where the asymptotic capacity is technically zero, but it is still possible to store non-negligible amounts of information in the concentration of the different molecules. As illustrated in Figure 3, in this setting one seeks to store information by choosing a “histogram” of DNA molecules, which is corrupted by the synthesis and sequencing processes. We developed a connection between this storage problem and the Poisson channel, previously studied in the context of optical communications. This connection allowed us to develop a coding scheme for this regime with optimality guarantees. While this coding strategy is nearly optimal for the very-short molecules, the actual storage rates achieved turn out to be too small for large-scale data storage and are more likely useful in applications where few data bits need to be stored (e.g., metadata storage or fingerprinting).

The third major focus of the project was related to the computational challenges that arise when having to process the output of a large-scale DNA data storage system. Practically, one needs to (1) cluster the output sequences to identify clusters of reads corresponding to the same input molecule and (2) align the sequences in each cluster to each other to produce a consensus (error-corrected) sequence. For the clustering task, we developed a new algorithm called BanditPAM (and its optimized version, BanditPAM++). BanditPAM utilizes a multi-armed bandit algorithm to adaptively and efficiently identify groups of similar reads (or other objects), which allows clustering to be performed while avoiding a full pairwise comparison of all reads. We also developed a multi-armed bandit approach to speed up the alignment task and released a tool called J-bandit that can efficiently perform multiple-sequence alignment on a large number of sequences.

The theoretical framework that was developed in this project led to the publication of a long-form monograph on Information-Theoretic Foundations of DNA Data Storage. This monograph and, more generally, the outcomes from this project have been influential in turning this topic into an active research area. The PI also organized and participated in several invited sessions and workshops on information theory and coding for macromolecular storage. The computational tools that were developed have applications that are broader than DNA data storage and are useful for the processing of more general genomic data. Finally, results from this project have been included as modern applications in a graduate-level Information Theory course and have contributed to the development of a new advanced course on Fundamental Limits in Data Science.

Last Modified: 08/09/2024
Modified by: Ilan Shomorony

Images (1 of 3)

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error