Award Abstract # 2112643
AF: Small: Theoretical Aspects of Repetition-Aware Text Compression and Indexing

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: THE UNIVERSITY OF CENTRAL FLORIDA BOARD OF TRUSTEES
Initial Amendment Date: August 24, 2021
Latest Amendment Date: March 1, 2022
Award Number: 2112643
Award Instrument: Standard Grant
Program Manager: Peter Brass
pbrass@nsf.gov
 (703)292-2182
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2021
End Date: April 30, 2023 (Estimated)
Total Intended Award Amount: $449,763.00
Total Awarded Amount to Date: $465,763.00
Funds Obligated to Date: FY 2021 = $67,729.00
FY 2022 = $16,000.00
History of Investigator:
  • Sharma Thankachan (Principal Investigator)
    svalliy@ncsu.edu
Recipient Sponsored Research Office: The University of Central Florida Board of Trustees
4000 CENTRAL FLORIDA BLVD
ORLANDO
FL  US  32816-8005
(407)823-0387
Sponsor Congressional District: 10
Primary Place of Performance: The University of Central Florida Board of Trustees
4000 CNTRL FLORIDA BLVD
Orlando
FL  US  32816-8005
Primary Place of Performance
Congressional District:
10
Unique Entity Identifier (UEI): RD7MXJV7DKT9
Parent UEI:
NSF Program(s): Algorithmic Foundations
Primary Program Source: 01002223DB NSF RESEARCH & RELATED ACTIVIT
01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7926, 9251
Program Element Code(s): 779600
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Being able to store, search, and analyze massive data sets efficiently is one of today's pressing challenges. To that end, this project will study a collection of problems under text compression and indexing with tremendous current relevance, owing to a specific characteristic prevalent in many modern text data sets, called high repetitiveness. This characteristic makes the data highly compressible using some specialized schemes. However, the theoretical understanding of those schemes is still in a nascent stage. This project will address some of the fundamental open problems on the effectiveness of several schemes that are popular in practice. This project will also introduce new ideas for indexing such data in a highly space-efficient manner and quickly support various (application-specific) queries. The techniques developed in this project will apply to a broad class of algorithmic problems and applications; therefore, they will have a lasting impact on related fields. The main results stem from this research will be disseminated through major conferences, journals, and workshop tutorials. The participation of undergraduates and minorities, including women, will be ensured.

Suffix trees and suffix arrays are two fundamental data structures for text indexing with many applications in bioinformatics. However, they are notorious for their space complexity. The FM-index index encodes suffix arrays in space close to the entropy-based lower bound using the Burrows-Wheeler Transformation (BWT). But, the entropy model of compression is less effective in capturing repetitiveness. Therefore, modern applications urge even more space frugal encodings that exploit repetitiveness. Although the community has made some recent progress in exact pattern matching, solutions to many other (more complex) matching problems are still open. To that end, this project will attempt to design repetition-aware indexes for pattern matching under mismatches, edits, and wildcards, along with efficient construction algorithms. This project also aims to develop a unified framework to compress several advanced suffix-tree variants (quasi-suffix trees) that support even more complex matching paradigms such as parameterized and order-isomorphic matching. Novel techniques and concepts that go beyond the existing BWT-based methods will be sought. Additionally, the project will attempt a deeper study on various aspects of BWT-runs and the relative Lempel-Ziv compression scheme from the hardness side.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Abedin, Paniz and Hooshmand, Sahar and Ganguly, Arnab and Thankachan, Sharma V. "The Heaviest Induced Ancestors Problem: Better Data Structures and Applications" Algorithmica , v.84 , 2022 https://doi.org/10.1007/s00453-022-00955-7 Citation Details
Gibney, Daniel and Thankachan, Sharma V. "On the Complexity of Recognizing Wheeler Graphs" Algorithmica , v.84 , 2022 https://doi.org/10.1007/s00453-021-00917-5 Citation Details
Gibney, Daniel and Thankachan, Sharma V. and Aluru, Srinivas "The Complexity of Approximate Pattern Matching on de Bruijn Graphs" Research in Computational Molecular Biology - 26th Annual International Conference, RECOMB 2022 , 2022 https://doi.org/10.1007/978-3-031-04749-7_16 Citation Details
Jain, Chirag and Gibney, Daniel and Thankachan, Sharma V. "Co-linear Chaining with Overlaps and Gap Costs" Research in Computational Molecular Biology - 26th Annual International Conference, RECOMB 2022 , 2022 https://doi.org/10.1007/978-3-031-04749-7_15 Citation Details
Thankachan, Sharma V. "Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc. (Invited Talk)" 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022 , 2022 https://doi.org/10.4230/LIPIcs.CPM.2022.3 Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page