
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | August 24, 2021 |
Latest Amendment Date: | March 1, 2022 |
Award Number: | 2112643 |
Award Instrument: | Standard Grant |
Program Manager: |
Peter Brass
pbrass@nsf.gov (703)292-2182 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2021 |
End Date: | April 30, 2023 (Estimated) |
Total Intended Award Amount: | $449,763.00 |
Total Awarded Amount to Date: | $465,763.00 |
Funds Obligated to Date: |
FY 2022 = $16,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
4000 CENTRAL FLORIDA BLVD ORLANDO FL US 32816-8005 (407)823-0387 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
4000 CNTRL FLORIDA BLVD Orlando FL US 32816-8005 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Algorithmic Foundations |
Primary Program Source: |
01002122DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Being able to store, search, and analyze massive data sets efficiently is one of today's pressing challenges. To that end, this project will study a collection of problems under text compression and indexing with tremendous current relevance, owing to a specific characteristic prevalent in many modern text data sets, called high repetitiveness. This characteristic makes the data highly compressible using some specialized schemes. However, the theoretical understanding of those schemes is still in a nascent stage. This project will address some of the fundamental open problems on the effectiveness of several schemes that are popular in practice. This project will also introduce new ideas for indexing such data in a highly space-efficient manner and quickly support various (application-specific) queries. The techniques developed in this project will apply to a broad class of algorithmic problems and applications; therefore, they will have a lasting impact on related fields. The main results stem from this research will be disseminated through major conferences, journals, and workshop tutorials. The participation of undergraduates and minorities, including women, will be ensured.
Suffix trees and suffix arrays are two fundamental data structures for text indexing with many applications in bioinformatics. However, they are notorious for their space complexity. The FM-index index encodes suffix arrays in space close to the entropy-based lower bound using the Burrows-Wheeler Transformation (BWT). But, the entropy model of compression is less effective in capturing repetitiveness. Therefore, modern applications urge even more space frugal encodings that exploit repetitiveness. Although the community has made some recent progress in exact pattern matching, solutions to many other (more complex) matching problems are still open. To that end, this project will attempt to design repetition-aware indexes for pattern matching under mismatches, edits, and wildcards, along with efficient construction algorithms. This project also aims to develop a unified framework to compress several advanced suffix-tree variants (quasi-suffix trees) that support even more complex matching paradigms such as parameterized and order-isomorphic matching. Novel techniques and concepts that go beyond the existing BWT-based methods will be sought. Additionally, the project will attempt a deeper study on various aspects of BWT-runs and the relative Lempel-Ziv compression scheme from the hardness side.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
Please report errors in award information by writing to: awardsearch@nsf.gov.