Skip to feedback

Award Abstract # 1855441
CRII: SHF: HPC Solutions to Big NGS Data Compression

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: FLORIDA INTERNATIONAL UNIVERSITY
Initial Amendment Date: October 22, 2018
Latest Amendment Date: October 22, 2018
Award Number: 1855441
Award Instrument: Standard Grant
Program Manager: Almadena Chtchelkanova
achtchel@nsf.gov
 (703)292-7498
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2018
End Date: January 31, 2020 (Estimated)
Total Intended Award Amount: $7,708.00
Total Awarded Amount to Date: $7,708.00
Funds Obligated to Date: FY 2016 = $7,708.00
History of Investigator:
  • Fahad Saeed (Principal Investigator)
    FSAEED@FIU.EDU
Recipient Sponsored Research Office: Florida International University
11200 SW 8TH ST
MIAMI
FL  US  33199-2516
(305)348-2494
Sponsor Congressional District: 26
Primary Place of Performance: Florida International University
FL  US  33199-0001
Primary Place of Performance
Congressional District:
26
Unique Entity Identifier (UEI): Q3KCVK5S9CP1
Parent UEI: Q3KCVK5S9CP1
NSF Program(s): Software & Hardware Foundation
Primary Program Source: 01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7942, 9251
Program Element Code(s): 779800
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Sequencing of genomes for numerous species including humans has become increasingly affordable due to next generation high-throughput genome sequencing (NGS) technologies. This opens up perspectives for diagnosis and treatment of genetic diseases and is increasingly effective in conducting system biology studies. However, there remain many computational challenges that need to be addressed before these technologies find their way into every day health and human care. One such daunting challenge is the volume of sequencing data which can reach peta-byte level for comprehensive system-biology studies.
Genomic data compression is needed to reduce the storage size, to increase the speed and reduce the cost of I/O bandwidth required for transmission of such data. However, existing genomic compression solutions yield poor performance for Big Genomic Data. Further, the existing state of the art tools require the user to decompress the data before it can be used for further analysis. This project is focused on compression of genomic information and developing a framework which will allow analysis of compressed form of the data. The project develops HPC solutions for fast compression of Big NGS Data sets using ubiquitous architectures such as GPUs and multicore processors. HPC techniques are utilized to compute essential functions such as alignment and mapping using the compressed form of the NGS data. More efficient encoding of the NGS data for better network utilization is also being investigated.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Fahad Saeed "Towards quantifying psychiatric diagnosis using machine learning algorithms and big fMRI data" BMC Big Data Analytics , v.3 , 2018 https://doi.org/10.1186/s41044-018-0033-0
Mohammed Aledhari, Marianne Di Pierro, andFahad Saeed "A Fourier-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets" 2018 IEEE International Congress on Big Data (BigData Congress)(BIGDATA CONGRESS) , v.00 , 2018 10.1109/BigDataCongress.2018.00024
Mohammed Aledhari, Marianne Di Pierro, Mohamed Hefeida, and Fahad Saeed "A Deep Learning-Based Data MinimizationAlgorithm for Fast and Secure Transfer of BigGenomic Datasets" IEEE Transactions on Big Data , 2018 10.1109/TBDATA.2018.2805687
Muaaz Gul Awan and Fahad Saeed "MaSS?Simulator: A Highly Configurable Simulator for Generating MS/MS Datasets for Benchmarking of Proteomics Algorithms" Wiley Proteomics , 2018 10.1002/pmic.201800206
Muhammad Haseeb ; Fahad Saeed "Efficient Shared Peak Counting in Database Peptide Search Using Compact Data Structure for Fragment-Ion Index" Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM) , 2019 , p.275-278 10.1109/BIBM47256.2019.8983152
Muhammad Haseeb ; Fatima Afzali ; Fahad Saeed "LBE: A Computational Load Balancing Algorithm for Speeding up Parallel Peptide Search in Mass-Spectrometry Based Proteomics" Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , 2019 , p.191-198 10.1109/IPDPSW.2019.00040
Taban Eslami and Fahad Saeed "Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson?s Correlation Coefficients for Time Series Data?fMRI Study" MDPI High-Throughput , v.7 , 2018 10.3390/ht7020011

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Analysis of high-throughput omics data is an essential task in experimental and computational biology. Millions of short DNA read are generated from high-throughput next generation sequencing machines from a single run of experiment. The big data that one gets from these high-throughput techniques is so large that no matter how good the conventional techniques are they will never be able to keep up with the rate of these data sets. The big data that one gets from these high-throughput machines required novel ways of managing the data.

The outcome of this project was the design, development and testing of high-performance computing algorithms for compression and processing of omics data. Our proposed techniques allowed us to compress massive amounts of data on memory-distributed clusters, and novel data structures that allowed us to analyze the compressed form of the genomics as well as proteomics data. Further, we developed novel algorithms that allowed lossy compression of mass spectrometry-based proteomics data sets. We also demonstrated that these data sets can be processed using graphical processing units. We demonstrated that these compressed data sets can be transmitted in much smaller time due to small memory-footprint. We expect that all of these fundamental contributions will have significant impact for domain systems biology scientists. Using HPC algorithms will allow these scientists to perform much more complex and accurate analysis than was previously possible. The efficiency and portability of our proposed techniques will have seminal impact in precision and personal medicine.

Dr. Saeed's group has made fundamental advnaces in dealing with these large data sets for processing. One of the most interesting aspects of the research was compression of these large omics data sets and processing on these data sets without the need to decompress them.

The NSF CRII award has partially supported 7 PhD students, 2 MS and 4 undergraduate students, numerous research talks, and has resulted in more than 12 peer-reviewed publications. The software resulting from these novel high-performance computing is available on the PI lab webpage at: https://saeedlab.cs.fiu.edu/software/


Last Modified: 02/17/2020
Modified by: Fahad Saeed

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page