Award Abstract # 1925960
CAREER: Towards Fast and Scalable Algorithms for Big Proteogenomics Data Analytics

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: FLORIDA INTERNATIONAL UNIVERSITY
Initial Amendment Date: March 28, 2019
Latest Amendment Date: March 28, 2019
Award Number: 1925960
Award Instrument: Standard Grant
Program Manager: Juan Li
jjli@nsf.gov
 (703)292-2625
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2018
End Date: September 30, 2023 (Estimated)
Total Intended Award Amount: $415,950.00
Total Awarded Amount to Date: $415,950.00
Funds Obligated to Date: FY 2017 = $415,948.00
History of Investigator:
  • Fahad Saeed (Principal Investigator)
    FSAEED@FIU.EDU
Recipient Sponsored Research Office: Florida International University
11200 SW 8TH ST
MIAMI
FL  US  33199-2516
(305)348-2494
Sponsor Congressional District: 26
Primary Place of Performance: Florida International University
FL  US  33199-0001
Primary Place of Performance
Congressional District:
26
Unique Entity Identifier (UEI): Q3KCVK5S9CP1
Parent UEI: Q3KCVK5S9CP1
NSF Program(s): CAREER: FACULTY EARLY CAR DEV,
Software & Hardware Foundation,
Computational Biology
Primary Program Source: 01001718DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1045, 7931, 7942, 9102
Program Element Code(s): 104500, 779800, 793100
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Proteogenomics studies require combination and integration of mass spectrometry data (MS) for proteomics and next generation sequencing (NGS) data for genomics. This integration drastically increases the size of the data sets that need to be analyzed to make biological conclusions. However, existing tools yield low accuracy and exhibit poor scalability for big proteogenomics data. This CAREER grant is expected to lay a foundation for fast algorithmic and high performance computing solutions suitable for analyzing big proteogenomics data sets. Design of accurate computational algorithms suitable for peta-scale data sets will be pursued and the software implementation will run on massively parallel supercomputers and graphical processing units. The direction in this CAREER proposal is towards designing and building infrastructure, which would be useful for the broadest biological and ecological community. A comprehensive interdisciplinary education will be executed for K12, undergraduate and graduate students to ensure that US retains its global leadership position in STEM fields. This project thus serves the national interest, as stated by NSF's mission: to promote the progress of science and to advance the national health, prosperity and welfare.

The goal of the proposed CAREER grant is to design and develop algorithmic and high performance computing (HPC) foundations for practical sublinear and parallel algorithms for big proteogenomics data - especially for non-model organisms with previously unsequenced or partially sequenced genomes. Integration of MS and NGS data sets required for proteogenomics studies exhibit enormous volume and velocity of data: NGS technologies such as Chip-Seq can generate tera-bytes of DNA/RNA data and mass spectrometers can generate millions of spectra (with thousand of peak per spectra). The current systems for analyzing MS data are mainly driven by heuristic practices and do not scale well. This CAREER proposal will explore a new class of reductive algorithms for analysis of MS data that can allow peptide deductions in sublinear time, compression algorithms that operate in sub-linear space, and denovo algorithms that operate on lossy reduced-form of the MS data. Novel low-complexity sampling and reductive algorithms that can exploit the sparsity of MS data such as non-uniform FFT based convolution kernels can lead to superior similarity metrics not prone to spurious correlations. The bottleneck in large system-biology studies is the low-scalability of coarse-grained parallel algorithms that do not exploit MS-specific data characteristics and lead to unbalanced loads due to non-uniform compute time required for peptide deductions. This project aims to explore design and implementation of scalable algorithms for both NGS and MS data on multicore and GPU platforms using domain decomposition techniques based on spectral clustering, MS-specific hybrid load-balancing based on work-load estimate, and HPC dimensionality reduction strategies and novel out-of-core sketching & streaming fine-grained parallel algorithms. These HPC solutions can enable previously impractical proteogenomics projects and allow biologists to perform computational experiments without needing expensive hardware. All of the implemented algorithms will be made available as open-source code interfaced with Galaxy framework to ensure maximum impact in systems biology labs. These designed techniques will then be integrated so that matching of spectra to RNA-Seq data can be accomplished without a reconstructed transcriptome. The proposed tools aim to reveal new biological insight such as novel genes, proteins and PTM's and are crucial steps towards understanding the genomic, proteomic and evolutionary aspects of species in the tree of life.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 34)
Ahmed, Khandaker Mamun and Eslami, Taban and Saeed, Fahad and Amini, M. Hadi "DeepCOVIDNet: Deep Convolutional Neural Network for COVID-19 Detection from Chest Radiographic Images" IEEE International Conference on Bioinformatics and Biomedicine (BIBM) , 2021 https://doi.org/10.1109/BIBM52615.2021.9669767 Citation Details
Aledhari, Mohammed and Di Pierro, Marianne and Hefeida, Mohamed and Saeed, Fahad "A Deep Learning-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets" IEEE Transactions on Big Data , 2019 10.1109/TBDATA.2018.2805687 Citation Details
Aledhari, Mohammed and Di Pierro, Marianne and Saeed, Fahad "A Fourier-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets" IEEE International Congress on Big Data (BigData Congress) , 2018 10.1109/BigDataCongress.2018.00024 Citation Details
Aledhari, Mohammed and Joji, Shelby and Hefeida, Mohamed and Saeed, Fahad "Optimized CNN-based Diagnosis System to Detect the Pneumonia from Chest Radiographs" Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM) , 2019 10.1109/BIBM47256.2019.8983114 Citation Details
Aledhari, Mohammed and Razzak, Rehma and Parizi, Reza M. and Saeed, Fahad "Federated Learning: A Survey on Enabling Technologies, Protocols, and Applications" IEEE Access , v.8 , 2020 https://doi.org/10.1109/ACCESS.2020.3013541 Citation Details
Aledhari, Mohammed and Razzak, Rehma and Qolomany, Basheer and Al-Fuqaha, Ala and Saeed, Fahad "Biomedical IoT: Enabling Technologies, Architectural Elements, Challenges, and Future Directions" IEEE Access , v.10 , 2022 https://doi.org/10.1109/ACCESS.2022.3159235 Citation Details
Almuqhim, Fahad and Saeed, Fahad "ASD-SAENet: A Sparse Autoencoder, and Deep-Neural Network Model for Detecting Autism Spectrum Disorder (ASD) Using fMRI Data" Frontiers in Computational Neuroscience , v.15 , 2021 https://doi.org/10.3389/fncom.2021.654315 Citation Details
Artiles, Oswaldo and Saeed, Fahad "A Multi-Factorial Assessment of Functional Human Autistic Spectrum Brain Network Analysis" IEEE International Conference on Bioinformatics and Biomedicine (BIBM), , 2021 https://doi.org/10.1109/BIBM52615.2021.9669679 Citation Details
Artiles, Oswaldo and Saeed, Fahad "GPU-SFFT: A GPU based parallel algorithm for computing the Sparse Fast Fourier Transform (SFFT) of k-sparse signals" Proceedings of IEEE International Conference on Big Data (Big Data) , 2019 10.1109/BigData47090.2019.9006579 Citation Details
Artiles, Oswaldo and Saeed, Fahad "TurboBC: A Memory Efficient and Scalable GPU Based Betweenness Centrality Algorithm in the Language of Linear Algebra" ICPP Workshops '21: 50th International Conference on Parallel Processing Workshop , 2021 https://doi.org/10.1145/3458744.3474047 Citation Details
Artiles, Oswaldo and Saeed, Fahad "TurboBFS: GPU Based Breadth-First Search (BFS) Algorithms in the Language of Linear Algebra" IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , 2021 https://doi.org/10.1109/IPDPSW52791.2021.00084 Citation Details
(Showing: 1 - 10 of 34)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The goal of the proposed CAREER grant is to design and develop algorithmic and high performance computing (HPC) foundations for practical sublinear and parallel algorithms for big proteogenomics data - especially for non-model organisms with previously unsequenced or partially sequenced genomes. This CAREER grants enabled us to explore a new class of reductive algorithms for analysis of MS data that can allow peptide deductions in sublinear time, compression algorithms that operate in sub-linear space, and HPC algorithms that can operate on lossy reduced-form of the MS data.

Following is the summary of our activities:

1)     We investigated and established that current proteogenomic tools are inadequate in scalability with increase size of the database, as well as with increasing number of species (Tariq, and Saeed, IEEE Access 2021).

2)     In order to design the building blocks needed for developing scalable HPC methods we designed three specific things: MaSS-Simulator (Awan & Saeed, PROTEOMICS 2018), and Benchmarking data sets for proteomics (Awan & Saeed 2021). Both these tools allowed us to generate, calibrate, and control the parameters that could be used for data simulation, and experimentations.

3)     We then started building the computational blocks necessary for scalable computing. This include template-based strategy that can be used on CPU-GPU architectures (Awan & Saeed, Computers in Biology and Medicine, 2018), BFS algorithms that could run on CPU-GPU architectures (Artiles & Saeed, IEEE IPDPS 2021), and algorithms that can run FFT-like computations on the CPU-GPU architectures (Artiles & Saeed, IEEE BigData 2019). These building blocks enabled us to compute large-scale proteogenomics data analysis on a variety of homogenous and heterogeneous architectures.   

4)     Two other strategies that we designed and developed were related to load-balancing of large-scale databases. To this end, we developed a compression method that would allow us to compress-and-compute the MS data without any decompression (Haseeb & Saeed, IEEE BIBM 2019), and LBE algorithm (Haseeb & Saeed, IEEE IPDPS 2019) which enable effective load-balancing based on the number of computations per unit database. We demonstrated that using both of these methods would enable massive reduction in the I/O for memory-distributed architectures and will result in load that is fairly balanced.

5)     We also developed a theoretical framework and verified with experimental results that for modern memory-distributed architectures, the amount of communication and I/O is a lot more than the computation costs, no matter what the scoring mechanism is used. This led to the development of a high-performance computing framework that used minimal communications costs to compute over large proteogenomics data sets in a reasonable time. We also demonstrated that our proposed method can give 10x speedups as compared to existing and established parallel computing methods.

The outcome of the project was design and development of novel communication-avoidance parallel algorithms that were introduced and was a new paradigm for MS based omics. They allowed us to scale proteogenomic data analysis from MS based experiments. We demonstrated that using theoretical results and published results from existing HPC algorithms related to MS based omics data that the communication cost was the biggest cost when processing MS omics data sets. This has been a neglected factor in ALL existing HPC tools for MS based omics data analysis. This theoretical and empirical result opened up a new direction for research and excited the parallel computing community to work on these high impact problems. Using our HPC framework, we were able to show that the proposed techniques enabled 10x speedups as compared to any existing parallel computing framework. The speedups are more instrumental, when the data is large (i.e. for proteogenomics) since a serial or a sub-optimal parallel computing design results in significant paging of the memory. Therefore, the end result is that as compared to existing parallel computing technique which may need weeks of computation, we are able to process terabytes of data within hours on a memory-distributed supercomputer. We expect that our frameworks will be used for investigation of biological, and chemical samples from complex environmental and micrbiomes leading to advances in laboratory, and clinical treatment and drug discovery. 

Dr. Saeed's group has made fundamental advances in dealing with these large data sets for processing on homogenous and heterogeneous supercomputing architectures. One of the most interesting aspects of the research was compression of these large omics data sets and processing on these data sets without the need to decompress them.

The NSF CAREER award has partially supported more than 6 PhD students, 2 MS and 4 undergraduate students, numerous research talks, and has resulted in more than 34 peer-reviewed publications. The software resulting from these novel high-performance computing is available on the PI lab webpage at: https://saeedlab.cs.fiu.edu/software/

 


Last Modified: 10/03/2023
Modified by: Fahad Saeed

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page