
NSF Org: |
OAC Office of Advanced Cyberinfrastructure (OAC) |
Recipient: |
|
Initial Amendment Date: | March 28, 2019 |
Latest Amendment Date: | March 28, 2019 |
Award Number: | 1925960 |
Award Instrument: | Standard Grant |
Program Manager: |
Juan Li
jjli@nsf.gov (703)292-2625 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2018 |
End Date: | September 30, 2023 (Estimated) |
Total Intended Award Amount: | $415,950.00 |
Total Awarded Amount to Date: | $415,950.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
11200 SW 8TH ST MIAMI FL US 33199-2516 (305)348-2494 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
FL US 33199-0001 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
CAREER: FACULTY EARLY CAR DEV, Software & Hardware Foundation, Computational Biology |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Proteogenomics studies require combination and integration of mass spectrometry data (MS) for proteomics and next generation sequencing (NGS) data for genomics. This integration drastically increases the size of the data sets that need to be analyzed to make biological conclusions. However, existing tools yield low accuracy and exhibit poor scalability for big proteogenomics data. This CAREER grant is expected to lay a foundation for fast algorithmic and high performance computing solutions suitable for analyzing big proteogenomics data sets. Design of accurate computational algorithms suitable for peta-scale data sets will be pursued and the software implementation will run on massively parallel supercomputers and graphical processing units. The direction in this CAREER proposal is towards designing and building infrastructure, which would be useful for the broadest biological and ecological community. A comprehensive interdisciplinary education will be executed for K12, undergraduate and graduate students to ensure that US retains its global leadership position in STEM fields. This project thus serves the national interest, as stated by NSF's mission: to promote the progress of science and to advance the national health, prosperity and welfare.
The goal of the proposed CAREER grant is to design and develop algorithmic and high performance computing (HPC) foundations for practical sublinear and parallel algorithms for big proteogenomics data - especially for non-model organisms with previously unsequenced or partially sequenced genomes. Integration of MS and NGS data sets required for proteogenomics studies exhibit enormous volume and velocity of data: NGS technologies such as Chip-Seq can generate tera-bytes of DNA/RNA data and mass spectrometers can generate millions of spectra (with thousand of peak per spectra). The current systems for analyzing MS data are mainly driven by heuristic practices and do not scale well. This CAREER proposal will explore a new class of reductive algorithms for analysis of MS data that can allow peptide deductions in sublinear time, compression algorithms that operate in sub-linear space, and denovo algorithms that operate on lossy reduced-form of the MS data. Novel low-complexity sampling and reductive algorithms that can exploit the sparsity of MS data such as non-uniform FFT based convolution kernels can lead to superior similarity metrics not prone to spurious correlations. The bottleneck in large system-biology studies is the low-scalability of coarse-grained parallel algorithms that do not exploit MS-specific data characteristics and lead to unbalanced loads due to non-uniform compute time required for peptide deductions. This project aims to explore design and implementation of scalable algorithms for both NGS and MS data on multicore and GPU platforms using domain decomposition techniques based on spectral clustering, MS-specific hybrid load-balancing based on work-load estimate, and HPC dimensionality reduction strategies and novel out-of-core sketching & streaming fine-grained parallel algorithms. These HPC solutions can enable previously impractical proteogenomics projects and allow biologists to perform computational experiments without needing expensive hardware. All of the implemented algorithms will be made available as open-source code interfaced with Galaxy framework to ensure maximum impact in systems biology labs. These designed techniques will then be integrated so that matching of spectra to RNA-Seq data can be accomplished without a reconstructed transcriptome. The proposed tools aim to reveal new biological insight such as novel genes, proteins and PTM's and are crucial steps towards understanding the genomic, proteomic and evolutionary aspects of species in the tree of life.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The goal of the proposed CAREER grant is to design and develop algorithmic and high performance computing (HPC) foundations for practical sublinear and parallel algorithms for big proteogenomics data - especially for non-model organisms with previously unsequenced or partially sequenced genomes. This CAREER grants enabled us to explore a new class of reductive algorithms for analysis of MS data that can allow peptide deductions in sublinear time, compression algorithms that operate in sub-linear space, and HPC algorithms that can operate on lossy reduced-form of the MS data.
Following is the summary of our activities:
1) We investigated and established that current proteogenomic tools are inadequate in scalability with increase size of the database, as well as with increasing number of species (Tariq, and Saeed, IEEE Access 2021).
2) In order to design the building blocks needed for developing scalable HPC methods we designed three specific things: MaSS-Simulator (Awan & Saeed, PROTEOMICS 2018), and Benchmarking data sets for proteomics (Awan & Saeed 2021). Both these tools allowed us to generate, calibrate, and control the parameters that could be used for data simulation, and experimentations.
3) We then started building the computational blocks necessary for scalable computing. This include template-based strategy that can be used on CPU-GPU architectures (Awan & Saeed, Computers in Biology and Medicine, 2018), BFS algorithms that could run on CPU-GPU architectures (Artiles & Saeed, IEEE IPDPS 2021), and algorithms that can run FFT-like computations on the CPU-GPU architectures (Artiles & Saeed, IEEE BigData 2019). These building blocks enabled us to compute large-scale proteogenomics data analysis on a variety of homogenous and heterogeneous architectures.
4) Two other strategies that we designed and developed were related to load-balancing of large-scale databases. To this end, we developed a compression method that would allow us to compress-and-compute the MS data without any decompression (Haseeb & Saeed, IEEE BIBM 2019), and LBE algorithm (Haseeb & Saeed, IEEE IPDPS 2019) which enable effective load-balancing based on the number of computations per unit database. We demonstrated that using both of these methods would enable massive reduction in the I/O for memory-distributed architectures and will result in load that is fairly balanced.
5) We also developed a theoretical framework and verified with experimental results that for modern memory-distributed architectures, the amount of communication and I/O is a lot more than the computation costs, no matter what the scoring mechanism is used. This led to the development of a high-performance computing framework that used minimal communications costs to compute over large proteogenomics data sets in a reasonable time. We also demonstrated that our proposed method can give 10x speedups as compared to existing and established parallel computing methods.
The outcome of the project was design and development of novel communication-avoidance parallel algorithms that were introduced and was a new paradigm for MS based omics. They allowed us to scale proteogenomic data analysis from MS based experiments. We demonstrated that using theoretical results and published results from existing HPC algorithms related to MS based omics data that the communication cost was the biggest cost when processing MS omics data sets. This has been a neglected factor in ALL existing HPC tools for MS based omics data analysis. This theoretical and empirical result opened up a new direction for research and excited the parallel computing community to work on these high impact problems. Using our HPC framework, we were able to show that the proposed techniques enabled 10x speedups as compared to any existing parallel computing framework. The speedups are more instrumental, when the data is large (i.e. for proteogenomics) since a serial or a sub-optimal parallel computing design results in significant paging of the memory. Therefore, the end result is that as compared to existing parallel computing technique which may need weeks of computation, we are able to process terabytes of data within hours on a memory-distributed supercomputer. We expect that our frameworks will be used for investigation of biological, and chemical samples from complex environmental and micrbiomes leading to advances in laboratory, and clinical treatment and drug discovery.
Dr. Saeed's group has made fundamental advances in dealing with these large data sets for processing on homogenous and heterogeneous supercomputing architectures. One of the most interesting aspects of the research was compression of these large omics data sets and processing on these data sets without the need to decompress them.
The NSF CAREER award has partially supported more than 6 PhD students, 2 MS and 4 undergraduate students, numerous research talks, and has resulted in more than 34 peer-reviewed publications. The software resulting from these novel high-performance computing is available on the PI lab webpage at: https://saeedlab.cs.fiu.edu/software/
Last Modified: 10/03/2023
Modified by: Fahad Saeed
Please report errors in award information by writing to: awardsearch@nsf.gov.