Award Abstract # 1718479
SHF:Small: Reproducibility and Comprehensive Assessment of Next Generation Sequencing Bioinformatics Software

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: GEORGIA TECH RESEARCH CORP
Initial Amendment Date: July 14, 2017
Latest Amendment Date: June 22, 2022
Award Number: 1718479
Award Instrument: Standard Grant
Program Manager: Almadena Chtchelkanova
achtchel@nsf.gov
 (703)292-7498
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: July 15, 2017
End Date: June 30, 2023 (Estimated)
Total Intended Award Amount: $499,984.00
Total Awarded Amount to Date: $499,984.00
Funds Obligated to Date: FY 2017 = $499,984.00
History of Investigator:
  • Srinivas Aluru (Principal Investigator)
    aluru@cc.gatech.edu
Recipient Sponsored Research Office: Georgia Tech Research Corporation
926 DALNEY ST NW
ATLANTA
GA  US  30318-6395
(404)894-4819
Sponsor Congressional District: 05
Primary Place of Performance: Georgia Institute of Technology
225 North Avenue
Atlanta
GA  US  30332-0002
Primary Place of Performance
Congressional District:
05
Unique Entity Identifier (UEI): EMW9FC8J3HN4
Parent UEI: EMW9FC8J3HN4
NSF Program(s): Information Technology Researc,
Software & Hardware Foundation
Primary Program Source: 01001718DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 040Z, 7923, 7942
Program Element Code(s): 164000, 779800
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Next-generation sequencing refers to a collection of high throughput DNA sequencing technologies that have originated about a decade ago, and are now the de facto equipment underpinning all modern genomics studies due to their cost-effectiveness and ubiquity and versatility of use. This project is conducting comprehensive reproducibility and assessment experiments to characterize the state of the art in the field, and make the findings publicly visible and accessible. The project results are expected to become a valuable resource for practitioners, researchers, and the significantly large community of users of next generation sequencing bioinformatics. The project is involving several undergraduate students, and raising awareness of research integrity and reproducibility issues among young researchers.
The project is establishing benchmark datasets to evaluate bioinformatics software for multiple next generation sequencers, multiple types of biological organisms, in multiple application contexts, and at multiple problem scales. The research spans assessment of software products for read error correction, read mapping to target genomes and reference databases, and assembly of genomes and transcriptomes. Reproducibility experiments are conducted to independently verify results of important software products based on results and datasets published in the literature. The software products are also evaluated on a range of metrics - quality of results, robustness and sensitivity to parameter values, run-time performance, memory usage, and ability to process real-world datasets. The project work will result in comprehensive recommendations available to practitioners as well as establishing state of the art to appropriately channel future research efforts.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Aluru, Maneesha and Shrivastava, Harsh and Chockalingam, Sriram P. and Shivakumar, Shruti and Aluru, Srinivas and Martelli, ed., Pier Luigi "EnGRaiN : a supervised ensemble learning method for recovery of large-scale gene regulatory networks" Bioinformatics , v.38 , 2021 https://doi.org/10.1093/bioinformatics/btab829 Citation Details
Jammula, Nagakishore and Aluru, Srinivas "ParRefCom: Parallel Reference-based Compression of Paired-end Genomics Read Datasets" Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB) , 2019 10.1145/3307339.3342171 Citation Details
Pan, Tony C. and Chockalingam, Sriram P. and Aluru, Maneesha and Aluru, Srinivas and Cowen, ed., Lenore "MCPNet: a parallel maximum capacity-based genome-scale gene network construction framework" Bioinformatics , v.39 , 2023 https://doi.org/10.1093/bioinformatics/btad373 Citation Details
Srivastava, Ankit and Chockalingam, Sriram P. and Aluru, Maneesha and Aluru, Srinivas "Parallel construction of module networks" Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC) , 2021 https://doi.org/10.1145/3458817.3476207 Citation Details
Srivastava, Ankit and Chockalingam, Sriram P. and Aluru, Srinivas "A Parallel Framework for Constraint-Based Bayesian Network Learning via Markov Blanket Discovery" IEEE Transactions on Parallel and Distributed Systems , v.34 , 2023 https://doi.org/10.1109/TPDS.2023.3244135 Citation Details
Zhang, Haowen and Jain, Chirag and Aluru, Srinivas "A comprehensive evaluation of long read error correction methods" BMC Genomics , v.21 , 2020 https://doi.org/10.1186/s12864-020-07227-0 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The project supported research and training in scientific reproducibility, an area that is gaining increasing prominence. The goal of the project is to conduct reproducibility experiments and comprehensive assessment of next generation sequencing data based bioinformatics software, specifically for the problems of error correction, read mapping, gene expression, gene network construction, and multimodal data integration. The types of data studied under the project include both short and long read sequencing, and both DNA sequencing to study genomes and RNA sequencing to study gene expressions. Software for several research tasks are analyzed in the context of applications drawn from pangenomics, systems biology, and single-cell biology.

Intellectual Merits: For each problem area studied, work carried under the project resulted in establishing benchmark datasets and the software was evaluated on a range of metrics including reproducibility, quality of results, robustness and sensitivity to parameter values, run-time performance, memory usage, and ability to process real-world datasets. These results will inform practitioners of the capabilities, limitations, and appropriate ways to use the various software programs on which the studies were conducted. They also inform researchers in the respective areas where future efforts are needed.

Publications resulting from this work themselves earned reproducibility badges, now adopted as a feature by some important conferences and journals. A publication resulting from the project was a finalist for the Best Reproducibility Advancement Award at the Supercomputing 2021 conference, and was selected for the Student Cluster Competition Reproducibiity Challenge.

Research into comprehensive assessment of bioinformatics software and the resulting understanding of ltheir limitations naturally led the project team itself to develop new approaches, algorithms, and software to overcome current limitations and bottlenecks. 

Broader Impacts: The project led to peer reviewed publications in conferences and journals, establishment of benchmark datasets, and open source software for evaluation and new methods developed under the project. Software products are made available on GitHub and datasets are made available on Zenodo.

The project supported the training of many undergraduate and graduate students on the important topic of scientific reproducibility. It contributed to the Reproducibility challenge of the Student Cluster Competition, where student teams representing their respective universities from around the world participate and compete. 


Last Modified: 05/12/2024
Modified by: Srinivas Aluru

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page