Skip to feedback

Award Abstract # 1447711
BIGDATA: F: DKA: DKM: Novel Out-of-core and Parallel Algorithms for Processing Biological Big Data

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF CONNECTICUT
Initial Amendment Date: August 25, 2014
Latest Amendment Date: August 25, 2014
Award Number: 1447711
Award Instrument: Standard Grant
Program Manager: Almadena Chtchelkanova
achtchel@nsf.gov
 (703)292-7498
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2014
End Date: August 31, 2019 (Estimated)
Total Intended Award Amount: $1,200,000.00
Total Awarded Amount to Date: $1,200,000.00
Funds Obligated to Date: FY 2014 = $1,200,000.00
History of Investigator:
  • Sanguthevar Rajasekaran (Principal Investigator)
    rajasek@engr.uconn.edu
  • Sartaj Sahni (Co-Principal Investigator)
  • Joerg Graf (Co-Principal Investigator)
  • George Weinstock (Co-Principal Investigator)
  • Jinbo Bi (Co-Principal Investigator)
Recipient Sponsored Research Office: University of Connecticut
438 WHITNEY RD EXTENSION UNIT 1133
STORRS
CT  US  06269-9018
(860)486-3622
Sponsor Congressional District: 02
Primary Place of Performance: University of Connecticut
257 ITEB, 371 Fairfield Way
Storrs
CT  US  06269-4155
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): WNTPS995QBM7
Parent UEI:
NSF Program(s): Big Data Science &Engineering
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 8083
Program Element Code(s): 808300
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

We live in an era when vast amounts of data are being generated at a low cost in several domains of science and engineering. However, advances in analytics tools have not caught up with data generation. In particular, existing tools take too much time. A main reason is that core memories of computers cannot hold all the data to be analyzed -- most of the data have to be stored in secondary storages (SSs) such as solid state drives and (rotating) disks. Data access times from SSs are several orders of magnitude more than from core memories. Tremendous speedups can be obtained by minimizing the number of data accesses from SSs. Also, although there has been much recent research in the development of multicore and GPU algorithms for biological problems, for many of the problems only sequential in-core algorithms are known.

This project is to develop novel out-of-core algorithms for biological big data (BBD) analytics. The proposed novel parallel algorithms employ various architectures including heterogeneous clusters of multicores and GPUs, to solve BBD problems. The developed novel scalable algorithms can handle petabytes of data and beyond for data mining applicable over varied datasets. This interdisciplinary project provides a new computation suite for mining voluminous biological and other data. This project provides educational opportunities to graduate and undergraduate students to get first-hand research experience in computational aspects of biological data analysis.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 41)
A.-A. Mamun and S. Rajasekaran "An efficient Minimum Spanning Tree algorithm" Proc. IEEE Symposium on Computers and Communications (ISCC) , 2016 , p.1047
A.-A. Mamun, R. Aseltine, and S. Rajasekaran "Ecient Record Linkage Algorithms Using Complete Linkage Clustering" PLoS ONE , v.11 , 2016 , p.e0154446 doi:10.1371/journal.pone.0154446
A.-A. Mamun, S. Pal, and S. Rajasekaran "KCMBT: a k-mer Counter based on MultipleBurst Trees" Bioinformatics , 2016 DOI: 10.1093/bioinformatics/btw345
C. Chu, J. Pei and Y. Wu "An improved approach for reconstructing consensus repeats from short sequence reads" BMC Genomics , v.19 , 2018
C. Chu, R. Nielsen and Y.Wu "REPdenovo: Inferring De Novo Repeat Motifs from Short Sequence Reads" PLoS ONE , v.11 , 2016 , p.e0150719 DOI: 10.1371/journal.pone.0150719
C. Chu, X. Li, and Y. Wu "SpliceJumper: a classification-based approach for calling splicing junctions from RNA-seq data" BMC Bioinformatics , v.16 , 2015 , p.1 doi: 10.1186/1471-2105-16-S17-S10
C. Zhao and S. Sahni "Cache and energy ecient algorithms for Nussinov RNA folding" 6th IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS) , 2016
C. Zhao and S. Sahni "Cache and energy efficient algorithms for Nussinov RNA folding" BMC Bioinformatics , v.18 , 2017 , p.518
C. Zhao and S. Sahni "Efficient alignment of very long sequences" Advances in Science, Technology and Engineering Systems Journal , v.3 , 2018
C. Zhao and S. Sahni "Efficient computation of the Damerau-Levenshtein distance between biological sequences" IEEE International Conference on Computational Bio- andMedical Sciences (ICCABS) , 2017
C. Zhao and S. Sahni "Efficient RNA folding using Zuker's method" IEEE International Conference on Computational Bio- and Medical Sciences (ICCABS) , 2017
(Showing: 1 - 10 of 41)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Project Outcomes Report

 

Background: We live in an era of big data where large amounts of data are generated in every walk of life. One of the major challenges we face in dealing with such a vast growth in data lies in creating methods to analyze these datasets and extract useful information from them. This project had the major goal of developing efficient computational techniques to analyze big data arising especially in the domain of biology. In particular, this project was aimed at developing effective computational algorithms for big data analytics. These algorithms should be suitable to be run on a single machine as well as multiple machines. The use of multiple machines (i.e., parallel computing) is vital for processing voluminous datasets.

 

Another important aspect of big data analytics is that the data may be very large and it may not fit in the main memory of a computer. In other words, the data may have to be stored in slower storage devices such as disks. In this case we have to ensure that the number of data accesses from the disks is minimized. Algorithms that minimize these data accesses are referred to as out-of-core algorithms. In this project we have focused on the development of novel parallel and out-of-core algorithms for solving numerous fundamental problems in biological big data analytics.

 

Intellectual Merit: We have developed parallel and out-of-core algorithms for solving a number of fundamental problems including motif search, string and sequence analysis, data compression, data linkage, correlational study between phenotypes and genotypes, finding the closest pair of points, feature selection, making a neural network sparse, and error correction in sequence data. These algorithms outperform the prior algorithms for the respective problems in terms of run times and accuracy. Our research results have been published in top-notch journals and conferences such as ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Nucleic Acids Research, Bioinformatics, Journal of the American Medical Informatics Association (JAMIA), IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), International Conference on Neural Information Processing Systems (NeurIPS), ACM International Conference on Information and Knowledge Management (CIKM), SIAM International Conference on Data Mining (SDM), and IEEE International Conference on Data Mining (ICDM). We have published thus far 19 journal papers and 33 conference papers. In addition, we have submitted 4 journal papers and 2 conference papers for publication.

 

Broader Impacts: Given the generic nature of the algorithms we have developed, they can be applied for different domains. Examples include Materials Science, Business, Physics, etc. For instance, our record linkage algorithms should be of use to healthcare providers. We expect that the society at large could benefit soon from the outcomes of our research. All the software developed in this project are open source and have been freely released to the public through appropriate forums such as GitHub. 16 graduate students (4 of them being women) have worked on this project in the duration of the project. In addition, several undergraduate students have also worked. These students have received excellent training in big data and conducting research. Some of these students have graduated with a Ph.D. and joined the industry (in Google, Amazon, etc). They are expected to apply the knowledge they have gained to solve real life problems. We have introduced the findings of this project in relevant courses as well. As a result, more students have been exposed to the topic of our research.

 

 


Last Modified: 09/13/2019
Modified by: Sanguthevar Rajasekaran

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page