
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 25, 2014 |
Latest Amendment Date: | August 25, 2014 |
Award Number: | 1447711 |
Award Instrument: | Standard Grant |
Program Manager: |
Almadena Chtchelkanova
achtchel@nsf.gov (703)292-7498 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2014 |
End Date: | August 31, 2019 (Estimated) |
Total Intended Award Amount: | $1,200,000.00 |
Total Awarded Amount to Date: | $1,200,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
438 WHITNEY RD EXTENSION UNIT 1133 STORRS CT US 06269-9018 (860)486-3622 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
257 ITEB, 371 Fairfield Way Storrs CT US 06269-4155 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Big Data Science &Engineering |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
We live in an era when vast amounts of data are being generated at a low cost in several domains of science and engineering. However, advances in analytics tools have not caught up with data generation. In particular, existing tools take too much time. A main reason is that core memories of computers cannot hold all the data to be analyzed -- most of the data have to be stored in secondary storages (SSs) such as solid state drives and (rotating) disks. Data access times from SSs are several orders of magnitude more than from core memories. Tremendous speedups can be obtained by minimizing the number of data accesses from SSs. Also, although there has been much recent research in the development of multicore and GPU algorithms for biological problems, for many of the problems only sequential in-core algorithms are known.
This project is to develop novel out-of-core algorithms for biological big data (BBD) analytics. The proposed novel parallel algorithms employ various architectures including heterogeneous clusters of multicores and GPUs, to solve BBD problems. The developed novel scalable algorithms can handle petabytes of data and beyond for data mining applicable over varied datasets. This interdisciplinary project provides a new computation suite for mining voluminous biological and other data. This project provides educational opportunities to graduate and undergraduate students to get first-hand research experience in computational aspects of biological data analysis.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Project Outcomes Report
Background: We live in an era of big data where large amounts of data are generated in every walk of life. One of the major challenges we face in dealing with such a vast growth in data lies in creating methods to analyze these datasets and extract useful information from them. This project had the major goal of developing efficient computational techniques to analyze big data arising especially in the domain of biology. In particular, this project was aimed at developing effective computational algorithms for big data analytics. These algorithms should be suitable to be run on a single machine as well as multiple machines. The use of multiple machines (i.e., parallel computing) is vital for processing voluminous datasets.
Another important aspect of big data analytics is that the data may be very large and it may not fit in the main memory of a computer. In other words, the data may have to be stored in slower storage devices such as disks. In this case we have to ensure that the number of data accesses from the disks is minimized. Algorithms that minimize these data accesses are referred to as out-of-core algorithms. In this project we have focused on the development of novel parallel and out-of-core algorithms for solving numerous fundamental problems in biological big data analytics.
Intellectual Merit: We have developed parallel and out-of-core algorithms for solving a number of fundamental problems including motif search, string and sequence analysis, data compression, data linkage, correlational study between phenotypes and genotypes, finding the closest pair of points, feature selection, making a neural network sparse, and error correction in sequence data. These algorithms outperform the prior algorithms for the respective problems in terms of run times and accuracy. Our research results have been published in top-notch journals and conferences such as ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Nucleic Acids Research, Bioinformatics, Journal of the American Medical Informatics Association (JAMIA), IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), International Conference on Neural Information Processing Systems (NeurIPS), ACM International Conference on Information and Knowledge Management (CIKM), SIAM International Conference on Data Mining (SDM), and IEEE International Conference on Data Mining (ICDM). We have published thus far 19 journal papers and 33 conference papers. In addition, we have submitted 4 journal papers and 2 conference papers for publication.
Broader Impacts: Given the generic nature of the algorithms we have developed, they can be applied for different domains. Examples include Materials Science, Business, Physics, etc. For instance, our record linkage algorithms should be of use to healthcare providers. We expect that the society at large could benefit soon from the outcomes of our research. All the software developed in this project are open source and have been freely released to the public through appropriate forums such as GitHub. 16 graduate students (4 of them being women) have worked on this project in the duration of the project. In addition, several undergraduate students have also worked. These students have received excellent training in big data and conducting research. Some of these students have graduated with a Ph.D. and joined the industry (in Google, Amazon, etc). They are expected to apply the knowledge they have gained to solve real life problems. We have introduced the findings of this project in relevant courses as well. As a result, more students have been exposed to the topic of our research.
Last Modified: 09/13/2019
Modified by: Sanguthevar Rajasekaran
Please report errors in award information by writing to: awardsearch@nsf.gov.