Award Abstract # 1447861
BIGDATA: F: DKM: Collaborative Research: Scalable Middleware for Managing and Processing Big Data on Next Generation HPC Systems

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF CALIFORNIA, SAN DIEGO
Initial Amendment Date: August 25, 2014
Latest Amendment Date: August 25, 2014
Award Number: 1447861
Award Instrument: Standard Grant
Program Manager: Almadena Chtchelkanova
achtchel@nsf.gov
 (703)292-7498
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2014
End Date: August 31, 2017 (Estimated)
Total Intended Award Amount: $359,999.00
Total Awarded Amount to Date: $359,999.00
Funds Obligated to Date: FY 2014 = $359,999.00
History of Investigator:
  • Amitava Majumdar (Principal Investigator)
    majumdar@sdsc.edu
  • Mahidhar Tatineni (Co-Principal Investigator)
Recipient Sponsored Research Office: University of California-San Diego
9500 GILMAN DR
LA JOLLA
CA  US  92093-0021
(858)534-4896
Sponsor Congressional District: 50
Primary Place of Performance: University of California-San Diego
9500 Gilman Drive
La Jolla
CA  US  92093-0934
Primary Place of Performance
Congressional District:
50
Unique Entity Identifier (UEI): UYTTZT6G9DT1
Parent UEI:
NSF Program(s): Big Data Science &Engineering
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 8083
Program Element Code(s): 808300
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Managing and processing large volumes of data and gaining meaningful insights is a significant challenge facing the Big Data community. Thus, it is critical that data-intensive computing middleware (such as Hadoop, HBase and Spark) to process such data are diligently designed, with high performance and scalability, in order to meet the growing demands of such Big Data applications. While Hadoop, Spark and HBase are gaining popularity for processing Big Data applications, these middleware and the associated Big Data applications are not able to take advantage of the advanced features on modern High Performance Computing (HPC) systems widely deployed all over the world, including many of of the multi-Petaflop systems in the XSEDE environment. Modern HPC systems and the associated middleware (such as MPI and Parallel File systems) have been exploiting the advances in HPC technologies (multi/many-core architectures, RDMA-enabled networking, NVRAMs and SSDs) during the last decade. However, Big Data middleware (such as Hadoop, HBase and Spark) have not embraced such technologies. These disparities are taking HPC and Big Data processing into "divergent trajectories."

The proposed research, undertaken by a team of computer and application scientists from OSU and SDSC, aim to bring HPC and Big Data processing into a "convergent trajectory." The investigators will specifically address the following challenges: 1) designing novel communication and I/O runtime for Big Data processing while exploiting the features of modern multi-/many-core, networking and storage technologies; 2) redesigning Big Data middleware (such as Hadoop, HBase and Spark) to deliver performance and scalability on modern and next-generation HPC systems; and 3) demonstrating the benefits of the proposed approach for a set of driving Big Data applications on HPC system. The proposed work targets four major workloads and applications in the Big Data community (namely data analytics, query, interactive, and iterative) using the popular Big Data middleware (Hadoop, HBase and Spark). The proposed framework will be validated on a variety of Big Data benchmarks and applications. The proposed middleware and runtimes will be made publicly available to the community. The research enables curricular advancements via research in pedagogy for key courses in the new data analytics program at Ohio State and SDSC -- among the first of its kind nationwide.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Advances in technology have enabled us to collect large amounts of
data from all walks of life.  Managing and processing such large
volumes of data, or "Big Data", and gaining meaningful insights is a
significant challenge facing the Big Data community. This has
significant impact in a wide range of domains including health care,
bio-medical research, Internet search, finance and business
informatics, and scientific computing. As data-gathering technologies
and data-sources witness an explosion in the amount of input data, it
is expected that in the future massive quantities of data in the order
of hundreds or thousands of petabytes will need to be processed. Thus,
it is critical that data-intensive computing middleware (such as
Hadoop, Spark, and HBase) to process such data are diligently
designed, with high performance and scalability, in order to meet the
growing demands of such Big Data applications.

While Hadoop, Spark, and HBase are gaining popularity for processing
Big Data applications, these middleware and the associated Big Data
applications are not able to fully take advantage of the advanced
features on modern High Performance Computing (HPC) systems widely
deployed all over the world, including many of of the multi-Petaflop
systems in the XSEDE environment. Modern HPC systems and the
associated middleware (such as MPI and Parallel File systems) have
been exploiting the advances in HPC technologies (multi/many-core
architectures, RDMA-enabled networking, NVRAMs and SSDs) during the
last decade.  However, Big Data middleware (such as Hadoop, Spark, and
HBase) have not embraced such technologies. These disparities are
taking HPC and Big Data processing into "divergent trajectories".
This leads to the following broad challenges: "Can novel runtimes be
used to redesign Big Data middleware (such as Hadoop, Spark, and
HBase) to deliver performance and scalability on modern and
next-generation HPC systems?"

In this project, we have proposed new runtimes and re-designed the Big
Data middleware stacks to take advantage of modern HPC technologies
and systems. Challenges have been addressed in the following three
directions: 1) Designed novel communication and I/O runtime for Big
Data processing while exploiting the features of modern
multi-/many-core, networking and storage technologies; 2) Redesigned
Big Data middleware (such as Hadoop, Spark, and HBase) to deliver
performance and scalability on modern and next-generation HPC systems;
and 3) Demonstrated the benefits of the proposed approach for a set of
driving Big Data benchmarks and applications on HPC systems.  The
proposed designs have brought HPC and Big Data processing into a
"convergent trajectory".

Contributions were made in all three areas and evaluated with a range
of Big Data benchmarks and applications including - TeraGen, Sort,
TeraSort, TestDFSIO, GroupBy, PUMA, YCSB, CCIndex, HiBench,
CloudBurst, Deep Learning, and Astronomy.  Some highlights of these
results are:

* The new RDMA-HDFS design delivers a speedup of 2.3x for TeraGen of
  200 GBytes of dataset

* There is an improvement of 25% for 120 GB Sort using the new
  MapReduce over Luster

* The RDMA-Spark design over 1536 cores improves the HiBench PageRank
  time by 43%.

* An astronomy application using 65GB dataset is accelerated by 21%
  using the proposed RDMA-Spark design.

* Up to 2.4x speedup for the YCSB workload A with RDMA-based design
  for HBase.

The results of this research (new designs, performance results,
benchmarks, etc.) have been made available to the community through
RDMA-Hadoop, RDMA-Spark, and RDMA-HBase libraries and OSU HiBD
(High-Performance Big Data) benchmark suite (multiple versions and
libraries).  The latest versions of these libraries are currently
running on many large-scale InfiniBand and RoCE systems including SDSC
Comet. Currently, the HiBD libraries are being used by more than 275
organizations in 34 countries. The HiBD libraries and the associated
enhancements are being used by a large number of users of these
systems to accelerate Big Data applications.
 
In each of these releases, information about the tuned designs for
various components (HDFS, MapReduce, Spark, HBase, etc.) has been
shared with the HiBD user community through mailing lists. The
applications-based results have been made available to the community
through the "Performance" link of the HiBD project web page.  In
addition to the software distribution, the results have been presented
at various conferences and events through talks, tutorials, and
hands-on sessions.  Multiple Ph.D and Masters students have performed
research work and received their Ph.D and M.S. degrees as a part of
this project.


Last Modified: 12/26/2017
Modified by: Amitava Majumdar

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page