Award Abstract # 1747447
SHF: EAGER: HI-HDFS - Holistic I/O optimizations for the Hadoop distributed filesystem

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: OHIO STATE UNIVERSITY, THE
Initial Amendment Date: August 29, 2017
Latest Amendment Date: August 29, 2017
Award Number: 1747447
Award Instrument: Standard Grant
Program Manager: Almadena Chtchelkanova
achtchel@nsf.gov
 (703)292-7498
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2017
End Date: August 31, 2018 (Estimated)
Total Intended Award Amount: $150,000.00
Total Awarded Amount to Date: $150,000.00
Funds Obligated to Date: FY 2017 = $150,000.00
History of Investigator:
  • Spyros Blanas (Principal Investigator)
    blanas.2@osu.edu
  • Srinivasan Parthasarathy (Co-Principal Investigator)
  • Yang Wang (Co-Principal Investigator)
Recipient Sponsored Research Office: Ohio State University
1960 KENNY RD
COLUMBUS
OH  US  43210-1016
(614)688-8735
Sponsor Congressional District: 03
Primary Place of Performance: Ohio State University
1960 Kenny Road
Columbus
OH  US  43210-1016
Primary Place of Performance
Congressional District:
03
Unique Entity Identifier (UEI): DLWBSLWAJWR1
Parent UEI: MN4MDDMN8529
NSF Program(s): Software & Hardware Foundation
Primary Program Source: 01001718DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7916, 7942
Program Element Code(s): 779800
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

File systems and their outdated POSIX "byte stream" interface suffer from an impedance mismatch with the versatile I/O requirements of today's applications. Specifically, the I/O path from the application to the raw storage device is becoming longer and it involves the interplay of intricate software and hardware components. This produces complex aggregate I/O patterns that application developers (often subject matter experts with limited knowledge of how massive concurrency creates I/O bottlenecks) cannot optimize based on intuition alone. File systems that tout their high scalability, such as the Hadoop distributed file system, largely do so by limiting applications to sequential access patterns. The question of whether one can accelerate the I/O performance of the Hadoop distributed file system for analytical applications with complex data models that cannot readily serialize data contiguously for fast sequential access remains open.

This project seeks to address this question and build HI-HDFS -- a framework that automatically collects and manages semantically richer I/O metadata to guide object placement in the Hadoop distributed file system. The HI-HDFS framework synthesizes the I/O activity across software components throughout the datacenter in a navigable graph structure to identify application-agnostic motifs in I/O activity. A novel I/O forecasting technique identifies and ameliorates bottlenecks at large scale by inspecting I/O activity from small-scale runs. Overall, the HI-HDFS framework challenges the I/O optimization mantra that manual data placement is the cornerstone of I/O performance and paves the way towards next-generation object-centric storage systems for high-performance computers. The efficacy of this automated approach will be examined on a complex data processing workload from the domain of emergency response which exhibits I/O patterns that are characteristic of modern analytical applications. The broader impacts of this work are expected to include open-source prototype implementations as well as pedagogical impact on a cloud computing course for both Computer Science and Data Analytics undergraduate majors at Ohio State.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Kang, Donghe and Patel, Vedang and Khandrika, Kalyan and Blanas, Spyros and Wang, Yang and Parthasarathy, Srinivasan "Characterizing I/O optimization opportunities for array-centric applications on HDFS" IEEE High Performance extreme Computing Conference (HPEC) 2018 , 2018 10.1109/HPEC.2018.8547529 Citation Details
Quoc, Do Le and Akkus, Istemi Ekin and Bhatotia, Pramod and Blanas, Spyros and Chen, Ruichuan and Fetzer, Christof and Strufe, Thorsten "ApproxJoin: Approximate Distributed Joins" ACM Symposium of Cloud Computing (SoCC) 2018 , 2018 10.1145/3267809.3267834 Citation Details
Shi, Rong and Gan, Yifan and Wang, Yang "Evaluating Scalability Bottlenecks by Workload Extrapolation" 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) , 2018 10.1109/MASCOTS.2018.00039 Citation Details
Sun, Jiankai and Bandyopadhyay, Bortik and Bashizade, Armin and Liang, Jiongqian and Sadayappan, P. and Parthasarathy, Srinivasan "ATP: Directed Graph Embedding with Asymmetric Transitivity Preservation" Proceedings of the AAAI Conference on Artificial Intelligence , v.33 , 2019 10.1609/aaai.v33i01.3301265 Citation Details
Xing, Haoyuan and Floratos, Sofoklis and Blanas, Spyros and Byna, Suren and Prabhat, M. and Wu, Kesheng and Brown, Paul "ArrayBridge: Interweaving Declarative Array Processing in SciDB with Imperative HDF5-Based Programs" IEEE 34th International Conference on Data Engineering (ICDE) 2018 , 2018 10.1109/ICDE.2018.00092 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

An impedance mismatch exists between the increasing sophistication of array-centric analytics and the rudimentary bytestream-based I/O interface of parallel file systems. In particular, scientific data analytics pipelines face scalability bottlenecks when processing massive datasets that consist of millions of small files. The adoption of scalable data analytics infrastructure further exacerbates I/O bottlenecks for many applications, as processing small files is notoriously inefficient. Such I/O bottlenecks commonly arise in problems as diverse as detecting supernovae and post-processing computational fluid dynamics simulations.

The major goal of this project was to characterize unexplored I/O optimization opportunities that arise during the analysis of array-centric datasets, often with machine learning techniques. The project focused on common analytical methods, in particular sequence learning using a long short-term model (LSTM) and image classification using a convolutional neural network (CNN), and analyzed their respective I/O patterns on the Hadoop Distributed File System (HDFS), a popular parallel file system in cloud environments due to its ability to scale I/O to thousands of nodes. The main thrusts of the project were (1) identifying I/O patterns and bottlenecks at runtime at various levels of the I/O stack, (2) predicting their performance impact and (3) optimizing I/O to alleviate bottlenecks and improve performance. The major activities of this project were (1) the development of PatternMiner, a tool to analyze I/O traces from small experiments and extrapolate such traces at larger scale to understand I/O bottlenecks, and (2) the development of ASHWHIN, an I/O library that exposes an array-centric data access interface to applications but seamlessly stores data in the Hadoop Distributed File System.

The key outcome of this project is that it has demonstrated that object consolidation has the potential to equalize performance across systems with diverse I/O characteristics. Although the research community has extensively studied data placement algorithms for complex I/O hierarchies, the results from this project show that data consolidation mechanisms have an equally significant role to play. Research results have been disseminated through publication in venues that focus on database systems (IEEE ICDE), cloud computing (ACM SoCC), performance analysis (IEEE MASCOTS), and high-performance computing (IEEE HPEC). Additional papers are currently under review. The prototype implementation of the developed system is available under an open-source license at http://code.osu.edu/arraybridge.

Broader impacts from this project include curriculum development activities and professional development opportunities for students. From a curricular standpoint, the I/O patterns of the scientific applications that were identified as part of this project guided the development of lab assignments for the "Data Management in the Cloud" course based on various Hadoop technologies (HDFS, Impala, Spark, TileDB). From a pedagogical standpoint, this project has directly contributed to the professional development of two Ph.D., one M.Sc. and two B.Sc. students.


Last Modified: 01/26/2019
Modified by: Spyros Blanas

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page