
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | August 29, 2017 |
Latest Amendment Date: | August 29, 2017 |
Award Number: | 1747447 |
Award Instrument: | Standard Grant |
Program Manager: |
Almadena Chtchelkanova
achtchel@nsf.gov (703)292-7498 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2017 |
End Date: | August 31, 2018 (Estimated) |
Total Intended Award Amount: | $150,000.00 |
Total Awarded Amount to Date: | $150,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
1960 KENNY RD COLUMBUS OH US 43210-1016 (614)688-8735 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
1960 Kenny Road Columbus OH US 43210-1016 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Software & Hardware Foundation |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
File systems and their outdated POSIX "byte stream" interface suffer from an impedance mismatch with the versatile I/O requirements of today's applications. Specifically, the I/O path from the application to the raw storage device is becoming longer and it involves the interplay of intricate software and hardware components. This produces complex aggregate I/O patterns that application developers (often subject matter experts with limited knowledge of how massive concurrency creates I/O bottlenecks) cannot optimize based on intuition alone. File systems that tout their high scalability, such as the Hadoop distributed file system, largely do so by limiting applications to sequential access patterns. The question of whether one can accelerate the I/O performance of the Hadoop distributed file system for analytical applications with complex data models that cannot readily serialize data contiguously for fast sequential access remains open.
This project seeks to address this question and build HI-HDFS -- a framework that automatically collects and manages semantically richer I/O metadata to guide object placement in the Hadoop distributed file system. The HI-HDFS framework synthesizes the I/O activity across software components throughout the datacenter in a navigable graph structure to identify application-agnostic motifs in I/O activity. A novel I/O forecasting technique identifies and ameliorates bottlenecks at large scale by inspecting I/O activity from small-scale runs. Overall, the HI-HDFS framework challenges the I/O optimization mantra that manual data placement is the cornerstone of I/O performance and paves the way towards next-generation object-centric storage systems for high-performance computers. The efficacy of this automated approach will be examined on a complex data processing workload from the domain of emergency response which exhibits I/O patterns that are characteristic of modern analytical applications. The broader impacts of this work are expected to include open-source prototype implementations as well as pedagogical impact on a cloud computing course for both Computer Science and Data Analytics undergraduate majors at Ohio State.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
An impedance mismatch exists between the increasing sophistication of array-centric analytics and the rudimentary bytestream-based I/O interface of parallel file systems. In particular, scientific data analytics pipelines face scalability bottlenecks when processing massive datasets that consist of millions of small files. The adoption of scalable data analytics infrastructure further exacerbates I/O bottlenecks for many applications, as processing small files is notoriously inefficient. Such I/O bottlenecks commonly arise in problems as diverse as detecting supernovae and post-processing computational fluid dynamics simulations.
The major goal of this project was to characterize unexplored I/O optimization opportunities that arise during the analysis of array-centric datasets, often with machine learning techniques. The project focused on common analytical methods, in particular sequence learning using a long short-term model (LSTM) and image classification using a convolutional neural network (CNN), and analyzed their respective I/O patterns on the Hadoop Distributed File System (HDFS), a popular parallel file system in cloud environments due to its ability to scale I/O to thousands of nodes. The main thrusts of the project were (1) identifying I/O patterns and bottlenecks at runtime at various levels of the I/O stack, (2) predicting their performance impact and (3) optimizing I/O to alleviate bottlenecks and improve performance. The major activities of this project were (1) the development of PatternMiner, a tool to analyze I/O traces from small experiments and extrapolate such traces at larger scale to understand I/O bottlenecks, and (2) the development of ASHWHIN, an I/O library that exposes an array-centric data access interface to applications but seamlessly stores data in the Hadoop Distributed File System.
The key outcome of this project is that it has demonstrated that object consolidation has the potential to equalize performance across systems with diverse I/O characteristics. Although the research community has extensively studied data placement algorithms for complex I/O hierarchies, the results from this project show that data consolidation mechanisms have an equally significant role to play. Research results have been disseminated through publication in venues that focus on database systems (IEEE ICDE), cloud computing (ACM SoCC), performance analysis (IEEE MASCOTS), and high-performance computing (IEEE HPEC). Additional papers are currently under review. The prototype implementation of the developed system is available under an open-source license at http://code.osu.edu/arraybridge.
Broader impacts from this project include curriculum development activities and professional development opportunities for students. From a curricular standpoint, the I/O patterns of the scientific applications that were identified as part of this project guided the development of lab assignments for the "Data Management in the Cloud" course based on various Hadoop technologies (HDFS, Impala, Spark, TileDB). From a pedagogical standpoint, this project has directly contributed to the professional development of two Ph.D., one M.Sc. and two B.Sc. students.
Last Modified: 01/26/2019
Modified by: Spyros Blanas
Please report errors in award information by writing to: awardsearch@nsf.gov.