Award Abstract # 0746832
CAREER: A Scalable Hierarchical Framework for High-Performance Data Storage

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: VIRGINIA POLYTECHNIC INSTITUTE & STATE UNIVERSITY
Initial Amendment Date: January 14, 2008
Latest Amendment Date: April 12, 2012
Award Number: 0746832
Award Instrument: Continuing Grant
Program Manager: Almadena Chtchelkanova
achtchel@nsf.gov
 (703)292-7498
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 1, 2008
End Date: July 31, 2014 (Estimated)
Total Intended Award Amount: $400,000.00
Total Awarded Amount to Date: $476,000.00
Funds Obligated to Date: FY 2008 = $124,242.00
FY 2009 = $85,725.00

FY 2010 = $88,867.00

FY 2011 = $92,197.00

FY 2012 = $84,969.00
History of Investigator:
  • Ali Butt (Principal Investigator)
    butta@cs.vt.edu
Recipient Sponsored Research Office: Virginia Polytechnic Institute and State University
300 TURNER ST NW
BLACKSBURG
VA  US  24060-3359
(540)231-5281
Sponsor Congressional District: 09
Primary Place of Performance: Virginia Polytechnic Institute and State University
300 TURNER ST NW
BLACKSBURG
VA  US  24060-3359
Primary Place of Performance
Congressional District:
09
Unique Entity Identifier (UEI): QDE5UHE5XD16
Parent UEI: X6KEFGLHSJX7
NSF Program(s): ADVANCED COMP RESEARCH PROGRAM,
COMPUTING PROCESSES & ARTIFACT,
Software & Hardware Foundation,
HIGH-PERFORMANCE COMPUTING
Primary Program Source: 01000809DB NSF RESEARCH & RELATED ACTIVIT
01000910DB NSF RESEARCH & RELATED ACTIVIT

01001011DB NSF RESEARCH & RELATED ACTIVIT

01001112DB NSF RESEARCH & RELATED ACTIVIT

01001213DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1045, 7942, 9216, 9218, 9251, HPCC
Program Element Code(s): 408000, 735200, 779800, 794200
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Modern scientific applications, such as analyzing information from large-scale distributed sensors, climate monitoring, and forecasting environmental impacts, require powerful computing resources and entail managing an ever-growing amount of data. While high-end computer architectures comprising of tens-of-thousands or more processors are becoming a norm in modern High Performance Computing (HPC) systems supporting such applications, this growth in computational power has not been matched by a corresponding improvement in storage and I/O systems. Consequently, there is an increasing gap between storage system performance and computational power of clusters, which poses critical challenges, especially in supporting emerging petascale scientific applications. This research develops a framework for bridging the said performance gap and supporting efficient and reliable data management for HPC. Through innovation, design, development, and deployment of the framework, the investigators improve the I/O performance of modern HPC setups.
The target HPC environments present unique research challenges, namely, maintaining I/O performance with increasing storage capacity, low-cost administration of a large number of resources, high-volume long-distance data transfers, and adapting to the varying I/O demands of applications. This research addresses these challenges in storage management by employing a Scalable Hierarchical Framework for HPC data storage. The framework provides high-performance reliable storage within HPC cluster sites via hierarchical organization of storage resources, decentralized interactions between sites to support high-speed, high-volume data exchange and strategic data placement, and system-wide I/O optimizations. The overall goal is a data storage framework attuned to the needs of modern HPC applications, which mitigates the underlying performance gap between compute resources and the I/O system. This research adopts a holistic approach where all system components interact to yield an efficient data management system for HPC.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 31)
Benjamin A. Schmidt and Ali R. Butt "Facilitating Intermediate Node Discovery for Decentralized Offloading in High Performance Computing Centers" Proceedings of IEEE SoutheastCon , 2009
Chreston Miller, Ali R. Butt, and Patrick Butler "On Utilization of Contributory Storage in Desktop Grids" Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS '08) , 2008
Guanying Wang, Henry Monti, Ali R. Butt, and Karan Gupta "Towards Synthesizing Realistic Workload Traces for Studying the Hadoop Ecosystem" Proceedings of the 19th Annual Meeting of the IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) , 2011
Henry Monti, Ali R. Butt, and Sudharshan S. Vazhkudai "CATCH: A Cloud-based Adaptive Data Transfer Service for HPC" Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS) , 2011
Henry Monti, Ali R. Butt, and Sudharshan S. Vazhkudai "Just-in-time Staging of Large Input Data for Supercomputing Jobs" Proceedings of the ACM Petascale Data Storage Workshop @SC'08 , 2008
Henry Monti, Ali R. Butt, and Sudharshan S. Vazhkudai "Just-in-time Staging of Large Input Data for Supercomputing Jobs" Proceedings of the ACM Petascale Data Storage Workshop @SC'08 , 2008
Henry Monti, Ali R. Butt, and Sudharshan S. Vazhkudai "On Timely Staging of HPC Job Input Data" IEEE Transactions on Parallel and Distributed Systems , v.24 , 2013 , p.1841?1851 http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.279
Henry Monti, Ali R. Butt, and Sudharshan S. Vazhkudai "Timely Offloading of Result-Data in HPC Centers" Proceedings of the ACM International Conference on Supercomputing (ICS '08) , 2008
Henry Monti, Ali R. Butt, and Sudharshan S. Vazhkudai. "Reconciling Scratch Space Consumption, Exposure, and Volatility to Achieve Timely Staging of Job Input Data" Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS) , 2010
Krish K.R., Guanying Wang, Puranjoy Bhattacharjee, Ali R. Butt, and Chris Gniady "On Reducing Energy Management Delays in Disks" Journal of Parallel and Distributed Computing , v.73 , 2013 , p.823-835 http://dx.doi.org/10.1016/j.jpdc.2013.02.011
Miller, Chreston; Butt, Ali R.; Butler, Patrick; IEEE "On utilization of contributory storage in desktop grids" 2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8 , v.2008 , 2008 , p.1122-1133
(Showing: 1 - 10 of 31)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

High performance computing (HPC) systems are faced with a deluge from the vast amounts of data that is being processed by the state-of-the-art and emerging petascale scientific computing applications. The goal of this project is to address the storage and I/O challenges arising from such data-intensive operations. We have developed a framework to bridge the performance gap between storage and compute components and support efficient and reliable data management for HPC. We adopted a two-pronged approach: providing high-performance reliable storage within HPC cluster sites via hierarchical organization of distributed storage resources, and enabling decentralized interactions between sites to support high-speed, high-volume data exchange.

 

A key contribution of the project is the design and development of tools to optimize the large-volume data transfers in HPC workflows. First, we developed a contributory storage based solution, which enabled HPC centers to offload data to user-provided distributed storage sites. We also developed cloud-enabled techniques for seamless data transfer between HPC centers and users, and offloading data-intensive workloads from the HPC centers to the cloud. Our off-loading approaches exploit the orthogonal bandwidth available between the users and the HPC center and relieve the center from handling I/O-intensive tasks, thus allowing the center to focus on compute-intensive components for which it is better provisioned. Evaluation of our approach using both real deployments as well as simulations demonstrates the feasibility of decentralized offloading; an improvement in the data transfer times by as much as 81.1% for typical HPC workloads was observed.

 

Second, we explored the use of solid-state storage devices (SSDs) in designing a novel multi-tiered data staging area that can that then be seamlessly integrated with our offloading system, with the traditional HPC storage stack (e.g., Lustre) as the secondary storage. The novelty of our approach is that we employed SSDs in a limited number of participants that are expected to observe the peak load, thus ensuring economic feasibility.  Our evaluation showed that the staging area absorbs application checkpoint data and seamlessly drains the data from various storage tiers to the parallel file system, thereby improving the overall I/O performance. We also extended the work to use adaptive data placement, both across various storage layers of an HPC site and with individual nodes within a site. The evaluation yielded better understanding of using the storage layers, and insights into how to incorporate SSDs into the storage hierarchy.

 

Finally, we explored the use of emerging technologies such as accelerators and low-power micro-servers in supporting the HPC I/O stack operations. Specifically, we explored the use of such components in supporting I/O-intensive workloads both for HPC applications as well as the extant cloud programming model, Hadoop. To this end, we designed low-cost GPUs to achieve a flexible, fault-tolerant, and high-performance RAID-6 solution for a parallel file system. We capitalize the resources provided by the file system, such as striping individual files over multiple disks, with the computational power of a GPU to provide flexible and fast parity computation for encoding and rebuilding of degraded RAID arrays. The results demonstrate that leveraging GPUs for I/O support functions, i.e., RAID parity computation, is a feasible approach and can provide an efficient alternative to specialized-hardware-based solutions. The effect would be to reduce the cost of HPC I/O systems, and improve the overall efficiency of the system.

 

The work on designing a robus...

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page