NSF Award Search: Award # 1115665

Award Abstract # 1115665

CSR: Small: DSA-Cloud: Data Semantics Aware Clouds for High Performance Analytics

NSF Org:	CNS Division Of Computer and Network Systems
Recipient:	THE UNIVERSITY OF CENTRAL FLORIDA BOARD OF TRUSTEES
Initial Amendment Date:	July 7, 2011
Latest Amendment Date:	April 17, 2012
Award Number:	1115665
Award Instrument:	Standard Grant
Program Manager:	Marilyn McClure mmcclure@nsf.gov (703)292-5197 CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	September 1, 2011
End Date:	August 31, 2015 (Estimated)
Total Intended Award Amount:	$374,973.00
Total Awarded Amount to Date:	$390,973.00
Funds Obligated to Date:	FY 2011 = $374,973.00 FY 2012 = $16,000.00
History of Investigator:	Jun Wang (Principal Investigator) Jun.Wang@ucf.edu
Recipient Sponsored Research Office:	The University of Central Florida Board of Trustees 4000 CENTRAL FLORIDA BLVD ORLANDO FL US 32816-8005 (407)823-0387
Sponsor Congressional District:	10
Primary Place of Performance:	The University of Central Florida Board of Trustees 4000 CENTRAL FLORIDA BLVD ORLANDO FL US 32816-8005
Primary Place of Performance Congressional District:	10
Unique Entity Identifier (UEI):	RD7MXJV7DKT9
Parent UEI:
NSF Program(s):	Special Projects - CNS, CSR-Computer Systems Research
Primary Program Source:	01001112DB NSF RESEARCH & RELATED ACTIVIT 01001213DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7923, 7354, 9251, 9178
Program Element Code(s):	171400, 735400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

This project is motivated by successful deployment of eScience applications
on clouds: how to deploy HPC analytics applications on the cloud? Both eScience applications
and HPC analytics applications manipulate with tera-scale or peta-scale data and require
access to expensive computing resources. However, HPC analytics applications bear several
distinct characteristics such as complex data access patterns and interest locality, which
pose new challenges to its adoption in clouds.

The goal of this project is to develop a data semantics
aware framework to enable HPC analytics at clouds. Such a framework is composed of three components;
1) a MapReduce API with data semantics awareness used to develop high-performance
analysis applications, 2) a translation layer equipped with data-semantics aware HPC interfaces,
and 3) a data-affinity-aware data placement scheme. It is anticipated that high productivity on the
economic impact is significantly improved through the cost-effective scientific data processing.
Delivering an open source software to the community speeds up the 21st century scientific
discovery process in any HPC analytics areas such as cosmology, astrophysics, chromodynamics,
bioinformatics, etc. Numerous educational benefits are expected to be generated from collaborative
effort with several UCF educational projects and external collaboration and community ties
through the integration into the FutureGrid, and scientific computing cloud at Department of Energy.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 15)

Show All

Grant Mackey, Saba Sehrish, John Bent (LANL), Christopher Mitchell, and Jun Wang "USFD: A Unified Storage Framework for SOAR HPC Scientific Workflows" International Journal of Parallel, Emergent and Distributed Systems , v.27 , 2012 , p.347-367 10.1080/17445760.2011.638294

Jiangling Yin, Junyao Zhang, Jun Wang. "CPPL: A New Chunk Based Power Proportional Layout with Fast Recovery" Special Issue of "Cloud Computing" of ZTE Communication Journal , v.4 , 2013

Jie Chen, Jun Wang, Zhihu Tan, Changsheng Xie "Recursive Updates in Copy-on-write File Systems - Modeling and Analysis" Journal of Computers , v.9 , 2014 , p.2342-2351

Jiguang Wan, Chao Yin, Jun Wang and Changsheng Xie "A New High-performance, Energy-efficient Replication Storage System with Reliability Guarantee" 28th IEEE Conference on Massive Data Storage (MSST 2012) , 2012

Jiguang Wan, Xiaoyang Qu, Nannan Zhao, Jun Wang, and Changsheng Xie "ThinRAID: Thinning down RAID array for energy conservation" IEEE Transactions on Parallel and Distributed Systems , 2014 , p.DOI: 10.1 1045-9219

Jun Wang, Lu Cheng, Lizhe Wang "Concentric Layout, a New Scientific Data Layout for Matrix Data Set in Hadoop File System" International Journal of Parallel, Emergent and Distributed Systems , v.28 , 2013 , p.407 10.1080/17445760.2012.720982

Jun Wang, Qiangju Xiao, Jianglin Yin, Pengju Shang "DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality" IEEE Transactions on Magnetics , v.49 , 2013 , p.2514

Pengju Shang, Jun Wang "TRAID: Exploiting Temporal Redundancy and Spatial Redundancy to Boost Transaction Processing Systems Performance" IEEE Transactions on Computers , v.61 , 2012 , p.517

Qiangju Xiao, Pengju Shang and Jun Wang "Co-located Compute and Binary File Storage in Data-intensive Computing" The 2012 International Conference on Networking, Architecture, and Storages (NASâ??12) , 2012

Qiangju Xiao, Pengju Shang, Jun Wang "Record-Based Block Distribution (RBBD) and weighted set cover scheduling (WSCS) in MapReduce" Journal of Internet Services and Applications , v.3 , 2012 , p.319-327 1867-4828

(Showing: 1 - 10 of 15)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project is motivated by successful deployment of eScience applications on clouds: how to deploy scientific analytics applications on the cloud? Many scientists are exploring the possibilities of deploying applications with large scale of data on cloud computing platforms such as Amazon EC2 and Windows Azure. Recently, the successful deployment of eScience applications on clouds motivates us to deploy HPC analytics applications to the cloud. The reason behind this lies in a fact that eScience applications and HPC analytics applications share some important features: tera-scale or peta-scale data size and high cost to run on single or several supercomputers or large platforms. Both eScience applications and HPC analytics applications manipulate with tera-scale or peta-scale data and require access to expensive computing resources. However, HPC analytics applications bear several distinct characteristics such as complex data access patterns and interest locality, which pose new challenges to its adoption in clouds.

In order to optimize the performance of HPC analytics on the cloud or Hadoop infrastructure, we construct a private cloud platform in Marmot cluster. Marmot is part of PRObE, an NSF-sponsored project providing a large-scale, low-level systems research facility. In this private cloud, we employ Virtual Machines (VMs) via various virtualization technologies, such as Xen, KVM, Linux Containers. From application users' perspectives, such VMs act as independent computing nodes, which resources, such as CPU cores, Memory, block I/O etc., can be adjusted at runtime and pay as it goes. In addition to VMs, the network facilities are also virtualized via state-of-the-art techniques. In our private cloud, the VMs are connected by Open vSwitch (OVS), which is an open-source implementation of a distributed virtual multilayer switch. Under the private cloud, we deployed two distinct Distributed File Systems (DFSs), Hadoop File Systemand Lustre file system. These DFSs are mainly used to provide storage services for running HPC analytics on the private cloud.

Facilitated by the private cloud and storage infrastructure, the investigators develop a set of frameworks and middlewares to enable fast execution of HPC analytics applications on cloud. Firstly, we develop a translation layer framework based on a Unified I/O System (UNIO) to avoid data movement overhead for public cloud infrastructures. Our main idea is to enable both HPC simulation programs and analytics programs to run atop one cloud file system, e.g. Hadoop file system, a data-intensive file system (DIFS in brief). Secondly, we develop a new Data-gRouping-AWare (DRAW) data placement scheme for cloud storage and data intensive file system to address an interest locality issue. DRAW dynamically scrutinizes data access from system log files. It extracts optimal data groupings and re-organizes data layouts to achieve the maximum parallelism per group subjective to load balance. Thirdly, We propose a novel method to Optimize Parallel Data Access on Distributed File Systems referred to as Opass to reduce remote parallel data accesses and achieve a higher balance of data read requests between cluster nodes.

In conclusion, this research project delivers indirect key outcomes for accelerating running eScience applications and HPC analytics on cloud environments. The outcomes include: 1) To solve the data migration problem in small-medium sized HPC clusters, we propose to construct a sided I/O path, named as SideIO, to explicitly direct analysis data to Hadoop file system that co-locates computing with data. In contrast, checkpoint data may not be read back later, it is written to the dedicated parallel file system to maximize I/O throughput. 2) We develop a new HPC analytics framework called NOHAA, to provide a semantics-aware intelligent data upload interface and a locality-aware hierarch...

Please report errors in award information by writing to: awardsearch@nsf.gov.

Top

Success

Error