Award Abstract # 1320226
CSR: Small: Reliability as a Service (RaaS) in Cloud Computing

NSF Org: CNS
Division Of Computer and Network Systems
Recipient: GEORGE WASHINGTON UNIVERSITY (THE)
Initial Amendment Date: August 26, 2013
Latest Amendment Date: August 26, 2013
Award Number: 1320226
Award Instrument: Standard Grant
Program Manager: Marilyn McClure
mmcclure@nsf.gov
 (703)292-5197
CNS
 Division Of Computer and Network Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2013
End Date: August 31, 2017 (Estimated)
Total Intended Award Amount: $407,968.00
Total Awarded Amount to Date: $407,968.00
Funds Obligated to Date: FY 2013 = $407,968.00
History of Investigator:
  • Tian Lan (Principal Investigator)
    tlan@gwu.edu
  • Suresh Subramaniam (Co-Principal Investigator)
  • H. Howie Huang (Co-Principal Investigator)
Recipient Sponsored Research Office: George Washington University
1918 F ST NW
WASHINGTON
DC  US  20052-0042
(202)994-0728
Sponsor Congressional District: 00
Primary Place of Performance: George Washington University
DC  US  20052-0058
Primary Place of Performance
Congressional District:
00
Unique Entity Identifier (UEI): ECR5E2LU5BL6
Parent UEI:
NSF Program(s): CSR-Computer Systems Research
Primary Program Source: 01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923
Program Element Code(s): 735400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Despite a projected shift to cloud computing, heightened concerns over cloud reliability remain paramount in both private and government sectors, and urge innovative solutions to meet the growing challenge of disparate reliability requirements. While existing techniques allow cloud providers to offer some fixed level of reliability to all customers, it may be either inadequate or too expensive to fit their specific requirements. This project aims to develop a novel framework for providing reliability as an elastic, transparent service that can be customized and accessed by all customers in cloud computing.

The goals of this project are: (1) holistic integration of two reliability approaches (viz., checkpointing and replication) with utility optimization and their adaptation to a distributed cloud environment with heterogeneous user demands, (2) the development of pricing schemes for cloud providers to put their ?resource white spaces? to profitable use. These two research directions collaboratively enable the realization of Reliability as a Service (RaaS). With the introduction of pay-per-use reliability services, cloud customers could choose reliability components they require on a feature-by-feature basis. Achieving a desired reliability level could be a single check box away. For cloud service providers, RaaS presents an additional source of revenue and value to their services.

By constructing realistic models and developing algorithms for resource allocation and optimization and pricing, the proposed research is expected to advance the start of the art of cloud computing. The project also includes an implementation and experimental component that will yield valuable knowledge on best practices and the main obstacles towards transitioning the results into the commercial world. This project will also carry out a number of educational activities involving K-12, undergraduate, and graduate students, and make strong outreach efforts for recruiting and mentoring under-represented students.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Maotong Xu, Sultan Alamro, Tian Lan, and Suresh Subramaniam "LASER: A Deep Learning Approach for Speculative Execution and Replication of Deadline-Critical Jobs in Cloud" IEEE ICCCN , 2017
Maotong Xu, Sultan Alamro, Tian Lan, and Suresh Subramaniam "Optimizing Speculative Execution of Deadline-Sensitive Jobs in Cloud" ACM Sigmetrics (Poster) , 2017
Shijing Li, Tian Lan, Moo-Ryong Ra, and Rajesh Panta "Background Trac Optimization for Meeting Deadlines in Data Center Storage" CISS 2016 , 2016
Shijing Li, Tian Lan, Moo-Ryong Ra, and Rajesh Panta "S3: Joint Scheduling and Source Selection for Background Traffic in Erasure-Coded Storage" ICDCS 2017 , 2017
Sultan Alamro, Maotong Xu, Tian Lan, and Suresh Subramaniam "CRED: Cloud Right-sizing to Meet Execution Deadlines and Data Locality" IEEE CLOUD 2016 , 2016
Vaneet Aggarwal, Jingxian Fan, and Tian Lan "Taming Tail Latency for Erasure-coded, Distributed Storage Systems" INFOCOM 2017 , 2017
Yu Xiang, Juzi Zhao, Tian Lan, Howie Huang, and Suresh Subramania "Elastic Reliability Optimization Through Peer-to-Peer Checkpointing in Cloud Computing" IEEE Transactions on Parallel and Distributed Systems , v.28 , 2017
Yu Xiang, Tian Lan, Vaneet Aggarwal, and Yih-Farn Robin Chen "Optimizing Dierentiated Latency in Multi-Tenant, Erasure-Coded Storage" IEEE Transactions on Network and Service Management , v.14 , 2017 , p.204
Zhe Huang, Bharath Balasubramanian, Michael Wang, Tian Lan, Mung Chiang, and Danny H.K.Tsang "RUSH: A RobUst ScHeduler to Manage Uncertain Completion-Times in Shared Clouds" ICDCS 2016 , 2016

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project created a new framework to enable Reliability as a Service (RaaS) in cloud computing. It harnessed checkpointing and replication techniques with utility optimization and dynamic cloud resource management to provide reliability as an elastic service, where flexible service-level agreements (SLAs) are negotiated through a joint assessment of users' reliability demands and total cloud resources available in a data center. A holistic RaaS framework that jointly optimizes reliability and cost/pricing over a number of entangled “control knobs”: reliability, checkpointing schedule, data replication factor, bandwidth allocation, dynamic scheduling of tasks/requests, storage/execution cost, latency and data locality, has been developed, providing an additional source of revenue to cloud providers by exploiting under-utilized resources and offering RaaS to cloud costumers. In solving these problems, the project developed novel models for service reliability, speculative execution, replication/erasure coding, and storage service latency, as well as new distributed algorithms for the proposed RaaS optimization. 

The project also investigated the practical and systems aspects of RaaS and utility-based optimization. In particular, the proposed RaaS framework and optimization algorithms haven been prototyped and integrated with several popular cloud and distributed computing systems, such as Amazon EC2, MapReduce, Tahoe, Ceph, and Cassandra. It resulted in a number of resource managers and task schedulers, which jointly optimizes reliability and performance metrics. Our evaluation using real-world workload validates significant reliability improvement on these systems and demonstrated the ability to provide elastic reliability that fits individual application’s requirements. 

The results of the project were published at peer-reviewed conferences; the source code of resulting tool, software and hardware design has been made openly available online. By jointly optimizing reliability, performance, and cost objectives, the resulting technologies will not only lead to new cloud infrastructure and management algorithms, but also promote the awareness of reliability and new practices such as usage-based RaaS through pricing and new business models. As cloud and distributed computing has become an important way for delivering network-based services, especially those from underserved communities and developing regions, to access information technology, this project will have a broader impact on the global society and economy. Notably technologies resulting from this project apply to not only mobile devices but also edge computing and mobile networks. Inspired by this research, new teaching lab facilities and interdisciplinary curriculum modules for teaching both the theory and systems have been developed.

 


Last Modified: 10/30/2017
Modified by: Tian Lan

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page