NSF Award Search: Award # 1617773

Award Abstract # 1617773

NeTS: Small: Collaborative Research: Enabling Application-Level Performance Predictability in Public Clouds

NSF Org:	CNS Division Of Computer and Network Systems
Recipient:	REGENTS OF THE UNIVERSITY OF MICHIGAN
Initial Amendment Date:	August 29, 2016
Latest Amendment Date:	August 29, 2016
Award Number:	1617773
Award Instrument:	Standard Grant
Program Manager:	Darleen Fisher CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	October 1, 2016
End Date:	September 30, 2019 (Estimated)
Total Intended Award Amount:	$238,500.00
Total Awarded Amount to Date:	$238,500.00
Funds Obligated to Date:	FY 2016 = $238,500.00
History of Investigator:	Mosharaf Chowdhury (Principal Investigator) mosharaf@umich.edu
Recipient Sponsored Research Office:	Regents of the University of Michigan - Ann Arbor 1109 GEDDES AVE STE 3300 ANN ARBOR MI US 48109-1015 (734)763-6438
Sponsor Congressional District:	06
Primary Place of Performance:	University of Michigan Ann Arbor 2260 Hayward Ann Arbor MI US 48109-2121
Primary Place of Performance Congressional District:	06
Unique Entity Identifier (UEI):	GNJ7BBP73WE9
Parent UEI:
NSF Program(s):	Networking Technology and Syst
Primary Program Source:	01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7923
Program Element Code(s):	736300
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

State-of-the-art resource sharing mechanisms in today's datacenters and compute clouds are agnostic to application-level performance requirements, resulting in unpredictable performance. This is especially true for the network. Unlike, CPU, memory, or disk, cloud operators do not provide any guarantees for the network. Many tenants rely on over-provisioning and static allocation for performance isolation, which results in low utilization and increased cost and environmental impacts. This project aims to build a set of solutions to achieve short- and long-term performance predictability with high resource utilization. The goal is to enable coexisting applications from different tenants to meet a variety of performance objectives including obtaining timely responses and minimizing variance of successive responses, while adhering to organizational hierarchies of individual tenants. The key technical challenges in this project include developing short- and long-term resource allocation algorithms, accurate demand estimation, as well as fast and efficient enforcement, all of which are compounded by the multi-resource and shared nature of the network. Two key techniques guide the proposed work: (i) temporal scheduling ensures predictable performance through short- and long-term performance isolation, and (ii) spatial placement ensures higher utilization through initial placement and periodic migration of tenants' virtual machines.

Predictable, efficient data analytics will have significant socio-economic ramifications. It will also enable mission-critical applications, e.g., anomaly detection, fraud protection, autonomous vehicles, and robotics-- that require a highly consistent and reliable level of performance to coexist with the less sensitive ones. Algorithms and software from the project will be incorporated into existing open-source big data stacks for public reuse. By leveraging ongoing relationships with the industry, artifacts from this project will be converted from research into practice in a fast manner. The project has significant educational and outreach components, which include introducing new courses at both graduate and undergraduate levels based on the outcomes of this project as well as arranging cloud computing boot camps aimed at students from high schools and involving women and under-represented minorities.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

State-of-the-art resource sharing mechanisms in today's cloud datacenters are agnostic to application-level performance requirements, resulting in unpredictable performance. This is as true for the network as it is for CPU, memory, or disk. Many tenants rely on over-provisioning and static allocation for performance isolation, which results in low utilization and increased cost and environmental impacts.

The overarching goal of this project was to build a set of solutions to achieve short and long-term performance predictability with high resource utilization and improved application-level performance. As part of this project, we built a suite of solutions -- that span the entire cloud stack -- to enable coexisting applications from different tenants to meet a variety of performance objectives including obtaining timely responses and minimizing variance of successive responses, while adhering to organizational hierarchies of individual tenants.

In terms of networking, we have designed and developed a datacenter load balancing solution (Hermes) as well as an end-host latency optimization mechanism (Leap). Hermes is a transport layer-aware load balancing solution that detects path conditions via comprehensive sensing using transport-level signals such as ECN and RTT. To further improve visibility, Hermes employs active probing – guided by the power of two choices technique – that can effectively increase the scope of sensing at minimal probing cost. Overall, it outperforms the state-of-the-art by more than 30%. Leap, in contrast, focuses on improving the data path inside the Linux kernel by removing unnecessary queueing and by proposing a new data prefetching algorithm. It improves the performance of low-latency, memory-intensive applications by up to 10X.

In terms of resource management, we have designed two solutions: one that handles heterogeneous requests with throughput and latency constraints (BoPF) and one that handles heterogeneous compute resources such as CPU and GPU (AlloX). BoPF relies on the observation that throughput-sensitive tenants do not care about allocated resources in the short term, as long as the long-term averaged resource share remains the same. This temporal flexibility allows us to accommodate bursts of latency-sensitive tenants without hurting throughput-sensitive ones. It outperforms existing solutions by more than 5X in terms of performance while providing the same level of fairness. AlloX focuses on carefully picking CPU and GPU combinations for deep learning training jobs. The key idea is leveraging the interchangeability of computation resources. By judiciously combining both, AlloX can improve average training time by as much as 95% in comparison to existing solutions that ignore resource interchangeability.

Finally, in the application level, we explored cloud applications are designed to interact with their data: is it via a key-value (KV) store or via remote procedure call (RPC). We observed that the root cause behind this choice -- and subsequent dichotomy between the two views -- is the shifting balance between CPU and network bottlenecks of a workload. One can improve throughput while satisfying service-level objectives (SLOs) by judiciously employing both at the same. We have proposed a fully decentralized algorithm (Kayak) that runs only on the client side and dynamically chooses the best option based on the shifting balance. Overall, Kayak can improve throughput by 35% by combining the best of both worlds.

All software developed as part of this project are based on established open-source systems such as Kubernetes, and we have and continue to open-source our works at https://github.com/symbioticlab. Research papers summarizing our works have been published or are under submission in top venues in networking and systems including SIGCOMM and NSDI. Some of the works have been incorporated into course contents in graduate- and undergraduate-level systems and networking courses at the University of Michigan. Last but not the least, three PhD students at the University of Michigan have worked on different pieces of our contributions, and this grant has helped in partly supporting their education and training.

Last Modified: 02/06/2020
Modified by: Mosharaf Chowdhury

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error