NSF Award Search: Award # 1563095

Award Abstract # 1563095

NeTS: CSR: Medium: Collaborative Research: Enabling Flexible and High Performance Big Data Analytics Over Geo-Distributed Clouds

NSF Org:	CNS Division Of Computer and Network Systems
Recipient:	REGENTS OF THE UNIVERSITY OF MICHIGAN
Initial Amendment Date:	May 17, 2016
Latest Amendment Date:	April 25, 2019
Award Number:	1563095
Award Instrument:	Continuing Grant
Program Manager:	Marilyn McClure mmcclure@nsf.gov (703)292-5197 CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	June 1, 2016
End Date:	May 31, 2020 (Estimated)
Total Intended Award Amount:	$400,000.00
Total Awarded Amount to Date:	$400,000.00
Funds Obligated to Date:	FY 2016 = $214,067.00 FY 2018 = $82,191.00 FY 2019 = $103,742.00
History of Investigator:	Mosharaf Chowdhury (Principal Investigator) mosharaf@umich.edu
Recipient Sponsored Research Office:	Regents of the University of Michigan - Ann Arbor 1109 GEDDES AVE STE 3300 ANN ARBOR MI US 48109-1015 (734)763-6438
Sponsor Congressional District:	06
Primary Place of Performance:	University of Michigan Ann Arbor 2260 Hayward Ann Arbor MI US 48109-2121
Primary Place of Performance Congressional District:	06
Unique Entity Identifier (UEI):	GNJ7BBP73WE9
Parent UEI:
NSF Program(s):	CSR-Computer Systems Research
Primary Program Source:	01001617DB NSF RESEARCH & RELATED ACTIVIT 01001819DB NSF RESEARCH & RELATED ACTIVIT 01001920DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7924
Program Element Code(s):	735400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Large organizations and small enterprises alike leverage datacenters across the globe to offer Internet services to their users. These sites routinely gather data pertaining to end user activities to provide better services, and they collect server monitoring logs and performance counters to ensure uninterrupted service. Although fast, efficient, and cost-effective analyses of these large datasets can significantly improve users' quality of experience and enable novel applications, the wide area network (WAN) that connects the datacenters poses a considerable challenge: because WAN bandwidth is limited and expensive, and WAN latency is high and variable, both the performance and timeliness of analytics are affected by the WAN.

This project aims to build a new WAN-aware big data stack customized for flexible geo-distributed data analytics. The project will not impose any constraints on the set of queries that can be issued, and it will support a variety of performance objectives including obtaining timely responses, minimizing batch completion times, or using minimal bandwidth. To account for unpredictable and fine-timescale changes to WAN conditions and to enable coordination among the actions taken by different layers of the analytics stack, this project will enable holistic, cross-layer visibility and optimizations. It will incorporate awareness of the geo-distributed setting in the stack's upper layers (e.g., query optimization) and of application-level objectives in the lower layers (e.g., networking). This will result in a radical re-factoring of the API and interfaces between query optimization, query execution, resource negotiation, wide-area storage, and network routing/scheduling.

Software artifacts from this project will be incorporated into existing open source big data stacks, making the research outcomes broadly available for public reuse. The experimental harnesses will be made available to ensure repeatability and to foster follow up research. The research outcomes will guide industry evolution as the industry slowly shifts from single-datacenter to geo-distributed settings. The project has a substantial educational component involving the introduction of new courses on big data systems at both graduate and undergraduate levels that will involve hands-on exercises with state-of-the-art big data software, and it will reach out to high-school students, women, and underrepresented minorities through big data boot camps.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A. P. Iyer, L. E. Li, M. Chowdhury, I. Stoica "Towards Mitigating the Latency-Accuracy Tradeoff in Mobile Data Analytics Systems" MobiCom , 2018

F. Lai, J. You, X. Zhu, H. V. Madhyastha, M. Chowdhury "Sol: Fast Distributed Computation Over Slow Networks" USENIX NSDI , 2020

Kshiteej Mahajan, Mosharaf Chowdhury, Aditya Akella, Shuchi Chawla "Dynamic Query Re-Planning Using QOOP" OSDI , 2018

K. V. Rashmi, M. Chowdhury, J. Kosaian, I. Stoica, K. Ramchandran "EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure Coding" OSDI , 2016

M. Chowdhury, S. Khuller, M. Purohit, S. Yang, J. You "Near Optimal Coflow Scheduling in Networks" SPAA , 2019

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

To cope with the increasing number of Internet users as well as Internet-of-Things (IoT) and edge devices, large organizations leverage tens to hundreds of datacenters and edge sites. Collecting data related to end-user sessions, monitoring logs, and performance counters, and then analyzing and personalizing this data can significantly improve overall user experience. The key challenge, however, is posed by the wide-area network (WAN) that connects the distributed sites. Unlike datacenter networks, WAN bandwidth is significantly lower and latency is higher; both are highly variable too. The overarching goal of the project was to redesign big data analytics stacks to support timely and efficient analysis, with a specific focus on analysis over geo-distributed sites. As part of this project, we built a suite of solutions to rethink storage, query planning, execution engine, and networking for geo-distributed analytics (GDA).

To enable general-purpose, interactive GDA across distributed sites, we designed a federated execution engine, Sol, that is API-compatible with Apache Spark. Meaning, any existing jobs can now run across multiple datacenters in an interactive fashion. Sol relies on two key changes from traditional datacenter-based execution engines. First, it pushes task instead of waiting for workers to be ready, which saves latency. Second, it decouples the provisioning of resources for communication and computation to achieve high resource utilization. As a result, it speeds up SQL and machine learning jobs by 4.9X and 16.4X compared to Apache Spark in resource-constrained networks. Moreover, even in datacenter environments, Sol outperforms Spark by 1.3X to 3.9X.

To manage the large data transfers between multiple datacenters for large, non-interactive GDA jobs, we designed the best-to-date communication scheduling-routing solution (NOCS). Specifically, we provided a tight 2-approximation algorithm when all release times and demands are polynomially sized, and a (2+ϵ)-approximation when the release times and demands can be super-polynomial. We have also developed variations of solutions that are computationally faster but may not always have well-analyzed approximation ratios. Evaluations of these algorithms show that the performance improvement in practice is even better than that predicted in theory.

We also made important progress in memory and storage management (EC-Cache) as well as dynamic query planning (QOOP) as part of this project. Specifically, EC-Cache is a low-latency erasure-coded (in-memory) storage that can improve efficiency and load balancing without sacrificing fault tolerance and availability for both datacenter and GDA jobs. We show that a small bandwidth overhead of at most 10% can provide more than 50% reduction in the median and tail latencies. EC-Cache's latency reductions increase as objects grow larger: for example, 1.33X for 1 MB objects and 5.5X for 100 MB objects. QOOP focuses on dynamic query (re)planning in changing conditions, such as WAN bandwidth variability. When resource availability changes, QOOP evaluates the progress of a query and picks a better query plan to switch to (if one exists). Dynamically reacting to such changing conditions can offer a median performance improvement of 1.47X compared to state-of-the-art alternatives.

Finally, we have explored two specific use case of GDA applications in CellScope and System H. CellScope enables geo-distributed machine learning for radio access network (RAN) performance diagnostics by trading of timeliness and accuracy. Our implementation on Apache Spark and evaluation shows that CellScope is able to achieve accuracy improvements up to 4.4X without incurring the latency overhead associated with the existing approaches. System H focuses on hybrid multi-cloud analytics (HMCA), where an enterprise wants to run analytics across public and private clouds owned by multiple entities. By taking WAN topology into account, we show that it can outperform the state-of- the-art by up to 3.43X for the smallest and up to 32.04X for the largest deployment on industry-standard TPC benchmarks.

All software developed as part of this project are based on established open-source systems such as Apache Spark and Apache YARN, and we have and continue to open-source our works at https://github.com/symbioticlab. Research papers summarizing our works have been published or are under submission in top venues in networking and systems including OSDI, NSDI, MobiCom, and SPAA. Some of the works have been incorporated into course contents in graduate- and undergraduate-level systems and networking courses at the University of Michigan. Last but not the least, two PhD students at the University of Michigan have worked on different pieces of our contributions, and this grant has helped in partly supporting their education and training.

Last Modified: 09/07/2020
Modified by: Mosharaf Chowdhury

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error