NSF Award Search: Award # 1629397

Award Abstract # 1629397

XPS: FULL: A Cross-Layer Approach Toward Low-Latency Data-Parallel Applications in Rack-Scale Computing

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	REGENTS OF THE UNIVERSITY OF MICHIGAN
Initial Amendment Date:	September 2, 2016
Latest Amendment Date:	September 2, 2016
Award Number:	1629397
Award Instrument:	Standard Grant
Program Manager:	Marilyn McClure mmcclure@nsf.gov (703)292-5197 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	September 1, 2016
End Date:	August 31, 2020 (Estimated)
Total Intended Award Amount:	$825,000.00
Total Awarded Amount to Date:	$825,000.00
Funds Obligated to Date:	FY 2016 = $825,000.00
History of Investigator:	Mosharaf Chowdhury (Principal Investigator) mosharaf@umich.edu Barzan Mozafari (Co-Principal Investigator)
Recipient Sponsored Research Office:	Regents of the University of Michigan - Ann Arbor 1109 GEDDES AVE STE 3300 ANN ARBOR MI US 48109-1015 (734)763-6438
Sponsor Congressional District:	06
Primary Place of Performance:	University of Michigan Ann Arbor 2260 Hayward Ann Arbor MI US 48109-2121
Primary Place of Performance Congressional District:	06
Unique Entity Identifier (UEI):	GNJ7BBP73WE9
Parent UEI:
NSF Program(s):	Exploiting Parallel&Scalabilty
Primary Program Source:	01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):
Program Element Code(s):	828300
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Although many modern applications, e.g., exploratory analytics and scientific visualization, come with stringent latency requirements, today's in-memory and scale-out solutions often provide only best-effort services. A root cause of unpredictability lies in the traditional design principle of minimizing I/O operations. With the advent of faster storage and networks in rack-scale computing, however, I/O may no longer be scarce anymore. This project revisits the tradeoffs and design principles of scale-out, low-latency applications in this emerging context. Bounded response times will reduce over-provisioning and foster new applications (e.g., business intelligence, robotics, and intensive care units) that require consistent performance. Project findings will be integrated into undergraduate and graduate curricula, and software artifacts will be open-sourced for the wider community across academia and industry.

This project aims to leverage the influx of new hardware capabilities to enable applications based on bounded response times as their primary design criteria. Specifically, the project leverages approximation, speculation, and scheduling to mask variabilities in latency-sensitive applications. The key technical challenge in realizing this vision lie in making a set of tradeoffs different from the norm: (i) rather than striving for less I/O, this project trades I/O off for better memory locality and aggressively speculate to reduce response times; (ii) when needed, it resorts to approximation techniques for bounded response times; and finally, (iii) it develops new approximation- and speculation-aware schedulers to increase resource efficiency. The project also investigates theoretical and empirical boundaries of approximate and speculative processing as well as new spatiotemporal scheduling techniques in rack-scale computing.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Although modern applications come with stringent performance requirements, existing solutions often provide only best-effort services. A root cause of unpredictability lies in the traditional design principle of minimizing I/O operations. However, with the advent of faster storage and networking hardware, I/O capacity is not as scarce anymore. The overarching goal of this project was to rethink the tradeoffs and design principles of modern applications in this emerging context. To this end, we built a set of solutions that married advances in hardware capabilities with battle-tested software optimization techniques to enable resource disaggregation for big data and AI/ML workloads.

To enable efficient and resilient memory disaggregation over fast networks, we created the first practical memory disaggregation solution (Infiniswap) as part of this project. We made it resilient without incurring large memory overhead by designing a erasure-coded memory solution (Hydra), and we enabled locking using RDMA primitives (DSLR) to enable concurrent accesses to remote memory objects. Overall, our solutions took the first steps toward practical memory disaggregation to the point that memory-intensive applications can run without any performance loss even when 50% of their memory resides in remote machines.

We also focused on high-performance big data analytics by enabling so-called infinite-scale analytics (VertictDB), whereby any existing analytics engine can leverage approximate query processing to speed up their performance by 57X on average (and up to 841X). We also designed new cluster scheduler (Carbyne) that can take the DAG of a job and altruistically exchange resources with other jobs to improve the average job completion times. In deployments, Carbyne provides 1.26X better effi?ciency and 1.59X lower average completion time than the state-of-the-art, while ensuring fair resource sharing.

Another key direction we explored is resource management in AI/ML clusters. To this end, we worked on GPU cluster management (Tiresias) and GPU resource management (Salus) for training as well as both for hyperparameter tuning (FluidExec). In addition, we looked beyond GPUs to optimize CPU resource management in distributed AI training, especially in the parameter server setting. Overall, our solutions resulted in up to 5.5X cluster-level improvement and 7X improvement at the individual GPU level resource usage efficiency, reducing the cost of AI for the masses.

Finally, from a theoretical advances perspective, we have explored several techniques to improve approximate query processing in the context of maximum inner-product search (BOUNDEDME) and joins on sampled data (SUBS), improving by an order of magnitude over the state-of-the-art techniques. At the same time, we have made progress on the learning theory side by enabling projection-free optimization and selectivity learning with mixture models (QuickSel). QuickSel is 34.0X?179.4X faster than stateof-the-art query-driven techniques for selectivity learning.

All software developed as part of this project are based on established open-source systems such as Apache Spark, Apache YARN, TensorFlow, and MySQL, and we have and continue to open-source our works at https://github.com/symbioticlab. Research papers summarizing our works have been published or are under submission in top venues in networking, systems, databases, and AI including OSDI, NSDI, SIGMOD, VLDB, and AAAI. Some of the works have been incorporated into course contents in graduate- and undergraduate-level networking and databases courses at the University of Michigan. Last but not the least, several PhD students at the University of Michigan have worked on different pieces of our contributions, and this grant has helped in partly supporting their education and training.

Last Modified: 12/02/2020
Modified by: Mosharaf Chowdhury

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error