NSF Award Search: Award # 1908536

Award Abstract # 1908536

CNS Core: Small: Optimizing Distributed Machine Learning for Transient Resources using Loose Synchronization

NSF Org:	CNS Division Of Computer and Network Systems
Recipient:	UNIVERSITY OF MASSACHUSETTS
Initial Amendment Date:	July 23, 2019
Latest Amendment Date:	July 23, 2019
Award Number:	1908536
Award Instrument:	Standard Grant
Program Manager:	Marilyn McClure mmcclure@nsf.gov (703)292-5197 CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	October 1, 2019
End Date:	September 30, 2024 (Estimated)
Total Intended Award Amount:	$500,000.00
Total Awarded Amount to Date:	$500,000.00
Funds Obligated to Date:	FY 2019 = $500,000.00
History of Investigator:	David Irwin (Principal Investigator) irwin@ecs.umass.edu Prashant Shenoy (Co-Principal Investigator) Lixin Gao (Co-Principal Investigator)
Recipient Sponsored Research Office:	University of Massachusetts Amherst 101 COMMONWEALTH AVE AMHERST MA US 01003-9252 (413)545-0698
Sponsor Congressional District:	02
Primary Place of Performance:	University of Massachusetts Amherst Amherst MA US 01003-9292
Primary Place of Performance Congressional District:	02
Unique Entity Identifier (UEI):	VGJHK59NMPK9
Parent UEI:	VGJHK59NMPK9
NSF Program(s):	CSR-Computer Systems Research
Primary Program Source:	01001920DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7923
Program Element Code(s):	735400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

The availability of large-scale data sets in many domains has driven the growth of large-scale distributed machine learning (ML) workloads on cloud platforms to derive insights from this data. To reduce the cost of executing these workloads, cloud platforms have begun to offer transient servers for a highly discounted price. Unfortunately, cloud platforms may revoke transient servers at any time, which can decrease distributed ML performance and eliminate any cost benefit. High revocation rates are especially problematic for distributed ML workloads that support synchronous processing, since revoked servers block others from continuing past predefined synchronization barriers until a replacement server can reach the barrier. While asynchronous processing eliminates this blocking and improves performance, it does not maintain the algorithmic properties of synchronous algorithms, resulting in slower algorithmic convergence or possibly preventing convergence. To maintain performance on low-cost transient servers, this project proposes re-designing traditional distributed ML algorithms to use looser forms of synchrony. Such loose synchronization minds the gap between synchronous and asynchronous processing by maintaining the algorithmic convergence properties of synchronous processing, while enabling some asynchronous processing to avoid blocking. The project combines this loose synchronization approach with adaptive policies for selecting transient servers based on their performance, cost, and volatility to significantly reduce the cost of executing large-scale distributed ML workloads on cloud platforms.

Distributed machine learning (ML) workloads that derive insights from large-scale data sets have become the foundation for numerous advances across multiple industry sectors. This project has the potential to accelerate these advances by significantly decreasing the cost and improving the efficiency of executing distributed ML workloads on cloud platforms using transient servers. To benefit the broader community, the project will publicly release its software artifacts as open source. The project will incorporate topics on transient servers and distributed ML into graduate and undergraduate courses on distributed and operation systems. The project will also involve undergraduates in research through related summer research experience projects and undergraduate theses.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 19)

Show All

Ambati, Pradeep and Irwin, David and Shenoy, Prashant "No Reservations: A First Look at Amazon's Reserved Instance Marketplace" USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) , 2020 Citation Details

Ali-Eldin, Ahmed and Wang, Bin and Shenoy, Prashant "The hidden cost of the edge: a performance comparison of edge and cloud latencies" International Conference for High Performance Computing, Networking, Storage and Analysis , 2021 https://doi.org/10.1145/3458817.3476142 Citation Details

Ambati, Pradeep and Bashir, Noman and Irwin, David and Hajiesmaili, Mohammad and Shenoy, Prashant "Hedge Your Bets: Optimizing Long-term Cloud Costs by Mixing VM Purchasing Options" 2020 IEEE International Conference on Cloud Engineering (IC2E) , 2020 https://doi.org/10.1109/IC2E48712.2020.00018 Citation Details

Ambati, Pradeep and Bashir, Noman and Irwin, David and Shenoy, Prashant "Good Things Come to Those Who Wait: Optimizing Job Waiting in the Cloud" Proceedings of the ACM Symposium on Cloud Computing (SoCC) , 2021 https://doi.org/10.1145/3472883.3487007 Citation Details

Ambati, Pradeep and Bashir, Noman and Irwin, David and Shenoy, Prashant "Modeling and Analyzing Waiting Policies for Cloud-Enabled Schedulers" IEEE Transactions on Parallel and Distributed Systems , 2021 https://doi.org/10.1109/TPDS.2021.3086270 Citation Details

Ambati, Pradeep and Bashir, Noman and Irwin, David and Shenoy, Prashant "Waiting Game: Optimally Provisioning Fixed Resources for Cloud-Enabled Schedulers" SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , 2020 https://doi.org/10.1109/SC41405.2020.00071 Citation Details

Bashir, Noman and Chandio, Yasra and Irwin, David and Anwar, Fatima M. and Gummeson, Jeremy and Shenoy, Prashant "Jointly Managing Electrical and Thermal Energy in Solar- and Battery-powered Computer Systems" 14th ACM International Conference on Future Energy Systems , 2023 https://doi.org/10.1145/3575813.3595191 Citation Details

Hanafy, Walid A. and Wang, Limin and Chang, Hyunseok and Mukherjee, Sarit and Lakshman, T. V. and Shenoy, Prashant "Understanding the Benefits of Hardware-Accelerated Communication in Model-Serving Applications" Proceedings of IEEE/ACM 31st International Symposium on Quality of Service (IWQoS) , 2023 https://doi.org/10.1109/IWQoS57198.2023.10188785 Citation Details

Liang, Qianlin and Shenoy, Prashant and Irwin, David "AI on the Edge: Characterizing AI-based IoT Applications Using Specialized Edge Architectures" IEEE International Symposium on Workload Characterization (IISWC) , 2020 https://doi.org/10.1109/IISWC50251.2020.00023 Citation Details

Mehboob, Talha and Bashir, Noman and Zink, Michael and Irwin, David "Is Sharing Caring? Analyzing the Incentives for Shared Cloud Clusters" 2023 ACM/SPEC International Conference on Performance Engineering , 2023 https://doi.org/10.1145/3578244.3583730 Citation Details

Pradeep Ambati, David Irwin "No Reservations: A First Look at Amazon's Reserved Instance Marketplace" USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) , 2020 Citation Details

(Showing: 1 - 10 of 19)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Machine learning (ML) has emerged as an important workload for cloud platforms and data centers given its wide variety of compelling use-cases, such as weather prediction, intelligent chatbots, self-driving cars, coding and debugging, and many others. Training machine learning models is expensive, as it often requires processing large-scale datasets across many distributed servers for long periods. Thus, training large-scale ML models is often done on cloud platforms, which have access to the necessary resources. To reduce cost, cloud platforms offer transient servers, which they may revoke at any time, but are 70-90% cheaper than traditional on-demand servers. Transience may also arise in other contexts, such as in edge devices with unreliable energy sources or network connectivity. Unfortunately, leveraging low-cost transient servers introduces challenges for training distributed ML, as it often uses a bulk-synchronous processing model, which causes revoked servers to block others from continuing past pre-defined synchronization barriers until a newly provisioned replacement server reaches the barrier. Unfortunately, while asynchronous processing can eliminate this blocking and improve system performance, it does not maintain the algorithmic properties of synchronous algorithms. To address the problem, this project developed models, algorithms, and frameworks for both understanding and optimizing distributed ML training on transient servers. Specifically, the project developed 1) a number of models that quantify the impact of transient servers on distributed ML in various contexts as a function of revocation rates, degree of parallelism, synchronization method, and straggler mitigation technique; 2) multiple frameworks that use different synchronization models and fault-tolerance techniques for mitigating the performance impact of transient server revocations, including asynchronous and flexible synchronous models; 3) implementations using the frameworks of a number of common ML algorithms to demonstrate their performance improvement under transient server revocations. The techniques the project developed increased the understanding of how to optimize distributed ML when running on cheap transient servers in the cloud with frequent revocations, which can significantly reduce the cost of training ML models.

Broader impacts of the project included advancing the state-of-the-art in training ML models at low-cost, which is increasingly important given the accelerating demand for ML since the project's inception. The project's results were disseminated to the community through a large number of presentations and publications on the project's models, algorithms, and frameworks in multiple contexts. The project also supported the incorporation of topics on transient servers, distributed ML, and synchronization methods into undergraduate and graduate courses. The project supported lectures each year on cloud computing for multiple high school summer programs for students interested in computer science and engineering. The project partially supported seven research assistants. One doctoral student supported through this award received the department's Outstanding Dissertation Award and joined industry as software engineer after graduation.

Last Modified: 01/27/2025
Modified by: David Irwin

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error