
NSF Org: |
CNS Division Of Computer and Network Systems |
Recipient: |
|
Initial Amendment Date: | July 23, 2019 |
Latest Amendment Date: | July 23, 2019 |
Award Number: | 1908536 |
Award Instrument: | Standard Grant |
Program Manager: |
Marilyn McClure
mmcclure@nsf.gov (703)292-5197 CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2019 |
End Date: | September 30, 2024 (Estimated) |
Total Intended Award Amount: | $500,000.00 |
Total Awarded Amount to Date: | $500,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
101 COMMONWEALTH AVE AMHERST MA US 01003-9252 (413)545-0698 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
Amherst MA US 01003-9292 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | CSR-Computer Systems Research |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
The availability of large-scale data sets in many domains has driven the growth of large-scale distributed machine learning (ML) workloads on cloud platforms to derive insights from this data. To reduce the cost of executing these workloads, cloud platforms have begun to offer transient servers for a highly discounted price. Unfortunately, cloud platforms may revoke transient servers at any time, which can decrease distributed ML performance and eliminate any cost benefit. High revocation rates are especially problematic for distributed ML workloads that support synchronous processing, since revoked servers block others from continuing past predefined synchronization barriers until a replacement server can reach the barrier. While asynchronous processing eliminates this blocking and improves performance, it does not maintain the algorithmic properties of synchronous algorithms, resulting in slower algorithmic convergence or possibly preventing convergence. To maintain performance on low-cost transient servers, this project proposes re-designing traditional distributed ML algorithms to use looser forms of synchrony. Such loose synchronization minds the gap between synchronous and asynchronous processing by maintaining the algorithmic convergence properties of synchronous processing, while enabling some asynchronous processing to avoid blocking. The project combines this loose synchronization approach with adaptive policies for selecting transient servers based on their performance, cost, and volatility to significantly reduce the cost of executing large-scale distributed ML workloads on cloud platforms.
Distributed machine learning (ML) workloads that derive insights from large-scale data sets have become the foundation for numerous advances across multiple industry sectors. This project has the potential to accelerate these advances by significantly decreasing the cost and improving the efficiency of executing distributed ML workloads on cloud platforms using transient servers. To benefit the broader community, the project will publicly release its software artifacts as open source. The project will incorporate topics on transient servers and distributed ML into graduate and undergraduate courses on distributed and operation systems. The project will also involve undergraduates in research through related summer research experience projects and undergraduate theses.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Machine learning (ML) has emerged as an important workload for cloud platforms and data centers given its wide variety of compelling use-cases, such as weather prediction, intelligent chatbots, self-driving cars, coding and debugging, and many others. Training machine learning models is expensive, as it often requires processing large-scale datasets across many distributed servers for long periods. Thus, training large-scale ML models is often done on cloud platforms, which have access to the necessary resources. To reduce cost, cloud platforms offer transient servers, which they may revoke at any time, but are 70-90% cheaper than traditional on-demand servers. Transience may also arise in other contexts, such as in edge devices with unreliable energy sources or network connectivity. Unfortunately, leveraging low-cost transient servers introduces challenges for training distributed ML, as it often uses a bulk-synchronous processing model, which causes revoked servers to block others from continuing past pre-defined synchronization barriers until a newly provisioned replacement server reaches the barrier. Unfortunately, while asynchronous processing can eliminate this blocking and improve system performance, it does not maintain the algorithmic properties of synchronous algorithms. To address the problem, this project developed models, algorithms, and frameworks for both understanding and optimizing distributed ML training on transient servers. Specifically, the project developed 1) a number of models that quantify the impact of transient servers on distributed ML in various contexts as a function of revocation rates, degree of parallelism, synchronization method, and straggler mitigation technique; 2) multiple frameworks that use different synchronization models and fault-tolerance techniques for mitigating the performance impact of transient server revocations, including asynchronous and flexible synchronous models; 3) implementations using the frameworks of a number of common ML algorithms to demonstrate their performance improvement under transient server revocations. The techniques the project developed increased the understanding of how to optimize distributed ML when running on cheap transient servers in the cloud with frequent revocations, which can significantly reduce the cost of training ML models.
Broader impacts of the project included advancing the state-of-the-art in training ML models at low-cost, which is increasingly important given the accelerating demand for ML since the project's inception. The project's results were disseminated to the community through a large number of presentations and publications on the project's models, algorithms, and frameworks in multiple contexts. The project also supported the incorporation of topics on transient servers, distributed ML, and synchronization methods into undergraduate and graduate courses. The project supported lectures each year on cloud computing for multiple high school summer programs for students interested in computer science and engineering. The project partially supported seven research assistants. One doctoral student supported through this award received the department's Outstanding Dissertation Award and joined industry as software engineer after graduation.
Last Modified: 01/27/2025
Modified by: David Irwin
Please report errors in award information by writing to: awardsearch@nsf.gov.