
NSF Org: |
CNS Division Of Computer and Network Systems |
Recipient: |
|
Initial Amendment Date: | August 11, 2016 |
Latest Amendment Date: | March 3, 2017 |
Award Number: | 1618923 |
Award Instrument: | Standard Grant |
Program Manager: |
Marilyn McClure
mmcclure@nsf.gov (703)292-5197 CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2016 |
End Date: | September 30, 2021 (Estimated) |
Total Intended Award Amount: | $485,504.00 |
Total Awarded Amount to Date: | $485,504.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
2550 NORTHWESTERN AVE # 1100 WEST LAFAYETTE IN US 47906-1332 (765)494-1055 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
IN US 47907-2107 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | CSR-Computer Systems Research |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
The emergence of cloud computing is undoubtedly one of the major paradigm shifts of the last decade in information technology, and one with substantial economic impact. Indeed, the ability to rent computing resources on a need basis (as opposed to acquiring and managing infrastructure provisioned for peak work loads that may occur only rarely) supports many businesses of different kinds and sizes. However, while cloud infrastructures allow computing resources to be allocated and released very dynamically, developing software that leverages this potential to automatically adjust its usage of resources to its workload (e.g., the number of client connections) and performance goals at runtime is a hard task for software engineers. The goal of this project is thus to provide programmers support in the form of a programming model and runtime environment for developing such elastic applications.
Devising such a generic programming model is however very challenging as it must reconcile simplicity (for programmers) with scalability (by facilitating parallelism and distribution) and robustness (by handling partial failures). Unfortunately, these properties may conflict. This project addresses the challenges through the following contributions. (1) Programming model and language: a novel object-oriented programming model variant called Atomic Events and Ownership Network (AEON) is proposed. AEON combines a simplified object model to reason about units of application state with a novel type of multiple ownership to streamline interaction between these units, and a novel notion of events for atomic client-server interaction. (2) Distributed runtime environment: a highly scalable and decentralized runtime environment for AEON is implemented, with support for dynamically adding and removing computational units, as well as for supporting the restructuring of their relationships without hampering consistency or conversely stalling progress. Heuristics to efficiently (re-)partition AEON applications are also proposed. (3) Resource management and fault tolerance: a resource management framework is leveraged for facilitating the mapping between application units and underlying resources; it is augmented to provide a notion of dependable resources achieving fault tolerance. (4) Evaluation: the developed support is evaluated on a wide variety of applications and across different cloud infrastructures. All developments are based on open-source software.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
This project is concerned with supporting elasticity and fault tolerance of applications executing in third-party cloud data centers. The main tenets are to shield the programmer as much as possible from explictly programming applications in a way using specific mechanism to achieve elasticity as well as fault tolerance, but instead proposing malleable systems that use at most high-level policies and simple configuration to achieve best possible performance.
Concretely, the main outcomes of the project are three-fold: 1. A programming language based on the popular actor model that leverages ownership and topological constraints observed by many common elastic applications to automatically achieve consistency among events executing concurrently across multiple actors in a serializable way. The language is shown to achieve much better performance compared to other approaches achieving comparable consistency guarantees across a large number of relevant benchmark applications. 2. A policy language for specifying high-level elasticitly/scalability constraints for programs written in our language (1.) which allows the runtime environment to autonomously place and migrate actors for best performance. The policy language is shown to allow for easily saving 25% of resources for running with same performance, or achieving 20% better performance with the same amount of resources, compared to prior simpler approaches of achieving elasticity. 3. A resource management system that achieves fault tolerance of resources via largely automated replication, using several heuristics for avoiding exorbitant overheads through naive replication of every resource component; the system can be configured both both batch and continuous processing applications. The replication is shown to incur a runtime overhead as low as 6%, while achieving up to 68% faster completion times in the presence of failures.
Last Modified: 11/07/2021
Modified by: Patrick T Eugster
Please report errors in award information by writing to: awardsearch@nsf.gov.