Award Abstract # 2106147
Collaborative Research: OAC Core: Simulation-driven runtime resource management for distributed workflow applications

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: UNIVERSITY OF SOUTHERN CALIFORNIA
Initial Amendment Date: August 30, 2021
Latest Amendment Date: October 14, 2022
Award Number: 2106147
Award Instrument: Standard Grant
Program Manager: Juan Li
jjli@nsf.gov
 (703)292-2625
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2021
End Date: September 30, 2025 (Estimated)
Total Intended Award Amount: $220,000.00
Total Awarded Amount to Date: $220,000.00
Funds Obligated to Date: FY 2021 = $220,000.00
History of Investigator:
  • Ewa Deelman (Principal Investigator)
    deelman@isi.edu
  • Loic Pottier (Former Principal Investigator)
  • Rafael Ferreira da Silva (Former Principal Investigator)
Recipient Sponsored Research Office: University of Southern California
3720 S FLOWER ST FL 3
LOS ANGELES
CA  US  90033
(213)740-7762
Sponsor Congressional District: 34
Primary Place of Performance: University of Southern California
4676 Admiralty Way, Ste 1001
Marina del Rey
CA  US  90292-6611
Primary Place of Performance
Congressional District:
36
Unique Entity Identifier (UEI): G88KLJR3KYT5
Parent UEI:
NSF Program(s): OAC-Advanced Cyberinfrast Core
Primary Program Source: 01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 026Z, 9102, 7923
Program Element Code(s): 090Y00
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Many scientific breakthroughs in domains such as health, climate modeling, particle physics, seismology, etc., can only be achieved by performing complex processing of vast amounts of data. This processing is automated by software systems that use the compute, storage, and network hardware provided by the cyberinfrastructure. In addition to automation, a key objective of these systems is the efficient use of the resources as measured by cost and energy usage, while making the processing as fast as possible or as needed. To this end, these systems must make decisions regarding which resources should be used to do what and when. Many such systems are used in production today and make such decisions. Yet making good, let alone best, decisions is still an open research challenge. Theoretical research has proposed solutions that are difficult to put into practice, and practical solutions are known to not make good decisions, or at least not consistently so. However, both theory and practice follow the same basic philosophy: make decisions by reasoning about known information on what needs to be computed and on what hardware resources are available. This philosophy has shown its limits, so this project adopts a radically different approach. The key idea is to repeatedly execute fast, computationally inexpensive simulations of the application execution in order to evaluate large sets of potential resource management decisions and automatically select the most desirable ones. The benefits of this approach will be demonstrated for several software systems used to support scientific applications that are critical for the development and sustainability of society.

Software systems are used to run scientific applications on advanced cyberinfrastructure. These systems automate application execution, and make resource management decision along several axes including selecting and provisioning (virtualized) hardware, picking application configuration options, and scheduling application activities in time and space. Their objective is to optimize both application performance and also a set of resource usage efficiency metrics that include monetary and energy costs. Consequently, the resource management decision space is enormous, and making good decisions is a steep challenge that has been the subject of countless efforts, both from theoreticians and practitioners. However, the challenge is far from being solved: theoreticians produce solutions that are rarely used by practitioners, and conversely practitioners implement solutions that may be highly sub-optimal because they not informed by theory. This project resolves this disconnect by obviating the need for developing effective resource management strategies. The key idea is to use online simulations to search the resource management decision space rapidly at runtime. Large numbers of fast simulations of the application's execution are executed throughout that very execution, so as to evaluate many potential resource management options and automatically select desirable ones. This approach thus shifts the overall problem from the design of complex resource management algorithms to the enumeration of many resource management decisions. The transformation of resource management practice in cyberinfrastructure systems not only renders the resource management problem tractable but also unlocks previously out-of-reach resource management decisions. The benefits of this transformation will be demonstrated for a critical class of production systems and applications, specifically Workflow Management Systems and the scientific applications they support.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Casanova, H. and Wong Y. C. and Pottier, L. and Ferreira da Silva, R. "On the Feasibility of Simulation-driven Portfolio Scheduling for Cyberinfrastructure Runtime Systems" Job Scheduling Strategies for Parallel Processing , 2022 Citation Details
Safri, Hamza and Papadimitriou, George and Desprez, Frédéric and Deelman, Ewa "A Workflow Management System Approach To Federated Learning: Application to Industry 4.0" , 2024 https://doi.org/10.1109/DCOSS-IoT61029.2024.00047 Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page