Award Abstract # 2106059
Collaborative Research: OAC Core: Simulation-driven runtime resource management for distributed workflow applications

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: UNIVERSITY OF HAWAII
Initial Amendment Date: August 30, 2021
Latest Amendment Date: August 30, 2021
Award Number: 2106059
Award Instrument: Standard Grant
Program Manager: Juan Li
jjli@nsf.gov
 (703)292-2625
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2021
End Date: September 30, 2025 (Estimated)
Total Intended Award Amount: $279,988.00
Total Awarded Amount to Date: $279,988.00
Funds Obligated to Date: FY 2021 = $279,988.00
History of Investigator:
  • Henri Casanova (Principal Investigator)
    henric@hawaii.edu
Recipient Sponsored Research Office: University of Hawaii
2425 CAMPUS RD SINCLAIR RM 1
HONOLULU
HI  US  96822-2247
(808)956-7800
Sponsor Congressional District: 01
Primary Place of Performance: University of Hawaii
1680 East-West Rd., POST 317
Honolulu
HI  US  96822-2327
Primary Place of Performance
Congressional District:
01
Unique Entity Identifier (UEI): NSCKLFSSABF2
Parent UEI:
NSF Program(s): OAC-Advanced Cyberinfrast Core
Primary Program Source: 01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 9150, 026Z, 7923
Program Element Code(s): 090Y00
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Many scientific breakthroughs in domains such as health, climate modeling, particle physics, seismology, etc., can only be achieved by performing complex processing of vast amounts of data. This processing is automated by software systems that use the compute, storage, and network hardware provided by the cyberinfrastructure. In addition to automation, a key objective of these systems is the efficient use of the resources as measured by cost and energy usage, while making the processing as fast as possible or as needed. To this end, these systems must make decisions regarding which resources should be used to do what and when. Many such systems are used in production today and make such decisions. Yet making good, let alone best, decisions is still an open research challenge. Theoretical research has proposed solutions that are difficult to put into practice, and practical solutions are known to not make good decisions, or at least not consistently so. However, both theory and practice follow the same basic philosophy: make decisions by reasoning about known information on what needs to be computed and on what hardware resources are available. This philosophy has shown its limits, so this project adopts a radically different approach. The key idea is to repeatedly execute fast, computationally inexpensive simulations of the application execution in order to evaluate large sets of potential resource management decisions and automatically select the most desirable ones. The benefits of this approach will be demonstrated for several software systems used to support scientific applications that are critical for the development and sustainability of society.

Software systems are used to run scientific applications on advanced cyberinfrastructure. These systems automate application execution, and make resource management decision along several axes including selecting and provisioning (virtualized) hardware, picking application configuration options, and scheduling application activities in time and space. Their objective is to optimize both application performance and also a set of resource usage efficiency metrics that include monetary and energy costs. Consequently, the resource management decision space is enormous, and making good decisions is a steep challenge that has been the subject of countless efforts, both from theoreticians and practitioners. However, the challenge is far from being solved: theoreticians produce solutions that are rarely used by practitioners, and conversely practitioners implement solutions that may be highly sub-optimal because they not informed by theory. This project resolves this disconnect by obviating the need for developing effective resource management strategies. The key idea is to use online simulations to search the resource management decision space rapidly at runtime. Large numbers of fast simulations of the application's execution are executed throughout that very execution, so as to evaluate many potential resource management options and automatically select desirable ones. This approach thus shifts the overall problem from the design of complex resource management algorithms to the enumeration of many resource management decisions. The transformation of resource management practice in cyberinfrastructure systems not only renders the resource management problem tractable but also unlocks previously out-of-reach resource management decisions. The benefits of this transformation will be demonstrated for a critical class of production systems and applications, specifically Workflow Management Systems and the scientific applications they support.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Casanova, Henri and Berney, Kyle and Chastel, Serge and Da Silva, Rafael Ferreira "WfCommons: Data Collection and Runtime Experiments using Multiple Workflow Systems" 1st IEEE International Workshop on Workflows in Distributed Environments , 2023 https://doi.org/10.1109/COMPSAC57700.2023.00290 Citation Details
Monniot, Julien and Tessier, François and Casanova, Henri and Antoniu, Gabriel "Simulation of Large-Scale HPC Storage Systems: Challenges and Methodologies" , 2024 https://doi.org/10.1109/HiPC62374.2024.00031 Citation Details
McDonald, Jesse and Dobbs, John and Wong, Yick Ching and da_Silva, Rafael Ferreira and Casanova, Henri "An exploration of online-simulation-driven portfolio scheduling in workflow management systems" Future Generation Computer Systems , 2024 https://doi.org/10.1016/j.future.2024.07.005 Citation Details
Horzela, Maximilian and Casanova, Henri and Giffels, Manuel and Gottmann, Artur and Hofsaess, Robin and Quast, Günter and Rossi_Tisbeni, Simone and Streit, Achim and Suter, Frédéric "Modeling Distributed Computing Infrastructures for HEP Applications" EPJ Web of Conferences , v.295 , 2024 https://doi.org/10.1051/epjconf/202429504032 Citation Details
H. Casanova, Y. C. "On the Feasibility of Simulation-driven Portfolio Scheduling for Cyberinfrastructure Runtime Systems" Proceedings of the 25th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP) , 2022 Citation Details
Coleman, Tainã and Casanova, Henri and Ferreira da Silva, Rafael "Automated generation of scientific workflow generators with WfChef" Future Generation Computer Systems , v.147 , 2023 https://doi.org/10.1016/j.future.2023.04.031 Citation Details
Casanova, Henri and Giersch, Arnaud and Legrand, Arnaud and Quinson, Martin and Suter, Frédéric "Lowering entry barriers to developing custom simulators of distributed applications and platforms with SimGrid" Parallel Computing , v.123 , 2025 https://doi.org/10.1016/j.parco.2025.103125 Citation Details
McDonald, Jesse and Horzela, Maximilian and Suter, Frédéric and Casanova, Henri "Automated Calibration of Parallel and Distributed Computing Simulators: A Case Study" , 2024 https://doi.org/10.1109/IPDPSW63119.2024.00173 Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page