NSF Award Search: Award # 2311830 - Collaborative Research: Frameworks: Performance Engineering Scientific Applications with MVAPICH and TAU using Emerging Communication Primitives

Award Abstract # 2311830

Collaborative Research: Frameworks: Performance Engineering Scientific Applications with MVAPICH and TAU using Emerging Communication Primitives

NSF Org:	OAC Office of Advanced Cyberinfrastructure (OAC)
Recipient:	OHIO STATE UNIVERSITY, THE
Initial Amendment Date:	May 3, 2023
Latest Amendment Date:	May 3, 2023
Award Number:	2311830
Award Instrument:	Standard Grant
Program Manager:	Varun Chandola vchandol@nsf.gov (703)292-2656 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering
Start Date:	September 1, 2023
End Date:	August 31, 2026 (Estimated)
Total Intended Award Amount:	$900,000.00
Total Awarded Amount to Date:	$900,000.00
Funds Obligated to Date:	FY 2023 = $900,000.00
History of Investigator:	Dhabaleswar Panda (Principal Investigator) panda.2@osu.edu Hari Subramoni (Co-Principal Investigator) Aamir Shafi (Co-Principal Investigator)
Recipient Sponsored Research Office:	Ohio State University 1960 KENNY RD COLUMBUS OH US 43210-1016 (614)688-8735
Sponsor Congressional District:	03
Primary Place of Performance:	Ohio State University 1960 KENNY RD COLUMBUS OH US 43210-1016
Primary Place of Performance Congressional District:	03
Unique Entity Identifier (UEI):	DLWBSLWAJWR1
Parent UEI:	MN4MDDMN8529
NSF Program(s):	Software Institutes
Primary Program Source:	01002324DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	077Z, 4444, 8004
Program Element Code(s):	800400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Earthquake hazards pose potentially life-threatening risks to communities and cause significant economic damage. Numerical simulations of earthquakes on large-scale supercomputers are emerging as key to guiding the infrastructure and policy decisions as a result of earthquake modeling. These seismic and other codes including simulations involving Fast Fourier Transform (FFT) distribute the processing across a large number of compute nodes in a supercomputer. Optimizing the communication between nodes is key to achieving good performance but it is a daunting task given the scale of execution. The MVAPICH communication library that implements the Message Passing Interface (MPI) and the TAU Performance System, a profiling tool to observe the communication, will be tightly coupled to assess the performance impact of tuning these codes during execution. These libraries will share key performance parameters and optimize the communication in these applications to improve the time to solution. Performance-engineered versions of these codes will help drive the next generation of earthquake forecasting and help improve our understanding of seismic events to reduce risks to population centers and the environment. The research will enable undergraduate and graduate curriculum advancements via research in pedagogy for High Performance Computing (HPC), Deep/Machine Learning, and Data Analytics courses. The results will also be disseminated to the collaborating organizations of the investigators to impact their HPC software applications.

Emerging HPC systems---driven by many-core processors and accelerator architectures--- require innovations in existing infrastructure to deliver the best performance for science domains. The MPI 4.0 standard has also brought forward new opportunities for co-designing applications. These include partitioned point-to-point and collective operations, and neighborhood collectives. With these advances, there is a critical need to update the commonly used tools and libraries that form the basis for the NSF?s HPC cyberinfrastructure. The research undertakes this challenge and pursues new performance engineering avenues---by exploiting a co-design approach using the MPI_T API---in the MVAPICH2 and TAU libraries with scientific applications. The project focuses on two popular HPC applications spanning multiple domains and representing various communication patterns - Anelastic Wave Propagation (AWP-ODC) and Highly efficient FFTs for Exascale (heFFTe). AWP-ODC is a highly scalable parallel finite-difference application with point-to-point operations that enables 3D earthquake calculations. HeFFTe, dominated by collective operations, is a massively parallel application that provides a scalable and efficient implementation of the widely used Fast Fourier Transform (FFT) operations. The research aims to investigate and develop the following innovations by co-designing MVAPICH2 and TAU libraries to scale driving science domains---including AWP-ODC and heFFTe: 1) Load-aware designs for MPI asynchronous communication, 2) Cross runtime coordination for MPI+X applications, 3) Partitioned point-to-point primitives, 4) Application-aware neighborhood collective communication, 5) Support for adaptive persistent collective communication, and 6) Coordinating communication kernels on GPUs. Integrated development and evaluation are carried out to ensure proper integration of proposed designs with the driving applications, and closely work with internal and external collaborators to facilitate wide deployment and adoption of the released software. The transformative impact of the proposed effort is to extract the performance and scalability of HPC applications in next-generation HPC architectures through intelligent performance engineering.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Anthony, Q and Michalowicz, B and Hatef, J and Xu, L and Abduljabbar, M and Shafi, A and Subramoni, H and Panda, D "Demystifying the Communication Characteristics for Distributed Transformer Models" , 2024 Citation Details

Alnaasan, N and Huang, H and Shafi, A and Subramoni, H and Panda, D "Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models" , 2024 Citation Details

Chen, Chen-Chun and Kuncham, Goutham_Kalikrishna Reddy and Kousha, Pouya and Subramoni, Hari and Panda, Dhabaleswar K "Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs" , 2024 https://doi.org/10.1145/3626203.3670549 Citation Details

Chen, Chen-Chun and Shafie_Khorassani, Kawthar and Kousha, Pouya and Zhou, Qinghua and Yao, Jinghan and Subramoni, Hari and Panda, Dhabaleswar K "MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators" , 2023 https://doi.org/10.1145/3624062.3624153 Citation Details

Xu, L and Anthony, Q and Zhou, Q and Alnaasan, N and Gulhane, R and Shafi, A and Subramoni, H and Panda, D "Accelerating Large Language Model Training with Hybrid GPU-based Compression" , 2024 Citation Details

Zhou, Qinghua and Ramesh, Bharath and Shafi, Aamir and Abduljabbar, Mustafa and Subramoni, Hari and Panda, Dhabaleswar K "Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters" , 2024 https://doi.org/10.23919/ISC.2024.10528931 Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error