NSF Award Search: Award # 1565414

Award Abstract # 1565414

SHF: Large: Collaborative Research: Next Generation Communication Mechanisms exploiting Heterogeneity, Hierarchy and Concurrency for Emerging HPC Systems

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	OHIO STATE UNIVERSITY, THE
Initial Amendment Date:	August 4, 2016
Latest Amendment Date:	August 25, 2017
Award Number:	1565414
Award Instrument:	Standard Grant
Program Manager:	Almadena Chtchelkanova achtchel@nsf.gov (703)292-7498 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	August 15, 2016
End Date:	July 31, 2020 (Estimated)
Total Intended Award Amount:	$1,171,893.00
Total Awarded Amount to Date:	$1,171,893.00
Funds Obligated to Date:	FY 2016 = $1,171,893.00
History of Investigator:	Dhabaleswar Panda (Principal Investigator) panda.2@osu.edu Karen Tomko (Co-Principal Investigator) Hari Subramoni (Co-Principal Investigator) Khaled Hamidouche (Former Co-Principal Investigator)
Recipient Sponsored Research Office:	Ohio State University 1960 KENNY RD COLUMBUS OH US 43210-1016 (614)688-8735
Sponsor Congressional District:	03
Primary Place of Performance:	Ohio State University OH US 43210-1206
Primary Place of Performance Congressional District:	03
Unique Entity Identifier (UEI):	DLWBSLWAJWR1
Parent UEI:	MN4MDDMN8529
NSF Program(s):	CI REUSE, Software & Hardware Foundation, CSR-Computer Systems Research
Primary Program Source:	01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7354, 7433, 7925, 7942
Program Element Code(s):	689200, 779800, 735400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

This award was partially supported by the CIF21 Software Reuse Venture whose goals are to support pathways towards sustainable software elements through their reuse, and to emphasize the critical role of reusable software elements in a sustainable software cyberinfrastructure to support computational and data-enabled science and engineering.

Parallel programming based on MPI (Message Passing Interface) is being used with increased frequency in academia, government (defense and non-defense uses), as well as emerging uses in scalable machine learning and big data analytics. The emergence of Dense Many-Core (DMC) architectures like Intel's Knights Landing (KNL) and accelerator/co-processor architectures like NVIDIA GPGPUs are enabling the design of systems with high compute density. This, coupled with the availability of Remote Direct Memory Access (RDMA)-enabled commodity networking technologies like InfiniBand, RoCE, and 10/40GigE with iWARP, is fueling the growth of multi-petaflop and ExaFlop systems. These DMC architectures have the following unique characteristics: deeper levels of hierarchical memory; revolutionary network interconnects; and heterogeneous compute power and data movement costs (with heterogeneity at chip-level and node-level).
For these emerging systems, a combination of MPI and other programming models, known as MPI+X (where X can be PGAS, Tasks, OpenMP, OpenACC, or CUDA), are being targeted. The current generation communication protocols and mechanisms for MPI+X programming models cannot efficiently support the emerging DMC architectures. This leads to the following broad challenges: 1) How can high-performance and scalable communication mechanisms for next generation DMC architectures be designed to support MPI+X (including Task-based) programming models? and 2) How can the current and next generation applications be designed/co-designed with the proposed communication mechanisms?

A synergistic and comprehensive research plan, involving computer scientists from The Ohio State University (OSU) and Ohio Supercomputer Center (OSC) and computational scientists from the Texas Advanced Computing Center (TACC), San Diego Supercomputer Center (SDSC) and University of California San Diego (UCSD), is proposed to address the above broad challenges with innovative solutions. The research will be driven by a set of applications from established NSF computational science researchers running large scale simulations on Stampede and Comet and other systems at OSC and OSU. The proposed designs will be integrated into the widely-used MVAPICH2 library and made available for public use. Multiple graduate and undergraduate students will be trained under this project as future scientists and engineers in HPC. The established national-scale training and outreach programs at TACC, SDSC and OSC will be used to disseminate the results of this research to XSEDE users. Tutorials will be organized at XSEDE, SC and other conferences to share the research results and experience with the community.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 26)

Show All

A. Awan , A. Jain , C. Chu , H. Subramoni , and DK Panda "Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects" IEEE HOT Interconnects (HotI '19) , 2019

A. Awan, K. Hamidouche, J. Hashmi, and D. K. Panda "S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters" 22nd ACM SIGPLANSymposium on Principles and Practice of Parallel Programming(PPoPP), , 2017

A. Awan, K. Vadambacheri Manian, C. Chu, H. Subramoni,and DK Panda "Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?" Parallel Computing , v.85 , 2019 , p.141

A. Ruhela, B. Ramesh, S. Chakraborty, H. Subramoni, J. Hashmi and D. K. Panda "Leveraging Network-level parallelism with Multiple Process-Endpoints for MPIBroadcast" 2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM) , 2019 10.1109/IPDRM49579.2019.00009

A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and DK Panda "Efficient Design for MPI AsynchronousProgress without Dedicated Resources" Parallel Computing - Systems & Applications , v.85 , 2019

A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour,P. Kousha and D. K. Panda "Efficient AsynchronousCommunication Progress for MPI without Dedicated Resources" European MPI Users Group Meeting, (EuroMPI'18) , 2018

A. Ruhela, S. Xu, K. V. Manian, H. Subramoni and D. K. Panda "Analyzing and Understanding the Impact of Interconnect Performance on HPC, Big Data, and Deep Learning Applications: A Case Study with InfiniBand EDR and HDR" 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , 2020 10.1109/IPDPSW50202.2020.00147

C. Chu, J. Hashmi, K. Khorassani, H. Subramoni , and D. K. Panda "High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems" 26th IEEE International Conference on HighPerformance Computing, Data, Analytics and Data Science (HiPC '19) , 2019 doi.org/10.1109/HiPC.2019.00041

H. Subramoni, S. Chakraborty, and D. K. Panda "Designing Dynamic and Adaptive MPI Point-to-point Communication Protocols for Efficient Overlap of Computation and Communication" International Supercomputing Conference (ISC) , 2017

J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni,and D. K. Panda "Designing Efficient Shared Address SpaceReduction Collectives for Multi-/Many-cores" 32nd IEEEInternational Parallel & Distributed Processing Symposium , 2018

S. Xu, J. M. Hashmi, S. Chakraborty, H. Subramoni and D. K. Panda "Design and Evaluation of Shared Memory Communication Benchmarks on Emerging Architectures using MVAPICH2" 2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM) , 2019 10.1109/IPDRM49579.2019.00010

(Showing: 1 - 10 of 26)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Current generation multi-petascale systems are being powered by modern multi-core architectures like Intel Cascade Lake and AMD Rome, and accelerators from NVIDIA and AMD. These systems have tight integration with Remote Direct Memory Access (RDMA) enabled high-performance networking technologies like InfiniBand, Omni-Path, and RDMA over Converged Enhanced Ethernet (RoCE). Such architectures are being defined as Dense Many-Core (DMC) systems. These systems are starkly different from homogeneous clusters of the past. These evolving DMC systems are targeted for emerging exascale computing and are characterized by: 1) Deeper levels of hierarchical memory, 2) Revolutionary network interconnects, and 3) Heterogeneous compute power and data movement costs (with heterogeneity at chip-level and node-level).

The Message Passing Interface (MPI) has been the de-facto parallel programming model for the past two decades and is very successful in implementing regular and iterative parallel algorithms with well-defined communication patterns. The Remote Memory Access (RMA) features of the MPI-3 standard have shown promise for expressing algorithms that have irregular computation and communication patterns by enabling light-weight one-sided communication and synchronization operations. On the other hand, owing to dramatic changes in the architectures (high concurrency and low memory per-core), hybrid programming models such as MPI+OpenMP and MPI+OpenACC/CUDA are being adopted as some of the primary programming models for High-Performance Computing (HPC) applications. The evolution and diversity of programming models and their hybrid usage modes for next-generation systems is being defined in a generic manner in the community as the `MPI+X' model.

On the other hand, task-based programming models and runtimes such as the Asynchronous PGAS (APGAS) models seem to be able to achieve efficient load balancing, fault tolerance and latency hiding for highly irregular communication patterns. However, it may not be ideal to express global control flow and global communication. Thus, MPI+Task (as another form of X) has been gaining momentum in the community. However, designing a unified resource progression mechanism to avoid resource starvation and/or deadlocks for the MPI+Task model is opening up several research challenges due to the fundamental differences in the flow of control between the two models - MPI being user-driven control flow and APGAS/Task-based model using system/runtime scheduler driven control. These trends lead to the following broad challenges: 1) How can high-performance and scalable communication mechanisms for next-generation DMC architectures be designed to support MPI+X (including Task-based) programming models? and 2) How can the current and next-generation applications be designed/co-designed with the proposed communication mechanisms?

To address the above outlined challenges, in this project, we have adopted a multi-year and multi-tiered approach to exploit emerging DMC architectures and design optimized runtimes for supporting MPI and MPI+X programming models. Challenges have been addressed along the following directions:

1. Designing and developing high-performance, contention-aware, and scalable point-to-point and collective communication protocols and algorithms for DMC heterogeneous systems with latest generation CPUs and GPUs.

2. Designing and developing dynamic and adaptive communication protocols for contiguous and non-contiguous data layout in MPI.

3. Designing efficient communication and synchronization schemes for MPI+PGAS and MPI+X programming models.

4. Carrying out in-depth study of the new designs with a range of computing and networking technologies.

5. Co-designing a set of applications with the new runtimes and studied performance and scalability on a set of contemporary multi-petaflop systems.

6. Deploying the new frameworks and runtimes on various HPC systems at Ohio Supercomputer Center (OSC), Texas Advanced Computing Center (TACC), and San Diego Supercomputer Center (SDSC) and carrying out continuous engagement with their users to improve and optimize the designs and deliver better performance and scalability for a large number of applications.

The results of this research (new designs, performance results, benchmarks, etc.) have been made available to the community through the MVAPICH2 MPI libraries. Multiple releases of these libraries have been made during the project period. More than 400,000 copies of the MVAPICH2 MPI libraries have been downloaded from the project's web site during this project period. In each of these releases, features, performance numbers and scalability information have been shared with the MVAPICH user community through mailing lists and the project's web site. In addition to the software distribution, the results have been presented at various conferences and journals and events through Keynote talks, invited talks, tutorials, and hands-on sessions. The research has also led to thesis for several M.S. and Ph.D. students.

Last Modified: 11/28/2020
Modified by: Dhabaleswar K Panda

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error