Award Abstract # 2321123
CyberTraining: Pilot: Cross-Layer Training of High-Performance Deep Learning Technologies and Applications for Research Workforce Development in Central Valley

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: UNIVERSITY OF CALIFORNIA, MERCED
Initial Amendment Date: August 20, 2023
Latest Amendment Date: August 20, 2023
Award Number: 2321123
Award Instrument: Standard Grant
Program Manager: Sharmistha Bagchi-Sen
shabagch@nsf.gov
 (703)292-8104
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2023
End Date: August 31, 2025 (Estimated)
Total Intended Award Amount: $299,587.00
Total Awarded Amount to Date: $299,587.00
Funds Obligated to Date: FY 2023 = $299,587.00
History of Investigator:
  • Xiaoyi Lu (Principal Investigator)
    xiaoyi.lu@ucmerced.edu
  • Yue Yu (Co-Principal Investigator)
Recipient Sponsored Research Office: University of California - Merced
5200 N LAKE RD
MERCED
CA  US  95343-5001
(209)201-2039
Sponsor Congressional District: 13
Primary Place of Performance: University of California - Merced
5200 N LAKE RD
MERCED
CA  US  95343-5001
Primary Place of Performance
Congressional District:
13
Unique Entity Identifier (UEI): FFM7VPAG8P92
Parent UEI:
NSF Program(s): CyberTraining - Training-based
Primary Program Source: 01002324DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 122Z, 9102, 7361, 7569
Program Element Code(s): 044Y00
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

High-Performance Computing (HPC) has revolutionized various scientific fields, including climate research, wildlife health, agricultural sciences, and scientific simulations and modeling. With the emergence of HPC-accelerated deep learning (HPC-DL) systems and applications, there is a pressing need for comprehensive cross-layer training materials to educate the research workforce on these advanced technologies. The primary objective of this pilot project is to address this need by providing comprehensive cross-layer HPC-DL training to a wide range of cyberinfrastructure (CI) users. The target audience includes undergraduate and graduate students, postdocs, faculty, and research staff who can benefit from enhanced knowledge and skills in utilizing HPC-DL CI technologies and resources. By equipping them with the necessary training, the project aims to improve their research efficiency and maximize the potential of HPC-DL in their respective fields. In addition, the project has a specific focus on fostering inclusivity and expanding opportunities for underrepresented communities in the Central Valley area of California. This will contribute to the national interest by empowering individuals with the knowledge and skills necessary to excel in the HPC-DL field.

This project addresses the critical training needs of the converged HPC-DL field by developing comprehensive training materials, fostering peer consultant programs, conducting workshops, and building an inclusive learning culture. It includes an integration of scientific applications, HPC technologies, and DL in a cross-layer approach. The training program covers several important CI topics, including Remote Direct Memory Access (RDMA), GPU-based distributed computing, Slurm, MPI, and NCCL, which are critical to achieving high performance for HPC-DL workloads. The training will also dive into distributed DL training frameworks such as PyTorch, TensorFlow, and Horovod, enabling participants to effectively leverage these tools for their research. Moreover, the training incorporates practical DL application case studies, offering real-world examples and insights. The short-term goal is to empower individuals with HPC-DL knowledge and cross-layer optimization skills to maximize the utilization of HPC-DL CI resources and improve research efficiency. This project will also examine the effectiveness of practice-central models and HPC-DL-centered workshops in promoting HPC-DL adoption in underrepresented communities. The project's long-term aim is to cultivate a robust research workforce with a deep understanding of HPC-DL CIs. By establishing a learning culture and targeting a significant number of CI users, this project addresses workforce shortages and extends its impact beyond the Central Valley. Through collaborations and the dissemination of open-source training materials, it will contribute to advancing compute- and data-intensive scientific simulations and knowledge discovery.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Dai, Liuyao and Qi, Hao and Chen, Weicong and Lu, Xiaoyi "High-Speed Data Communication With Advanced Networks in Large Language Model Training" IEEE Micro , v.44 , 2024 https://doi.org/10.1109/MM.2024.3360081 Citation Details
Li, Yuke and Guo, Yanfei and Lu, Xiaoyi "Characterizing One-/Two-sided Designs in OpenSHMEM Collectives" , 2023 Citation Details
Li, Yuke and Kashyap, Arjun and Chen, Weicong and Guo, Yanfei and Lu, Xiaoyi "Accelerating Lossy and Lossless Compression on Emerging BlueField DPU Architectures" , 2024 https://doi.org/10.1109/IPDPS57955.2024.00040 Citation Details
Li, Yuke and Kashyap, Arjun and Guo, Yanfei and Lu, Xiaoyi "Compression Analysis for BlueField-2/-3 Data Processing Units: Lossy and Lossless Perspectives" IEEE Micro , v.44 , 2024 https://doi.org/10.1109/MM.2023.3343636 Citation Details
Ng, Darren and Lin, Andrew and Kashyap, Arjun and Li, Guanpeng and Lu, Xiaoyi "NVMe-oPF: Designing Efficient Priority Schemes for NVMe-over-Fabrics with Multi-Tenancy Support" , 2024 https://doi.org/10.1109/IPDPS57955.2024.00052 Citation Details
Ng, Darren and Parkinson, Charles and Lin, Andrew and Kashyap, Arjun and Lu, Xiaoyi "An Early Case Study with Multi-Tenancy Support in SPDKs NVMe-over-Fabric Designs" , 2023 Citation Details
Qi, Hao and Dai, Liuyao and Chen, Weicong and Lu, Xiaoyi "Early Experience in Characterizing Training Large Language Models on Modern HPC Clusters" , 2023 Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page