Award Abstract # 1818253
Computation for the Endless Frontier

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: UNIVERSITY OF TEXAS AT AUSTIN
Initial Amendment Date: August 28, 2018
Latest Amendment Date: February 29, 2024
Award Number: 1818253
Award Instrument: Cooperative Agreement
Program Manager: Edward Walker
edwalker@nsf.gov
 (703)292-4863
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2018
End Date: February 28, 2025 (Estimated)
Total Intended Award Amount: $60,000,000.00
Total Awarded Amount to Date: $78,999,136.00
Funds Obligated to Date: FY 2018 = $60,000,000.00
FY 2019 = $2,999,135.00

FY 2020 = $4,000,000.00

FY 2023 = $11,999,999.00
History of Investigator:
  • Daniel Stanzione (Principal Investigator)
    dan@tacc.utexas.edu
  • Dhabaleswar Panda (Co-Principal Investigator)
  • Omar Ghattas (Co-Principal Investigator)
  • Tommy Minyard (Co-Principal Investigator)
  • John West (Co-Principal Investigator)
Recipient Sponsored Research Office: University of Texas at Austin
110 INNER CAMPUS DR
AUSTIN
TX  US  78712-1139
(512)471-6424
Sponsor Congressional District: 25
Primary Place of Performance: University of Texas at Austin
3925 West Braker Lane, Suite 156
Austin
TX  US  78759-5316
Primary Place of Performance
Congressional District:
37
Unique Entity Identifier (UEI): V6AFQPN18437
Parent UEI:
NSF Program(s): CYBERINFRASTRUCTURE,
Leadership-Class Computing
Primary Program Source: 01001819DB NSF RESEARCH & RELATED ACTIVIT
01002324DB NSF RESEARCH & RELATED ACTIVIT

01002021DB NSF RESEARCH & RELATED ACTIVIT

01001920DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 026Z, 7781, 097Z
Program Element Code(s): 723100, 778100
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Computation is critical to our nation's progress in science and engineering. Whether through simulation of phenomena where experiments are costly or impossible, large scale data analysis to sift the enormous quantities of digital data scientific instruments can produce, or machine learning to find patterns and suggest hypothesis from this vast array of data, computation is the universal tool upon which nearly every field of science and engineering relies upon to hasten their advance. This project will deploy a powerful new system, called "Frontier", that builds upon a design philosophy and operations approach proven by the success of the Texas Advanced Computing Center (TACC) in delivering leading instruments for computational science. Frontier provides a system of unprecedented scale in the NSF cyberinfrastructure that will yield productive science on day one, while also preparing the research community for the shift to much more capable systems in the future. Frontier is a hybrid system of conventional Central Processing Units (CPU) and Graphics Processing Units (GPU), with performance capabilities that significantly exceeds prior leadership-class computing investments made by NSF. Importantly, the design of Frontier will support the seamless transition of current NSF leadership-class computing applications to the new system, as well as enable new large-scale data-intensive and machine learning workloads that are expected in the future. Following deployment, the project will operate the system in partnership with ten academic partners. In addition, the project will begin planning activities in collaboration with leading computational scientists and technologists from around the country, and will leverage strategic public-private partnerships to design a leadership-class computing facility with at least ten times more performance capabilities for Science and Engineering research, ensuring the economic competitiveness and prosperity for our nation at large.

TACC, in partnerships with Dell EMC and Intel, will deploy Frontier, a hybrid system offering 39 PF (double precision) of Intel Xeon processors, complemented by 11 PF (single precision) of GPU cards for machine learning applications. In addition to 3x the per node memory of NSF's prior leadership-class computing system primary compute nodes, Frontier will have 2x the storage bandwidth in a storage hierarchy that includes 55PB of usable disk-based storage and 3PB of 'all flash' storage, to enable next generation data-intensive applications and support for the data science community. Frontier will be deployed in TACC's state-of-the-art datacenter which is configured to supply 30% of the system's power needs from renewable energy. Frontier will include support for science and engineering in virtually all disciplines through its software environment support for application containers, as well as through its partnership with ten academic institutions providing deep computational science expertise in support of users on the system. The project planning effort for a Phase 2 system with at least 10x performance improvement will incorporate a community-driven process that will include leading computational scientists and technologists from around the country and leverage strategic public-private partnerships. This process will ensure the design of a future NSF leadership-class computing facility that incorporates the most productive near-term technologies, and anticipates the most likely future technological capabilities for all of science and engineering requiring leadership-class computational and data-analytics capabilities. Furthermore, the project is expected to develop new expertise and techniques for leadership-class computing and data-driven applications that will benefit future users worldwide through publications, training, and consulting. The project will leverage the team's unique approach to education, outreach, and training activities to encourage, educate, and develop the next generation of leadership-class computational science researchers. The team includes leaders in campus bridging, minority-serving institute (MSI) outreach, and data technologies who will oversee efforts to use Frontier to increase the diversity of groups using leadership-class computing for traditional and data-driven applications.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 12)
Suresh, K. and Khorassani, K. and Chen, C. and Ramesh, B. and Abduljabbar, M. and Shafi, A. and Panda, DK. "Network Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries" Hot Interconnects , 2022 Citation Details
Tran, A. and Michalowicz, B. and Ramesh, B. and Subramoni, H. and Shafi, A. and Panda, DK. "Designing Hierarchical Multi-HCA Aware Allgather in MPI" International Workshop on Parallel Programming Models and Systems Software for High-End Computing , 2022 https://doi.org/10.1145/3547276.3548524 Citation Details
Xu, Shulei and Shafi, Aamir and Subramoni, Hari and Panda, Dhabaleswar K. "Arm meets Cloud: A Case Study of MPI Library Performance on AWS Arm-based HPC Cloud with Elastic Fabric Adapter" IEEE International Parallel and Distributed Processing Symposium Workshops , 2022 https://doi.org/10.1109/IPDPSW55747.2022.00083 Citation Details
Zhou, Q. and Kousha, P. and Anthony, Q. and Khorassani, K. and Shafi, A. and Subramoni, H. and Panda, DK. "Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters" ISC HIGH PERFORMANCE , 2022 https://doi.org/10.1007/978-3-031-07312-0_1 Citation Details
Al Attar, K. and Shafi, A. and Abduljabbar, M. and H. Subramoni, H. and Panda, DK "Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI" 2022 IEEE International Conference on Cluster Computing , 2022 Citation Details
Al-Attar, Kinan and Shafi, Aamir and Subramoni, Hari and Panda, Dhabaleswar K. "Towards Java-based HPC using the MVAPICH2 Library: Early Experiences" 2022 IEEE International Parallel and Distributed Processing Symposium Workshops ( , 2022 https://doi.org/10.1109/IPDPSW55747.2022.00091 Citation Details
Alnaasan, Nawras and Jain, Arpan and Shafi, Aamir and Subramoni, Hari and Panda, Dhabaleswar K "OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems" 23rd Parallel and Distributed Scientific and Engineering Computing Workshop (PDSEC) at IPDPS22 , 2022 https://doi.org/10.1109/IPDPSW55747.2022.00143 Citation Details
Chen, Chen-Chun and Khorassani, Kawthar Shafie and Anthony, Quentin G. and Shafi, Aamir and Subramoni, Hari and Panda, Dhabaleswar K. "Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems" Heterogeneity in Computing Workshop , 2022 https://doi.org/10.1109/IPDPSW55747.2022.00014 Citation Details
Jain, A and Shafi, A. and Anthony, Q. and Kousha, P. and Subramoni, H. and Panda, DK. "Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters" Proceedings International Conference on High Performance Computing , 2022 https://doi.org/10.1007/978-3-031-07312-0_6 Citation Details
Kousha, P. and Jain, A. and Kolli, A. and Prasanna, S. and Miriyala, S. and Subramoni, H. and Shafi, A. and Panda, DK. "Hey CAI - Conversational AI Enabled User Interface for HPC Tools" Proceedings International Conference on High Performance Computing , 2022 https://doi.org/10.1007/978-3-031-07312-0_5 Citation Details
Ramesh, Bharath and Hashmi, Jahanzeb Maqbool and Xu, Shulei and Shafi, Aamir and Ghazimirsaeed, Mahdieh and Bayatpour, Mohammadreza and Subramoni, Hari and Panda, Dhabaleswar K. "Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems" International Conference on High Performance Computing, Data, and Analytics , 2021 https://doi.org/10.1109/HIPC53243.2021.00041 Citation Details
(Showing: 1 - 10 of 12)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This award led to the 2019 deployment of the Frontera supercomputer at the Texas Advanced Computing Center at the University of Texas at Austin.  Built with components from Dell, Intel, and Mellanox, Frontera debuted as the #5 fastest supercomputer in the world, and for the last six years has been the fastest supercomputer at any University in the United States.   

With operations still ongoing, through six years of life Frontera has delivered more than 7.2M simulations to more than 1,600 unclassified research projects, and supported nearly a billion dollars worth of open science research supported by the National Science Foundation.   Frontera has delivered more than 370M node hours (21 Billion CPU core hours) to researcher in fields such as Materials Science, Electronics, Astronomy, Weather and Climate, Chemistry, Physics, Engineering, and many more.   Frontera was instrumental in research during the COVID pandemic running drug discovery pipelines and pandemic modeling, was involved in the imaging of the Black Hole at the center of our galaxy, processing some of the first datasets from the James Webb Space Telescope, countless hurricane impact forecasts, and many other projects.   Frontera was used from everyone from Nobel Prize winners to high school students learning AI and coding skills.   

Frontera initially consisted of 8,008 56-core Intel Xeon "Cascade Lake" nodes from Dell, 360 NVIDIA RTX 5000 GPUs, and roughly 100 V100 GPUs in IBM Power9 nodes, along with a 200Gbps fabric from Mellanox, and more than 50PB of fast storage from DataDirect Networks.   Frontera was upgraded several times, with the addition of more than 300 additional compute nodes during the pandemic, replacement of the V100s with NVIDIA A100 GPUs in Dell servers, and most recently the project was used to stand up the bridge system to the next Leadership System, Horizon, coming in 2026.  

Throughout its production life, Frontera maintained uptime in excess of 99%, with more than 95% of compute nodes in use at all times.   It has been a remarkably productive system that has helped make thousands of advances in engineering and science. 


Last Modified: 07/01/2025
Modified by: Daniel Stanzione

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page