
NSF Org: |
OAC Office of Advanced Cyberinfrastructure (OAC) |
Recipient: |
|
Initial Amendment Date: | September 27, 2013 |
Latest Amendment Date: | February 9, 2021 |
Award Number: | 1341698 |
Award Instrument: | Cooperative Agreement |
Program Manager: |
Edward Walker
edwalker@nsf.gov (703)292-4863 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2013 |
End Date: | July 31, 2021 (Estimated) |
Total Intended Award Amount: | $12,000,000.00 |
Total Awarded Amount to Date: | $27,313,476.00 |
Funds Obligated to Date: |
FY 2014 = $9,599,963.00 FY 2015 = $2,399,881.00 FY 2016 = $21,000.00 FY 2017 = $906,388.00 FY 2018 = $2,386,245.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
9500 GILMAN DR LA JOLLA CA US 92093-0021 (858)534-4896 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
9500 Gilman Drive San Diego CA US 92093-0934 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
XD-Extreme Digital, Innovative HPC |
Primary Program Source: |
01001415DB NSF RESEARCH & RELATED ACTIVIT 01001516DB NSF RESEARCH & RELATED ACTIVIT 01001617DB NSF RESEARCH & RELATED ACTIVIT 01001718DB NSF RESEARCH & RELATED ACTIVIT 01001819DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
The University of California at San Diego will provide a ground-breaking new computing facility, Wildfire, that will be made available to the research community to both well established users of high end computing (HEC) and especially to new user communities that are less familiar with how HEC can advance their scientific and engineering goals.
The distinguishing features of Wildfire are:
(i) Deliver 1.8-2.0 Petaflop/s of long sought capacity for the 98% of XSEDE jobs (50% of XSEDE core hours) that use fewer than 1,000 cores and also support larger jobs. The exact number will depend on the speed of the processor being delivered by Intel but cannot be less that 1.8 Petaflop/s.
(ii) Provide 7 PB of Lustre-based Performance Storage at 200 GB/s bandwidth for both scratch and allocated storage as well as 6 PB of Durable Storage
(iii) Ensure high throughput and responsiveness using allocation/scheduling using proven policies on earlier deployed systems such as Trestles and Gordon
(iv) Establish a rapid-access queue to provide new accounts within one day of the request
(v) Enable community-supported custom software stacks via virtualization for communities that are unfamiliar with HPC environments. These virtual clusters will be able to perform at or near native InfinBand bandwidth/latency
Wildfire will provide novel approaches for resource allocation, scheduling, and user support, queues with quicker response for high-throughput computing, medium-term storage allocations, virtualized environments with customized software stack, dedicated allocations of physical/virtual machines, support for Science Gateways and bandwidth reservations on high-speed networks. Wildfire has been designed to efficiently serve the 98% of XSEDE jobs that need fewer than 1,000 cores, while also supporting larger jobs. The award leverages but also enhances the services available through the XSEDE project.
The Wildfire acquisition will work to increase the diversity of researchers able to effectively make use of advanced computational resources and establish a pipeline of potential users through virtualization, science gateways and educational activities focused on the undergraduate, graduate and post-graduate levels.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The San Diego Supercomputer Center at the University of California, San Diego deployed the Comet supercomputer as a natioanl resource in 2015. It was operated for allocated access by academic researchers, educators, and students through the NSF XSEDE project from May 2015 to July 2021. Following its decommissioning as an NSF-funded resource, Comet transitioned to a resource for the Center for Western Weather and Water Extremes (CW3E), a research and service project of the Scripps Institution of Oceanography. During its 75 months of operation as an XSEDE resource, Comet ran over 28 million jobs; provided over 2 billion core-hours and 13 million GPU-hours of compute time; served over 100,000 unique users, most of whom gained access via a science gateway rather than the command line; enabled publication of over 2,000 scientific papers; and supported research across virtually every domain of science and engineering.
Comet has a peak speed of 2.8 Pflop/s delivered by 48,784 cores in 1,944 compute nodes, each with two, 12-core Intel Haswell processors and 128 GB DRAM. It also has 72 GPU nodes, half with 4 NVIDIA K80s and half with 4 NVIDIA P100s, plus 4 large-memory (1.5 TB) nodes. Like its Gordon predecessor, Comet features a large amount of flash memory via solid-state discs on every compute and GPU node. SDSC designed Comet in collaboration with vendor partners Dell, Intel, NVIDIA, Mellanox, and Aeon Computing.
Comet was designed explicitly to serve the long tail of science, defined as the large number of researchers who require only modest numbers of cores or GPUs. Such users also benefited from Comet's optimized scheduling and allocations policies that lowered the barrier for accessing a complex high-performance computer. The design incorporated several significant technology and policy innovations:
-A heterogenous architecture of CPUs, GPUs, large-memory nodes, along with a rich storage hierarchy, supported a broad range of science and engineering research.
-One compute rack of 2,016 cores, connected by an FDR InfiniBand, non-blocking fat tree, supported a wide range of job sizes, from single-core to modest-scale, fully-coupled applications.
-Virtual Cluster (VC) software, developed by SDSC in partnership with collaborators at Indiana University, provided a low-overhead virtualization layer that allowed customized software to run side-by-side with the standard cluster software stack.
-Restricting the allocation limit of an individual PI to 10M core-hours allowed Comet to support more projects. A higher limit of 20M core-hours for science gateways provided access for many more users without the overhead of requesting their own allocation.
Comet's Virtual Cluster interface was used by researchers from the Laser Interferometer Gravitational-Wave Observatory (LIGO) in support of the confirmation of the landmark detection of gravitational waves as hypothesized by Albert Einstein over 100 years ago. LIGO researchers consumed nearly 630,000 hours of computational time on Comet via the Open Science Grid (OSG) using a VC that supported the direct integration of OSG's high-throughput scheduler into Comet's batch scheduler. Comet also became one of the first NSF national resources to use Singularity, which allowed users to run containerized application software that would otherwise not be feasible with a standard cluster software stack.
Comet set out to reach 10,000 unique users over its lifetime, a goal that was achieved within the first year of operations. Notable science gateways included CIPRES, I-TASSER, and the Neuroscience Gateway. Between these and the other gateways on Comet, over 100,000 unique users accessed Comet to study a wide range of physical, chemical, and biological systems.
During its lifetime, Comet became a primary source of GPUs for the community. A research team led by UCSD's Rommie Amaro and Arvind Ramanathan, a computational biologist at Argonne National Laboratory, explored the movement of SARS-CoV-2's spike protein to understand how it behaves and gains access to the human cell. Using Comet's GPU resources as part of the scaling work, the team built a workflow based on artificial intelligence (AI) to more efficiently simulate the spike protein. The work led to a special Gordon Bell Award at the Annual Supercomputing Conference. In 2020, Comet joined the COVID-19 HPC Consortium, adding resources to help understand the spread of COVID-19 and help search for treatments and vaccines.
Outreach activities exposed thousands of researchers, educators, and students to the benefits of Comet's unique features and ease of use. SDSC staff hosted tutorials at scientific meetings, workshops at SDSC and on other university campuses, and annual summer institutes. Following travel constraints due to COVID-19, programs were conducted virtually. SDSC used that opportunity to improve remote training processes and tools, ultimately increasing participation rates above those seen before the pandemic. In the final years of service, there was a marked increase in the interest in machine learning and AI. Targeted outreach to meet this demand resulted in a body of training materials now being used with SDSC's Expanse system.
Last Modified: 11/26/2021
Modified by: Michael L Norman
Please report errors in award information by writing to: awardsearch@nsf.gov.