
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | May 8, 2013 |
Latest Amendment Date: | April 30, 2015 |
Award Number: | 1302693 |
Award Instrument: | Continuing Grant |
Program Manager: |
Almadena Chtchelkanova
achtchel@nsf.gov (703)292-7498 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | May 15, 2013 |
End Date: | December 31, 2017 (Estimated) |
Total Intended Award Amount: | $850,000.00 |
Total Awarded Amount to Date: | $850,000.00 |
Funds Obligated to Date: |
FY 2015 = $295,856.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
601 S HOWES ST FORT COLLINS CO US 80521-2807 (970)491-6355 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
200 W. Lake St. Fort Collins CO US 80521-4593 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Software & Hardware Foundation, HIGH-PERFORMANCE COMPUTING |
Primary Program Source: |
01001516DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Parallel and distributed computing systems are often a heterogeneous mix of machines. As these systems continue to expand rapidly in capability, their computational energy expenditure has skyrocketed, requiring elaborate cooling facilities, which themselves consume significant energy. The need for energy-efficient resource management is thus paramount. Moreover, these systems frequently experience degraded performance and high power consumption due to circumstances that change unpredictably, such as thermal hotspots caused by load imbalances or sudden machine failures. As the complexity of systems grows, so does the importance of making system operation robust against these uncertainties. The goal of this award is to study stochastic-based models, metrics, and algorithmic strategies for deriving resource allocations that are energy-efficient and robust. The research focus is on deriving stochastic robustness and energy models from real-world data from heterogeneous computing machines; applying stochastic models for resource management strategies that co-optimize performance, robustness, computation energy, and cooling energy; developing novel schemes for real-time thermal modeling; and driving and validating the research with feedback collected from real-world petascale systems (Yellowstone at National Center of Atmospheric Research and Jaguar at Oak Ridge National Lab) and terascale systems (Colorado State University's ISTeC cluster and clusters at Oak Ridge National Lab).
The research is expected to realize resource management strategies that are resilient to various sources of uncertainty at run-time while also considering the dynamics of temperature variations and cooling capacity to meet performance guarantees with unprecedented gains in system energy-efficiency in high performance computing environments. By lowering the energy costs and impact of uncertainties associated with computing, this research will ultimately render high performance computing accessible to a wider population of researchers and scientific problems. In the long term, the theoretical foundations and tools that emerge from this research will play a vital role in achieving the grand promise of sustainable computing at extreme scales within realistic power budgets. The broader impacts of the research include: incorporate research results into all levels of teaching, including graduate, undergraduate, secondary, and even elementary education; increase participation by underrepresented groups; and foster close ties with industry and government labs to transfer the developed knowledge quickly into real-world deployments.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
High Performance Computing (HPC) systems are widely used in fields as diverse as weather modeling, financial predictions, fluid dynamics, and big data searches, and are opening the doors to new discoveries. For instance, the field of genomics relies heavily on very large supercomputers. Because of genomics, we have new drugs, ways of diagnosing disease, and crime investigation techniques. But the energy costs of operating HPC systems, whether it be supercomputers, data centers, or clusters of machines, are becoming prohibitive as HPC systems evolve. As an example, the K supercomputer in Japan consumes enough energy to power 10,000 homes, at a cost of $10 million/year. The petascale Yellowstone supercomputer at NCAR in Wyoming, USA has an energy expenditure of $2 million/year. Even smaller terascale HPC clusters, such as those at Colorado State University (CSU) and Oak Ridge National Lab (ORNL), have significant computational and cooling energy costs, ranging from $20K-200K per year. These energy costs impose a huge monetary burden on the scientific community, taking HPC systems out of the reach of those who have the greatest potential to make groundbreaking discoveries that can benefit society. Prior efforts to improve energy-efficiency in computing are unfortunately either not applicable to large HPC platforms or ignore important facets of the problem, such as cooling/thermal costs and uncertainties that often surface in HPC platforms at runtime.
The overarching theme of this proposal has been to devise a new software-based resource management framework that can intelligently manage the execution of applications on large-scale HPC platforms, while minimizing the energy needed for computation and cooling. The fundamental innovation that has emerged from this project is the discovery of the complex relationship between cooling and computation energy in HPC platforms, and its characterization using stochastic performance, robustness, and power models derived from real-world data (from terascale and petascale HPC systems). The insights from these models have guided the design of new strategies to co-optimize computing performance, robustness, computation power, and cooling power in large-scale HPC platforms. These strategies have further benefitted from new models that have been developed for 1) quantifying the impact of interference in shared memory and network subsystems; 2) fast real-time thermal characterization; and 3) cooling energy costs and capacity.
Rigorous experimental analysis has shown that the developed models and framework significantly outperform the best known prior efforts to quantify and optimize energy usage in HPC platforms. This project has also contributed to the integration of energy-efficient resource management in production HPC systems at ORNL and the Department of Defense (DoD). The research contributions ultimately represent unique and valuable solutions to overcome the energy challenge facing the design of future HPC platforms. The innovations from this research have been widely disseminated through over 50 peer-reviewed scientific journal/conference publications, as well as several invited industry and conference seminar talks, keynotes, and tutorials. The technical outcomes of this project have thus made significant and lasting contributions towards the goal of meeting current national needs for energy-efficient and cost-effective HPC systems.
Beyond the technical objectives accomplished, this project also has had an immense broader impact. The techniques developed as part of this project have the potential to be applied to a variety of computing and communication system environments. As an example, the algorithms and models for energy-efficient HPC resource management from the project were successfully applied to solve multi-objective resource management problems in manycore electronic chip design. Several students have been fully or partially supported by this project. A total of eight Ph.D. students, two post-doctoral fellows, three M.S. students, and seven senior undergraduate students have conducted research with the faculty as part of this project. As part of K-12 outreach, four high school students also have been provided opportunities to work with the senior students on this project, and learn about the exciting opportunities in computer engineering. By exposing these students to the diverse aspects of modeling and analysis, optimization algorithms, emerging hardware, and software applications, and disseminating the developed ideas and outcomes via curriculum enhancements at CSU, the proposed research has also significantly contributed to an agile high-tech workforce that will maintain continued USA leadership in technological innovation.
Last Modified: 01/26/2018
Modified by: Sudeep Pasricha
Please report errors in award information by writing to: awardsearch@nsf.gov.