Award Abstract # 1302693
SHF:Medium: Energy Efficient and Stochastically Robust Resource Allocation for Heterogeneous Computing

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: COLORADO STATE UNIVERSITY
Initial Amendment Date: May 8, 2013
Latest Amendment Date: April 30, 2015
Award Number: 1302693
Award Instrument: Continuing Grant
Program Manager: Almadena Chtchelkanova
achtchel@nsf.gov
 (703)292-7498
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: May 15, 2013
End Date: December 31, 2017 (Estimated)
Total Intended Award Amount: $850,000.00
Total Awarded Amount to Date: $850,000.00
Funds Obligated to Date: FY 2013 = $554,144.00
FY 2015 = $295,856.00
History of Investigator:
  • Sudeep Pasricha (Principal Investigator)
    sudeep@colostate.edu
  • Howard Siegel (Co-Principal Investigator)
  • Patrick Burns (Co-Principal Investigator)
  • Anthony Maciejewski (Co-Principal Investigator)
Recipient Sponsored Research Office: Colorado State University
601 S HOWES ST
FORT COLLINS
CO  US  80521-2807
(970)491-6355
Sponsor Congressional District: 02
Primary Place of Performance: Colorado State University
200 W. Lake St.
Fort Collins
CO  US  80521-4593
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): LT9CXX8L19G1
Parent UEI:
NSF Program(s): Software & Hardware Foundation,
HIGH-PERFORMANCE COMPUTING
Primary Program Source: 01001314DB NSF RESEARCH & RELATED ACTIVIT
01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7924, 7942
Program Element Code(s): 779800, 794200
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Parallel and distributed computing systems are often a heterogeneous mix of machines. As these systems continue to expand rapidly in capability, their computational energy expenditure has skyrocketed, requiring elaborate cooling facilities, which themselves consume significant energy. The need for energy-efficient resource management is thus paramount. Moreover, these systems frequently experience degraded performance and high power consumption due to circumstances that change unpredictably, such as thermal hotspots caused by load imbalances or sudden machine failures. As the complexity of systems grows, so does the importance of making system operation robust against these uncertainties. The goal of this award is to study stochastic-based models, metrics, and algorithmic strategies for deriving resource allocations that are energy-efficient and robust. The research focus is on deriving stochastic robustness and energy models from real-world data from heterogeneous computing machines; applying stochastic models for resource management strategies that co-optimize performance, robustness, computation energy, and cooling energy; developing novel schemes for real-time thermal modeling; and driving and validating the research with feedback collected from real-world petascale systems (Yellowstone at National Center of Atmospheric Research and Jaguar at Oak Ridge National Lab) and terascale systems (Colorado State University's ISTeC cluster and clusters at Oak Ridge National Lab).

The research is expected to realize resource management strategies that are resilient to various sources of uncertainty at run-time while also considering the dynamics of temperature variations and cooling capacity to meet performance guarantees with unprecedented gains in system energy-efficiency in high performance computing environments. By lowering the energy costs and impact of uncertainties associated with computing, this research will ultimately render high performance computing accessible to a wider population of researchers and scientific problems. In the long term, the theoretical foundations and tools that emerge from this research will play a vital role in achieving the grand promise of sustainable computing at extreme scales within realistic power budgets. The broader impacts of the research include: incorporate research results into all levels of teaching, including graduate, undergraduate, secondary, and even elementary education; increase participation by underrepresented groups; and foster close ties with industry and government labs to transfer the developed knowledge quickly into real-world deployments.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 45)
Bhavesh Khemka, Dylan Machovec, Christopher Blandin, Howard Jay Siegel, Salim Hariri, Ahmed Louri, Cihan Tunc, Farah Fargo, and Anthony A. Maciejewski "Resource Management in Heterogeneous Parallel Computing Environments with Soft and Hard Deadlines" Metaheuristics International Conference (MIC 2015) , 2015
Bhavesh Khemka, Ryan Friese, Luis Diego Briceno, Howard Jay Siegel, Anthony A. Maciejewski, Gregory A. Koenig, Chris Groer, Gene Okonski, Marcia M. Hilton, Rajendra Rambharos, and Steve Poole "Utility Functions and Resource Management in an Oversubscribed Heterogeneous Computing Environment" IEEE Transactions on Computers , v.64 , 2015
Bhavesh Khemka, Ryan Friese, Sudeep Pasricha, Anthony A. Maciejewski, Howard Jay Siegel, Gregory A. Koenig, Sarah Powers, Marcia Hilton, Rajendra Rambharos, and Steve Poole "Utility Maximizing Dynamic Resource Management in an Oversubscribed Energy-Constrained Heterogeneous Computing System" Sustainable Computing: Informatics and Systems , v.5 , 2015
Bhavesh Khemka, Ryan Friese, Sudeep Pasricha, Anthony A. Maciejewski, Howard Jay Siegel, Gregory A. Koenig, Sarah Powers, Marcia Hilton, Rajendra Rambharos, Mike Wright, and Steve Poole "Comparison of Energy-Constrained Resource Allocation Heuristics under Different Task Management Environments" 2015 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2015) , 2015
C. Tunc, D. Machovec, N. Kumbhare, A. Akoglu, S. Hariri, B. Khemka, H. J. Siegel "Value of Service Based Resource Management for Large-Scale Computing Systems" Cluster Computing , v.20 , 2017
C. Tunc, N. Kumbhare, A. Akoglu, S. Hariri, D. Machovec, and H. J. Siegel "Value of Service Based Task Scheduling for Cloud Computing Systems" IEEE International Conference on Cloud and Autonomic Computing (ICCAC'16) , 2016
D. Dauwe, E. Jonardi, R. Friese, S. Pasricha, A. A. Maciejewski, D. Bader, and H.J. Siegel "HPC Node Performance and Energy Modeling Under the Uncertainty of Application Co-Location" Journal of Supercomputing , v.72 , 2016
D. Dauwe, R. Jhaveri, S. Pasricha, A. A. Maciejewski, H. J. Siegel "Optimizing Checkpoint Intervals for Reduced Energy Use in Exascale Systems" IEEE Workshop on Energy-efficient Networks of Computers (E2NC): From the Chip to the Cloud, co-organized with IEEE 2017 International Green and Sustainable Computing Conference, , 2017
D. Dauwe, S. Pasricha, A. A. Maciejewski, and H. J. Siegel "An Analysis of Resilience Techniques for Exascale Computing Platforms" 19th Workshop on Advances in Parallel and Distributed Computational Models (APDCM), co-organized with IEEE International Parallel and Distributed Processing Symposium (IPDPS) , 2017
D. Dauwe, S. Pasricha, A. A. Maciejewski, and H. J. Siegel "A Performance and Energy Comparison of Fault Tolerance Techniques for Exascale Computing Systems" 6th IEEE International Symposium on Cloud and Service Computing (SC-2) , 2016
D. Machovec, B. Khemka, S. Pasricha, A. A. Maciejewski, H. Jay Siegel, G. A. Koenig, M. Wright, M. Hilton, R. Rambharos, N. Imam "Dynamic Resource Management for Parallel Tasks in an Oversubscribed Energy-Constrained Heterogeneous Environment" International Heterogeneity in Computing Workshop (HCW) co-located with IEEE International Parallel & Distributed Processing Symposium IPDPS , 2016
(Showing: 1 - 10 of 45)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

High Performance Computing (HPC) systems are widely used in fields as diverse as weather modeling, financial predictions, fluid dynamics, and big data searches, and are opening the doors to new discoveries. For instance, the field of genomics relies heavily on very large supercomputers. Because of genomics, we have new drugs, ways of diagnosing disease, and crime investigation techniques. But the energy costs of operating HPC systems, whether it be supercomputers, data centers, or clusters of machines, are becoming prohibitive as HPC systems evolve. As an example, the K supercomputer in Japan consumes enough energy to power 10,000 homes, at a cost of $10 million/year. The petascale Yellowstone supercomputer at NCAR in Wyoming, USA has an energy expenditure of $2 million/year. Even smaller terascale HPC clusters, such as those at Colorado State University (CSU) and Oak Ridge National Lab (ORNL), have significant computational and cooling energy costs, ranging from $20K-200K per year. These energy costs impose a huge monetary burden on the scientific community, taking HPC systems out of the reach of those who have the greatest potential to make groundbreaking discoveries that can benefit society. Prior efforts to improve energy-efficiency in computing are unfortunately either not applicable to large HPC platforms or ignore important facets of the problem, such as cooling/thermal costs and uncertainties that often surface in HPC platforms at runtime.

The overarching theme of this proposal has been to devise a new software-based resource management framework that can intelligently manage the execution of applications on large-scale HPC platforms, while minimizing the energy needed for computation and cooling. The fundamental innovation that has emerged from this project is the discovery of the complex relationship between cooling and computation energy in HPC platforms, and its characterization using stochastic performance, robustness, and power models derived from real-world data (from terascale and petascale HPC systems). The insights from these models have guided the design of new strategies to co-optimize computing performance, robustness, computation power, and cooling power in large-scale HPC platforms. These strategies have further benefitted from new models that have been developed for 1) quantifying the impact of interference in shared memory and network subsystems; 2) fast real-time thermal characterization; and 3) cooling energy costs and capacity.

Rigorous experimental analysis has shown that the developed models and framework significantly outperform the best known prior efforts to quantify and optimize energy usage in HPC platforms. This project has also contributed to the integration of energy-efficient resource management in production HPC systems at ORNL and the Department of Defense (DoD). The research contributions ultimately represent unique and valuable solutions to overcome the energy challenge facing the design of future HPC platforms. The innovations from this research have been widely disseminated through over 50 peer-reviewed scientific journal/conference publications, as well as several invited industry and conference seminar talks, keynotes, and tutorials. The technical outcomes of this project have thus made significant and lasting contributions towards the goal of meeting current national needs for energy-efficient and cost-effective HPC systems.

Beyond the technical objectives accomplished, this project also has had an immense broader impact. The techniques developed as part of this project have the potential to be applied to a variety of computing and communication system environments. As an example, the algorithms and models for energy-efficient HPC resource management from the project were successfully applied to solve multi-objective resource management problems in manycore electronic chip design. Several students have been fully or partially supported by this project. A total of eight Ph.D. students, two post-doctoral fellows, three M.S. students, and seven senior undergraduate students have conducted research with the faculty as part of this project. As part of K-12 outreach, four high school students also have been provided opportunities to work with the senior students on this project, and learn about the exciting opportunities in computer engineering. By exposing these students to the diverse aspects of modeling and analysis, optimization algorithms, emerging hardware, and software applications, and disseminating the developed ideas and outcomes via curriculum enhancements at CSU, the proposed research has also significantly contributed to an agile high-tech workforce that will maintain continued USA leadership in technological innovation. 


Last Modified: 01/26/2018
Modified by: Sudeep Pasricha

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page