
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | July 24, 2013 |
Latest Amendment Date: | July 24, 2013 |
Award Number: | 1318384 |
Award Instrument: | Standard Grant |
Program Manager: |
Tao Li
CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 1, 2013 |
End Date: | July 31, 2017 (Estimated) |
Total Intended Award Amount: | $500,000.00 |
Total Awarded Amount to Date: | $500,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
77 MASSACHUSETTS AVE CAMBRIDGE MA US 02139-4301 (617)253-1000 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
77 Massachusetts Avenue Cambridge MA US 02139-4307 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | COMPUTER ARCHITECTURE |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Multicore chips are now mainstream, and increasing the number of cores per chip has become the primary way to improve performance. Current multicores rely on sophisticated cache hierarchies to mitigate the high latency, limited bandwidth, and high energy of main memory accesses, which often limit system performance. These on-chip caches consume more than half of chip area, and most of this cache space is shared among all cores. Sharing this capacity has major advantages, such as improving space utilization and accelerating core-to-core communication, but poses two fundamental problems. First, with more cores, cache accesses take longer and consume more energy, severely limiting scalability. Second, concurrently executing applications contend for this shared cache capacity, which can cause unpredictable performance degradation among them. The goal of this project is to redesign the cache hierarchy to make it both highly scalable, and to provide strict isolation among competing applications, enabling end-to-end performance guarantees. If successful, this work will improve the performance and energy efficiency of future processors, enabling systems with larger numbers of cores than previously possible. Moreover, these systems will eliminate interference and enforce quality of service guarantees among competing applications, even when those applications are latency-critical. This will enable much higher utilization of shared computing infrastructure (such as cloud computing servers), potentially saving billions of dollars in IT infrastructure and energy consumption.
To achieve the dual goals of high scalability and quality-of-service (QoS) guarantees efficiently, this project proposes an integrated hardware-software approach, where hardware exposes a small and general set of mechanisms to control cache allocations, and software uses these mechanisms to implement both partitioning and non-uniform access policies efficiently. At the hardware level, a novel cache organization provides thousands of fine-grained, spatially configurable partitions, implements lightweight monitoring and reconfiguration mechanisms to guide software policies effectively, and supports full-system scalable cache coherence cheaply. At the software level, a system-level runtime leverages this hardware to implement dynamic data classification, placement, migration, and replication mechanisms, maximizing system performance and efficiency, while at the same time enforcing the strict QoS guarantees of latency-critical workloads, transparently to applications. Combined with existing bandwidth partitioning approaches, these techniques will enforce full-system QoS guarantees by controlling all on-chip shared resources (caches, on-chip network, and memory controllers). In addition, the infrastructure and benchmarks developed as part of this project will be publicly released, allowing other researchers to build on the results of this work, and enabling the development of course projects and other educational activities in large-scale parallel computer architecture, both at MIT and elsewhere.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Multicore chips are now mainstream, and increasing the number of cores per chip has become the primary way to improve performance. Current multicores rely on sophisticated cache hierarchies to mitigate the high latency, limited bandwidth, and high energy of main memory accesses, which limit system performance. These caches consume more than half of chip area, and most of this cache space is shared among all cores. Sharing this capacity has major advantages, such as improving space utilization and accelerating core-to-core communication, but poses two fundamental problems. First, with more cores, cache accesses take longer and consume more energy, severely limiting scalability. Second, applications contend for this shared cache capacity, causing unpredictable performance degradation. This lack of predictability limits the utilization of shared servers, where applications much often meet strict performance targets.
Intellectual Merit: The key goal of this project has been to redesign the cache hierarchy to make it both highly scalable and to provide strict isolation among competing applications, enabling end-to-end performance guarantees. Our research has produced the following main outcomes:
First, we have investigated and designed software-defined cache hierarchies, a new memory system organization that scales the cache hierarchy by leveraging the strengths of hardware and software. Our approach combines simple, configurable hardware mechanisms controlled by sophisticated software runtimes. Hardware exposes spatially-distributed cache banks to software, allowing an OS-level runtime to build virtual cache hierarchies tailored to the needs of each application, dynamically and transparently to applications. With a single application, our design approaches the performance of the best application-specific hierarchy; with multiple applications sharing the chip, software can control how to divide resources to satisfy system-level objectives (e.g., maximizing throughput or enforcing application priorities). To enable software-defined hierarchies, we have developed novel practical optimization algorithms that reconfigure the whole system in about 1 millisecond and perform within 1% of impractically-expensive solvers. We have demonstrated that these techniques yield large speedups in systems with spatially-distributed caches, as well as in systems with heterogeneous memory technologies (e.g., SRAM and die-stacked DRAM). Moreover, putting software in control of the cache hierarchy enables several novel optimizations throughout the system stack. We have demonstrated these capabilities through novel techniques that perform coordinated scheduling of data and computation and leverage application-level knowledge to improve data placement.
Second, we have designed new management techniques to share hardware resources dynamically among applications while providing strict performance guarantees. To share resources safely, these techniques leverage simple hardware mechanisms to let software control resource allocations at high speed, as well as novel modeling techniques to account for the inherent performance inertia of each resource. As a result, these techniques allow much more efficient utilization of caches and cores, and dramatically improve utilization of shared servers in clusters and datacenters.
Third, we have designed new analytical cache modeling techniques and cache replacement policies to better understand and improve cache performance. These techniques rely on a novel probabilistic framework based on absolute reuse distances. Our modeling techniques accurately predict performance for a wide range of cache configurations and policies, enabling many system optimizations. Beyond improving performance, our replacement policies yield important qualitative benefits, such as eliminating performance cliffs, which makes cache performance smooth and predictable.
To prototype and evaluate these techniques, we have developed a substantial amount of infrastructure, including a state-of-the-art parallel simulator and a diverse benchmark suite. We have released this infrastructure under open-source licenses, allowing others to build on the results of our work, both in research and in the classroom.
Broader Impacts: The techniques developed in this project significantly improve the performance and energy efficiency of multicore processors, enabling systems with a larger number of cores than previously possible. Moreover, by eliminating interference and enforcing quality-of-service guarantees among competing applications, these techniques enable much higher utilization of shared computing infrastructure (such as cloud computing servers), reducing both IT infrastructure costs and energy consumption.
Finally, this project has supported the training and professional development of six graduate students.
Last Modified: 11/01/2017
Modified by: Daniel Sanchez Martin
Please report errors in award information by writing to: awardsearch@nsf.gov.