NSF Award Search: Award # 1318384

Award Abstract # 1318384

SHF:Small:Scalable Memory Hierarchies with Fine-Grained QoS Guarantees

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	MASSACHUSETTS INSTITUTE OF TECHNOLOGY
Initial Amendment Date:	July 24, 2013
Latest Amendment Date:	July 24, 2013
Award Number:	1318384
Award Instrument:	Standard Grant
Program Manager:	Tao Li CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	August 1, 2013
End Date:	July 31, 2017 (Estimated)
Total Intended Award Amount:	$500,000.00
Total Awarded Amount to Date:	$500,000.00
Funds Obligated to Date:	FY 2013 = $500,000.00
History of Investigator:	Daniel Sanchez Martin (Principal Investigator) sanchez@csail.mit.edu
Recipient Sponsored Research Office:	Massachusetts Institute of Technology 77 MASSACHUSETTS AVE CAMBRIDGE MA US 02139-4301 (617)253-1000
Sponsor Congressional District:	07
Primary Place of Performance:	Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA US 02139-4307
Primary Place of Performance Congressional District:	07
Unique Entity Identifier (UEI):	E2NYLCDML6V1
Parent UEI:	E2NYLCDML6V1
NSF Program(s):	COMPUTER ARCHITECTURE
Primary Program Source:	01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7923, 7941
Program Element Code(s):	794100
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Multicore chips are now mainstream, and increasing the number of cores per chip has become the primary way to improve performance. Current multicores rely on sophisticated cache hierarchies to mitigate the high latency, limited bandwidth, and high energy of main memory accesses, which often limit system performance. These on-chip caches consume more than half of chip area, and most of this cache space is shared among all cores. Sharing this capacity has major advantages, such as improving space utilization and accelerating core-to-core communication, but poses two fundamental problems. First, with more cores, cache accesses take longer and consume more energy, severely limiting scalability. Second, concurrently executing applications contend for this shared cache capacity, which can cause unpredictable performance degradation among them. The goal of this project is to redesign the cache hierarchy to make it both highly scalable, and to provide strict isolation among competing applications, enabling end-to-end performance guarantees. If successful, this work will improve the performance and energy efficiency of future processors, enabling systems with larger numbers of cores than previously possible. Moreover, these systems will eliminate interference and enforce quality of service guarantees among competing applications, even when those applications are latency-critical. This will enable much higher utilization of shared computing infrastructure (such as cloud computing servers), potentially saving billions of dollars in IT infrastructure and energy consumption.

To achieve the dual goals of high scalability and quality-of-service (QoS) guarantees efficiently, this project proposes an integrated hardware-software approach, where hardware exposes a small and general set of mechanisms to control cache allocations, and software uses these mechanisms to implement both partitioning and non-uniform access policies efficiently. At the hardware level, a novel cache organization provides thousands of fine-grained, spatially configurable partitions, implements lightweight monitoring and reconfiguration mechanisms to guide software policies effectively, and supports full-system scalable cache coherence cheaply. At the software level, a system-level runtime leverages this hardware to implement dynamic data classification, placement, migration, and replication mechanisms, maximizing system performance and efficiency, while at the same time enforcing the strict QoS guarantees of latency-critical workloads, transparently to applications. Combined with existing bandwidth partitioning approaches, these techniques will enforce full-system QoS guarantees by controlling all on-chip shared resources (caches, on-chip network, and memory controllers). In addition, the infrastructure and benchmarks developed as part of this project will be publicly released, allowing other researchers to build on the results of this work, and enabling the development of course projects and other educational activities in large-scale parallel computer architecture, both at MIT and elsewhere.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 11)

Show All

Nathan Beckmann, Daniel Sanchez "Modeling Cache Performance Beyond LRU" HPCA , 2016

Po-An Tsai, Nathan Beckmann, Daniel Sanchez "Jenga: Software-Defined Cache Hierarchies" Proceedings of the 44th International Symposium in Computer Architecture (ISCA-44) , 2017

Anurag Mukkara, Nathan Beckmann, Daniel Sanchez "Whirlpool: Improving Dynamic Cache Management with Static Data Classification" ASPLOS-XXI , 2016

Harshad Kasture, Daniel Sanchez "TailBench: A Benchmark Suite and Evaluation Methodology for Latency-Critical Applications" IISWC , 2016

Harshad Kasture, Xu Ji, Nosayba El-Sayed, Nathan Beckmann, Xiaosong Ma, Daniel Sanchez "POSTER: Improving Datacenter Efficiency through Partitioning-Aware Scheduling" Proceedings of the 26th international conference on Parallel Architectures and Compilation Techniques (PACT-26) , 2017

Maleen Abeydeera, Suvinay Subramanian, Mark C. Jeffrey, Joel Emer, Daniel Sanchez "SAM: Optimizing Multithreaded Cores for Speculative Parallelism" Proceedings of the 26th international conference on Parallel Architectures and Compilation Techniques (PACT-26) , 2017

Mark C. Jeffrey, Suvinay Subramanian, Maleen Abeydeera, Joel Emer, Daniel Sanchez "Data-Centric Execution of Speculative Parallel Programs" Proceedings of the 49th annual IEEE/ACM international symposium on Microarchitecture (MICRO-49) , 2016

Nathan Beckmann, Daniel Sanchez "Cache Calculus: Modeling Caches through Differential Equations" Computer Architecture Letters (CAL) , 2016

Nathan Beckmann, Daniel Sanchez "Maximizing Cache Performance Under Uncertainty" Proceedings of the 23rd international symposium on High Performance Computer Architecture (HPCA-23) , 2017

Po-An Tsai, Nathan Beckmann, Daniel Sanchez "Nexus: A New Approach to Replication in Distributed Shared Caches" Proceedings of the 26th international conference on Parallel Architectures and Compilation Techniques (PACT-26) , 2017

Suvinay Subramanian, Mark C. Jeffrey, Maleen Abeydeera, Hyun Ryong Lee, Victor A. Ying, Joel Emer, Daniel Sanchez "Fractal: An Execution Model for Fine-Grain Nested Speculative Parallelism" Proceedings of the 44th International Symposium in Computer Architecture (ISCA-44) , 2017

(Showing: 1 - 10 of 11)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Multicore chips are now mainstream, and increasing the number of cores per chip has become the primary way to improve performance. Current multicores rely on sophisticated cache hierarchies to mitigate the high latency, limited bandwidth, and high energy of main memory accesses, which limit system performance. These caches consume more than half of chip area, and most of this cache space is shared among all cores. Sharing this capacity has major advantages, such as improving space utilization and accelerating core-to-core communication, but poses two fundamental problems. First, with more cores, cache accesses take longer and consume more energy, severely limiting scalability. Second, applications contend for this shared cache capacity, causing unpredictable performance degradation. This lack of predictability limits the utilization of shared servers, where applications much often meet strict performance targets.

Intellectual Merit: The key goal of this project has been to redesign the cache hierarchy to make it both highly scalable and to provide strict isolation among competing applications, enabling end-to-end performance guarantees. Our research has produced the following main outcomes:

First, we have investigated and designed software-defined cache hierarchies, a new memory system organization that scales the cache hierarchy by leveraging the strengths of hardware and software. Our approach combines simple, configurable hardware mechanisms controlled by sophisticated software runtimes. Hardware exposes spatially-distributed cache banks to software, allowing an OS-level runtime to build virtual cache hierarchies tailored to the needs of each application, dynamically and transparently to applications. With a single application, our design approaches the performance of the best application-specific hierarchy; with multiple applications sharing the chip, software can control how to divide resources to satisfy system-level objectives (e.g., maximizing throughput or enforcing application priorities). To enable software-defined hierarchies, we have developed novel practical optimization algorithms that reconfigure the whole system in about 1 millisecond and perform within 1% of impractically-expensive solvers. We have demonstrated that these techniques yield large speedups in systems with spatially-distributed caches, as well as in systems with heterogeneous memory technologies (e.g., SRAM and die-stacked DRAM). Moreover, putting software in control of the cache hierarchy enables several novel optimizations throughout the system stack. We have demonstrated these capabilities through novel techniques that perform coordinated scheduling of data and computation and leverage application-level knowledge to improve data placement.

Second, we have designed new management techniques to share hardware resources dynamically among applications while providing strict performance guarantees. To share resources safely, these techniques leverage simple hardware mechanisms to let software control resource allocations at high speed, as well as novel modeling techniques to account for the inherent performance inertia of each resource. As a result, these techniques allow much more efficient utilization of caches and cores, and dramatically improve utilization of shared servers in clusters and datacenters.

Third, we have designed new analytical cache modeling techniques and cache replacement policies to better understand and improve cache performance. These techniques rely on a novel probabilistic framework based on absolute reuse distances. Our modeling techniques accurately predict performance for a wide range of cache configurations and policies, enabling many system optimizations. Beyond improving performance, our replacement policies yield important qualitative benefits, such as eliminating performance cliffs, which makes cache performance smooth and predictable.

To prototype and evaluate these techniques, we have developed a substantial amount of infrastructure, including a state-of-the-art parallel simulator and a diverse benchmark suite. We have released this infrastructure under open-source licenses, allowing others to build on the results of our work, both in research and in the classroom.

Broader Impacts: The techniques developed in this project significantly improve the performance and energy efficiency of multicore processors, enabling systems with a larger number of cores than previously possible. Moreover, by eliminating interference and enforcing quality-of-service guarantees among competing applications, these techniques enable much higher utilization of shared computing infrastructure (such as cloud computing servers), reducing both IT infrastructure costs and energy consumption.

Finally, this project has supported the training and professional development of six graduate students.

Last Modified: 11/01/2017
Modified by: Daniel Sanchez Martin

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error