Award Abstract # 1117147
SHF: Small: Collaborative Research: ShapeShifting and PubSub for Tailoring Memory Accesses and Communication in Heterogeneous Multiprocessors

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: THE TRUSTEES OF PRINCETON UNIVERSITY
Initial Amendment Date: June 21, 2011
Latest Amendment Date: June 21, 2011
Award Number: 1117147
Award Instrument: Standard Grant
Program Manager: Tao Li
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2011
End Date: August 31, 2016 (Estimated)
Total Intended Award Amount: $225,000.00
Total Awarded Amount to Date: $225,000.00
Funds Obligated to Date: FY 2011 = $225,000.00
History of Investigator:
  • Margaret Martonosi (Principal Investigator)
Recipient Sponsored Research Office: Princeton University
1 NASSAU HALL
PRINCETON
NJ  US  08544-2001
(609)258-3090
Sponsor Congressional District: 12
Primary Place of Performance: Princeton University
1 NASSAU HALL
PRINCETON
NJ  US  08544-2001
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): NJ1YPQXQG7U5
Parent UEI:
NSF Program(s): Software & Hardware Foundation
Primary Program Source: 01001112DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7941
Program Element Code(s): 779800
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Over the past decade or more, microprocessors have faced increasing challenges in achieving high-performance for current and emerging software applications while abiding by severe power and thermal limits. In response, industry has turned to approaches that use specialized graphics and computational hardware and complex memory organizations. The end result is that computer systems have become more heterogeneous and complex, in ways that make it difficult for programmers to write efficient and high-performance software. Software tuned to run on one implementation will often not run at all or will perform poorly or unpredictably when ported to even a different implementation in the same chip family.

The objective of this research effort is to design and evaluate system and hardware support that tailors memory and data access/movements to improve performance and power efficiency, while also easing the issues of programmability and of tuning software for individual chip characteristics.
The two key themes of this work are Shape Shifting and PubSub data abstractions. ShapeShifting refers to optimizations and hardware support structures that allow data to be transformed in layout, in order to support faster access, more efficient use of memory, and other attributes that improve power and performance. In some preliminary experiments, even a software-only implementation of Shape Shifting improves performance by 15%. Pub Sub data abstractions offer methods for individual processors to indicate interest (or disinterest) in updates regarding other program variables. These abstractions form the underpinning for memory optimizations that are tailored to the application?s memory usage patterns. By mitigating false sharing, encouraging coarse-grained fetches, and reducing coherence broadcasts to uninterested cores, PubSub has the potential to improve the power and performance efficiency of multi-core implementations by a factor of 2X or more.

The research program is targeting several types of broad impact. First, the simulators and tools developed by this project will be released as free, open-source software. Second, the results can enhance performance and energy efficiency of future parallel hardware. Energy-efficiency is of particular concern from a national economic and strategic standpoint, given the growing electricity consumption of computer systems and the important role of the memory hierarchy in influencing computer power consumption.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 12)
Keitaro Oka, Wenhao Jia, Margaret Martonosi, Koji Inoue "Characterization and cross-platform analysis of high-throughput accelerators." IEEE ISPASS , 2015
Daniel Lustig, Michael Pellauer, Margaret Martonosi "Pipe Check: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models." International Symposium on Microarchitecture (MICRO). , 2014
Daniel Lustig, Margaret Martonosi "Reducing GPU offload latency via fine-grained CPU-GPU synchronization." International Symposium on High-Performance Computer Architecture (HPCA). , 2013
Yavuz Yetim, Sharad Malik, and Margaret Martonosi. "EPROF: An Energy/Performance/ Reliability Optimization Framework for Streaming Applications." ASP-DAC Conference. , 2012
Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. "Characterizing and Improving the Use of Demand-Fetched Caches in GPUs." 26th International Conference on Supercomputing (ICS 2012). , 2012
Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. "Stargazer: Automated Regression-based GPU Design Space Exploration." 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). , 2012
Wenhao Jia, Kelly A. Shaw, Margaret Martonosi "Characterizing and improving the use of demand-fetched caches in GPUs." International Conference on Supercomputing. , 2012
Wenhao Jia, Kelly A. Shaw, Margaret Martonosi "MRPB: Memory request prioritization for massively parallel processors." IEEE Symposium on High-Performance Computer Architecture (HPCA) , 2014
Wenhao Jia, Kelly A. Shaw, Margaret Martonosi "Starchart: Hardware and software optimization using recursive partitioning regression trees." IEEE International Conference on Parallel Architectures and Compilation Techniques (PACT). , 2013
Wenhao Jia, Kelly A. Shaw, Margaret Martonosi "Stargazer: Automated regression-based GPU design space exploration." IEEE ISPASS , 2012
Wenhao Jia, Elba Garza, Kelly A. Shaw, and Margaret Martonosi. "GPU Performance and Power Tuning Using Regression Trees" ACM Transactions on Architecture and Code Optimization (TACO) , v.12 , 2015 , p.http://do 10.1145/2736287
(Showing: 1 - 10 of 12)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Over the past decade or more, microprocessors have faced increasing challenges in achieving high-performance for current and emerging applications while abiding by severe power limits. In response, industry has turned to multi-core approaches, often with specialized functional units or heterogeneous cores. In addition, memory hierarchies have also become more heterogeneous and complex, with multiple levels of shared and private caches, software-managed scratchpads, and configurable cache/scratchpad combinations.

While embracing heterogeneity has allowed CMPs and SoCs to reach application power and performance targets, it generally does so at the expense of programmability and performance portability. Software tuned to run on one implementation will often not run at all or will perform poorly or unpredictably when ported to even a different implementation in the same chip family. The goal of our research is to propose, design, and evaluate system and hardware support that will allow the tailoring of memory and data access/movements for performance and power efficiency, while also easing the issues of programmability and of tuning software for individual chip characteristics.

Throughout the evolution of the project, a key aspect of it has been identifying and mitigating communication bottlenecks in CPU-GPU systems. In addition, another key goal has been research on performance estimation and design space exploration techniques to identify the hardware and software parameter settings that offer the best performance, the best power dissipation, or the best portability across several platforms. Finally, we also explored verification methods to ascertain the correctness of communication between CPUs, GPUs, and accelerators.

Design Space Exploration Tools: We developed the Starchart and Stargazer tools for estimating performance and power across complicated GPU design spaces by using statistical techniques for regression and partitioning. Published at PACT 2013, Starchart automates factor selection and automatically identifies key "split points" in a design space. Split points represent points where e.g. performance saturates, power takes a steep increase, etc. Starchart helps hardware designers and programmers identify good operating points for GPU-based systems.

Cross-Platform Performance Estimation: Building on the Starchart and Stargazer work, we have also done research on using the performance results from one platform (Xeon Phi) to estimate the performance of codes if they were ported to another platform (NVIDIA GPU). The accuracy of our performance estimators is very good (within 5%) for compute-bound kernels. Kernels with more data communication are more challenging for this cross-platform estimation, but we have indicator variables (i.e. confidence predictors) that allow us to identify (before porting) the cases where the estimate is not likely to be accurate.

GPU Performance Optimization: Based on our extensive performance estimation and analysis work (above) we also have new papers published regarding ways to reduce the performance impact of memory boundedness in GPUs (MRPB, HPCA 2014) and to properly adjust thread operating points in order to maximize performance while reducing resource usage.

Decoupled Supply-Compute (DeSC) Approaches: While heterogeneous and specialized parallelism shows great leverage improving computation performance at manageable power, its effective use raises additional challenges. As specialized accelerators speed up computations, the communication or memory operations that feed them represent even more of the remaining performance slowdown. In addition, the software-managed communication tailoring used to reduce communication cost often increases software complexity and reduces performance portability. Our DeSC architectures improve the performance, programmer effort, and software portability of accelerator-based systems, by employing automatic compiler techniques to separate data access and address calculations from value computations. Once separated, the code is targeted at either the accelerator itself (for the compute slice) or a data supply unit that feeds it. The data supply unit can either be a general-purpose core or can be another accelerator tailored to this role.

Memory Consistency Model Translation and Verification: Another key thrust of our work was on formal techniques for specifying, verifying, and translating between memory consistency models. This work is central to the heterogeneous communication issues of this research grant, because it allows different compute elements to transfer data in ways that are high-performance, interoperable, and formally verifiable. The key to our approach is the development of microarchitecture-level happens-before graphs (uHBGs). Nodes in these graphs correspond to hardware units, and edges correspond to known event orderings. By considering all possible event orderings by which different values might flow through complex memory hierarchies, memory consistency models can be checked. Our initial work focused on the microarchitecture itself, while subsequent work expanded it to consider cache coherence and address translation interactions.

Overall, this work has produced numerous publications and open-source software releases.  The research has generated innovative ideas that are being further explored by companies and are likely to influence computer systems hardware and software for years to come.

 


Last Modified: 09/05/2016
Modified by: Margaret Martonosi

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page