Award Abstract # 1618912
CSR: Small: Collaborative Research: Exploring Portable Data Placement on Massively Parallel Platforms with Heterogeneous Memory Architectures

NSF Org: CNS
Division Of Computer and Network Systems
Recipient: TRUSTEES OF THE COLORADO SCHOOL OF MINES
Initial Amendment Date: August 2, 2016
Latest Amendment Date: August 2, 2016
Award Number: 1618912
Award Instrument: Standard Grant
Program Manager: Matt Mutka
CNS
 Division Of Computer and Network Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2016
End Date: August 31, 2019 (Estimated)
Total Intended Award Amount: $199,671.00
Total Awarded Amount to Date: $199,671.00
Funds Obligated to Date: FY 2016 = $199,671.00
History of Investigator:
  • Bo Wu (Principal Investigator)
    bwu@mines.edu
Recipient Sponsored Research Office: Colorado School of Mines
1500 ILLINOIS ST
GOLDEN
CO  US  80401-1887
(303)273-3000
Sponsor Congressional District: 07
Primary Place of Performance: Colorado School of Mines
1610 Illinois Street
Golden
CO  US  80401-1833
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): JW2NGMP4NMA3
Parent UEI: JW2NGMP4NMA3
NSF Program(s): CSR-Computer Systems Research
Primary Program Source: 01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923
Program Element Code(s): 735400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Heterogeneous computing is becoming crucial for many computational fields, including simulations of the galaxy, analysis of social networks, modeling of stock transactions, and so on. Programming heterogeneous memory systems is a grand challenge, and creates a major obstacle between heterogeneous hardware and applications because of the programming complexity and fast hardware evolution. This project aims to address this obstacle, and is expected to significantly relieve programmers from handling the underlying memory system heterogeneity. The outcome from this research will also enable continuous enhancement of the computing efficiency of a number of applications on future heterogeneous systems, which is a critical condition for sustained advancement of science, health, security and other aspects of humanity.

To address the programming challenges on heterogeneous memory systems, the project investigates a software framework, consisting of a hardware specification language, a set of novel compiler and runtime techniques, and advanced memory performance modeling. The goal is to develop a systematic solution to automatically place data given a complex heterogeneous memory system, especially on massively parallel platforms. With the proposed framework, programmers are relieved from tailoring their programs to different memory systems, and at the same time, the sophisticated memory systems can get fully translated into high computing efficiency. The framework transforms the programs such that they are customized - in terms of where data are placed in memory, when and how to migrate, etc.- to the underlying heterogeneous memory system at runtime and attain a near optimal memory usage.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Qi Zhu, Bo Wu, Xipeng Shen, Kai Shen, Li Shen, Zhiying Wang "Resolving the GPU responsiveness dilemma through program transformations" Frontiers of Computer Science , v.12 , 2018
Bo Wu, Xu Liu, Xiaobo Zhou, and Changjun Jiang "FLEP: Enabling Flexible and Efficient Preemption on GPUs" The 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems , 2017 http://dx.doi.org/10.1145/3037697.3037742
Daniel Mawhirter, Bo Wu, Dinesh Mehta, Chao Ai "ApproxG: Fast Approximate Parallel Graphlet Counting Through Accuracy Control" The 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing , 2018
Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu "GRNN: Low-Latency and Scalable RNN Inference on GPUs" European Conference on Computer Systems , 2019 10.1145/3302424.3303949
Feng Zhang, Bo Wu, Jidong Zhai, Bingsheng He, Wenguang Chen "FinePar: Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures" The International Symposium on Code Generation and Optimization , 2017
Qi Zhu, Bo Wu, Xipeng Shen, Li Shen and Zhiying Wang "Co-Run Scheduling with Power Cap on Integrated CPU-GPU Systems" The 31st IEEE International Parallel & Distributed Processing Symposium , 2017
Wei Han, Daniel Mawhirter, Matthew Buland, and Bo Wu "Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU" The 26th International Conference on Parallel Architectures and Compilation Techniques , 2017
Zhen Peng, Alexander Powell, Bo Wu, Tekin Bicer, Bin Ren "GraphPhi: Efficient Parallel Graph Processing on Emerging Throughput-oriented Architectures" The 27th International Conference on Parallel Architectures and Compilation Techniques , 2018 10.1145/3243176.3243205

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Modern processors often leverage massive parallelism to provide extremely high computationthroughput, exemplified by Graphic Processing Units (GPUs), Many Integrated Cores (MIC), and Accelerate Processing Units (APUs). For the massive parallelism to yield performancebenefits, the memory system should provide high bandwidth, high capacity, and low latency.However, one type of memory can satisfy at most two of these requirements, motivating theemployment of heterogeneous memory systems (HMS). An HMS consists of multiple memorycomponents with different properties. For example, an NVIDIA GPU has more than eight typesof memories (global, texture, shared, constant, and various caches), with some on-chip,some off-chip, some directly manageable by software, and some not. It is thus challengingto place data in an optimal manner to maximize the achieved throughput.

The main technical objective of this project was to design a systematic approach tooptimizing data-to-memory mapping for HMS in massively parallel platforms. The projectaimed at dramatically improving the performance of important applications in multipleemerging domains, such as graph analytics and machine learning.

During the NSF project, we have produced research results to deepen the understanding of1) processing very large graphs which do not fit in the global memory of GPUs, 2) placingdata on fast and slow memory to optimize aggregate bandwidth, 3) partitioning and mappingcomputation to the complex memory hierarchy of GPUs for recurrent neural networks, and 4)partitioning data between the CPU and the GPU. Based on these understandings, we showedthat better data placement could lead to up to 10X and 7X performance improvements overother systems for graph processing and serving recurrent neural networks models,respectively.

We published seven papers and made two software repositories(https://github.com/zhangfengthu/FinePar and https://github.com/cmikeh2/grnn) publicthanks to the support of this award. Some papers appeared in top conferences, includingASPLOS, EuroSys, CGO, IPDPS, and PACT. Because of the high performance of the librarybased on our EuroSys paper, our collaborators at Microsoft are trying to integrate thecode in their production system for multiple applications, ranging from natural languageprocessing to text classification.

The PI has integrated some of the research outcomes in three courses he has offered multiple times at Colorado School of Mines. The project has supported five graduateresearch assistants, who gained research experiences in compilers, runtime systems, graphprocessing applications, and deep learning techniques. Some of them attended academicconferences and workshops, and consider to develop their career in academia.


Last Modified: 12/28/2019
Modified by: Bo Wu

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page