NSF Award Search: Award # 1618912

Award Abstract # 1618912

CSR: Small: Collaborative Research: Exploring Portable Data Placement on Massively Parallel Platforms with Heterogeneous Memory Architectures

NSF Org:	CNS Division Of Computer and Network Systems
Recipient:	TRUSTEES OF THE COLORADO SCHOOL OF MINES
Initial Amendment Date:	August 2, 2016
Latest Amendment Date:	August 2, 2016
Award Number:	1618912
Award Instrument:	Standard Grant
Program Manager:	Matt Mutka CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	September 1, 2016
End Date:	August 31, 2019 (Estimated)
Total Intended Award Amount:	$199,671.00
Total Awarded Amount to Date:	$199,671.00
Funds Obligated to Date:	FY 2016 = $199,671.00
History of Investigator:	Bo Wu (Principal Investigator) bwu@mines.edu
Recipient Sponsored Research Office:	Colorado School of Mines 1500 ILLINOIS ST GOLDEN CO US 80401-1887 (303)273-3000
Sponsor Congressional District:	07
Primary Place of Performance:	Colorado School of Mines 1610 Illinois Street Golden CO US 80401-1833
Primary Place of Performance Congressional District:	07
Unique Entity Identifier (UEI):	JW2NGMP4NMA3
Parent UEI:	JW2NGMP4NMA3
NSF Program(s):	CSR-Computer Systems Research
Primary Program Source:	01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7923
Program Element Code(s):	735400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Heterogeneous computing is becoming crucial for many computational fields, including simulations of the galaxy, analysis of social networks, modeling of stock transactions, and so on. Programming heterogeneous memory systems is a grand challenge, and creates a major obstacle between heterogeneous hardware and applications because of the programming complexity and fast hardware evolution. This project aims to address this obstacle, and is expected to significantly relieve programmers from handling the underlying memory system heterogeneity. The outcome from this research will also enable continuous enhancement of the computing efficiency of a number of applications on future heterogeneous systems, which is a critical condition for sustained advancement of science, health, security and other aspects of humanity.

To address the programming challenges on heterogeneous memory systems, the project investigates a software framework, consisting of a hardware specification language, a set of novel compiler and runtime techniques, and advanced memory performance modeling. The goal is to develop a systematic solution to automatically place data given a complex heterogeneous memory system, especially on massively parallel platforms. With the proposed framework, programmers are relieved from tailoring their programs to different memory systems, and at the same time, the sophisticated memory systems can get fully translated into high computing efficiency. The framework transforms the programs such that they are customized - in terms of where data are placed in memory, when and how to migrate, etc.- to the underlying heterogeneous memory system at runtime and attain a near optimal memory usage.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Qi Zhu, Bo Wu, Xipeng Shen, Kai Shen, Li Shen, Zhiying Wang "Resolving the GPU responsiveness dilemma through program transformations" Frontiers of Computer Science , v.12 , 2018

Bo Wu, Xu Liu, Xiaobo Zhou, and Changjun Jiang "FLEP: Enabling Flexible and Efficient Preemption on GPUs" The 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems , 2017 http://dx.doi.org/10.1145/3037697.3037742

Daniel Mawhirter, Bo Wu, Dinesh Mehta, Chao Ai "ApproxG: Fast Approximate Parallel Graphlet Counting Through Accuracy Control" The 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing , 2018

Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu "GRNN: Low-Latency and Scalable RNN Inference on GPUs" European Conference on Computer Systems , 2019 10.1145/3302424.3303949

Feng Zhang, Bo Wu, Jidong Zhai, Bingsheng He, Wenguang Chen "FinePar: Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures" The International Symposium on Code Generation and Optimization , 2017

Qi Zhu, Bo Wu, Xipeng Shen, Li Shen and Zhiying Wang "Co-Run Scheduling with Power Cap on Integrated CPU-GPU Systems" The 31st IEEE International Parallel & Distributed Processing Symposium , 2017

Wei Han, Daniel Mawhirter, Matthew Buland, and Bo Wu "Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU" The 26th International Conference on Parallel Architectures and Compilation Techniques , 2017

Zhen Peng, Alexander Powell, Bo Wu, Tekin Bicer, Bin Ren "GraphPhi: Efficient Parallel Graph Processing on Emerging Throughput-oriented Architectures" The 27th International Conference on Parallel Architectures and Compilation Techniques , 2018 10.1145/3243176.3243205

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Modern processors often leverage massive parallelism to provide extremely high computationthroughput, exemplified by Graphic Processing Units (GPUs), Many Integrated Cores (MIC), and Accelerate Processing Units (APUs). For the massive parallelism to yield performancebenefits, the memory system should provide high bandwidth, high capacity, and low latency.However, one type of memory can satisfy at most two of these requirements, motivating theemployment of heterogeneous memory systems (HMS). An HMS consists of multiple memorycomponents with different properties. For example, an NVIDIA GPU has more than eight typesof memories (global, texture, shared, constant, and various caches), with some on-chip,some off-chip, some directly manageable by software, and some not. It is thus challengingto place data in an optimal manner to maximize the achieved throughput.

The main technical objective of this project was to design a systematic approach tooptimizing data-to-memory mapping for HMS in massively parallel platforms. The projectaimed at dramatically improving the performance of important applications in multipleemerging domains, such as graph analytics and machine learning.

During the NSF project, we have produced research results to deepen the understanding of1) processing very large graphs which do not fit in the global memory of GPUs, 2) placingdata on fast and slow memory to optimize aggregate bandwidth, 3) partitioning and mappingcomputation to the complex memory hierarchy of GPUs for recurrent neural networks, and 4)partitioning data between the CPU and the GPU. Based on these understandings, we showedthat better data placement could lead to up to 10X and 7X performance improvements overother systems for graph processing and serving recurrent neural networks models,respectively.

We published seven papers and made two software repositories(https://github.com/zhangfengthu/FinePar and https://github.com/cmikeh2/grnn) publicthanks to the support of this award. Some papers appeared in top conferences, includingASPLOS, EuroSys, CGO, IPDPS, and PACT. Because of the high performance of the librarybased on our EuroSys paper, our collaborators at Microsoft are trying to integrate thecode in their production system for multiple applications, ranging from natural languageprocessing to text classification.

The PI has integrated some of the research outcomes in three courses he has offered multiple times at Colorado School of Mines. The project has supported five graduateresearch assistants, who gained research experiences in compilers, runtime systems, graphprocessing applications, and deep learning techniques. Some of them attended academicconferences and workshops, and consider to develop their career in academia.

Last Modified: 12/28/2019
Modified by: Bo Wu

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error