NSF Award Search: Award # 1527706

Award Abstract # 1527706

SHF: Small: Empirical Autotuning of Parallel Computation for Scalable Hybrid Systems

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	UNIVERSITY OF TENNESSEE
Initial Amendment Date:	July 6, 2015
Latest Amendment Date:	July 6, 2015
Award Number:	1527706
Award Instrument:	Standard Grant
Program Manager:	Almadena Chtchelkanova achtchel@nsf.gov (703)292-7498 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	July 15, 2015
End Date:	June 30, 2019 (Estimated)
Total Intended Award Amount:	$450,000.00
Total Awarded Amount to Date:	$450,000.00
Funds Obligated to Date:	FY 2015 = $450,000.00
History of Investigator:	Jack Dongarra (Principal Investigator) dongarra@icl.utk.edu Jakub Kurzak (Co-Principal Investigator)
Recipient Sponsored Research Office:	University of Tennessee Knoxville 201 ANDY HOLT TOWER KNOXVILLE TN US 37996-0001 (865)974-3466
Sponsor Congressional District:	02
Primary Place of Performance:	University of Tennessee Knoxville TN US 37996-0003
Primary Place of Performance Congressional District:	02
Unique Entity Identifier (UEI):	FN2YCS2YAUW3
Parent UEI:	LXG4F9K8YZK5
NSF Program(s):	Software & Hardware Foundation
Primary Program Source:	01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7923, 7942, 9150
Program Element Code(s):	779800
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Today, scientific and engineering computing is synonymous with parallel computing, and applications such as climate modeling, drug design, aircraft design, etc. utilize very large supercomputer installations, with power consumption measured in MegaWatts, and the cost of electricity measured in millions of dollars. At the same time, every parallel application requires some level of tuning to ensure that the software is mapped appropriately to the hardware. Otherwise, suboptimal performance can lead to lost cycles, kilowatt-hours, and, ultimately, dollars. Tuning the application by making repeated runs is also a wasteful option at very large scale. The DARE project addresses this problem by tuning the application through modeling and simulation of its behavior at very large scale, rather than actually running it. Therefore, resources required for tuning are marginal compared to those consumed in production runs. DARE is based on the observation that the same approach that replaces a wind tunnel with a computer simulation of the airfoil can be applied to the software itself. Two aspects of today's high-end computing landscape make the DARE work unique: 1) the prevalence of hardware accelerators, such as Graphics Processing Units and Xeon Phi co-processors, and 2) adoption of task-based, dynamic, work scheduling systems as an alternative to traditional, lock-step parallel programming models. In particular, DARE combines three components into a refinement loop: a hardware analysis component, a kernel modeling component, and a workload simulation component. The role of the hardware analysis component is to extract the basic hardware information, such as processing power and data link speed. The role of the kernel modeling component is to provide performance models of the serial kernels that constitute the building blocks of the parallel program. Finally, the role of the simulation component is to simulate large-scale parallel workloads.

The hardware analysis component gathers the basic knowledge about the system, such as: the number of CPU sockets per shared memory node, the number of CPU cores in each socket, the cache hierarchy, existence of hyper-threading, number of NUMA nodes and proximity of CPUs to NUMA nodes, number of GPU accelerators or Xeon Phi co-processors and capacities of their device memories, and the topology and bandwidth of data links, both within each node (busses), and between nodes (network switches). Part of this knowledge can be gathered by using appropriate query APIs, such as hwloc, netloc, PAPI, and those provided in the CUDA SDK, OpenCL SDK, and Xeon Phi SDK. Synthetic tests can be used for parameters that cannot be established in this manner.
Kernels are essentially the serial building blocks of parallel problems. Although kernels are usually characterized by serial control flow, most of the time they already rely on a high degree of data parallelism. Today's CPUs get most of their performance from SIMD parallelism, and GPUs get their performance from massive SIMT parallelism. The role of the kernel modeling component is two-fold: 1) to tune kernels for maximum performance at a given granularity, 2) to provide the kernel performance model as a function of granularity, which is changing to accommodate parallel execution.
DARE turns to a stochastic time-stepping simulation in order to predict the performance of a dynamic runtime scheduler for two fundamental reasons: 1) Building good performance models on the basis of benchmarking actual parallel runs requires a significant number of runs with significant problem sizes, which is simply too time consuming. And 2), the impact of many tuning parameters is too complex to be modeled by sparsely sampling the tuning space and fitting simple curves / surfaces to the sample points. The answer to the problem is to replace the run with a time stepping simulation, where a given task-based scheduler is used for assigning tasks to cores, but instead of invoking actual kernel tasks, control is passed to a progress tracking simulation system, which relies on kernel performance models to simulate the execution of the tasks and produce a virtual trace of the simulated execution. The performance advantage is twofold: 1) Simulating a single run is much faster than actually making that run, and 2) Many simulations can be run in parallel allowing for fast sweeps through a large parameter search space.
DARE replaces the standard waterfall autotuning process with a process that is incremental and iterative in nature. The power of the DARE approach lies in the mutual refinement loop, where each of the three phases is capable of massively pruning the search space for the other two. As a result, very high quality models can be built for a particular workload, since time is being spent refining the model for the conditions that actually apply, rather than sampling the search space in areas never touched at runtime.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Asim YarKhan, Jakub Kurzak, Piotr Luszczek, Jack Dongarra "Porting the PLASMA Numerical Library to the OpenMP Standard" International Journal of Parallel Programming , v.45 , 2017 , p.612 10.1007/s10766-016-0441-6

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The objective of the Data-driven Autotuning for Runtime Execution (DARE) project was to provide application-level performance tuning capabilities to the end user. DARE’s development motivation stemmed from the never-ending hurdles of performance tuning for the PLASMA and MAGMA linear algebra libraries. These hurdles motivated the development of a software architecture that combined three components: hardware analysis, kernel modeling, and workload simulation.

With DARE, the hardware analysis block built a detailed model of the hardware, its computational resources, and its memory system. The kernel modeling block built accurate performance models for the computational kernels involved in the workload, and the workload simulation block rapidly simulated a large number of runs to find the best execution conditions while relying on the information provided by the other two blocks.

For our modeling, DARE used distributed-memory, multi-threaded, accelerator-enabled routines from the linear algebra library being developed as part of the Software for Linear Algebra Targeting Exascale (SLATE) project (https://icl.utk.edu/slate/). The SLATE project is developing fundamental dense linear algebra capabilities for current and upcoming distributed-memory systems, including GPU-accelerated systems as well as more traditional multi core–only systems. See the SLATE Working Notes (SWANs) for details (http://www.icl.utk.edu/publications/series/swans).

The ultimate objective of DARE was to arrange the analysis and simulation blocks in a continuous refinement loop that could serve as a framework for optimizing applications beyond the field of dense linear algebra.

Last Modified: 09/30/2019
Modified by: Jack J Dongarra

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error