NSF Award Search: Award # 1162196

Award Abstract # 1162196

SHF: AF: Medium: Collaborative Research: The Pochoir Stencil Compiler

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	THE RESEARCH FOUNDATION FOR THE STATE UNIVERSITY OF NEW YORK
Initial Amendment Date:	April 5, 2012
Latest Amendment Date:	May 2, 2013
Award Number:	1162196
Award Instrument:	Continuing Grant
Program Manager:	Almadena Chtchelkanova achtchel@nsf.gov (703)292-7498 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	April 1, 2012
End Date:	September 30, 2014 (Estimated)
Total Intended Award Amount:	$173,584.00
Total Awarded Amount to Date:	$173,584.00
Funds Obligated to Date:	FY 2012 = $83,366.00 FY 2013 = $90,218.00
History of Investigator:	Rezaul Chowdhury (Principal Investigator) rezaul@cs.stonybrook.edu
Recipient Sponsored Research Office:	SUNY at Stony Brook W5510 FRANKS MELVILLE MEMORIAL LIBRARY STONY BROOK NY US 11794-0001 (631)632-9949
Sponsor Congressional District:	01
Primary Place of Performance:	SUNY at Stony Brook NY US 11794-3366
Primary Place of Performance Congressional District:	01
Unique Entity Identifier (UEI):	M746VC6XMNH9
Parent UEI:	M746VC6XMNH9
NSF Program(s):	Software & Hardware Foundation
Primary Program Source:	01001213DB NSF RESEARCH & RELATED ACTIVIT 01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7329, 7924
Program Element Code(s):	779800
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Many high-end scientific applications perform stencil computations in their inner loops. A stencil defines the value of a grid point in a d-dimensional spatial grid at time t as a function of neighboring grid points at recent times before t. Stencil computations are conceptually simple to implement using nested loops, but looping implementations suffer from poor cache performance on multicore processors. Cache-oblivious divide-and-conquer stencil codes can achieve an order of magnitude improvement in cache efficiency over looping implementations, but most programmers find it difficult to write cache-oblivious stencil codes. Moreover, open problems remain in adapting these algorithms to realistic applications that lack the perfect regularity of simple examples. This project's investigation of cache-oblivious stencil compilation enables ordinary programmers of stencil computations to enjoy the benefits of multicore technology without requiring them to write code any more complex than naive nested loops.

The research project is developing a language embedded in C++ that can express stencil computations concisely and can be compiled automatically into highly efficient algorithmic code for multicore processors and other platforms. The Pochoir stencil compiler compiles stencil computations that exhibit complex boundary conditions, such as periodic, constant, Dirichlet, Neumann, mirrored, and phase factors; irregularities, including macroscopic and microscopic inhomogeneities, as well as irregular shapes; general complex dependencies, such as push dependencies, horizontal dependencies, and dynamic dependencies.
To achieve these goals, the researchers are developing provably good algorithms for complex stencil computations; exploring how domain-specific compiler technology can achieve speedups from efficient cache management, processor-pipeline scheduling, and parallel computation; investigating how to run stencils efficiently on a wide variety of architectures such as multicore, distributed-memory clusters, graphical processing units, FPGA's, and future exascale machines; demonstrating the effectiveness of their research by developing a production-quality stencil compiler; developing a benchmark suite and benchmarking system for evaluating Pochoir.

This research enables scientific researchers and others to easily produce highly efficient codes for complex stencil computations. The codes make good use of the memory hierarchy and processor pipelines endemic to multicore processors and run fast on a diverse set of hardware platforms. This research eases the development and maintenance of a wide variety of stencil-based applications ? ranging across physics, biology, chemistry, energy, climate, mechanical and electrical engineering, finance, and other areas ? benefiting these application areas, as well as society at large.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Rezaul Chowdhury, Dmitri Beglov, Mohammad Moghadasi, Ioannis Paschalidis, Pirooz Vakili, Sandor Vajda, Chandrajit Bajaj, and Dima Kozakov "Efficient Maintenance and Update of Nonbonded Lists in Macromolecular Simulations" Journal of Chemical Theory and Computation , v.10 , 2014 , p.4449 10.1021/ct400474w

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This research investigated the problem of generating high-efficiency cache-oblivious code for real-world stencil computations that make good use of the memory hierarchy and processor pipelines, starting with simple-to-write linguistic specifications. The goal was to make a wide variety of stencil-based applications --- ranging across physics, biology, chemistry, energy, climate, mechanical and electrical engineering, finance, and other areas --- easier to develop and maintain, benefiting these application areas, as well as society at large. The outcome was a robust stencil compiler that handles homogeneous stencil rules on rectangular shapes, simple boundary conditions, and simple dependency rules. Approaches to handle complex boundary conditions, inhomogeneities, irregular shapes, and complicated dependencies were also investigated. Results were also obtained on performing stencil computations on unstructured graphs, achieving pipeline parallelism on the fly, and delivering optimal cache performance without trading off parallelism.

The researchers developed PRISM, a chromatic-scheduling algorithm for executing dynamic data-graph computations. Data-graph computations are an extension of stencil computations from simple grids to general unstructured graphs. PRISM uses a vertex-coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. PRISM has provably good bounds on work and span, and these theoretical guarantees are matched by good empirical performance. The researchers also developed PRISM-R, a variation of PRISM that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations. PRISM-R satisfies the same theoretical bounds as PRISM, but its implementation is more involved.

The researchers investigated how a concurrency platform for fork-join parallelism can be enhanced to support pipeline parallelism in a provably efficient manner. Indeed, one can imagine performing a stencil computation as a pipeline. The researchers proposed simple linguistics for specifying pipeline parallelism, designed a provably efficient scheduling algorithm that integrates pipeline parallelism into a work-stealing scheduler for fork-join parallelism, and analyzed the algorithm. The proposed linguistics allow the structure of the pipeline to be specified "on the fly," as opposed to the "build-and-run" approach taken by existing concurrency platforms, thereby allowing a more flexible structure of the pipeline. The proposed mechanism has been incorporated into Cilk-M, a Cilk-based work-stealing runtime system. Benchmark results indicate that the prototype implementation has low serial overhead and good scalability.

The researchers invented parallel “Cache-Oblivious Wavefront” (COW) algorithms. Unlike most state-of-the-art cache-oblivious parallel algorithms for stencils and dynamic programming (DP) problems that trade off parallelism for better cache performance, COW algorithms aim to achieve the same without sacrificing parallelism. COW algorithms perform divide-and-conquer (DAC) of the stencil/DP grid in exactly the same way as standard cache-optimal recursive DAC-based algorithms, but consider a recursive subtask ready for execution as soon as all its real dependency constraints are satisfied. COW algorithms inherit the cache-obliviousness and (serial) cache-optimality properties of the standard recursive DAC-based algorithms. But by scheduling a recursive subtask ready for execution based only on its real dependency constraints, COW algorithms lead to potentially improved parallelism. Experimental results on stencil and DP computations support these claims.

<...

Please report errors in award information by writing to: awardsearch@nsf.gov.

Top

Success

Error