NSF Award Search: Award # 1265449

Award Abstract # 1265449

EAGER: Collaborative Research: Using PDE Descriptions To Generate Code Precisely Tailored To Energy-Constrained Systems Including Large GPU Accelerated Clusters

NSF Org:	OAC Office of Advanced Cyberinfrastructure (OAC)
Recipient:	LOUISIANA STATE UNIVERSITY
Initial Amendment Date:	August 15, 2013
Latest Amendment Date:	August 15, 2013
Award Number:	1265449
Award Instrument:	Standard Grant
Program Manager:	Rajiv Ramnath OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering
Start Date:	September 1, 2013
End Date:	August 31, 2017 (Estimated)
Total Intended Award Amount:	$169,999.00
Total Awarded Amount to Date:	$169,999.00
Funds Obligated to Date:	FY 2013 = $169,999.00
History of Investigator:	Steven Brandt (Principal Investigator) sbrandt@cct.lsu.edu David Koppelman (Co-Principal Investigator) Peter Diener (Co-Principal Investigator) Frank Loffler (Co-Principal Investigator)
Recipient Sponsored Research Office:	Louisiana State University 202 HIMES HALL BATON ROUGE LA US 70803-0001 (225)578-2760
Sponsor Congressional District:	06
Primary Place of Performance:	Louisiana State University and A&M College 202 Himes Hall Baton Rouge LA US 70803-2701
Primary Place of Performance Congressional District:	06
Unique Entity Identifier (UEI):	ECQEYCHRNKJ4
Parent UEI:
NSF Program(s):	Gravity Theory, Software Institutes
Primary Program Source:	01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7916, 9150
Program Element Code(s):	124400, 800400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Modern computer system architectures are forcing computational scientists to move scientific applications
from traditional homogeneous cpu-based systems to heterogeneous multi-core/accelerator architectures.
Obtaining performance in the presence of accelerators requires close attention to
the memory hierarchy and chip-level parallelism to reach even a modest fraction
of the potential performance. As a result, coding tasks which were once the province of
lone graduate students in a single discipline now require interdisciplinary teams of people.
Project Chemora will explore the design of a new application framework for automatically
creating highly optimized code for high-end computational machines. The system
will use as input a set of partial differential equations (PDEs) that describe a
problem, it will then construct a machine-specific abstract performance model, and using these
it will generate well-tuned code and execution configurations for accelerated
(e.g., hybrid CPU/GPU) computing clusters at various scales. Chemora will
improve programmability in this simplified domain by decoupling the science and
computer science at a high level, thereby reducing the complexity and number of issues scientists need to
collectively understand and allowing individual scientists in the team to focus on their area of
specialty. Chemora will improve performance (both wallclock time and energy) for
systems with both simple and complex sets of equations by making use of detailed
information describing the problem and machine, and will provide improved load
balancing through the AMPI framework.

The Chemora project has chosen the Einstein equations as the primary science driver because
these equations are one of the more complex PDE systems, one with many
hundreds of terms, and a problem scale that is challenging to optimize for most
compilers. Achieving this vision for a general scientific problem would indeed
be a "Grand Challenge" in computational science, but in order to give our
research a sharper focus we have chosen as a science driver the
simulation of Intermediate mass ratio Binary Black Hole (IBBH) systems. Such
systems, consisting of a black hole of mass 100 to 1,000 solar masses orbited by
a smaller black hole of mass 5 to 20 solar masses are expected to be important
sources of gravitational waves for advanced Laser Interferometer Gravitational
Wave Observatory (LIGO) and the Einstein Telescope (ET). Accurate modeling of
the waveforms from IBBH systems will be necessary in order to extract
gravitational wave signals using template-matching data analysis techniques.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Hu, Yue and Koppelman, David and Brandt, Steven R. "Efficiency-Based Assignment of Buffering Strategies for GPU Stencil Computations" IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-16) , 2016 , p.361 9781509035328

Hu, Yue and Koppelman, David M and Brandt, Steven R "Thoroughly Exploring GPU Buffering Options for Stencil Code by using an Efficiency Measure and a Performance Model" IEEE Transactions on Multi-Scale Computing Systems , 2017 2332-7766

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The modern era of scientific code development is both a golden age and a dark age. It is a golden age because we have machines that can operate at unprecedented levels of performance. Unfortunately, it is also a dark age because these machines are becoming increasingly hard to program and use. This is due in part to power becoming a limiting resource, in response to which energy-efficient GPU computational accelerator designs have been introduced that omit energy-consuming niceties such as caches and predictors, replacing these with a baroque memory model and a requirement that programs be coded to follow a large number of threads of execution. Such accelerator designs demand much more effort from programmers to properly use, and programs must be re-tuned for each succeeding accelerator generation.

This need to re-tune is much less of a problem for CPUs, because features such as large caches and branch predictors enable compilers to generate good code without having to know, for example, how many times a piece of data will be accessed. GPU accelerators replace large caches with several types of storage, such as a high-speed scratchpad memory. The decision on whether to use these specialized storage areas depends upon aspects of code execution that the compiler often cannot determine. As a result the burden is placed on the programmer to decide how to stage data.
Though making use of specialized storage and other GPU features is beyond the ability of current compilers for *any* type of program, it is feasible to do for specialized domains, including for what are called stencil calculations. Stencil calculations, are used to solve many important scientific and engineering problems, such as simulating black holes, neutron stars, exploring quantum cosmology, simulating fluids, and performing coastal simulations.

The goal of the project was to develop a stencil framework, Chemora, that would allow a physicist or some other domain expert to code a stencil simulation in what is called a domain-specific language (DSL), and then run it on a GPU-accelerated cluster, and have its performance rival that of hand-tuned code. The use of the DSL makes it much easier for Chemora to generate code since Chemora knows much more about the movement of data than could be determined for an unconstrained language. Chemora can use all the kinds of specialized memory appearing on recent GPU accelerators (including shared, constant, and texture memory), and can transform calculations into pieces that each comfortably fit on the accelerator device. Chemora transforms a calculation based on a performance model of the device, in some cases optimizing certain rearrangements, in other cases using the model to choose among multiple candidates. Chemora operates in part when a program is run, which is when key information such as input data and system characteristics are first available. It takes advantage of this to generate highly efficient code.

Chemora can generate efficient code for several GPU generations. Chemora has been updated as new accelerators become available. Programs coded in Chemora's DSL enjoy good performance on the new devices without any effort required on the part of the original programmers.

The driver application for Chemora is a black hole simulation based on Einstein's equations. The application operates on a large number of quantities, something which introduces problems not encountered with simpler code. To accommodate this Chemora's model considers factors ignored by others, such as what is called register pressure. As a result, Chemora can efficiently run applications that may overwhelm other systems.
Future work will bring the Chemora research to production level, enabling scientists to make better use of computational resources and make new kinds of science possible in many areas that benefit humanity both directly and indirectly.

Last Modified: 11/29/2017
Modified by: Steven Brandt

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error