NSF Award Search: Award # 1439126

Award Abstract # 1439126

XPS: FULL: FP: Collaborative Research: Taming parallelism: optimally exploiting high-throughput parallel architectures

NSF Org:	CCF Division of Computing and Communication Foundations
Recipient:	PURDUE UNIVERSITY
Initial Amendment Date:	August 6, 2014
Latest Amendment Date:	August 6, 2014
Award Number:	1439126
Award Instrument:	Standard Grant
Program Manager:	Anindya Banerjee abanerje@nsf.gov (703)292-7885 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering
Start Date:	September 1, 2014
End Date:	August 31, 2019 (Estimated)
Total Intended Award Amount:	$329,571.00
Total Awarded Amount to Date:	$329,571.00
Funds Obligated to Date:	FY 2014 = $329,571.00
History of Investigator:	Milind Kulkarni (Principal Investigator)
Recipient Sponsored Research Office:	Purdue University 2550 NORTHWESTERN AVE # 1100 WEST LAFAYETTE IN US 47906 (765)494-1055
Sponsor Congressional District:	04
Primary Place of Performance:	Purdue University 465 Northwestern Ave West Lafayette IN US 47907-2035
Primary Place of Performance Congressional District:	04
Unique Entity Identifier (UEI):	YRXVL4JYCEF5
Parent UEI:	YRXVL4JYCEF5
NSF Program(s):	Exploiting Parallel&Scalabilty
Primary Program Source:	01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):
Program Element Code(s):	828300
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Title: XPS: FULL: FP: Collaborative Research: Taming parallelism: Optimally exploiting high-throughput parallel architectures

Over the past decade, computer manufacturers have focused on producing "multicore" chips, that package multiple, powerful computing cores on a single chip. Researchers have invested significant effort in developing methods for writing programs that can run efficiently on these cores. The basic idea is to allow programmers to write programs using a high-level programming model and to rely on an underlying compiler and runtime system to efficiently schedule these programs on multicore platforms. However, due to power and heat dissipation concerns, emerging "throughput-oriented" computing systems increasingly rely on far simpler computing cores to deliver parallel computing performance. These cores are much more efficient than traditional multicores, and can deliver much higher performance. Practitioners across numerous fields -- bioinformatics, data analytics, machine learning, etc. -- are deploying these systems to harness their power. Unfortunately, existing high level programming models are targeted to multicore chips, and do not produce code that can run effectively on these new systems. As a result, practitioners are forced to rewrite their applications, with painstaking low-level optimization and scheduling. This project will develop schemes to adapt applications written for multicore systems to run efficiently on throughput-oriented processors. The intellectual merits are novel program optimizations that will transform multicore-oriented programs into forms that map efficiently to throughput-oriented processors, scheduling mechanisms that ensure that these throughput-oriented processors do not waste computational resources, and scheduling policies that ensure that the mechanisms are used effectively. The project's broader significance and importance are that programmers will be able to write portable, high-performant and energy-efficient programs for both traditional multicore systems as well as throughput-oriented systems. Moreover, high-level programming models will be used to program the throughput-oriented machines, thus leading to significant reduction of programming effort for practitioners in many science and engineering disciplines. Finally, outreach efforts enhance the project by providing training and mentoring to a diverse group of students.

Languages like Cilk provide support for "dynamic multithreading", which allows programmers to identify all of the parallelism in their program, while relying on sophisticated runtime systems to map that parallelism to available parallel execution hardware at runtime. However, Cilk-style execution is inappropriate for the vector-based parallelism found in SIMD units, GPUs and the Xeon Phi; vector parallelism requires finding identical computations performed on different data units. This project investigates a series of transformations that will morph Cilk-style programs into programs that expose vectorizable parallelism, allowing dynamic multithreading programs to be mapped to emerging throughput-oriented architectures. The enabling transformation involves transforming task parallel applications into data-parallel applications by identifying similar tasks being performed at different points in the computation. This project develops a series of scheduling mechanisms and provably efficient scheduling policies that ensure that parallelizing dynamic multithreading applications on throughput-oriented architectures are effective. In this manner, this project enables portable applications that run efficiently both on multicores and on vector-based architectures.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Yifan Xu, I-Ting Angelina Lee, and Kunal Agrawal "Efficient parallel determinacy race detection for two-dimensional dags" Principle and Practices of Parallel Programming (PPoPP) , 2018

Robert Utterback, Kunal Agrawal, I-Ting Angelina Lee and Milind Kulkarni "Processor-Oblivious Record and Replay" Principles and Practice of Parallel Programming (PPoPP) , 2017

Kunal Agrawal and Seth Gilbert "The Power to Schedule a Parallel Program" International Parallel and Distributed Processing Symposium (IPDPS) , 2018

Kirshanthan Sundararajah, Laith Sakka and Milind Kulkarni "Locality Transformations for Nested Recursive Iteration Spaces" Architectural Support for Programming Languages and Operating Systems (ASPLOS) , 2017

Kirshanthan Sundararajah and Milind Kulkarni "Composable, Sound Transformations of Nested Recursion and Loops" Programming Languages Design and Implementation (PLDI) , 2019 10.1145/3314221.3314592

Jad Hbeika and Milind Kulkarni "Locality-aware Task-parallel Execution on GPUs" Languages and Compilers for Parallel Computers (LCPC) , 2016

Christopher Wright, Sriram Krishnamoorthy and Milind Kulkarni "MULKSG: MULtiple K Simultaneous Graph Assembly" Algorithms in Computational Biology (AlCoB) , 2019 10.1007/978-3-030-18174-1_9

Bin Ren, Sriram Krishnamoorthy, Kunal Agrawal and Milind Kulkarni "Exploiting Vector and Multicore Parallelism for Recursive, Data- and Task-Parallel Programs" Principles and Practices of Parallel Programming (PPoPP) , 2017

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

With the end of Dennard scaling, and the subsequent end of single-threaded performance scaling, architects have begun to rethink how computer hardware should be designed to achieve the optimal performance/efficiency tradeoff. One point in the design space that has seen substantial interest is in throughput oriented architectures, that rely on massive parallelism to obtain computational power; examples of these include Graphics Processing Units (GPUs) and wide vector machines. To make this massive parallelism energy efficient, however, these throughput oriented architectures have a host of constraints that programs must obey to extract performance; in particular, programs must be structured carefully so that parallel computations are regular: the different threads of computation must be executing the same kind of instruction at the same time, in a structured manner. These constraints are relatively easy (though not trivial) to satisfy for data parallel programs that operate over dense arrays and matrices; indeed, throughput orientied architecutres were designed with data parallel programs in mind. Unforunately, the constraints are much harder to satisfy for task parallel programs that operate over recursive structures. In fact, it is not clear that recursive task parallel programs can be mapped effectively to throughput-oriented machines at all.

This project took the first steps towards showing that task parallelism can be effectively mapped to throughput oriented architectures both in theory and in practice. The project developed the first program transformations that extracted data parallelism from task-parallel programs: programs written in a task parallel manner were automatically rewritten to expose data parallelism. To ensure that the data parallelism efficiently exploited hardware resources, the project next developed scheduling mechanisms that continually rebalanced task parallel computation to ensure that the massive parallelism of throughput-oriented architecutres was occupied.

To show that these schedulers did indeed suffice to map task parallelism to data parallel architecures, this project undertook a series of theoretical explorations to prove that the scheduling mechanisms were efficient: that they did not lose parallelism compared to a "normal" execution model, and hence efficiently exploited the parallelism of throughput architectures.

The project next extended both the practical scheduling results and the theoretical optimality results to more complex programs: those that combined data parallelism and task parallelism. These new schedulers deployed more sophisticated techniques to extract data parallelism, ensuring that even for more complex task-parallel programs, parallel hardware could be kept busy. The project next showed how to combine these novel schedulers with traditional task-parallel schedulers, creating systems that can efficiently trade off data- and task-parallelism.

Finally, the project undertook two tasks to improve the programmability of task parallel programs. It developed the first efficient record-and-replay system for task parallel programs that allowed for programs to be (correctly) debugged while running on more processors than the original program, speeding up the debugging process. It also developed novel race detectors for a more general class of parallel programs, allowing developer to catch bugs more effectively.

The potential broader impacts of this project are many. First, task parallel programming is a popular style of parallel programming, as exemplified by many production-level programming models: Cilk, OpenMP Tasks, X10, and the like. This project showed that these programming models are inherently compatible with emerging throughput oriented architectures, opening up a broader class of programs to the promise of efficient, massive parallelism. Second, the programming tools developed as part of this project make it easier for programmers to write correct task parallel programs, speeding up the development cycle. Finally, the efforts of this project supported the research of a number of graduate students, and brought multiple undergrads into research projects as well.

Last Modified: 12/18/2019
Modified by: Milind Kulkarni

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error