NSF Award Search: Award # 1405939

Award Abstract # 1405939

II-New: A Cluster of Nodes with 32 Cores and 256-GB Memory to Enable Many-Core Systems Research and Education

NSF Org:	CNS Division Of Computer and Network Systems
Recipient:	PURDUE UNIVERSITY
Initial Amendment Date:	July 31, 2014
Latest Amendment Date:	July 31, 2014
Award Number:	1405939
Award Instrument:	Standard Grant
Program Manager:	Tao Li CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	August 1, 2014
End Date:	July 31, 2017 (Estimated)
Total Intended Award Amount:	$286,300.00
Total Awarded Amount to Date:	$286,300.00
Funds Obligated to Date:	FY 2014 = $286,300.00
History of Investigator:	Terani Vijaykumar (Principal Investigator) Antony Hosking (Co-Principal Investigator) Vijay Pai (Co-Principal Investigator) Mithuna Thottethodi (Co-Principal Investigator) Milind Kulkarni (Co-Principal Investigator)
Recipient Sponsored Research Office:	Purdue University 2550 NORTHWESTERN AVE # 1100 WEST LAFAYETTE IN US 47906-1332 (765)494-1055
Sponsor Congressional District:	04
Primary Place of Performance:	Purdue University IN US 47907-2017
Primary Place of Performance Congressional District:	04
Unique Entity Identifier (UEI):	YRXVL4JYCEF5
Parent UEI:	YRXVL4JYCEF5
NSF Program(s):	CCRI-CISE Cmnty Rsrch Infrstrc
Primary Program Source:	01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7359
Program Element Code(s):	735900
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Research and education in many-core computing systems are of importance to the NSF CRI program as well as to the research community. This project targets performance, energy consumption, and scalability of many-core systems, which are important for the computer industry. The team is committed to releasing the research artifacts of the project as open-source software to be used by the research community as well. This project will benefit graduate student research and help educational activities in undergraduate and graduate curricula. The project will support outreach activities sponsored by various centers at Purdue University via the involvement of the team in Purdue Computing Research Institute's High Performance Computing workshops, for example.

This infrastructure will support research and education efforts in multiple areas: computer architecture, compilers, high-performance cloud computing, and run-times for managed languages. Computer architects will explore optimizations for performance, programmability and power of many-core architectures, on-chip networks, and disk optimizations. Compiler writers will explore shared memory optimizations and their scalability targeting shared-memory applications for distributed memory machines, and techniques to transform seemingly irregular memory access patterns into regular and parallel computations and memory accesses. Run-time researchers will pursue parallel garbage collection of large garbage-collected heaps and associated scalability issues. High performance computing researchers will explore the performance overhead of virtualization and cloud computing for cluster workloads, along with mechanisms for reducing overhead.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Ahmed Abdel-Gawad,Mithuna Thottethodi "Scalable, Global, Optimal-bandwidth, Application-Specific Routing" 24th Annual Symposium on High Performance Interconnects. , 2016

Keith Chapman,Antony L. Hosking, andJ. Eliot B. Moss "Hybrid STM/HTM for Nested Transactions on OpenJDK" ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications , 2016 , p.569 10.1145/2983990.2984029

Keith ChapmanAntony L. HoskingJ. Eliot B. Moss "Extending OpenJDK to support hybrid STM/HTM: Preliminary design." ACM SIGPLAN Workshop on Virtual Machines and Intermediate Languages , 2016 10.1145/2998415.2998417

Kirshanthan Sundararajah, Laith Sakka and Milind Kulkarni "Locality Transformations for Nested Recursive Iteration Spaces" Architectural Support for Programming Languages and Operating Systems (ASPLOS) , 2017

Laith Sakka, Kirshanthan Sundararajah and Milind Kulkarni "TreeFusor: A Framework for Analyzing and Fusing General Recursive Tree Traversals" Object-Oriented Programming, Systems, Languages & Applications (OOPSLA) , 2017

Nitin, Mithuna Thottethodi , T N Vijaykumar, and Milind Kulkarni "Efficient Collaborative Approximation inMapReduce Without Missing Rare Keys" IEEE International Conference on Cloud and Autonomous Computing (ICCAC). , 2017 , p.80 10.1109/ICCAC.2017.15

Peter Gammie, Antony L. Hosking, and Kai Engelhardt "Relaxing safely: Verified on-the-fly garbage collection for x86-TSO" ACM SIGPLAN International Conference on Pro- gramming Language Design and Implementation , 2015 10.1145/2737924.2738006

Yi Lin, Kunshan Wang, Stephen M. Blackburn, Antony L. Hosking, and Michael Norrish "Stop and go: Understanding yieldpoint behavior" ACM SIGPLAN International Symposium on Memory Management , 2015 10.1145/2754169.2754187

Yi Lin, Stephen M. Blackburn, Antony L. Hosking, and Michael Norrish "Rust as a language for high performance GC implementation" ACM SIGPLAN International Symposium on Memory Management, , 2016 10.1145/2926697.2926707

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project includes research and education efforts in computer architecture, compilers, run-times for managed languages, and distributed systems. The architecture research explore optimizations for performance, programmability and power of multicores, many-core architectures, and datacenters. The compiler writers explore techniques to transform seemingly irregular memory access patterns into regular and parallel computations and memory accesses. The run-time researchers pursue parallel garbage collection of large garbage-collected heaps and associated scalability issues. Specific efforts and their significant results are:

unscaling of clock frequency to tackle slowing of Dennard's Scaling; 15% throughput improvement in many-core systems where voltage scaling has stopped (exceeding previously published "dark silicon performance limit"),
exploiting value locality for soft-error tolerance; 75% soft-error coverage at 10% performance and 25% power overheads whereas redundancy-based schemes incur 80% power overhead,
a novel 3-D cache architecture to reduce on-chip tag overhead while converting 3-D bandwidth advantage into performance; for under 1-MB of on-chip overhead, our 256-MB 3-D DRAM cache performs 15% better than the best previous design with similar on-chip tag.
a novel cost-effective distributed system architecture for causal consistency; reduce the cost of a causally-consistent geo-replicated data store by 28-37% via partial replication while achieving the same performance as full replication,
power and performance optimization of MapReduce via stratified sampling; improve average MapReduce perrformance by 40% while maintaining per-key error within 1%,
optimizing datacenter power by exploiting latency tail in online data intensive applications; reduce datacenter energy by 15% and 40% at 90% and 30% datacenter loading,
a novel processing-near-memory (PNM) architectures for Big Data Machine learning; improves performance and energy over GPGPU and a "sea of simple MIMD cores", respectively, by 145% and 20% and 37% and 34% when all the three architectures have the same number of cores, on-die memory, and die-stacked bandwidth.
addressing message buffer management and flow control for RDMA in datacenters; Our RDMA architecture either reduces buffer memory by three orders of magnitude under little programmer effort or achieves same buffer memory at much less programmer burden,
implementing nested transactions for Java; our XJ prototype achieves good performance,
multicore scaling for garbage collection; our implementation achieves scalable perofrmance,
development of the machine-checked proof for a real-time concurrent collector, allowing parallelized execution of the proof script,
scalable global routing for HPC; achieves scalable, global, Optimal-bandwidth via application-specific routing,
optimizing off-chip traffic of convolutional neural networks (CNNs) via a novel tiling strategy; provably-optimal tiling for CNNs using a given on-chip cache capacity (2-10x fewer off-chip misses), and
new compiler optimizations of irregular applications led to performance improvements of up to 10x on data mining applications, and 70% for tree traversal applications like compiler passes.

This infrastructure has supported the research of more than eight graduate students who are being trained in one or more of computer architecture, compilers, distributed systems, and runtime systems via the above-mentioned efforts. As part of their 'senior design project', an undergraduate team of four developed a DNN (deep neural network) based software infrastructure to automatically track student attendance in classrooms. The analysis of the large dataset for training and evaluation was facilitated by the CRI infrastructure. We expect continued participation of undergraduate students in this activity over the next few semesters.

Last Modified: 11/06/2017
Modified by: T. N Vijaykumar

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error