Award Abstract # 1217748
SHF: Small: Scalable Trace-Based Tools for In-Situ Data Analysis of HPC Applications (ScalaJack)

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: NORTH CAROLINA STATE UNIVERSITY
Initial Amendment Date: May 16, 2012
Latest Amendment Date: July 30, 2012
Award Number: 1217748
Award Instrument: Standard Grant
Program Manager: Almadena Chtchelkanova
achtchel@nsf.gov
 (703)292-7498
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: June 1, 2012
End Date: May 31, 2017 (Estimated)
Total Intended Award Amount: $457,395.00
Total Awarded Amount to Date: $457,395.00
Funds Obligated to Date: FY 2012 = $457,395.00
History of Investigator:
  • Frank Mueller (Principal Investigator)
    fmuelle@ncsu.edu
Recipient Sponsored Research Office: North Carolina State University
2601 WOLF VILLAGE WAY
RALEIGH
NC  US  27695-0001
(919)515-2444
Sponsor Congressional District: 02
Primary Place of Performance: North Carolina State University
NC  US  27695-8206
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): U3NVH931QJJ3
Parent UEI: U3NVH931QJJ3
NSF Program(s): Software & Hardware Foundation
Primary Program Source: 01001213DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7942
Program Element Code(s): 779800
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Production codes on supercomputers are struggling to remain scalable
each time the processor core count increases by a factor of 10, even
though they run efficiently at smaller scale.
But root cause diagnosis fails at petascale since (1) symptoms of
performance problems can be subtle, (2) only few
metrics can be efficiently collected and (3) tools can only feasibly record
a small subset of even these metrics.

This work addresses these problems by creating a framework that allows
application developers to focus on data analysis that drives customized
data extraction combined with on-the-fly analysis specifically geared
to their individual problems. This is accomplished by combining trace
analysis and in-situ data analysis techniques at runtime, thereby
lifting data reduction to a new level where it IS analysis. With this
approach, modular measurement and analysis components are combined to
selectively extract representative data from production codes in a
problem-specific manner, which enables root cause analysis.

The work demonstrates the feasibility of customized data
extraction and analysis at scale for root cause analysis on current
and forthcoming multi-petascale supercomputers. It thus contributes
to sustain scalable scientific computing into the future up to the largest
scales. Results of this work will be contributed as open-source code
to the research community and beyond as done, allowing other groups to
not only build tools on top of our framework but also contribute their
own components.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Anwesha Das, Frank Mueller, Xiaohui Gu, Arun Iyengar "Performance Analysis of a Multi-Tenant In-memory Data Grid" IEEE Cloud , 2016
Neha Gholkar,Frank Mueller, Barry Rountree "Power Tuning HPC Jobs on Power-Constrained Systems" International Conference on ParallelArchitecture and Compilation Techniques (PACT) , 2016
Xiaoqing Luo,Frank Mueller, Philip Carns, Jonathan Jenkins, Robert Latham, RobertRoss and Shane Snyder "ScalaIOExtrap: Elastic I/O Tracing and Extrapolation" International Parallel and Distributed Processing Symposium (IPDPS) , 2017

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This decade is projected to usher in the period of exascale computing
with the advent of systems with more than 500 million concurrent
tasks.  Harnessing such hardware with coordinated computing in
software poses significant challenges.  Production codes tend to face
scalability problems, but current performance analysis tools seldom
operate effectively beyond 10,000 cores.

We have combined trace analysis and in-situ data analysis techniques
at runtime.  Application developers thus create ultra low-overhead
measurement and analysis facilities on-the-fly, customized for the
performance problems of particular application.  We developed an
analysis generator called ScalaJack for this purpose. We further
extended the underlying ScalaTrace infrastructure to exploit
statistical clustering techniques so that only one trace per cluster
needs to be generated, yet such traces can be replayed by all nodes of
a cluster without loss of events and using correct communication and
I/O parameters for trace events. We showed that overheads for tracing
remain extremely low even for large numbers of nodes, which is a
significant improvement over past trace consolidation, which imposed
exponentially increasing overheads as the number of nodes increases.

Results of this work were contributed as open-source code to the
research community. Pluggable, customization analysis not only allows
other groups to build tools on top of our approach but to also
contribute components to our framework that will be shared in a
repository hosted by us.


Last Modified: 06/08/2017
Modified by: Frank Mueller

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page