NSF Award Search: Award # 1445806

Award Abstract # 1445806

XD Metrics Service (XMS)

NSF Org:	OAC Office of Advanced Cyberinfrastructure (OAC)
Recipient:	THE RESEARCH FOUNDATION FOR THE STATE UNIVERSITY OF NEW YORK
Initial Amendment Date:	April 29, 2015
Latest Amendment Date:	May 4, 2022
Award Number:	1445806
Award Instrument:	Cooperative Agreement
Program Manager:	Edward Walker edwalker@nsf.gov (703)292-4863 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering
Start Date:	July 1, 2015
End Date:	February 28, 2023 (Estimated)
Total Intended Award Amount:	$9,063,861.00
Total Awarded Amount to Date:	$13,016,050.00
Funds Obligated to Date:	FY 2015 = $9,079,701.00 FY 2016 = $24,960.00 FY 2017 = $627,500.00 FY 2018 = $22,000.00 FY 2019 = $31,200.00 FY 2020 = $1,706,401.00 FY 2021 = $1,524,288.00
History of Investigator:	Thomas Furlani (Principal Investigator) furlani@buffalo.edu Abani Patra (Co-Principal Investigator) Gregor von Laszewski (Co-Principal Investigator) Matthew Jones (Co-Principal Investigator) Steven Gallo (Co-Principal Investigator)
Recipient Sponsored Research Office:	SUNY at Buffalo 520 LEE ENTRANCE STE 211 AMHERST NY US 14228-2577 (716)645-2634
Sponsor Congressional District:	26
Primary Place of Performance:	SUNY at Buffalo 701 Ellicott St Buffalo NY US 14203-1101
Primary Place of Performance Congressional District:	26
Unique Entity Identifier (UEI):	LMCJKRFW5R81
Parent UEI:	GMZUKXFDJMA9
NSF Program(s):	XD-Extreme Digital
Primary Program Source:	01001516DB NSF RESEARCH & RELATED ACTIVIT 01001617DB NSF RESEARCH & RELATED ACTIVIT 01001718DB NSF RESEARCH & RELATED ACTIVIT 01001819DB NSF RESEARCH & RELATED ACTIVIT 01001920DB NSF RESEARCH & RELATED ACTIVIT 01002021DB NSF RESEARCH & RELATED ACTIVIT 01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7476, 9251
Program Element Code(s):	747600
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

The XD Metrics Service (XMS) is a renewed project of the Technology Audit Service (TAS), which aims at improving the operational efficiency and management of NSF's XD network of computational resources. XMS builds on and expands the successes of the TAS project, such as the development of the XDMoD tool. This tool provides stakeholders of XD and its largest project, XSEDE, with ready access to data about utilization, performance, and quality of service for XD resources and XSEDE-related services. While the initial project focus was the XD community, the ongoing effort realized that such a resource management tool would also be of great utility to high performance computing centers in general, as well as to other data centers managing complex IT infrastructure. To pursue this opportunity, Open XDMoD was being developed, which is an open source version of the tool. Open XDMoD is already in use by numerous academic and industrial HPC centers. The XMS project expands XDMoD beyond its original goals, so as to increase its utility to XD and move it into the realm of a comprehensive resource management tool for cyberinfrastructure. One example is the incorporation of job-level performance data through "TACC_Stats" into XDMoD. This functionality provides XDMoD with the ability to identify poorly performing applications, improve throughput, characterize the system's workload, and provide metrics critical for the specification of future resource acquisitions. Given the scale of today's HPC systems, even modest increases in throughput can have a substantial impact on science and engineering research. For example, with respect to the XD network, every 1% increase in system performance translates into an additional 15 million CPU hours of computer time that can be allocated for research.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 58)

Show All

1.Jeanette M. Sperhac, Robert L. DeLeon, Joseph P. White, Matthew D. Jones, Andrew Bruno, Renette Jones-Ivey, Thomas R. Furlani, Jonathan Bard, and Vipin Chaudhary "Towards Performant Workflows, Monitoring and Measuring" Proceedings of the 29th International Conference on Computer Communications and Networks , v.20 , 2020 doi: 10.1109/ICCCN49398.2020.9209647

1.Joseph P. White, Alexander D. Kofke, Robert L. DeLeon, Martins D. Innus, Matthew D. Jones, Thomas R. Furlani "Automatic Characterization of HPC Job Parallel Filesystem I/O Patterns" PEARC18 Proceedings of the Practice and Experience on Advanced Research Computing 2018 , 2018 doi:10.1145/3219104.3219121

1.Joseph White, Martins Innus, Matthew Jones, Robert DeLeon, Nikolay A. Simakov, Jeffery Palmer, Steven Gallo, Thomas Furlani, Michael Showerman, Robert Brunner, Andriy Kot, Gregory Bauer, Brett Bode, Jeremy Enos and William Kramer "Challenges of workload analysis on large HPC systems; a case study on NCSA Blue Waters" PEARC17 Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact , 2017 10.1145/3093338.3093348

2.Ben Fulton, Steven Gallo, Matt Link, Robert Henschel, Tom Yearke, Katy Borner, Robert L. DeLeon, Thomas Furlani and Craig A. Stewart "Value Analytics: A Financial Module for the Open XDMoD Project" PEARC17 Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact , 2017 10.1145/3093338.3093358

2.Nikolay A. Simakov, Robert L. DeLeon, Martins D. Innus, Matthew D. Jones, Joseph P. White, Steven M. Gallo, Abani K. Patra, Thomas R. Furlani "Slurm Simulator: Improving Slurm Scheduler Performance on Large HPC Systems by Utilization of Multiple Controllers and Node Sharing" PEARC18 Proceedings of the Practice and Experience on Advanced Research Computing , 2018 doi:10.1145/3219104.3219111

3.Fugang Wang, Gregor von Laszewski, Timothy Whitson, Geoffrey C. Fox, Thomas R. Furlani, Robert L. DeLeon, Steven M. Gallo "Evaluating the Scientific Impact of XSEDE" PEARC18 Proceedings of the Practice and Experience on Advanced Research Computing 2018 , 2018 doi:10.1145/3219104.3219124

5.Jeanette M. Sperhac, Benjamin D. Plessinger, Jeffrey T. Palmer, Rudra Chakraborty, Gregary Dean, Martins Innus, Ryan Rathsam, Nikolay Simakov, Joseph P. White, Thomas R. Furlani, Steven M. Gallo, Robert L. DeLeon, Matthew D. Jones, Cynthia Cornelius, A "Federating XDMoD to Monitor Affiliated Computing Resources" Proceedings of the 2018 IEEE International Conference on Cluster Computing , 2019 doi:10.1109/CLUSTER.2018.00074

Benjamin Machalowicz, Eric Raut, Yan Kang, Tony Curtis, Andrew Burford, Alan Calder, David Carlson, Barbara Chapman, Firat Coskun, Catherine Feldman, Robert Harrison, Eva Siegmann, Daniel Wood, Robert DeLeon, Mathew Jones, Nikolay Simakov, Joseph White, a "Ookami: Deployment and Initial Experiences" Practice and Experience in Advanced Research Computing (PEARC '21) , 2021 https://doi.org/10.1145/3437359.3465578

Burford, Andrew and Calder, Alan and Carlson, David and Chapman, Barbara and Coskun, Firat and Curtis, Tony and Feldman, Catherine and Harrison, Robert and Kang, Yan and Michalowicz, Benjamin and Raut, Eric and Siegmann, Eva and Wood, Daniel and DeLeon, R "Ookami: Deployment and Initial Experiences" PEARC '21: Practice and Experience in Advanced Research Computing , 2021 https://doi.org/10.1145/3437359.3465578 Citation Details

Dean, G., Furlani, T.R. "Performance Optimization of the Open XDMoD Datawarehouse" Proceedings of the Practice and Experience in Advanced Research Computing, ser PEARC '22. , 2022 10.1145/3491418.3530290

Dean, Gregary and Moraes, Joshua and White, Joseph and Deleon, Robert and Jones, Matthew and Furlani, Thomas "Performance Optimization of the Open XDMoD Datawarehouse" Proceedings of the Practice and Experience in Advanced Research Computing, ser PEARC '22 , 2022 https://doi.org/10.1145/3491418.3530290 Citation Details

(Showing: 1 - 10 of 58)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

High performance computing (HPC) systems, more commonly known as supercomputers, play a pivotal rule in society today, including the U.S. economy. They are essential tools in a diverse range of areas including finance, artificial intelligence, oil and gas exploration, pharmaceutical drug design, medical and basic research, and aeronautics, to name a few. Today?s supercomputers are a complex combination of computer hardware (servers, network switches, storage) and software, and it is important that system support personnel have at their disposal tools to ensure that this complex infrastructure is running with optimal efficiency as well as the ability to proactively identify underperforming hardware and software. In addition, most HPC systems are overloaded, with many jobs queued waiting to run, and accordingly system support personnel desire the capability to monitor and analyze all end-user jobs to determine how efficiently they are running and what resources they are consuming (computer memory, processing, storage, networking, etc.) in order to optimize the number of jobs run as well as plan for future needs. This is far from an academic exercise. For example, every 1% increase in system performance on the HPC resources supported by the National Science Foundation translates into the ability to allocate an additional 101 M CPU hours annually to research projects and corresponds to a savings of $5M (assuming a rate of $0.05 per CPU hour).

Given the important role that high performance computers play in research and the economy, it is important that tools are available to the science and engineering community to help ensure the efficient management and operation of these resources. The XD Metrics Service award supported the continued development of XDMoD, a powerful tool for the management of HPC resources that is widely viewed as the de facto standard tool in this space. XDMoD is designed to meet the following objectives:

(1) provide the end-user community with a tool to optimize their use of HPC resources,

(2) provide operational staff with the ability to monitor, diagnose, and tune system performance as well as measure the performance of all applications running on the HPC systems they manage,

(3) provide software developers with the ability to easily obtain detailed analysis of application performance to aid in optimizing code performance,

(4) provide stakeholders with a diagnostic tool to facilitate HPC planning and analysis, and

(5) provide metrics to help measure return on investment.

XDMoD provides a rich set of features accessible through an intuitive graphical interface, which is tailored to the role of the user, from scientists and engineers running computations to HPC facility and funding agency managers. Metrics provided by XDMoD include comprehensive statistics on: number and type of computational jobs run, resources (computation, memory, disk, network, etc.) consumed, job wait and wall time, scientific impact, and quality of service. The web interface is intuitive, allowing one to chart various metrics and interactively drill down to access additional related information.

The XDMoD framework is also designed to help ensure that the HPC infrastructure is delivering a high quality of service to its end-users by continuously monitoring system performance and reliability through the deployment of a series of programs specifically designed to monitor overall system performance. System managers are therefore able to proactively monitor the HPC infrastructure as opposed to having to rely on users to report failures or underperforming hardware and software.

An important capability of XDMoD is centered around monitoring the performance of all user jobs running on a given HPC resource with the goal of automatically identifying poorly performing jobs. Using the XDMoD Job Viewer, a utility developed by this program that provides detailed performance information for each job, system support personnel can work with each end-user to improve their code?s performance thereby increasing the user?s job efficiency and importantly freeing up what otherwise would have been wasted CPU cycles that can be made available to other users.

XDMoD provides computer center managers with unprecedented information on how well the supercomputers are running, how to improve their operational efficiency, what computer codes are running and how well they are running, and informs what hardware and software upgrades will be required in the future.

Open XDMoD, which is widely employed worldwide by academic, industrial and government HPC centers, is freely available for download at https://open.xdmod.org/

Last Modified: 03/23/2023
Modified by: Thomas R Furlani

Image

Please report errors in award information by writing to: awardsearch@nsf.gov.