NSF Award Search: Award # 2016704

Award Abstract # 2016704

CCRI: ENS: Collaborative Research: Open Computer System Usage Repository and Analytics Engine

NSF Org:	CNS Division Of Computer and Network Systems
Recipient:	PURDUE UNIVERSITY
Initial Amendment Date:	August 15, 2020
Latest Amendment Date:	November 16, 2021
Award Number:	2016704
Award Instrument:	Standard Grant
Program Manager:	Marilyn McClure mmcclure@nsf.gov (703)292-5197 CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	October 1, 2020
End Date:	September 30, 2023 (Estimated)
Total Intended Award Amount:	$1,183,897.00
Total Awarded Amount to Date:	$1,183,897.00
Funds Obligated to Date:	FY 2020 = $1,183,897.00
History of Investigator:	Saurabh Bagchi (Principal Investigator) Xiaohui Carol Song (Co-Principal Investigator) Rajesh Kalyanam (Co-Principal Investigator) Stephen Harrell (Co-Principal Investigator) Amiya Maji (Co-Principal Investigator)
Recipient Sponsored Research Office:	Purdue University 2550 NORTHWESTERN AVE # 1100 WEST LAFAYETTE IN US 47906-1332 (765)494-1055
Sponsor Congressional District:	04
Primary Place of Performance:	Purdue University 465 Northwestern Avenue West Lafayette IN US 47907-2035
Primary Place of Performance Congressional District:	04
Unique Entity Identifier (UEI):	YRXVL4JYCEF5
Parent UEI:	YRXVL4JYCEF5
NSF Program(s):	CCRI-CISE Cmnty Rsrch Infrstrc
Primary Program Source:	01002021DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7359
Program Element Code(s):	735900
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

In science and engineering research, large-scale, centrally managed computing clusters or ?supercomputers? have been instrumental in enabling the kinds of resource-intensive simulations, analyses, and visualizations that have been used in computer-aided drug discovery, high strength materials design for cars and jet engines, and disease vector analysis to name a few. Such clusters are complex systems comprised of several hundred to thousand computer servers with fast network connections between them, various data storage resources, and highly optimized scientific software being shared with several hundred other researchers from diverse domains. Consequently, the overall dependability of such systems relies on the dependability of these individual highly interconnected elements as well as the characteristics of cascading failures. While computer systems researchers and practitioners have been at the forefront of designing and deploying dependable computing cluster systems, this task has been hampered by the lack of publicly available, real-world failure data from supercomputers currently in operation. Prior practice has largely involved tedious, manual collection and curation of small sets of data for use in specific analyses. This project will establish seamless, automated pipelines for acquiring, processing, and curating continuous, detailed system usage, monitoring, and failure data from large computing clusters at two organizations, Purdue University and the University of Texas at Austin. This data will be disseminated through a publicly accessible portal and complemented by a suite of in-situ analytics capabilities that will support and spur research in dependable computing systems. The data acquisition pipeline and analytics software will be made open-source and designed for ease of federation, extension, and adoption to cluster systems operated by other organizations.

Cluster computing systems are a key resource in time-sensitive, computationally intensive research such as virus structure modeling and drug discovery and have been at the forefront of efforts to tackle global pandemics. Both unanticipated system down-times and lack of actionable feedback to researchers on computational failures can have adverse effects on research timeliness and efficiency. This project will allow the practitioners and administrators of these systems to develop data-backed best practices for ensuring high availability and utilization for their clusters. The resulting large, public data repository consisting of data from clusters with diverse workloads spanning traditional high-performance computing, modern accelerator-based computing (for example on graphics processing units (GPUs)), and cloud-style applications will allow the systems research community to consider forward-looking research questions based on real system data. The project will train a cadre of students in data analysis on live production systems and this will provide them with a unique learning experience, interfacing with a variety of stakeholders.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Abdallah, Mustafa and Joung, Byung-Gun and Lee, Wo Jae and Mousoulis, Charilaos and Raghunathan, Nithin and Shakouri, Ali and Sutherland, John W. and Bagchi, Saurabh "Anomaly Detection and Inter-Sensor Transfer Learning on Smart Manufacturing Datasets" Sensors , v.23 , 2023 https://doi.org/10.3390/s23010486 Citation Details

Abdallah, Mustafa and Rossi, Ryan and Mahadik, Kanak and Kim, Sungchul and Zhao, Handong and Bagchi, Saurabh "AutoForecast: Automatic Time-Series Forecasting Model Selection" 31st ACM International Conference on Information & Knowledge Management , 2022 https://doi.org/10.1145/3511808.3557241 Citation Details

Ikram, Azam and Chakraborty, Sarthak and Mitra, Subrata and Saini, Shiv and Bagchi, Saurabh and Kocaoglu, Murat "Root cause analysis of failures in microservices through causal discovery" Advances in Neural Information Processing Systems , 2022 Citation Details

Ketterer, Austin and Shekar, Asha and Yi, Edgardo and Bagchi, Saurabh and Clements, Abraham "An Automated Approach to Re-Hosting Embedded Firmware Through Removing Hardware Dependencies." IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW) , 2022 https://doi.org/10.2172/2006057 Citation Details

Mahgoub, Ashraf and Shankar, Karthick and Mitra, Subrata and Klimovic, Ana and Chaterji, Somali and Bagchi, Saurabh "SONIC: Application-aware data passing for chained serverless applications" USENIX Annual Technical Conference (USENIX ATC) , 2021 Citation Details

Mahgoub, Ashraf and Yi, Edgardo Barsallo and Shankar, Karthick and Elnikety, Sameh and Chaterji, Somali and Bagchi, Saurabh "ORION and the Three Rights: Sizing, Bundling, and Prewarming for Serverless DAGs" 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , 2022 Citation Details

Mahgoub, Ashraf and Yi, Edgardo Barsallo and Shankar, Karthick and Minocha, Eshaan and Elnikety, Sameh and Bagchi, Saurabh and Chaterji, Somali "WISEFUSE: Workload Characterization and DAG Transformation for Serverless Workflows" ACM SIGMETRICS , 2022 https://doi.org/10.1145/3489048.3530959 Citation Details

Shankar, Karthick and Wang, Pengcheng and Xu, Ran and Mahgoub, Ashraf and Chaterji, Somali "JANUS: Benchmarking Commercial and Open-Source Cloud and Edge Platforms for Object and Anomaly Detection Workloads" 2020 IEEE 13th International Conference on Cloud Computing (CLOUD) , 2020 https://doi.org/10.1109/CLOUD49709.2020.00088 Citation Details

Xu, Ran and Wang, Haoliang and Petrangeli, Stefano and Swaminathan, Viswanathan and Bagchi, Saurabh "Closing-the-Loop: A Data-Driven Framework for Effective Video Summarization" IEEE International Symposium on Multimedia (ISM) , 2020 https://doi.org/10.1109/ISM.2020.00042 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The FRESCO project made progress on the systematic collection, curation, and presentation of public failure data pertinent to large-scale computing systems, all of which is consolidated in a repository named FRESCO. Originating from Purdue, U of Illinois at Urbana-Champaign, and U of Texas at Austin, the datasets encapsulate both static and dynamic information, encompassing system usage, workloads, and failure data, applicable to both planned and unplanned outages. The systems are operational central computing clusters at these universities and consequently, they see a wide variety of workloads from all different science and engineering domains. The systems have different kinds of performance characteristics and loads from the workloads stress them to different extents. Our data illuminates this intricate relationship between workload request patterns and health of the computing clusters. Further, it sheds light on the impact of planned maintenance operations, including upgrades, on the health of the computing clusters.

In a broader context, this rich and well-curated dataset aims to benefit researchers, technologists, and data scientists in navigating the complexities and challenges inherent in managing and maintaining robust computing infrastructures. This endeavor not only facilitates a deeper understanding of system failures but also propels further research and development in the realm of dependable computing systems.

Further, the project created an analytics toolbench that would demonstrate analytics queries that are commonly useful in such situations, e.g., what is the mean time to failure of a certain range of machines, or what is the correlation between load of a particular resource (say memory) and performance degradation of a computing node. The analytics toolbench provides users with the ability through simple user interface to create a wide variety of analytics queries. The analytics queries can span a spatial range (sets of compute nodes) as well as a temporal range (windows of time).

The project had a key focus of making the asset widely usable, to broad classes of stakeholders. With this aim, we organized Birds of Feather gatherings at the relevant conferences (Supercomputing, DSN). We also created explanatory videos showing the usage of the dataset and the analytics toolbench. We plan to continue this line of work, bringing in more partners as data providers plus encouraging usage of the asset by the broad classes of users.

In a related effort that comes under the purview of this project, we worked with commercial entities that have production workloads on the cloud to release cloud computing traces. One of this is for serverless workflows on the Microsoft Azure cloud and the other is resource utilization for Adobe’s use on the Azure cloud.

Last Modified: 03/17/2024
Modified by: Saurabh Bagchi

Images (1 of 3)

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error