
NSF Org: |
CNS Division Of Computer and Network Systems |
Recipient: |
|
Initial Amendment Date: | August 15, 2020 |
Latest Amendment Date: | November 16, 2021 |
Award Number: | 2016704 |
Award Instrument: | Standard Grant |
Program Manager: |
Marilyn McClure
mmcclure@nsf.gov (703)292-5197 CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | October 1, 2020 |
End Date: | September 30, 2023 (Estimated) |
Total Intended Award Amount: | $1,183,897.00 |
Total Awarded Amount to Date: | $1,183,897.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
2550 NORTHWESTERN AVE # 1100 WEST LAFAYETTE IN US 47906-1332 (765)494-1055 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
465 Northwestern Avenue West Lafayette IN US 47907-2035 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | CCRI-CISE Cmnty Rsrch Infrstrc |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
In science and engineering research, large-scale, centrally managed computing clusters or ?supercomputers? have been instrumental in enabling the kinds of resource-intensive simulations, analyses, and visualizations that have been used in computer-aided drug discovery, high strength materials design for cars and jet engines, and disease vector analysis to name a few. Such clusters are complex systems comprised of several hundred to thousand computer servers with fast network connections between them, various data storage resources, and highly optimized scientific software being shared with several hundred other researchers from diverse domains. Consequently, the overall dependability of such systems relies on the dependability of these individual highly interconnected elements as well as the characteristics of cascading failures. While computer systems researchers and practitioners have been at the forefront of designing and deploying dependable computing cluster systems, this task has been hampered by the lack of publicly available, real-world failure data from supercomputers currently in operation. Prior practice has largely involved tedious, manual collection and curation of small sets of data for use in specific analyses. This project will establish seamless, automated pipelines for acquiring, processing, and curating continuous, detailed system usage, monitoring, and failure data from large computing clusters at two organizations, Purdue University and the University of Texas at Austin. This data will be disseminated through a publicly accessible portal and complemented by a suite of in-situ analytics capabilities that will support and spur research in dependable computing systems. The data acquisition pipeline and analytics software will be made open-source and designed for ease of federation, extension, and adoption to cluster systems operated by other organizations.
Cluster computing systems are a key resource in time-sensitive, computationally intensive research such as virus structure modeling and drug discovery and have been at the forefront of efforts to tackle global pandemics. Both unanticipated system down-times and lack of actionable feedback to researchers on computational failures can have adverse effects on research timeliness and efficiency. This project will allow the practitioners and administrators of these systems to develop data-backed best practices for ensuring high availability and utilization for their clusters. The resulting large, public data repository consisting of data from clusters with diverse workloads spanning traditional high-performance computing, modern accelerator-based computing (for example on graphics processing units (GPUs)), and cloud-style applications will allow the systems research community to consider forward-looking research questions based on real system data. The project will train a cadre of students in data analysis on live production systems and this will provide them with a unique learning experience, interfacing with a variety of stakeholders.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The FRESCO project made progress on the systematic collection, curation, and presentation of public failure data pertinent to large-scale computing systems, all of which is consolidated in a repository named FRESCO. Originating from Purdue, U of Illinois at Urbana-Champaign, and U of Texas at Austin, the datasets encapsulate both static and dynamic information, encompassing system usage, workloads, and failure data, applicable to both planned and unplanned outages. The systems are operational central computing clusters at these universities and consequently, they see a wide variety of workloads from all different science and engineering domains. The systems have different kinds of performance characteristics and loads from the workloads stress them to different extents. Our data illuminates this intricate relationship between workload request patterns and health of the computing clusters. Further, it sheds light on the impact of planned maintenance operations, including upgrades, on the health of the computing clusters.
In a broader context, this rich and well-curated dataset aims to benefit researchers, technologists, and data scientists in navigating the complexities and challenges inherent in managing and maintaining robust computing infrastructures. This endeavor not only facilitates a deeper understanding of system failures but also propels further research and development in the realm of dependable computing systems.
Further, the project created an analytics toolbench that would demonstrate analytics queries that are commonly useful in such situations, e.g., what is the mean time to failure of a certain range of machines, or what is the correlation between load of a particular resource (say memory) and performance degradation of a computing node. The analytics toolbench provides users with the ability through simple user interface to create a wide variety of analytics queries. The analytics queries can span a spatial range (sets of compute nodes) as well as a temporal range (windows of time).
The project had a key focus of making the asset widely usable, to broad classes of stakeholders. With this aim, we organized Birds of Feather gatherings at the relevant conferences (Supercomputing, DSN). We also created explanatory videos showing the usage of the dataset and the analytics toolbench. We plan to continue this line of work, bringing in more partners as data providers plus encouraging usage of the asset by the broad classes of users.
In a related effort that comes under the purview of this project, we worked with commercial entities that have production workloads on the cloud to release cloud computing traces. One of this is for serverless workflows on the Microsoft Azure cloud and the other is resource utilization for Adobe’s use on the Azure cloud.
Last Modified: 03/17/2024
Modified by: Saurabh Bagchi
Please report errors in award information by writing to: awardsearch@nsf.gov.