
NSF Org: |
CNS Division Of Computer and Network Systems |
Recipient: |
|
Initial Amendment Date: | June 29, 2015 |
Latest Amendment Date: | May 2, 2016 |
Award Number: | 1513197 |
Award Instrument: | Standard Grant |
Program Manager: |
Marilyn McClure
mmcclure@nsf.gov (703)292-5197 CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 1, 2015 |
End Date: | June 30, 2019 (Estimated) |
Total Intended Award Amount: | $763,331.00 |
Total Awarded Amount to Date: | $779,331.00 |
Funds Obligated to Date: |
FY 2016 = $16,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
2550 NORTHWESTERN AVE # 1100 WEST LAFAYETTE IN US 47906-1332 (765)494-1055 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
465 Northwestern Avenue West Lafayette IN US 47907-2035 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Special Projects - CNS, CCRI-CISE Cmnty Rsrch Infrstrc |
Primary Program Source: |
01001617DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Dependability has become a necessary requisite property for many of the computer systems that surround us or work behind the scenes to support our personal and professional lives. Heroic progress has been made by computer systems researchers and practitioners working together to build and deploy dependable systems. However, an overwhelming majority of this work is not based on real publicly available failure data. Unfortunately, an open failure data repository for any recent computing infrastructure that is large enough, diverse enough and with enough information about the infrastructure and the applications that run on them does not exist.
This project will address this pressing need. The research team appreciates that this effort is challenging on many levels. Failure data are considered sensitive and are usually unveiled only before trusting eyes of a small subset of the people at the organization. As part of a current one-year planning grant, this team has collected specific requirements for the repository from a wide audience, collected failure and usage data from the largest centrally managed computing cluster at Purdue and performed preliminary analysis to reveal the workload usage patterns. The goal of this full-scale project is to collect data from a variety of computational infrastructure at the two participating universities, and from several of the NSF-funded large cyberinfrastructure projects.
The project will collect, curate, and present public failure data of large-scale computing systems in a repository called FRESCO. The data sets will include static information, dynamic information about the workloads, and failure information for both planned and unplanned outages. The data collection from production machines will have to obey several practical constraints -- no changes to the workload, little performance perturbation, and minimal changes to the operating system. Further, the data have to be sanitized for removing sensitive information and processed to make it interpretable by a broad group of researchers. This project will also provide analysis tools to answer certain commonly occurring questions, such as the correlation between workload and failure and the performance implications of using one library over another, as well as an intuitive graphical front-end which will allow people to explore the data sets and download the relevant ones.
Widespread use of the data and the associated analysis tools will give computer systems researchers an unprecedented ability to do data-driven research and offer computing infrastructure providers an analytic-driven capability to run more efficient reliable infrastructures.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Large-scale high performance computing (HPC) systems have become common in academic, industrial, and government for compute intensive applications, including large-scale parallel applications. These HPC systems solve problems that would take millennia on personal computers, but managing such large shared resources can be challenging and requires administrators to balance requirements from a diverse set of users. Large, focused organizations can afford to buy centralized resources, and choose to manage and operate it at academic organizations through a central IT organization. These are funded by federal funding agencies (like the National Science Foundation in the US) and individual researchers write grant proposals to get access to compute time on these systems. Examples of such systems include Comet at the University of California San Diego, BlueWaters at the University of Illinois at Urbana-Champaign, and Frontera at the University of Texas at Austin.
Progress in building dependable systems can be made faster if researchers working on dependability challenges can be exposed to problems through quantitative data. Theories in the labs and small demonstrations in prototypes can be transitioned to the demanding realities of large computer systems if they could validate their inventions with real system failure and attack data. There is an astonishing lack of such publicly available data for researchers and as a result many productive avenues of work in dependable system building are lying hidden. A comparison may fruitfully be drawn to the widespread use of benchmarks and reference data sets in performance analysis of computer systems, such as, those put out by SPEC or TPC. We therefore proposed to solve this problem by collecting, cleaning, annotating, and presenting data from the production compute infrastructures at several public universities.
We were not Pollyannaish in our effort and appreciated that this effort is challenging, both due to technological and psychological reasons. The technological reason referred to the need to keep the production infrastructure relatively undisturbed as the monitoring and the data collection happens. The psychological reason referred to the fact that such data is considered sensitive by many, to be unveiled only before trusting eyes of a subset of the people at the organization. We mitigated the two factors in our project. The technological factor was mitigated by carefully deciding which monitoring tool to use, when that should be activated, and how to store the data, both online and offline. We sidestepped the psychological factor by focusing on computational infrastructure at public universities (including ours), rather than private commercial organizations, and then working closely with the production IT unit to enable the data collection and data understanding.
Intellectual Merit:
We performed 4 different categories of analysis on production compute data collected from central computing clusters at three large public universities¾Purdue University, University of Texas at Austin, and University of Illinois at Urbana-Champaign. In the first, we showed the breakdown of node failures into different categories and reasoned about the corresponding up times and recovery times. In the second, we considered the failures of individual jobs and reasoned about their root cause through examination of their exit codes. The third analysis sheds light on the relation between resource usage and job failure rates. We consider the 5 primary kinds of resources, local and remote, memory on a node, local IO, remote IO to the parallel file system, network, and runtime of a job. For the runtime, we considered both the spatial component (i.e., number of nodes) and the temporal component (i.e., the execution time on each node). Finally, we developed a job failure prediction model which can help minimize resource wastage corresponding to job failures due to system related issues.
Broader Impacts:
The project has created a large public workload and failure data repository from production computing clusters in a university setting, Fresco (for Purdue and UT Austin) and Monet (University of Illinois at Urbana-Champaign). Widespread use of the data and the associated analysis tools will give computer systems researchers an unprecedented ability to do data-driven research and computing infrastructure providers an analytics capability for running the infrastructures more efficiently and more reliably. University researchers will benefit from more available centralized computing clusters. Broad societal impact will result from the development of more efficient and more reliable large-scale computing clusters that can run societal critical applications reliably and efficiently and at large scales.
Last Modified: 10/26/2019
Modified by: Saurabh Bagchi
Please report errors in award information by writing to: awardsearch@nsf.gov.