NSF Award Search: Award # 1703936

Award Abstract # 1703936

NeTS: Medium: Collaborative Research: Diagnosing Datacenter Networks with Quantitative Provenance

NSF Org:	CNS Division Of Computer and Network Systems
Recipient:	TRUSTEES OF THE UNIVERSITY OF PENNSYLVANIA, THE
Initial Amendment Date:	July 22, 2017
Latest Amendment Date:	August 6, 2020
Award Number:	1703936
Award Instrument:	Continuing Grant
Program Manager:	Darleen Fisher CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	September 1, 2017
End Date:	September 30, 2023 (Estimated)
Total Intended Award Amount:	$856,480.00
Total Awarded Amount to Date:	$856,480.00
Funds Obligated to Date:	FY 2017 = $223,572.00 FY 2018 = $470,801.00 FY 2020 = $162,107.00
History of Investigator:	Linh Thi Xuan Phan (Principal Investigator) linhphan@cis.upenn.edu Boon Thau Loo (Co-Principal Investigator) Andreas Haeberlen (Co-Principal Investigator)
Recipient Sponsored Research Office:	University of Pennsylvania 3451 WALNUT ST STE 440A PHILADELPHIA PA US 19104-6205 (215)898-7293
Sponsor Congressional District:	03
Primary Place of Performance:	Trustees of the University of Pennsylvania 3330 Walnut Street Philadelphia PA US 19104-6205
Primary Place of Performance Congressional District:	03
Unique Entity Identifier (UEI):	GM1XX56LEP58
Parent UEI:	GM1XX56LEP58
NSF Program(s):	Networking Technology and Syst
Primary Program Source:	01001718DB NSF RESEARCH & RELATED ACTIVIT 01001819DB NSF RESEARCH & RELATED ACTIVIT 01002021DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	9102, 7924
Program Element Code(s):	736300
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

The increasing complexity of data center networks has made it considerably more difficult to identify the source of a networking problem when something goes wrong. However, a set of new diagnostic tools can help diagnose subtle bugs that would be difficult to find with existing tools.
One promising approach is based on data provenance, a concept that was originally developed by the database community but is now increasingly being applied in the networking domain. In this approach, the network keeps track of causality as data flows through the system -- for instance, by noting a router's configuration state that contributed to a particular forwarding decision. This information can then be used later to determine a
comprehensive explanation of an observed networking problem.

This project will develop a quantitative equivalent of provenance for data networking that can be used to reason about properties such as time or probability. The key idea is to use this provenance to improve root-cause analysis of network events. The proposed effort will develop the scientific foundations of quantitative provenance, as well as practical techniques for capturing, storing, and reasoning about it. The investigators will add several quantitative metrics to provenance: temporal, probabilistic and influence; three research thrusts will be considered, one corresponding to each of these metrics. The project will explore efficient and reusable implementations of new diagnostic tools, which will be applied to several concrete case studies.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 26)

Show All

Gandhi, Neeraj and Saldana, David and Kumar, Vijay and Phan, Linh Thi "Self-Reconfiguration in Response to Faults in Modular Aerial Systems" IEEE Robotics and Automation Letters , v.5 , 2020 10.1109/LRA.2020.2970685 Citation Details

Abedi, Saeed and Gandhi, Neeraj and Demoulin, Henri Maxime and Li, Yang and Wu, Yang and Phan, Linh Thi "RTNF: Predictable Latency for Network Function Virtualization" IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS) , 2019 10.1109/RTAS.2019.00038 Citation Details

A. Loveless, R. Dreslinski "IGOR: Accelerating Byzantine Fault Tolerance for Real-Time Systems with Eager Execution" IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS) , 2021 Citation Details

Chen, Tianyang and Phan, Linh T.X. "SafeMC: A system for the design and evaluation of mode change protocols" IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS) , 2018 Citation Details

Chen, Tianyang and Phan, Linh T.X. "SafeMC: A system for the design and evaluation of mode change protocols" Proceedings - IEEE Real-Time and Embedded Technology and Applications Symposium , 2018 Citation Details

Demoulin, Henri Maxime and Fried, Joshua and Pedisich, Isaac and Kogias, Marios and Loo, Boon Thau and Phan, Linh Thi and Zhang, Irene "When Idling is Ideal: Optimizing Tail-Latency for Heavy-Tailed Datacenter Workloads with Perséphone" Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP) , 2021 https://doi.org/10.1145/3477132.3483571 Citation Details

Demoulin, Henri Maxime and Pedisich, Isaac and Phan, Linh Thi and Loo, Boon Thau "Automated Detection and Mitigation of Application-level Asymmetric DoS Attacks" Proceedings of the Afternoon Workshop on Self-Driving Networks , 2018 10.1145/3229584.3229589 Citation Details

Edo Roth, Hengchu Zhang "Orchard: Differentially Private Analytics at Scale" USENIX Symposium on Operating Systems Design and Implementation (OSDI) , 2020 Citation Details

Edo Roth, Karan Newatia "Mycelium: Large-Scale Distributed Graph Queries with Differential Privacy" 28th ACM Symposium on Operating Systems Principles (SOSP '21) , 2021 Citation Details

Gandhi, N. and Roth, E. and Gifford, R. and Phan, L. T. and Haeberlen, A. "Bounded-Time Recovery for Distributed Real-Time Systems" IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS) , 2020 Citation Details

Gandhi, Neeraj and Roth, Edo and Sandler, Brian and Haeberlen, Andreas and Phan, Linh Thi "REBOUND: Defending Distributed Systems Against Attacks with Bounded-Time Recovery" Proceedings of the 16th European Conference on Computer Systems (EuroSys'21) , 2021 https://doi.org/10.1145/3447786.3456257 Citation Details

(Showing: 1 - 10 of 26)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Provenance is a way to reason about where a given piece of data came from, or why a particular event occurred. To use a real-world analogy, the provenance of a cup of coffee would include its ingredients (the water and the beans) as well as a description of the brewing process; it would then continue recursively with the provenance of the water and the beans. A data structure like this has several important uses in distributed systems, including diagnostics. Consider, for instance, what would happen if a data-center system, with perhaps thousands of servers, produces an unexpected output. Finding the root causes of this output among the millions of things that such a system is doing would be quite difficult for a human operator. But if the system has been keeping track of provenance, the task is much easier: the operator can simply inspect the provenance of the unexpected output. However, existing solutions could only produce qualitative answers: while it was possible to tell that a given output was computed from certain inputs, it was not possible to tell, say, why the computation took unusually long.

This project addressed this problem by generalizing provenance to quantitative properties. We developed theoretical foundations for quantitative provenance, we built systems for capturing it and reasoning about it, we developed several tools and applications, and we studied a number of different use cases. We have particularly focused on 1) temporal provenance, which can be used about timing and delays; 2) probabilistic provenance, which can be used to reason about probability distributions; and 3) meta provenance, which can be used to reason about the influence of a particular piece of code on a certain event.

Today, the most important application scenario for our results is diagnostics in data-center networks. This is important because data centers are running the large-scale services we use every day - including the global payment network or airline reservation systems, but also Amazon, Google, Facebook, Instagram, Uber, and pretty much any other large web platform. The high complexity of these systems makes diagnostics particularly challenging. However, we have also found uses in a number of other domains. For instance, one result helped us quickly find malfunctioning rotors in multirotor aircraft, which could help to improve their safety; another has been useful in a collaboration with industry, to improve a next-generation metaverse platform; and a third was even able to find a security issue in NASA's space shuttle and has led to changes to industry standards.

The project has helped to train several PhD students, some of whom have already graduated and are now working in the tech industry. It has also provided research experience and training to many Master's and undergraduate students, and it has had an impact on three core computer-science courses at Penn, each of which has been taken by more than 100 undergraduate and graduate students per semester.

Last Modified: 03/31/2024
Modified by: Linh Thi Xuan Phan

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error