NSF Award Search: Award # 1318396

Award Abstract # 1318396

NeTS: Small: Automated Diagnosis and Root Cause Analysis of Internet Problems

NSF Org:	CNS Division Of Computer and Network Systems
Recipient:	UNIVERSITY OF WASHINGTON
Initial Amendment Date:	August 20, 2013
Latest Amendment Date:	December 18, 2013
Award Number:	1318396
Award Instrument:	Standard Grant
Program Manager:	Darleen Fisher CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	September 1, 2013
End Date:	August 31, 2017 (Estimated)
Total Intended Award Amount:	$499,351.00
Total Awarded Amount to Date:	$499,351.00
Funds Obligated to Date:	FY 2013 = $499,351.00
History of Investigator:	Arvind Krishnamurthy (Principal Investigator) arvind@cs.washington.edu David Choffnes (Co-Principal Investigator)
Recipient Sponsored Research Office:	University of Washington 4333 BROOKLYN AVE NE SEATTLE WA US 98195-1016 (206)543-4043
Sponsor Congressional District:	07
Primary Place of Performance:	University of Washington 185 Stevens Way Seattle WA US 98195-2350
Primary Place of Performance Congressional District:	07
Unique Entity Identifier (UEI):	HD1WMN6945W6
Parent UEI:
NSF Program(s):	Networking Technology and Syst
Primary Program Source:	01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7923
Program Element Code(s):	736300
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Reliable Internet performance and availability are essential for many existing and future network applications. While the Internet works well enough most of the time for most people, nearly everyone has experienced outages and service degradation that make the network unusable, and we are far from five nines of reliability that critical services require. Improving Internet connectivity requires action against all sources of unavailability and poor performance.

The research community has made substantial progress toward understanding and developing technologies to address short-term outages due to BGP (border gateway protocol) routing convergence. However, much less progress has been made at reducing the impact of long-term outages and route misconfiguration. Despite being rare, these events have a large impact on overall network availability because repairs happen on a human timescale. Additionally, many users suffer from the use of sub-optimal (high latency or lossy) paths to network services due to misconfigurations and ineffective route selection. Operators at an affected ISP or service often encounter stumbling blocks at each step: identifying that a problem exists, localizing the root cause of the problem, and affecting a repair.

The researchers on this project will develop a system to transform this largely manual troubleshooting process into a fully automated one. The goal of the research is that persistent outages and performance problems can be identified in real-time, rather than today's matter of hours. While automated diagnosis and identification of root cause is fundamentally hard, the project will benefit from dramatic recent progress in Internet measurement technologies, specifically reverse path measurement that provides a much more complete picture of the Internet topology than ever before.

Intellectual Merit: The goal of the research project is to change the paradigm of network diagnosis on the Internet -- from blind to informed. The state of art with network troubleshooting is to use ad-hoc techniques. For instance, it is common occurrence on the NANOG (North American Network Operators? Group) mailing list for operators to post requests asking other operators to manually issue traceroutes and report them in order to identify network anomalies. The network could thus benefit from a continuously operated service that can not only detect network problems in realtime but also identify misbehaving network elements at the granularity of routers. There are also a number of challenges to deploying a functional diagnosis system, and the researchers will address them using the following key components. First, the project will produce a scalable measurement system that will synthesize measurements from different techniques to provide snapshots of routing behavior in real-time. Second, the research will focus on developing a general theory of Internet path changes that will help model the propagation of routing events and identify the candidate set of responsible ASes (autonomous systems). Third, the researchers will develop inference techniques that will operate on measured data and identify the origin of failures and path changes in the wide area even when the measurement data is incomplete or subject to transient dynamics.

Broader Impact: Our society is increasingly relying on the Internet for critical telecommunications services, such as home health monitoring, e-911, smart grids, and so forth. It is no longer simply an inconvenience when the Internet is unavailable or inefficient. If this project is successful, it will help operators address the major sources of unavailability and misconfigurations in the Internet, benefiting all of its users. In addition, because of a lack of automated tools, operators currently spend huge amounts of time chasing down individual outages and performance misconfigurations; this raises the barrier to entry for small ISPs, ultimately raising the costs of Internet service for everyone.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

U. Javed, I. Cunha, D. Choffnes, E. Katz-Bassett, T. Anderson, and A. Krishnamurthy "PoiRoot: Investigating the Root Cause of Interdomain Path Changes" SIGCOMM Computer Communication Review , 2013

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Our end goal is that network problems can be identified in realtime, rather than today's matter of hours. While automated diagnosis and identification of root cause is fundamentally hard, our project benefits from dramatic recent progress in Internet measurement technologies, specifically reverse path measurement, that provides a much more complete picture of the Internet topology than ever before.

We obtained the following results:

(a) Probabilistic analysis of root cause and AS relationship inference: We developed a new algorithm for inferring business relationships between Autonomous Systems (ASes) from publicly available BGP paths. Unlike previous approaches, our algorithm does not depend on pre-defined characteristics of BGP policies, such as valley-free routing. Instead, we build a probabilistic model by learning patterns in historical BGP data. We identify a key set of route features that are highly predictive. Our algorithm, which makes relationship predictions based on a weighted sum of these route features, consistently achieves 99% prediction accuracy (less than 1% error rate) over 3-4 years of routing data, reducing significantly the error rate of state-of-the-art algorithm. The predictiveness of these features leads us to believe that we have identified underlying structures of AS relationships---particular hybrid AS relationships---that were not previously understood.

(b) Automated diagnosis of performance problems due to problems in the optical network: To confirm and quantify physical layer over-engineering in today's datacenters, we conduct what to our knowledge is the first large-scale study of operational optical links. We analyzed over 300K links across more than 20 datacenters of a large cloud provider over a period of 10 months. We find a remarkably conservative state of affairs---99.9% of the links have incoming optical signal quality that is higher than the minimum threshold for BER, while the median is 6 times higher! We then designed a practical system that builds multiple virtual topologies on the same physical topology, where the class of the topology offers a bound on the maximum packet error rate (i.e., grayness) on any path in it. The first-class topology does not have any gray paths; hence, it offers the same path packet error rate guarantee as current DCN designs offer. Other classes increasingly use more gray links. Each application uses the virtual topology that meets its needs. Thus, loss tolerant applications use virtual topologies that may have more gray paths. To support applications, such as large transfers that are otherwise loss-tolerant but suffer when the transport protocol (e.g., TCP) is sensitive to losses, Rail uses a transparent coding-based error correction scheme. We develop an efficient algorithm to compute virtual topologies that leverages the topological structure of DCNs. Rail is easily deployable as it requires no changes to the switch or transceiver hardware. We evaluated Rail using simulations-based analysis and a testbed. Even at the maximum stretched reach level we consider, we find 95% of all paths are as reliable as today. Furthermore, Rail successfully protects loss-sensitive applications from gray paths.

(c) Analysis of packet losses of microbursts inside datacenters: Our primary result is to provide a high-resolution characterization of a production data center network. To do so, we developed a custom high-resolution counter collection framework on top of the data center operator's in-house switch platform. This framework is able to poll switch statistics at a 10s to 100s of microseconds granularity with minimal impact on regular switch operations. With the framework, we proceed to perform a data-driven analysis of various counters (including packet counters and buffer utilization statistics) from Top-of-Rack (ToR) switches in multiple clusters running multiple applications. While our measurements are limited to ToR switches, our measurements and prior work indicate that the majority of congestion occurs at that layer. Our main findings include: (1) micro-bursts, periods of high utilization lasting less than 1ms, exist in production data centers, and in fact, they encompass most congestion events. (2) Link utilization is multimodal; when bursts occur, they are generally intense. (3) At small timescales, many multi-statistic features become possible to measure: load can be very unbalanced, packets tend to be larger inside bursts than outside, and buffers are related to simultaneous bursts in a nonlinear fashion.

Last Modified: 11/30/2017
Modified by: Arvind Krishnamurthy

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error