Understanding Internet Routing Anomalies and Building Robust Transport Layer Protocols (thesis) | Computer Science Department at Princeton University

Report ID:

TR-732-05

Authors:

Zhang, Ming

Date:

July 2005

Pages:

132

Download Formats:

[PDF]

Abstract:

As the Internet grows and routing complexity increases, network-level instabilities are becoming more and more common. End-to-end communications are especially susceptible to service disruptions, while diagnosing and mitigating these disruptions are extremely challenging. In this dissertation, we design and build systems for diagnosing routing anomalies and improving robustness of end-to-end communications.

The first piece of this work describes PlanetSeer, a novel distributed system for diagnosing routing anomalies. PlanetSeer passively monitors traffic in wide-area services, such as Content Distribution Networks
(CDNs) or Peer-to-Peer (P2P) systems, to detect anomalous behavior. It then coordinates active probes from multiple vantage points to confirm the anomaly, characterize it, and determine its scope. There are several advantages of this approach: first, we obtain more complete and finer-grained views of routing anomalies since the wide-area nodes provide geographically-diverse vantage points. Second, we incur limited additional measurement cost since most active probes are initiated when passive monitoring detects oddities. Third, we detect anomalies at a much higher rate than other researchers have reported since the wide-area services provide large volumes of traffic to sample. Through extensive experimental study in the wide-area network, we demonstrate that PlanetSeer is an effective system for both gaining a better understanding about routing anomalies and for providing optimization opportunities for the host service.

To improve the robustness of end-to-end communications during performance anomalies, we design mTCP, a novel transport layer protocol that can minimize the impact of anomalies using redundant paths. mTCP separates the congestion control for each path so that it can not only obtain higher throughput but also be more robust to path failures. mTCP can quickly react to failures, and the recovery process normally takes only several seconds.

We integrate a shared congestion detection mechanism into mTCP that allows us to suppress paths with shared congestion. This helps alleviate the aggressiveness of mTCP. We also propose a heuristic to find disjoint paths between pairs of nodes. This can minimize the chance of concurrent failures and shared congestion. We implement mTCP on top of an overlay network and evaluate it using both emulations and experiments in the wide-area network.