Networks That Manage Themselves (Jennifer Rexford) ================================================== In the early days of the ARPANET, survivability in the presence of failures was a primary design goal for the network that would eventually become the Internet. Resilience was important to tolerate malfunctioning equipment and malicious attacks against the infrastructure. The two most visible manifestations of this design goal were (i) the absence of per-session state in the network and (ii) the use of distributed routing protocols that adapt automatically to changes in the underlying topology. Later, the Transmission Control Protocol (TCP), implemented in end hosts, would evolve to adapt to changes in load on the network to prevent congestion collapse and allow a large, distributed set of users to share the available bandwidth. Together, these design decisions led to a network that could, in some sense, manage itself. Ultimately, these features were crucial in supporting the rapid increase in both the size of the network and the volume of traffic in the mid-to-late 1990s. Although congestion control and dynamic routing protocols ensure that the Internet continues to function, these mechanisms do not ensure that the network operates efficiently. For example, - Links in one part of the network may be heavily utilized while other links remain idle - Flash crowds or equipment failures may cause sudden shifts in load on the network - Voice-over-IP calls and gaming applications may experience poor performance due to network paths with high end-to-end delay - Denial-of-service attacks may impart a significant load on the network and certain end hosts - Dynamic changes in routing may cause transient disruptions in end-to-end performance - Configuration errors can cause destinations to become unreachable or trigger sudden increases in router table size and processing load The existing control mechanisms in the Internet do not address these problems; in practice, the Internet does not manage itself. Instead, network operators must continually work to detect, diagnose, and fix problems as they arise. If anything, the automatic adaptation mechanisms of TCP and the routing protocols make it more difficult for operators to perform these tasks. In addition, most traffic in the Internet traverses multiple Autonomous Systems (ASes) en route to the destinations. As such, the operators of any individual AS do not have end-to-end control, making it difficult to detect and diagnose problems and to predict the impact of changes in network configuration. Operating an IP network consists of a variety of tasks ranging from provisioning new routers and new connections to neighboring ASes to the day-to-day monitoring and control of the network. The current state-of-the-art for IP network management methodologies and systems is still quite immature. The main Internet protocols were not designed with managability or measurability in mind. Router vendors offer extremely primitive "assembly language" interfaces for configuring their equipment. Commercial network management tools focus primarily on operating individual pieces of equipment rather than controlling entire networks. Over the past several years, advances in networking research have led to significant increases in the capacity of links and routers, and new mechanisms for path selection, buffer management, and link scheduling. However, relatively little attention has been given to the models and systems needed to guide the operation of the improved infrastructure. As a result, operators often make changes to their networks in a manual fashion without knowing the effects in advance, or must develop their own custom tools to solve individual problems as they arise. Network management is largely a black art performed by an increasingly overwhelmed community of human operators. Improving the abstractions and tools that drive IP network operations will require new innovations from the networking research community drawing from a variety of related technical disciplines, such as: - Programming languages and compilers: Higher levels of abstraction for (re)configuring individual routers and entire networks - Database technology: Systems for automated network provisioning and online analysis of diverse types of measurement data - Algorithms: Efficient techniques for detecting and diagnosing anomalies from large streams of traffic, routing, and fault data - Distributed systems: Reliable, distributed systems for data collection and analysis and for automated configuration of network equipment - Statistics: Mathematical models for making sound inferences about network load and behavior from limited measurement data - Operations research: Robust optimization techniques for tuning router mechanisms for path selection, buffer management, and link scheduling These kinds of advances can significantly improve the state-of-the-art for managing large IP networks, and can help evolve the Internet into the reliable and mature communications infrastructure needed to support a wide variety of government, commercial, and educational applications.