Networks That Manage Themselves (Jennifer Rexford)
==================================================
In the early days of the ARPANET, survivability in the presence of
failures was a primary design goal for the network that would
eventually become the Internet.  Resilience was important to tolerate
malfunctioning equipment and malicious attacks against the infrastructure.
The two most visible manifestations of this design goal were (i) the
absence of per-session state in the network and (ii) the use of
distributed routing protocols that adapt automatically to changes in
the underlying topology.  Later, the Transmission Control Protocol
(TCP), implemented in end hosts, would evolve to adapt to changes in
load on the network to prevent congestion collapse and allow a large,
distributed set of users to share the available bandwidth.  Together,
these design decisions led to a network that could, in some sense,
manage itself.  Ultimately, these features were crucial in supporting
the rapid increase in both the size of the network and the volume of 
traffic in the mid-to-late 1990s.

Although congestion control and dynamic routing protocols ensure that
the Internet continues to function, these mechanisms do not ensure
that the network operates efficiently.  For example,

  - Links in one part of the network may be heavily utilized while
    other links remain idle

  - Flash crowds or equipment failures may cause sudden shifts in 
    load on the network

  - Voice-over-IP calls and gaming applications may experience poor 
    performance due to network paths with high end-to-end delay

  - Denial-of-service attacks may impart a significant load on the 
    network and certain end hosts

  - Dynamic changes in routing may cause transient disruptions in
    end-to-end performance

  - Configuration errors can cause destinations to become unreachable or
    trigger sudden increases in router table size and processing load

The existing control mechanisms in the Internet do not address these
problems; in practice, the Internet does not manage itself.  Instead,
network operators must continually work to detect, diagnose, and fix
problems as they arise.  If anything, the automatic adaptation
mechanisms of TCP and the routing protocols make it more difficult for
operators to perform these tasks.  In addition, most traffic in the
Internet traverses multiple Autonomous Systems (ASes) en route to the
destinations.  As such, the operators of any individual AS do not have
end-to-end control, making it difficult to detect and diagnose
problems and to predict the impact of changes in network configuration.

Operating an IP network consists of a variety of tasks ranging from
provisioning new routers and new connections to neighboring ASes to
the day-to-day monitoring and control of the network.  The current
state-of-the-art for IP network management methodologies and systems
is still quite immature.  The main Internet protocols were not
designed with managability or measurability in mind.  Router vendors
offer extremely primitive "assembly language" interfaces for
configuring their equipment.  Commercial network management tools
focus primarily on operating individual pieces of equipment rather
than controlling entire networks.  Over the past several years,
advances in networking research have led to significant increases in
the capacity of links and routers, and new mechanisms for path
selection, buffer management, and link scheduling.  However,
relatively little attention has been given to the models and systems
needed to guide the operation of the improved infrastructure.  As a
result, operators often make changes to their networks in a manual
fashion without knowing the effects in advance, or must develop their
own custom tools to solve individual problems as they arise.  Network
management is largely a black art performed by an increasingly
overwhelmed community of human operators.

Improving the abstractions and tools that drive IP network operations
will require new innovations from the networking research community
drawing from a variety of related technical disciplines, such as:

  - Programming languages and compilers: Higher levels of abstraction
    for (re)configuring individual routers and entire networks

  - Database technology: Systems for automated network provisioning and 
    online analysis of diverse types of measurement data  

  - Algorithms: Efficient techniques for detecting and diagnosing anomalies 
    from large streams of traffic, routing, and fault data

  - Distributed systems: Reliable, distributed systems for data collection
    and analysis and for automated configuration of network equipment

  - Statistics: Mathematical models for making sound inferences about 
    network load and behavior from limited measurement data

  - Operations research: Robust optimization techniques for tuning router
    mechanisms for path selection, buffer management, and link scheduling

These kinds of advances can significantly improve the state-of-the-art
for managing large IP networks, and can help evolve the Internet into
the reliable and mature communications infrastructure needed to support
a wide variety of government, commercial, and educational applications.