Routing Problems are Too Easy to Cause, and Too Hard to Diagnose ================================================================ IP routing protocols, such as OSPF or BGP, form a complex, highly-configurable distributed system underlying the end-to-end delivery of data packets. "Highly configurable" is a nice way of saying "hard to configure" or "easy to misconfigure," and "distributed system" is a nice way of saying "hard to understand" or "hard to debug." As such, we have a routing system today where a single typographical error by a human operator can easily disconnect parts of the Internet, and diagnosing and fixing routing problems remains an elusive black art. This is unacceptable for any technology that would be considered a core communication infrastructure. I believe that the networking research community should devote significant attention to improving the state of the art in router configuration and network troubleshooting. Several factors conspire to make IP router configuration extremely challenging - Vendor configuration languages are primitive and low-level, like assembly language (e.g., a typical router may have ten thousand lines of configuration commands) - Routers implement numerous complex protocols (e.g., static routes, RIP, EIGRP, IS-IS, OSPF, BGP, MPLS, and various multicast protocols) that have many tunable parameters (e.g., timers, link weights/areas, and BGP routing policies) - The routing protocols interact with each other (e.g., "hot-potato" routing in BGP based on the underlying IGP, use of static routes to reach the remote BGP end-point, and route injection between protocols) - Scalability often requires even more complex configuration to limit the scope of routing information (e.g., OSPF areas and summarization, BGP route reflectors and confederations, and route aggregation) - Networks are configured at the element (or router) level, rather than as a single cohesive unit with well-defined policies and constraints - Key network operations goals, such as traffic engineering and security, are not directly supported, requiring operators to tweak the router configuration in the hope of having the right (indirect) effect on the network and its traffic Addressing these complicated problems will require research work in configuration languages, protocol modeling, and network modeling, and would hopefully lead to a higher level of abstraction for managing the configuration of the network as well as tools for configuration checking and, better yet, automation of configuration from a higher-level specification of the network goals. Extensions (or replacements!) of the routing protocols may also be necessary to rectify some of these problems. Detecting, diagnosing, and fixing routing problems are also very complicated because: - Routing protocols are hard to configure, making configuration mistakes very common (see above!) - Routing protocols do not convey enough information to explain why a route has changed (or disappeared entirely) - No authoritative record exists that can identify which routes are valid (e.g., whether the originating AS is entitled to advertise the prefix, or whether one AS should be providing transit service from one AS to another) - Failures, configuration errors, or malicious acts in remote locations can affect the path between two hosts - Reachability problems can arise for other reasons, unrelated to the routing protocols (e.g., packet filtering or firewalls, MTU mismatches, network congestion, and overloaded or faulty end hosts) - The end-to-end forwarding path depends on the complex interaction between multiple routing protocols running in a large collection of networks - Route filtering and route aggregation (often necessary for scalability) can lead to subtle reachability problems, including persistent forwarding loops - The network does not have much support for active measurement tools for measuring the forwarding path (i.e., traceroute is very primitive, and limited in its accuracy and potential uses) - The Internet topology is not fully known, at the router or the AS levels (or in terms of AS relationships and policies), and may be inherently unknowable Like router configuration, network troubleshooting has received little attention from the research community, despite its importance to network practitioners. Research work in network support for measurement, extensions to routing protocols to facilitate diagnosis, and new diagnostic tools would be extremely valuable for improving the state of the art.