New Systems and Algorithms for Scalable Fault Tolerance
Users of Internet services are increasingly intolerant of delays and outages, while demanding a consistent online experience. A website that is down or misbehaving is reported within seconds, often with an embarrassing screenshot that spreads through the news like wildfire. Among these failures, the most notorious are the ones that manifest arbitrary behavior, such as returning the wrong content to users or accidentally deleting their data. Unfortunately, protecting against such failures---whether due to misconfigurations, bugs, or even malice---is prohibitively expensive, because most existing solutions do not scale beyond a single server's performance. As a result, these solutions are not used for customer-facing services, where scalability is required to cope with large user populations.
This thesis describes new systems and algorithms for tolerating arbitrary failures in Internet services, inspired by real-world debacles. Unlike prior work, our solutions are highly scalable. Our approach integrates theoretical innovations into the later stages of system design, giving robust guarantees that are also practical. We begin with a real failure that occurred in the indexing technique used by a certain database provider, and explain theoretically why the technique failed. We remedy the technique by introducing a new class of tree data structures, called relaxed trees, with provably good properties. Our analysis of relaxed trees makes use of exponential potential functions.
Then, we describe a general system for tolerating arbitrary failures, called Prophecy, that delivers scalable performance on read-mostly workloads. With a modest trust assumption, Prophecy is practical for modern Internet services, as our evaluation confirms. Finally, we devise two techniques to scale this fault tolerance to very large-scale systems and general workloads. The first is an algorithm for securely composing many small replica groups, subject to an adversary that can coordinate faulty nodes across the groups dynamically. The second is a technique for improving the fault tolerance within each replica group, by adding small, trusted broadcast channels that mitigate the impact of faulty nodes.