Reasoning about Software in the Presence of Transient Faults (thesis)
A transient fault occurs when an energetic particle strikes a chip and causes a change in state in the processor. Although there is no permanent damage, the current computation may become corrupt. Transient faults have been shown to be the cause of crashes at major companies, and current technology trends will make future processors more susceptible to them. Researchers have already developed a number of solutions using combinations of software and hardware where the general approach is to duplicate computations and check for consistency between the copies. Unfortunately, generating correct fault-tolerant code is difficult, and there has been little research on proving the correctness of these techniques. Reasoning formally about fault tolerance is challenging, because invariants that hold at compile time, such as standard type safety, may not actually hold at runtime. Previous work on formalizing fault tolerance has been at the high level, including proofs about fault-tolerant algorithms and the definition of a fault-tolerant lambda calculus.
This dissertation presents the first set of techniques for statically proving fault-tolerance properties of actual executable code. To address this challenge, we develop modifications to the general methodology for typed assembly languages. An assembly-level type system
incorporates invariants about fault tolerance that are strong enough to prove that all well-typed programs have the desired behaviors. All that is required to guarantee that a specific piece of code is fault-tolerant is to type-check the code at the conclusion of compilation.
More specifically, we introduce a family of three type systems. TALFT is a core language with simple instructions that guarantees that well-typed programs implementing a hybrid fault-tolerance scheme will always detect a single fault before the fault can affect the observable behavior of the program. ETALFT extends TALFT with features necessary to support realistic compilation, including stack activation frames and dynamic memory allocation. The third language, TALCF, precisely captures the behavior of software solutions for control flow faults and can provably detect any fault that causes incorrect control transfers between basic blocks before control exits that first incorrect block. Although each typed assembly language is designed for a specific hardware context, the type systems and proof methods use similar designs, allowing us to demonstrate general approaches needed for reasoning in the presence of transient faults. For example, to statically reason about values that may be corrupted at runtime, we generalize the “color systems” of previous work into a framework for classifying values with related reliability properties.
As well as being the first to prove that executable code is fault-tolerant, this dissertation makes three additional contributions. By including a language of static expressions within the type system, we can verify low-level fault-tolerant code independently of the compilation process. We show to apply fault-tolerant typed assembly languages to two different fault models: a hybrid system to detect data faults and a software-only system to detect control-flow faults. Finally, we investigate how to generate fault-tolerant typed assembly language for a realistic compiler.