Failure Recovery in Memory-Resident Transaction Processing Systems (thesis)
Abstract:
A main memory transaction processing system holds a complete copy of its database in semiconductor memory. We present and compare, in a common framework, a number of strategies for recovery management in main memory transaction processing systems. These include strategies for asynchronously
checkpointing the primary (main memory) database copy, and for maintaining a transaction log. Though they are not directly concerned with recovery management, we also consider strategies for updating the primary database, since they affect the performance of the recovery manager. The recovery strategies are compared using an analytic performance model and a testbed implementation. The model computes two performance metrics: Processor overhead and recovery time. Processor overhead measures the impact of a recovery strategy during normal operation, i.e., the cost of preparing to recover from a failure. Recovery time is a measure of the cost of recovery once a failure has occurred. Generally, it is possible to reduce processor overhead by increasing recovery time, and vice versa. The model captures this tradeoff, the exact nature of which depends on which recovery strategies are used. Many of the recovery strategies have been implemented in a testbed, a working transaction processing system. The testbed allows recovery strategies to be combined and tested. It has been used to verify the performance model and to study other aspects of performance not considered in the model, such as data contention and
transaction response times. The testbed runs on a large-memory VAX 11/785 using services provided by the Mach operating system. Our results indicate that the selection of a checkpointing strategy is the most critical decision in designing a recovery manager. In most situations, fuzzy, or unsynchronized,
checkpointing strategies outperform highly sychronized alternatives. This is true even when synchronized checkpoints are combined with efficient logical logging strategies, which cannot be used with fuzzy checkpoints.