LEVERAGING DISTRIBUTED STORAGE REDUNDANCY IN DATACENTERS
Abstract:
All distributed storage systems replicate data objects, providing built-in redundancy that is
designed to help the system withstand failures. Such redundancy is unavoidable because
withstanding failures is a critical goal of distributed systems, and redundancy is the only
way to tolerate failures that cause loss of data access.
However, with the proliferation of data, it is becoming ever more paramount to reduce
the costs of distributed storage systems. To balance the need to reduce storage costs and the
need to withstand failures, this thesis explores two ways we can leverage the unavoidable
redundancy in distributed storage systems to eliminate additional storage overheads in other
parts of the storage stack.
The first system we present is Replex. The key end-to-end observation in this work
is that distributed secondary indices duplicate the work done by replication. Secondary
indices often store full copies of data objects, in addition to the replicas of data objects that
are created by default to handle failures. In Replex, we eliminate the additional storage
overhead of secondary indices by treating them as data replicas during replication time.
The second system we present is DIRECT. The key end-to-end observation here is that
the redundancy created by replication can and should be used to correct bit errors at the
hardware level. Traditionally, disks are expected to abstract bit errors from software, and
in fact flash devices are shipped with aggressive internal error correction mechanisms to
prevent errors from percolating to the user for the calculated lifetime of the device. In
DIRECT, we argue that the underlying premise that disks should not expose bit errors is incorrect.
In doing so, DIRECT enables the use of flash devices well beyond their advertised
lifetime, which is a huge cost savings for datacenter operators.
Therefore, by applying existing storage redundancy to enable two key properties in
datacenter storage systems– secondary indexing and flash reliability– this thesis shows that
distributed storage systems can be designed without burdensome storage overheads