Published on *Computer Science Department at Princeton University* (https://www.cs.princeton.edu)

This thesis explores two fundamental issues in the design of

large-scale parallel computers: {it communication /} and {it fault

tolerance/}.

In Chapter 1, we introduce and provide motivation for the problems

that we study in this thesis.

Chapter 2 examines several simple algorithms for routing packets on

butterfly networks with bounded queues. Among other things, we show

that for any greedy queuing protocol, a routing problem in which each

of the $N$ inputs sends a packet to a randomly chosen output requires

$O(log N)$ steps, with high probability, provided that the queue size

is a sufficiently large, but fixed, constant.

In Chapter 3, we analyze the fault-tolerance properties of several

bounded-degree hypercubic networks that are commonly used for parallel

computation. Among other things, we show that an $N$-node butterfly

containing $N^{1-epsilon}$ worst-case faults (for any constant

$epsilon > 0$) can emulate a fault-free butterfly of the same size

with only constant slowdown. Similar results are proved for the

shuffle-exchange graph. Hence, these networks become the first

connected bounded-degree networks known to be able to sustain more

than a constant number of worst-case faults without suffering more

than a constant-factor slowdown in performance.

In Chapter 4, we study the ability of array-based networks to tolerate

faults. Among other things, we show that an $N imes N$ two-dimensional

array can

sustain $N^{1-epsilon}$ worst-case faults, for some fixed $epsilon <

1$, and still emulate a fully-functioning $N imes N$ array with only

constant slowdown.

In Chapter 5, we study a concurrent error detection scheme called

Algorithm Based Fault Tolerance (ABFT). Unlike the schemes

developed in Chapters 3 and 4 to

tolerate permanent faults, the scheme studied in this chapter is

primarily aimed at tolerating transient faults in a parallel

computer. The main contribution of this chapter is to propose a simple

and novel algorithm called RANDGEN to generate data-check relationships.

By simply varying its parameters, RANDGEN

can produce data-check relationships with a wide spectrum of

properties, many of which have been considered important in recent ABFT

designs.

**Links**

[1] https://www.cs.princeton.edu/research/techreps/author/464

[2] ftp://ftp.cs.princeton.edu/techreports/1993/414.ps.gz