Caml
Power

Scheduling Parallel Computations

The work of a computation is the total number of operations the computation must execute. The span of a computation is the length of the longest sequence of operations that cannot be executed in parallel. Intuitively, the span tells us how long a computation would take on a machine that has an infinite number of processors and that will assign any ready parallel task to a free processor instantaneously. Of course, very few of us are wealthy enough to own a machine with infinitely many processors. Still, span can often be a useful way of describing a computation. It provides a simple model that allows us to think about the properties of our algorithms independently of the machines on which they run.

Parallelism

It is often useful to compare the work to the span of computation. Such a comparison results in an estimate of the degree of parallelism that an algorithm exhibits:

parallelism = work/span
The parallelism of a computation provides a rough estimate of the maximum number of processors that can be used efficiently by any algorithm. For instance, suppose the work and the span are roughly equal:
parallelism = work/span = 1
If the work is equal to the span then there is no parallelism in an algorithm. It is completely sequential -- no operations may be performed in parallel. On the other hand, if the span is low and the work is high, then there is a great deal of parallelism. In the limit, the span of a computation may be 1 and the work may be N:
parallelism = work/span = N/1 = N
In this situation, all operations may be executed in parallel so the program may use up to N processors effectively -- one processor per operation.

Series-Parallel Graphs

A series-parallel graph (also known as a parallel program dependence graph) is a directed graph that describes a set of computational tasks and the dependencies between them. The nodes in the graph represent atomic operations. The edges in the graph represent dependencies between operations. Each such graph has a source node where the computation begins and a sink node where the computation ends. When viewing such graphs, one should assume that time flows from the top of the page to the bottom of the page. Hence, the source node is always drawn at the top and the sink node is at the bottom. Here are three such graphs, with nodes labelled with letters:

              b                 d
              |               /   \
a             |              e     f
              |               \   /
              c                 g

(i)          (ii)              (c)
Graph (i) represents a computation with a single operation (a). Graph (ii) represents a computation with two operations in sequence. Execution of b must precede execution of c. Graph (iii) represents a computation that begins with d and then allows e and f to execute in parallel. When both e and f complete, the final operation g may execute. All such graphs describe a particular kind of parallelism: fork-join in which a computation initiates some number of parallel tasks and eventually those parallel tasks synchronize at a particular join point in the computation.

It is easy to compose two such graphs. For instance, given two graphs G1 and G2, which describe subcomputations, we draw the graph describing the sequential composition of G1 and G2 as follows:

    G1
    |
    |
    G2
Above, the line between G1 and G2 depicts the connection between the sink node of G1 and the source node of G1. It represents the fact that G1 must be executed first in its entirety and G2 must be executed second. We can compose two computations in parallel as follows:
    .
   / \
 G1   G2
   \ /
    .

It is relatively easy to extract a series-parallel graph from a pure functional program because all of the dependencies in the program are explicit. In a program with mutable data structures, it is much more difficult to do so --- writing to a mutable data structure through a reference will affect any computation that reads from an alias to that reference. Determining which references alias one another is undecideable in theory and usually very difficult (or impossible) in practice.

As a simple example, consider the computation:

let f () = 2 + 3 in
let g () = f() + 7 in
let h () = 4 + 5 in
h <*> g
If we assign one node for the work of each function call, the series-parallel graph for the expression h <*> g might be drawn as follows.
  .
 / \
h() f()
|   |
|   g()
 \ /
  .

Once one has a series-parallel graph, one can easily determine the work and the span of a computation. The work is simply the number of nodes in the graph. The span is the length of the longest path through the graph. Hence, the work of the above computation (as depicted by the graph) is 5 and the span is 3. In this context, the span is often called the critical path.

Scheduling

Series-parallel graphs can serve both as a visualization technique to help us understand the complexity of our algorithms as well as a data structure than can be used by compilers to schedule execution of jobs on a real machine with a finite number of processors. For instance, consider the following series-parallel graph.

       a
      / \
     /   \
    b     g
   / \   / \
  c   d h   i
   \ /   \ /
    e     j
     \   /
      \ /
       f
Suppose also that each job takes the same amount of time and that we wish to execute the computation on a machine with two processors. We will do so by allocating jobs to processors beginning at the top. Note that we can not allocate a job to a processor until all of its precedessors (those nodes above it, connected via an edge) have completed. With these constraints in mind, we might begin our schedule as follows:
Time Step    Jobs Scheduled
1               a
2               b g
3               c d
Above, in the first time step, we can only schedule job a (all other jobs require a to complete before they can be started). Unfortunately, that means one of our two available processors will remain idle during that time step. When a completes, we can schedule both b and g and we do so. In step 3, we schedule both c and d -- they both depend upon b, which has completed so scheduling them now is legal. If we had more processors, we could have scheduled both h and i at the same time, because they depend only on g, which has also completed. We might finish the schedule as follows:
Time Step    Jobs Scheduled
1               a
2               b g
3               c d
4               e h
5               i
6               j
7               f
It turns out above that we made a bit of a bad choice by scheduling e and h at the same time. Doing so left us in a situation in which the only item we could schedule in step 5 was i. And then we could only schedule j and then f. This schedule leaves the 2nd processor idol for quite a while. We could instead have scheduled h and i at step 4, resulting in the following schedule:
Time Step    Jobs Scheduled
1               a
2               b g
3               c d
4               h i
5               e j
6               f            
The moral of the story is that how you choose to schedule your jobs can make quite a difference in terms of completion time of the overall computation, even in this simplistic setting where all jobs take the same amount of time.

Greedy Schedulers

A greedy scheduler is a scheduler that will schedule a task instantaneously whenever it becomes ready and there is an available processor. No schedulers are perfectly greedy because there is always some overhead to allocate a task to a processor. Nevertheless, it is a reasonable approximation in many circumstances.

Greedy schedulers have some wonderful properties. In particular, given p processors, the time T(p) to execute a computation using a greedy scheduler is constrained as follows.

(1)   T(p) < work/p + span
Moreover, one can show that the total time T(p) is always greater than the max of the work/p and the span:
(2)  T(p) >= max(work/p, span)
T(p) is greater than the span, because, by definition, the span is the longest series of instructions that cannot be executed in parallel. T(p) is greater than work/p since the best one could imagine doing is to break up the work perfectly evenly and execute exactly work/p operations on each individual processor. If we do that, each processor is in constant use throughout the entire computation. All processors simultaneously complete their jobs in work/p time steps.

Interestingly, according to equations and above, we have now bounded the time taken by a greedy scheduler:

max(work/p, span) <= T(p) < work/p + span
Moreover, you can see that if the span is very small relative to the work then max(work/p, span) is work/p and work/p + span is very close to work/p. In other words, when the span is small relative to work/p, greedy schedulers will converge on optimum performance. If span is large compared with work/p, then little parallelism is available and again any greedy scheduler is near optimal.

Summary

This note covers the following key concepts:
  • Parallelism can be defined as work/span. It approximates the number of processors that can be used productively in a computation
  • Series-parallel graphs describe the dependencies between tasks in a parallel computation. They can be extracted from functional programs relatively easily and are a good way of visualizing the amount of parallelism in a computation. Series-parallel tasks can also be used within a compiler to assign tasks to processors.
  • A greedy scheduler assigns tasks to processors whenever a task is ready and a processor is available.

Acknowledgement: Lectures notes adapted from materials developed by Guy Blelloch, Bob Harper and Dan Licata.