Power
Complexity Analysis for Parallel Programs
In the last note, we introduced parallel futures, a useful
abstraction that allows functional programmers to add parallelism
to their programs but avoid incurring any nondeterminism. In this
lecture, we will explore another, closely related operator we call
par
. The expression par f x g y
evaluates
(f x)
concurrently with (g y)
.
We also capture a special case using the operator (<*>).
The expression (f <*> g) evaluates
(f ())
concurrently with (g ())
.
Both operators simplify the process of analyzing the complexity of
parallel programs. Their types follow.
par : ('a > 'b) > 'a > ('c > 'd) > 'c > 'b * 'd <*> : (unit > 'a) > (unit > 'b) > 'a * 'b
If a sequential algorithm executes the code:
(e1, e2) (* 1 *)a parallel algorithm could execute:
(fun () > e1) <*> (fun () > e2) (* 2 *)instead provided
e1
and e2
are pure.
We can easily capture this idea using an
equivalence rule that is valid for all pure expressions e1 and e2:
(e1, e2) == (fun () > e1) <*> (fun () > e2)Likewise, when f and g are pure (effectfree) functions, we have:
(f x, g y) == par f x g y
Both operators are easily implemented using futures
open Future;; let par f x g y = let ff = future f x in let v = g y in (force ff, v) ;; let (<*>) f g = par f () g () ;;
Divide and Conquer Parallel Algorithms
Many parallel algorithms use a divideandconquer strategy. Given a problem, a divideandconquer algorithm operates using the following steps:
 Split your input in to 2 or more smaller subproblems.
 Recursively solve your smaller subproblems in parallel
 Combine the results of solving the subproblems to solve the overall problem
Clearly, in order for your overall algorithm to be efficient, it must be possible to split your input efficiently and it must be possible to recombine the results of solving subproblems efficiently.
Mergesort is an excellent example of a divideandconquer algorithm. In the following sections, we will choose to represent the sequences to be sorted by mergesort in two different ways. First we consider mergesorting lists, and then we consider mergesorting trees. The effect of our data structure choice will have a significant impact on the complexity of our algorithms.
Parallel Mergesort Over Lists
First, we examine a parallel functional mergesort on lists. Recall that mergesort operates by splitting its input list in half, sorting the two halves and then merging the two sorted lists together. Because the sorting of the two sublists can be done in parallel, it seems like a good candidate for parallelization.
(* split one list into two lists of equal size *) let rec split (l : int list) : int list * int list = match l with [] > ([] , [])  [x] > ([x] , [])  x :: y :: xs > let (pile1, pile2) = split xs in (x :: pile1, y :: pile2) ;; (* merge two sorted lists in to one sorted list *) let rec merge (l1 : int list) (l2 : int list) : int list = match (l1, l2) with ([] , l2) > l2  (l1 , []) > l1  (x :: xs, y :: ys) > if x < y then x :: merge xs l2 else y :: merge l1 ys ;; (* sort list *) let rec mergesort (l : int list) : int list = match l with [] > []  [x] > [x]  _ > let (pile1,pile2) = split l in let (sorted1,sorted2) = par mergesort pile1 (* 1 *) mergesort pile2 in merge sorted1 sorted2 ;;
The first thing to notice about this algorithm is that the only difference between a sequential mergesort and a parallel mergesort is at the line 1. For a sequential mergesort, we would write:
(mergesort pile1, mergesort pile2) (* 2 *)
Complexity Models: Work and Span
How do we analyze the cost of executing a parallel program? There are two components to consider: the work and the span. The work of a computation is the total number of operations executed. Hence, the work is the same as the standard sequential complexity of a program. The span (sometimes called the depth) of a computation is the length of the longest sequence of operations that are not executed in parallel. Said another way, the span is the cost of executing a program assuming an infinite number of processors are available so no parallel task ever has to wait for a free processor to execute.
For a sequential pair:
work (exp1, exp2) = work(exp1) + work(exp2) + 1 span (exp1, exp2) = span(exp1) + span(exp2) + 1There is no parallelism in the execution of a sequential pair of expressions, so the work and the span are the same  the sum of the cost of executing each subexpression plus a cost of 1 to represent the cost of creating the pair itself.
For a parallel pair:
work ((fun () > exp1) <*> (fun () > exp2)) = work(exp1) + work(exp2) + 1 work ((fun () > exp1) <*> (fun () > exp2)) = max(span(exp1), span(exp2)) + 1The work is the same but the span is the max of the spans of the two subexpressions (plus 1). The span assumes both subexpressions are executed in parallel so only the longer one adds to the span.
The work of mergesort l
is the same as the cost of the
sequential algorithm and proportional to n log n
, where
n
is the length of the list. What about the span?
The merge
and split
functions are sequential;
their span is equal to their work:
span of split applied to a list of length n: span(split, n) = k1 + span(split, n2) (for some constant k1) = k1*n/2 = O(n) span of merge applied to a list of length n: span(merge, n) = k2 + span(merge, n1) (for some constant k2) = k2*n = O(n)Now, what about the span of mergesort itself?
span of mergesort applied to a list of length n: span(mergesort, n) = k3 + span(split,n) + span(merge,n) + max(span(mergesort, n/2),span(mergesort, n/2)) <= k4*n + span(mergesort, n/2) = k4*(n + n/2 + n/4 + n/8 + ...) = k4*2*n = O(n)
So the span of mergesort
is linear in the length of the
list. Can we develop a better sorting algorithm? Yes, but we have
to change the data structure we use to store the elements of our
lists  lists are a bad data structure for parallel algorithms.
Trees are much better.
Parallel TreeSort
The fact that mergesort operates by subdividing lists in half and recursively sorting the halves in parallel suggests that we might be able to reduce the span if we can avoid the linearspan split and merge functions. If instead of using a list, we use a balanced tree, doing so is not too hard. For the purpose of exposition, we will work with integer trees.
type tree = Empty  Node of tree * int * tree ;; let node (left:tree) (i:int) (right:tree) : tree = Node (left, i, right) ;; let one (i:int) : tree = node Empty i Empty ;;
Definition: A tree is sorted (aka, "in order") under the following conditions:

Empty
is sorted. 
Node(left, i, right)
is sorted iffi
is valuable,left
is sorted,right
is sorted all integers in
left
are less than or equal toi
and  all integers in
right
are greater thani
.
Given an unsorted tree, but balanced tree, how do we mergesort it? Let's start by attacking the problem topdown. Mergesorting a tree involves mergesorting the left and right subtrees recursively, just like we did with lists. Then we merge the sorted left and right subtrees back together, along with the root, to create a sorted result. The code is below.
let rec mergesort (t:tree) : tree = match t with Empty > Empty  Node (l, i, r) > let (l', r') = par mergesort l mergesort r in merge (merge l' r') (one i) ;;
We implement the merge as follows. The key idea is
that when both t1
and t2
are
nonempty, we split t2
in to two parts  one
(l2
)
for the elements less than or equal to the root i
of
t1
; the other (r2
) for
the elements greater than i
. Then l2
and r2
are recursively merged with the subtrees of
t1
. It is easier to code that to say:
let rec merge (t1:tree) (t2:tree) : tree = match t1 with Empty > t2  Node (l1, i, r1) > let (l2, r2) = split_at t2 i in let (t1', t2') = par (merge l1) l2 (merge r1) r2 in Node (t1', i, t2') ;;Splitting is a simple recursive procedure:
let rec split_at (t:tree) (bound:int) : tree * tree = match t with Empty > (Empty, Empty)  Node (l, i, r) > if bound < i then let (ll, lr) = split_at l bound in (ll, Node (lr, i, r)) else let (rl, rr) = split_at r bound in (Node (l, i, rl), rr) ;;
Mergesort Complexity
The work of parallel mergesort is O(n log n) (where n is the number of nodes in the tree), like a conventional sequential mergesort over lists.
Let's analyze the span of mergesort assuming the depth of the tree is d.
We'll start with the span of split_at
. Each recursive call
is made on a subtree of the input  a tree with depth one less than the
input:
span(split_at, d) = k + span(split_at, d1) = O(d)For merge, let d1 and d2 be the depths of the input trees and let dl2 and dr2 be the depths of the trees that result from splitting d2.
span(merge, d1, d2) = k + span(split_at, d2) + max(span(merge, d11, dl2), span(merge, d11, dr2))Split creates trees than are no deeper than its input tree. Hence dl2 and dr2 are no deeper than d2. Consequently, we can approximate:
span(merge, d1, d2) <= k1 + span(split_at, d2) + max(span(merge, d11, d2), span(merge, d11, d2)) <= k2*d2 + span(merge, d11, d2) = k2*d2*d1 = O(d1*d2)If n is the size of the tree, and it is balanced so its depth d is log n, we analyze the span of parallel tree mergesort as follows.
span(mergesort, d) <= k + max(span(mergesort, d1), span(mergesort, d1)) + span(merge, d, d) (* 1 *) + span(merge, 2*d, 1) (* 2 *) <= k + span(mergesort, d1) + k1*d^2 + k2*d <= span(mergesort, n/2) + k3*d^2 = O(d^3) = O((log n)^3) (where n is the number of nodes in the balanced tree)Note that the second call to merge operates over the output of the first call to merge. That output is guaranteed to have a depth less than the sum of the depths of its input trees:
depth(merge(l, r)) <= depth l + depth rConsequently, the line marked
(* 2 *)
shows that
the input to the second call to merge is a tree with depth
2*d
.
Unfortunately, there is one glitch in our analysis. We assumed the
trees l'
and r'
that arise from the recursive
calls to mergesort
are balanced and
hence had depth d
when
we called merge
on them. This led to the line
marked (* 1 *)
in our analysis above. To ensure
the trees produced by mergesort are indeed balanced,
we much rebalance before returning from mergesort. Hence,
the correct code for mergesort
is as follows.
let rec mergesort (t:tree) : tree = match t with Empty > Empty  Node (l, i, r) > let (l', r') = par mergesort l mergesort r in rebalance (merge (merge l' r') (one i)) (* change *) ;;
Coding a parallel rebalance that does not increase the overall work or span of the algorithm is another challenge we leave to the reader.
Summary
There are several takeaway messages from this lecture: The parallel execution operators
f <*> g
andpar f x g y
useful relatives of parallel futures.  We can approximate the cost of parallel functional programs using work and span. Work is the sequential cost of executing a program (the sum of the costs of all instructions). Span is the parallel cost. Start by optimizing the work of your program, then optimize the span.
 Like any complexity model, work and span are approximations. They are useful guides when constructing parallel programs, but the costs of moving data to the computation (either in and out of your cache or across machines in a distributed system) can be very important in developing the most efficient programs possible. Use these complexity models in conjunction with empirical analysis.
 Divideandconquer is a common parallel programming strategy. However, the effectiveness of a divideandconquer algorithm can often be determined by the cost of splitting and merging data. These costs are in turn dictated by the choice of data structure used.
 In general, lists are bad data structures for parallel programming  you typically have to traverse a list (incurring ~n span) in order to gather the data you need to commence a parallel computation. Balanced trees (and several other kinds of data structures) are better because they support constantwork and constantspan operations to split your data in half or recombine it.
Acknowledgement: Lectures notes adapted from materials developed by Bob Harper and Dan Licata.