Caml
Power

Complexity Analysis for Parallel Programs

In the last note, we introduced parallel futures, a useful abstraction that allows functional programmers to add parallelism to their programs but avoid incurring any non-determinism. In this lecture, we will explore another, closely related operator we call par. The expression par f x g y evaluates (f x) concurrently with (g y). We also capture a special case using the operator (<*>). The expression (f <*> g) evaluates (f ()) concurrently with (g ()). Both operators simplify the process of analyzing the complexity of parallel programs. Their types follow.

par : ('a -> 'b) -> 'a -> ('c -> 'd) -> 'c -> 'b * 'd

<*> : (unit -> 'a)  -> (unit -> 'b) -> 'a * 'b

If a sequential algorithm executes the code:

(e1, e2)  (* 1 *)
a parallel algorithm could execute:
(fun () -> e1) <*> (fun () -> e2)  (* 2 *)
instead provided e1 and e2 are pure. We can easily capture this idea using an equivalence rule that is valid for all pure expressions e1 and e2:

(e1, e2)  ==  (fun () -> e1) <*> (fun () -> e2)
Likewise, when f and g are pure (effect-free) functions, we have:
(f x, g y)  ==  par f x g y

Both operators are easily implemented using futures

open Future;;

let par f x g y =
  let ff = future f x in
  let v = g y        in
  (force ff, v)
;;

let (<*>) f g = par f () g ()
;;

Divide and Conquer Parallel Algorithms

Many parallel algorithms use a divide-and-conquer strategy. Given a problem, a divide-and-conquer algorithm operates using the following steps:

  1. Split your input in to 2 or more smaller subproblems.
  2. Recursively solve your smaller subproblems in parallel
  3. Combine the results of solving the subproblems to solve the overall problem

Clearly, in order for your overall algorithm to be efficient, it must be possible to split your input efficiently and it must be possible to recombine the results of solving subproblems efficiently.

Mergesort is an excellent example of a divide-and-conquer algorithm. In the following sections, we will choose to represent the sequences to be sorted by mergesort in two different ways. First we consider mergesorting lists, and then we consider mergesorting trees. The effect of our data structure choice will have a significant impact on the complexity of our algorithms.

Parallel Mergesort Over Lists

First, we examine a parallel functional mergesort on lists. Recall that mergesort operates by splitting its input list in half, sorting the two halves and then merging the two sorted lists together. Because the sorting of the two sublists can be done in parallel, it seems like a good candidate for parallelization.

(* split one list into two lists of equal size *)
let rec split (l : int list) : int list * int list =
  match l with
    [] -> ([] , [])
  | [x] -> ([x] , [])
  | x :: y :: xs -> 
      let (pile1, pile2) = split xs in 
      (x :: pile1, y :: pile2)
;;

(* merge two sorted lists in to one sorted list *)
let rec merge (l1 : int list) (l2 : int list) : int list =
  match (l1, l2) with
    ([] , l2) -> l2
  | (l1 , []) -> l1
  | (x :: xs, y :: ys) ->
     if x < y then 
        x :: merge xs l2
     else 
        y :: merge l1 ys
;;

(* sort list *)
let rec mergesort (l : int list) : int list =
  match l with
    [] -> []
  | [x] -> [x]
  | _ -> 
    let (pile1,pile2) = split l in
    let (sorted1,sorted2) = 
      par mergesort pile1      (* 1 *)
          mergesort pile2 
    in
    merge sorted1 sorted2
;;

The first thing to notice about this algorithm is that the only difference between a sequential mergesort and a parallel mergesort is at the line 1. For a sequential mergesort, we would write:

(mergesort pile1, mergesort pile2)  (* 2 *)

Complexity Models: Work and Span

How do we analyze the cost of executing a parallel program? There are two components to consider: the work and the span. The work of a computation is the total number of operations executed. Hence, the work is the same as the standard sequential complexity of a program. The span (sometimes called the depth) of a computation is the length of the longest sequence of operations that are not executed in parallel. Said another way, the span is the cost of executing a program assuming an infinite number of processors are available so no parallel task ever has to wait for a free processor to execute.

For a sequential pair:

 
work (exp1, exp2) = work(exp1) + work(exp2) + 1
span (exp1, exp2) = span(exp1) + span(exp2) + 1
There is no parallelism in the execution of a sequential pair of expressions, so the work and the span are the same -- the sum of the cost of executing each subexpression plus a cost of 1 to represent the cost of creating the pair itself.

For a parallel pair:

work ((fun () -> exp1) <*> (fun () -> exp2)) = work(exp1) + work(exp2) + 1
work ((fun () -> exp1) <*> (fun () -> exp2)) = max(span(exp1), span(exp2)) + 1
The work is the same but the span is the max of the spans of the two subexpressions (plus 1). The span assumes both subexpressions are executed in parallel so only the longer one adds to the span.

The work of mergesort l is the same as the cost of the sequential algorithm and proportional to n log n, where n is the length of the list. What about the span? The merge and split functions are sequential; their span is equal to their work:

span of split applied to a list of length n:
span(split, n) = k1 + span(split, n-2)   (for some constant k1)
               = k1*n/2
               = O(n)     

span of merge applied to a list of length n:
span(merge, n) = k2 + span(merge, n-1)   (for some constant k2)
               = k2*n
               = O(n)                   
Now, what about the span of mergesort itself?
span of mergesort applied to a list of length n:
span(mergesort, n) 
   = k3 + span(split,n) + span(merge,n) 
        + max(span(mergesort, n/2),span(mergesort, n/2)) 
  <= k4*n + span(mergesort, n/2)
   = k4*(n + n/2 + n/4 + n/8 + ...)
   = k4*2*n
   = O(n)                  

So the span of mergesort is linear in the length of the list. Can we develop a better sorting algorithm? Yes, but we have to change the data structure we use to store the elements of our lists -- lists are a bad data structure for parallel algorithms. Trees are much better.

Parallel TreeSort

The fact that mergesort operates by subdividing lists in half and recursively sorting the halves in parallel suggests that we might be able to reduce the span if we can avoid the linear-span split and merge functions. If instead of using a list, we use a balanced tree, doing so is not too hard. For the purpose of exposition, we will work with integer trees.

type tree = Empty | Node of tree * int * tree ;;

let node (left:tree) (i:int) (right:tree) : tree = Node (left, i, right) ;;
let one (i:int) : tree = node Empty i Empty ;;

Definition: A tree is sorted (aka, "in order") under the following conditions:

  • Empty is sorted.
  • Node(left, i, right) is sorted iff
    • i is valuable,
    • left is sorted,
    • right is sorted
    • all integers in left are less than or equal to i and
    • all integers in right are greater than i.

Given an unsorted tree, but balanced tree, how do we mergesort it? Let's start by attacking the problem top-down. Mergesorting a tree involves mergesorting the left and right subtrees recursively, just like we did with lists. Then we merge the sorted left and right subtrees back together, along with the root, to create a sorted result. The code is below.

let rec mergesort (t:tree) : tree =
  match t with
      Empty -> Empty
    | Node (l, i, r) -> 
        let (l', r') = 
          par mergesort l 
              mergesort r 
        in
        merge (merge l' r') (one i)
;;

We implement the merge as follows. The key idea is that when both t1 and t2 are non-empty, we split t2 in to two parts -- one (l2) for the elements less than or equal to the root i of t1; the other (r2) for the elements greater than i. Then l2 and r2 are recursively merged with the subtrees of t1. It is easier to code that to say:

let rec merge (t1:tree) (t2:tree) : tree =
  match t1 with
      Empty -> t2
    | Node (l1, i, r1) -> 
        let (l2, r2) = split_at t2 i in
        let (t1', t2') = 
          par (merge l1) l2 
              (merge r1) r2 
        in
        Node (t1', i, t2')
;;
Splitting is a simple recursive procedure:
let rec split_at (t:tree) (bound:int) : tree * tree =
  match t with
      Empty -> (Empty, Empty)
    | Node (l, i, r) -> 
        if bound < i then
          let (ll, lr) = split_at l bound in
	  (ll, Node (lr, i,  r))
        else
          let (rl, rr) = split_at r bound in
          (Node (l, i, rl), rr)
;;

Mergesort Complexity

The work of parallel mergesort is O(n log n) (where n is the number of nodes in the tree), like a conventional sequential mergesort over lists.

Let's analyze the span of mergesort assuming the depth of the tree is d. We'll start with the span of split_at. Each recursive call is made on a subtree of the input -- a tree with depth one less than the input:

span(split_at, d) = k + span(split_at, d-1)
                  = O(d)
For merge, let d1 and d2 be the depths of the input trees and let dl2 and dr2 be the depths of the trees that result from splitting d2.
span(merge, d1, d2) = k + span(split_at, d2) 
                        + max(span(merge, d1-1, dl2), span(merge, d1-1, dr2))
                                                 
Split creates trees than are no deeper than its input tree. Hence dl2 and dr2 are no deeper than d2. Consequently, we can approximate:
span(merge, d1, d2) <= k1 + span(split_at, d2) 
                        + max(span(merge, d1-1, d2), span(merge, d1-1, d2))
                    <= k2*d2 + span(merge, d1-1, d2)
                    = k2*d2*d1
                    = O(d1*d2)
If n is the size of the tree, and it is balanced so its depth d is log n, we analyze the span of parallel tree mergesort as follows.
span(mergesort, d) 
  <= k + max(span(mergesort, d-1), span(mergesort, d-1)) 
       + span(merge, d, d)                         (* 1 *)
       + span(merge, 2*d, 1)                       (* 2 *)
  <= k + span(mergesort, d-1) + k1*d^2 + k2*d
  <= span(mergesort, n/2) + k3*d^2
  = O(d^3)
  = O((log n)^3)    (where n is the number of nodes in the balanced tree)
Note that the second call to merge operates over the output of the first call to merge. That output is guaranteed to have a depth less than the sum of the depths of its input trees:
depth(merge(l, r)) <= depth l + depth r
Consequently, the line marked (* 2 *) shows that the input to the second call to merge is a tree with depth 2*d.

Unfortunately, there is one glitch in our analysis. We assumed the trees l' and r' that arise from the recursive calls to mergesort are balanced and hence had depth d when we called merge on them. This led to the line marked (* 1 *) in our analysis above. To ensure the trees produced by mergesort are indeed balanced, we much rebalance before returning from mergesort. Hence, the correct code for mergesort is as follows.

let rec mergesort (t:tree) : tree =
  match t with
      Empty -> Empty
    | Node (l, i, r) -> 
        let (l', r') = 
          par mergesort l 
              mergesort r 
        in
        rebalance (merge (merge l' r') (one i))  (* change *)
;;

Coding a parallel rebalance that does not increase the overall work or span of the algorithm is another challenge we leave to the reader.

Summary

There are several takeaway messages from this lecture:
  • The parallel execution operators f <*> g and par f x g y useful relatives of parallel futures.
  • We can approximate the cost of parallel functional programs using work and span. Work is the sequential cost of executing a program (the sum of the costs of all instructions). Span is the parallel cost. Start by optimizing the work of your program, then optimize the span.
  • Like any complexity model, work and span are approximations. They are useful guides when constructing parallel programs, but the costs of moving data to the computation (either in and out of your cache or across machines in a distributed system) can be very important in developing the most efficient programs possible. Use these complexity models in conjunction with empirical analysis.
  • Divide-and-conquer is a common parallel programming strategy. However, the effectiveness of a divide-and-conquer algorithm can often be determined by the cost of splitting and merging data. These costs are in turn dictated by the choice of data structure used.
  • In general, lists are bad data structures for parallel programming -- you typically have to traverse a list (incurring ~n span) in order to gather the data you need to commence a parallel computation. Balanced trees (and several other kinds of data structures) are better because they support constant-work and constant-span operations to split your data in half or recombine it.

Acknowledgement: Lectures notes adapted from materials developed by Bob Harper and Dan Licata.