Parallel Collections

One of the simplest, easiest-to-use and most successful forms of parallelism revolves around programming with parallel collections. There are many different kinds of parallel collections, including collections that behave like sets or like dictionaries or tables or sequences. In general, these collection data types provide an interface that includes operations that process all objects in the collection in parallel. For instance, map f c is one of the most common operations. It generates a new set (or dictionary or table or sequence) by a applying the function f to all objects in the collection c in parallel.

Bulk parallel operations like map f are simple to use as the semantics of a parallel map is deterministic and indistinguishable (except for performance) from the semantics of a sequential map when the function f is effect-free. So programming with parallel collections is no harder than programming with sequential collections, and you learned how to do that latter in the second week of class. It's easy!

Moreover, it is well-known how to develop highly efficient implementations of many operations over parallel collections. Some implementations are designed to operate over a single multi-core machine and others over many machines, such as in a data center. In the latter case, the implementation will typically also handle machine failures transparently by re-executing jobs that occurred on failed machines. When functions are effect-free, re-executing them does no harm (another huge benefit of functional programming).

Because there are many different implementation techniques for parallel collections and the implementation details are often important for achieving high performance, you should strive to implement your algorithms at a high level of abstraction on top of abstract parallel data types. This will help separate the low-level parallel implementation details involving scheduling tasks across multiple cores, processors or machines from the high-level algorithmic details specific to your problem. If you use abstraction carefully, it should be possible, for example, to swap out one parallel implementation for another one. This can help to make your algorithms more portable and reuseable. For instance, you might design and test your code on a single machine but then deploy it in a data center over hundreds of machines. Having said that, not all parallel collection APIs can be implemented equally efficiently on different types of parallel computing platforms. For instance, experience has shown that the powerful parallel sequences API we will discuss next can be effectively implemented on a single shared memory multicore machine, but may be less efficient on a cluster of many machines. On the other hand, the map-reduce collection library implemented by Google is ideal for computing over large clusters in a data center.

Parallel Sequences

The abstract sequence data type is a canonical example of a parallel collection. Java has such a library and several parallel programming languages such as NESL and data-parallel Haskell have been organized around the idea of programming parallel sequences.

In this section, we will explore the use of a parallel sequence library inspired by NESL and defined by the following interface. Notice that we specify the work and span of each operation in the interface. This is need-to-know information for clients of the sequence library who wish to estimate the work and span of their algorithms.

type 'a seq

(* length s returns the number of elements in the segment 
 * work = O(1); span = O(1) *)
val length : 'a seq -> int

(* cons x xs
 * if the length of xs is l then cons evaluates to a sequence of length l+1
 * where the first element is x and the remaining l items are exactly the same
 * as the sequence xs 
 * work = O(n); span = O(1) *)
val cons : 'a -> 'a seq -> 'a seq

(* singleton x
 * evaluates to a sequence of length 1 whose only item is x 
 * work = O(1); span = O(1) *)
val singleton : 'a -> 'a seq

(* seq s1 s2
 * if s1 is a sequence of length l1 and s2 is a sequence of length s2 then
 * append s1 s2 appends to a sequence of length l1+l2 whose first l1 items are
 * the sequence s1 and whose last l2 items are the sequence s2 
  * work = O(n); span = O(1) *)
val append : 'a seq -> 'a seq -> 'a seq

(* tabulate f n
 * evaluates to a sequence s of length n where the ith item is the result of
 * evaluating (f i) 
 * work = O(n); span = O(1) *)
val tabulate : (int -> 'a) -> int -> 'a seq

(* nth s i
 * evaluates to the ith element in s. Sequences are 0-indexed (nth s 0 returns
 * the first element of the sequence) 
 * work = O(1); span = O(1) *)
val nth : 'a seq -> int -> 'a

(* map f s
 * evaluates to a sequence s' such that the length of s' is the same as the
 * length s and the ith of the element of s' is the result of applying f to the
 * ith element of s 
 * work = O(n); span = O(1) (if f is a O(1) work and span function) *)
val map : ('a -> 'b) -> 'a seq -> 'b seq

(* reduce c b s
 * combines all of the items in s pairwise with c using b as the base case. c
 * must be associative, with b as its identity 
 * work = O(n); span = O(log n) (if c is a O(1) work and span function) *)
val reduce : ('a -> 'a -> 'a) -> 'a -> 'a seq -> 'a

(* filter p s
 * returns the longest subsequence ss of s such that p evaluates to
 * true for every item in ss 
 * work = O(n); span = O(log n) (if p is a O(1) work and span function)
val filter : ('a -> bool) -> 'a seq -> 'a seq

(* flatten ss
 * is equivalent to reduce append (empty ()) ss 
 * work = O(m); span = O(log n) 
 * (if m is the total number of elements in all sequences and 
 *     n is the number of elements in the outer sequence)
val flatten : 'a seq seq -> 'a seq

(* zip (s1,s2)
 * evaluates to a sequence whose nth item is the pair of the nth
 * item of s1 and the nth item of s2. 
 * work = O(n); span = O(1) (if n is the minimum of the lengths of s1, s2) *)
val zip : ('a seq * 'b seq) -> ('a * 'b) seq

(* split s i
 * evaluates to a pair of sequences (s1,s2) where s1 has length i and
 * append(s1,s2) is the same as s. 
 * work = O(n); span = O(1) *)
val split : 'a seq -> int -> 'a seq * 'a seq

To begin, let us examine the tabulate function. It is one of the primary constructors for sequences. For instance, to construct an integer sequence with elements 0..(n-1) with O(1) work and O(1) span, we simply write:

let make (n:int) =
  tabulate (fun i -> i) n

Tabulate is especially powerful in conjunction with nth. For instance, to reverse a sequence with O(n) work and O(1) span:

let reverse (s:'a seq) : 'a seq =
  let n = length s in
  tabulate (fun i -> nth s (n - (i+1))) n
Moreover, map can actually be implemented as a derived operator:
let map (f:'a -> 'b) (s:'a seq) : 'b seq =
  tabulate (fun i -> f (nth s i)) (length s)

Divide and Conquer with Parallel Sequences

Recall that divide-and-conquer algorithms have three key steps:

  • Split your input in to two or more independent, smaller subproblems
  • In parallel, recursively solve the smaller subproblems
  • Combine the results of solving the smaller sub-problems to produce
The sequence data type is fine data type for implementing many efficient divide-and-conquer parallel algorithms because it supports an efficient (O(1)) operation to split a sequence in to subsequences. To explore divide-and-conquer over parallel sequences, let's take a look at the parenthesis matching problem. The goal is to take a sequence of parentheses like this one:
and determine whether the parentheses match. Recall that a sequence of parens does not match if there is ever a point in the sequence when there are more closed (ie: right) parens than there are open (ie: left) parens or if the entire sequence doesn't contain exactly the same number of open and closed parens. Here are examples of unmatched sequences:
The first step in attacking this problem is to define a data type to represent parentheses:

type paren =
    L         (* left paren *)
  | R         (* right paren *)

type parens = paren seq;;
Next, we can observe that it is easy to develop a sequential algorithm that uses on the order of n work and n span (where n is the length of the sequence) using a fold.
let fold_seq (f: 'a -> 'b -> 'a) (base:'b) (s:'a seq) : 'b =

  let len = length s in

  let rec aux n so_far =
    if n >= len then 
      aux (n+1) (f (nth s n) so_far)

  aux 0 base

(* check if this next paren causes the paren sequence to be unmatched
 * return None if the cumulative number of right parens outnumbers left parens
 * return Some c where c is the number of unmatched left parens *)
let check (p:paren) (so_far:int option) : int option =
  match (p, so_far) with
    (_, None) -> None
  | (L, Some c) -> Some (c+1)
  | (R, Some 0) -> None
  | (R, Some c) -> Some (c+1)

(* true if s is a sequence of properly matched parentheses
 * false if it isn't 
 * the work and span are poor (linear with respect to the length of
 * the sequence) because the fold is sequential *)
let matcher (s: parens) : bool =
  match fold_seq check (Some 0) s with
    Some 0 -> true
  | None -> false

To craft an efficient parallel solution to the parenthesis matching problem, the following observations will come in handy:

  • Suppose the sequence s has () as a subsequence and suppose s' is the sequence we obtain when we remove that subsequence () from s. Then s is a matching sequence of parentheses if and only if s' is a matching sequence of parentheses.
  • Suppose the sequence s has no subsequences of the form (). Then s must have the form ...)))(((.... In other words s must be a sequence of right parens followed by a sequence of left parens.

Using those ideas, we can construct an efficient divide-and-conquer algorithm for parenthesis matching. The core of the algorithm is a routine that eliminates all of the matching pairs () of the input s and returns information about the sequence ...)))(((... of unmatched pairs that remain. Specifically, if ...)))(((... consists of i right parens and j left parens, our algorithm will return (i,j).

This core algorithm will operate by splitting its input sequence in half, recursively computing the number of unmatched parens (i, j) for the left half of the sequence and the number of unmatched parens (k, l) from the right half of the sequence. The observation is that after returning from the 2 recursive calls, we know that our sequence has the following form:

))...i...)) ((...j...(( ))...k...)) ((...l...((
And we can see that some of the j left parens and k right parens will cancel eachother out. In fact:
  • if j > k then we wind up with i right parens followed by l + j - k left parens.
  • if j <= k then we wind up with i + k - j right parens followed by j2 left parens.
Such a computation allows us to implement the "combine" step of our divide-and-conquer algorithm efficiently. The code, using the sequence API, follows. Notice that we define a convenient and reuseable helper function for sequences --- one that splits a sequence in to a "tree view."
type 'a treeview = Empty | One of 'a | Pair of 'a seq * 'a seq;;

let show_tree (s:'a seq) : 'a treeview =
  match length s with
    0 -> Empty
  | 1 -> One (nth s 0)
  | n -> Pair (split s (n/2))

let matcher (s:'a seq) : bool =
  let aux s =
    match show_tree s with
      Empty -> (0, 0)
    | One L -> (0, 1)
    | One R -> (1, 0)
    | Pair (left, right) -> 
       let (i, j), (k, l) = par 
         aux left
         aux right
       if j > k then
         (i, l + j - k)
         (i + k - j, l)
  match aux s with
    (0,0) -> true
  | _ -> false

It is easy to see that this algorithm has worst-case logarithmic span:

span(aux, n) ~= k + max(span(aux, n/2), span(aux, n/2))
             = k + span(aux, n/2)
             = k*log n

Google Map-Reduce

The idea of using functional programming to parallelize analysis of massive data sets really came in to vogue after Jeffrey Dean and Sanjay Ghemawat published their influential paper on Google's map-reduce programming platform in 2004. In order to get a sense of the context they were working in and where their ideas came from, it is worthwhile quoting the first couple of paragraphs of their article:

Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose computations that process large amounts of raw data, such as crawled documents, web request logs, etc., to compute various kinds of derived data, such as inverted indices, various representations of the graph structure of web documents, summaries of the number of pages crawled per host, the set of most frequent queries in a given day, etc. Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues.

As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages... Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.

Towards the end of their article, the authors cite some statistics concerning the adoption of map-reduce at Google at the time (late 2004). In just its first year, over 900 separate map-reduce programs were checked in to their main source code respository. In August 2004, over 29,000 different map-reduce jobs were run on their data centers, almost 80,000 machine days used and 3.3 TB of input data read. Since then, the popularity of the basic map-reduce paradigm has grown greater still and there are many implementations of the basic concept. For instance, Hadoop is an open source implementation developed by Apache and used by many companies (you can download it and use it yourself too). Facebook reported in 2008 that one of their largest Hadoop clusters used 2500 cores and had 1 Petabyte of storage attached. That's a lot of computation and a lot of data!

The key to the success of map-reduce is the simplicity of its programming model. Indeed, the map-reduce programming model is little more than a minor variation the sequential map-reduce programming paradigm that you learned in the first couple weeks of class. However, Google has managed to hide a complex, distributed, parallel and fault-tolerant implementation behind this simple interface. Because the interface was so simple, many of their analysts -- researchers or programmers not necessarily skilled in parallel programming or distributed systems -- could use it easily. Modularity and clear, high-level abstractions are just as important (perhaps more so) in large-scale distributed computing as in sequential computing.

For more information on the map-reduce parallel programming paradigm, see Dean and Ghemawat's paper on Google's map-reduce implementation.

Acknowledgement: Sequence API from materials developed by Bob Harper and Dan Licata. Examples concerning parenthesis matching from Guy Blelloch.