Programming Model

The Coarse Grain Dataflow Programming Model (CGD) aims at improving the productivity of large scale application development and optimization, while maintaining performance similar to message passing models. CGD is a hybrid SPMD - Dataflow model where parallel algorithms are expressed as dependencies between data distributions and SPMD computations.

We will use the simple matrix multiplication algorithm as a presentation device for introducing key CGD programming concepts. The implementation we are considering uses the block row and block column decomposition: each processor holds a block of rows and a block of columns from the two input matrices, and it computes a section of the output matrix corresponding to a 2D decomposition. The following chart presents the dataflow needed to compute D=(A+B)*C :

The following abstractions are needed to define the CGD model:

  • domain or range
    • A subset of a datastructure domain; e.g. the (0..7, 0..7) range of a 2D array.
  • distribution, decomposition or partition
    • A list of domains assigned to processors; e.g. the sqr distribution assigns (0..3, 0..3) to proc 0, (0..3, 4..7) to proc 1, and so on.
  • global datastructure
    • A distributed datastructure allowing multiple distribution views; e.g. A[sqr] gives proc. 0 access to A (0..3, 0..3), while A[col] gives proc. 0 access to A (0..3, 0..7)
  • redistribution or swap
    • Defines how a datastructure can be converted from one distribution to a new distribution; a simple implementation is a matrix of domains to be exchanged between processors.
  • computation or kernel
    • sequential C++ function taking datastructure domains as input and output arguments; e.g. SUM A (0..3, 0..3) with B (0..3, 0..3) and writes the result into C (0..3, 0..3).

Datastructures, domains, and distributions are supported not only for array types; custom datastructures and domains can be defined by the user by providing several operators for these new types (CGD library).

Coarse Grain Dataflow Language

CGD parallel programs specify the dataflow graph using a C-like language, and implement sequential computation kernels as C++ functions (CGD language). Typical CGD programs includes the following code sections:

  • Type Declaration
  • Kernel Declaration
  • Distribution Rules
  • Dataflow Code

Distribution Rules

sqr < row; 
sqr < col;

sqr2row : sqr -> row;
sqr2col : sqr -> col;

These rules are used by the compiler to automatically transform datastructure distributions.

  • sqr < row means the sqr domains are a subset of col domains and nothing needs to be done to convert row to sqr
  • sqr2row : sqr -> row means redistribution sqr2row is needed to convert sqr to row. When using this rule the compiler inserts the redistribution primitives supplied by the runtime; these are implemented via messaging or direct memory copy depending on the target architecture.

Dataflow Code

sum (A[sqr], B[sqr] -> X[sqr]); 

prod (X[row], C[col] -> D[sqr]);

The dataflow code contains a list of data parallel computations taking (datastructure, distribution) pairs as input and output arguments. The CGD language is similar to other dataflow languages by imposing the single assignment rule, a special loop syntax, and scheduling computations in topological order (CGD language).

Distributed Datastructure Library

The C++ code generated by the CGD compiler relies on a template distributed datastructure library that implements allocation, redistribution, and synchronization primitives (CGD Library). The template library supports any datastructure and domain type for which a set of operators can be defined; these are predefined for a few commonly used data types: arrays, lists, standard distributions, etc. It is possible to write C++ code directly using this library.

For the matrix multiplication example, the CGD compiler generates the following code sections:

Datastructure Allocation

DataRa (Vector2Dreal, A_sqr, sqr[pe]); 
DataRa (Vector2Dreal, B_sqr, sqr[pe]);
DataRa (Vector2Dreal, C_col, col[pe]);
DataRa (Vector2Dreal, X_row, row[pe]);

Here datastructure A_sqr is declared and allocated for distribution sqr. The VAR_part variable notation is used for verbosity, meaning VAR_part is allocated using distribution part. The sqr distribution can be indexed by processor id to obtain the local domain sqr[pe].

Datastructure Redistribution and Computations

// start redistribution C[sqr] -> C[col] 
swapBegin (sqr2col, C_col, C_col, 0, pe);

// computation : A[sqr] + B[sqr] -> X [sqr]
sum (A_sqr, B_sqr, sqr[pe], X_row);

// start redistribution X[col] -> X[row]
swapBegin (sqr2row, X_row, X_row, 1, pe);

// end above redistributions
swapEnd (sqr2col, C_col, C_col, 0, pe);
swapEnd (sqr2row, X_row, X_row, 1, pe);

// computation : X[row] * C[col] -> D[sqr]
prod (X_row, C_col, row[pe], col[pe], sqr[pe], D_sqr);

This code shows how the matrix multiplication and addition are translated into C++. The compiler adds two redistribution operations to produce the X_row and C_col datastructures needed by the prod computation. Note that the C redistribution is overlapped with the sum computation, and the two redistributions are overlapping. Swaps can be nested as long as they are assigned different IDs (4th argument).

The CGD model is presented in more depth in the language, library, and applications sections.