CGD Programming Language

Types

Generic types are declared as domain/range types or datastructure types:

Syntax
type range range-type, ... ; Delaration of user defined domain type
type partition range-type part-type [predef-part-type], ... ;Declaration of distribution type with <= 1 domains / proc
type mpartition range-type part-type [predef-part-type], ... ;Declaration of distribution type with any number domains / proc
type swap <range-type> swap-type [predef-part-type], ... ;Declaration of redistribution matrix type for distributions with <=1 domains / proc
type mswap <range-type> swap-type [predef-part-type], ... ;Declaration of redistribution matrix type for distributions with any number domains / proc
type data data-type [range-type-list], ... ; Declaration of user datastructure type
type data <base-type> data-type [range-type-list], ... ;Declaration of user datastructure array type
Examples
type range Ra, Rb; User defined domain types
type partition <Ra> PartRa [PartNP], MPartRa [PartNP]; Distribution types based on Ra domains, assigning <= 1 or any # of domains to each proc
type swap <Ra> SwapRa [PartNP]; Redistribution type describing how to convert PartRa distributions
type data Da, Db; Non-decomposable datastructure types
type data Dc [PartRa, PartRb]; Datastructure type decomposed by distributions PartRa and PartRb
  • Domain and datastructure types are defined by the user in C++ header files
  • A global datastructure can be decomposed based on several domain representations
  • Decomposition/partition and redistribution/swap types are based on arbitrary user domain types
    • CGD uses C++ templates to define these types
  • Redistribution types are defined as domain matrices
    • if present, s(i,j) specifies a domain to be copied from proc i to proc j
  • For domain and datastructure types it may be required to define helper functions (see Library Helpers). 
    • predefined for N-dimmensional arrays and other basic types
  • Predefined distributions
    • n elements
      • ALLn - a single PE holds all n elements
      • PPEn - each PE pi holds element pi
      • ONEn - a single PE holds all n elements
    • 1 element
      • ALL1 - each PE holds a copy
      • ONE1 - a single PE holds a copy

Declarations

Parallel and sequential computations are declared to take input and output arguments:

Syntax
function fname (IN-arg-list -> OUT-arg-list)
  type arg, ... ,
  type arg [partition-arg], ... ,
  ... ;
Computation declaration
Examples
function foo (A -> B)
  Data A, B;
A, B are non-decomposable. Shortcut for A[ALL1], B[ALL1]
function foo (A, p -> B)
  Data A[p], B[p],
  PartRange p[ALLn];
A has distribution p; p has predefined distribution ALLn
function foo (A, p -> B)
  Data A[p], B[p],
  Range p;
A has distribution taken from domain argument p. Shortcut for PartRange p[PPEn]
function foo (A, pin, pout -> A*)
  Data A[pin], A*[pout],
  Range pin, pout;
is an in-out argument 
  • For each argument
    • type : data type, range type, partition type, swap type
    • decomposition/partition optional : another argument or predefined
  • Argument A has distribution argument pa
    • A has data type allowing partition type of pa
  • Distribution arguments are decomposed by predefined distributions
    • ALLn - each PE knows the domains of all PEs
    • PPEn - each PE knowns only its own domain
  • In-out arguments A, A*
    • both have same type
    • may have identical or different distributions
    • C++ function declaration has single argument
    • CGD uses different labels for data before and after execution

Decomposition Rules

Relations between partitions and swaps describe how datastructures can be converted between partitions:

Syntax
part partition < partition ... ; Order : decomposition inclusion
part partition = partition + partition ... ; Merge : decomposition union
part swap : partition -> partition ; Swap : redistribution
Examples
part pa < pb < pc; Decomposition pa is included in pb, included in pc
part pa = pb + pc; Decomposition pa is obtained by merging domains of pb and pc
part sw : pa -> pb; Decomposition pa is converted to pb according to redistribution sw
  • Order
    • larger partition domains include smaller partition domains
      • large -> small conversion requires nothing to be done
  • Merge
    • the union of domains of multiple partitions produces new partition
      • logical copy, avoided if same C++ datastructure used for all partitions
  • Swap
    • a redistribution operation is needed to convert the datastructure
      • may do messaging, copy, and synchronization depending on implementation
  • Each global datastructure argument is produced by one or more computations
    • producers provide datastructures in given decomposition
    • subsequent computations may need datastructures in other decompositions
    • decomposition rules describe how to do the conversion
  • Datastructures with no decompositions are assigned PAL decomposition

Parallel Computations

When the body of a computation is defined in CGD it is considered a parallel computation. For sequential computations only the declaration is provided, and the function is written in C++. More comprehensive examples are presented in Applications.

Syntax
function fname (IN-arg-list -> OUT-arg-list)
  argument-declarations
{
  decomposition-rules |
  loop |
  computation ;
  ...
}
Parallel computation definition
   for each argument : type, distribution

   function body


loop (idx-arg, cond-arg ; IN-arg-list -> OUT-arg-list)
{
  loop | if | 
  computation ;
  ...
}
Loop computation
  idx-arg
is iteration count
  cond-arg is end condition

   loop body

if (cond-arg, IN-arg-list -> OUT-arg-list)
{
  loop | if |
  computation ;
  ...
} else {
  ...
}
If computation
  cond-arg is condition arg

   TRUE body

   FALSE body

Input args are optional
fname <task-part> (IN-arg-list -> OUT-arg-list); Computation invocation assigning a number of sequential computations to each proc according to task-part
Examples
loop (idx, cond ; A[p] -> B[p])
{
  foo (A[p], C -> B[p]);
  bar (idx -> cond);
}
Loop computes B[p] each iteration
   value of B is given to A the next iteration
   end condition cond is set using iter count idx
if (cond, A[all] -> B[pb])
{
  foo (A[p1], C[pc] -> B[pb]);
} else {
  bar (A[p2], C[pc] -> B[pb]);
}

If computes B[pb] calling either foo or bar computations
A[all] is provided to both bodies
A[p1] or A[p2] produced within each body

function foo (A, pa, pb -> B)
  Data A[pa], B[pb];

  Range pa, pb;
...

Computation declaration
foo <ALL1> (A, pa, pb -> B);Standard computation invocation
foo <ALL1> (A[pa], pa, pb -> B[pb]); Explicit computation invocation
foo (A[pa] -> B[pb]); Short computation invocation
  • A datastructure label represents a single value
    • no logical overwriting
    • C++ datastructure overwriting can happen
      • updated value gets a new label
  • Each datastructure produced by a computation, used by many
    • decomposition conversion based on decomposition rules
  • Body of parallel computation is a graph
    • nodes :
      • sequential, parallel computations
      • (datastructure, distribution) pairs
    • links between providers and users of a given datastructure
  • Loops 
    • body of loop is subgraph
    • seen as single computation from above
      • dependencies: in args, other data nodes used by body
      • produces: out args
    • each iteration logically assigns output arguments to input arguments
      • copy avoided using renaming
  • If
    • true and false bodies are subgraphs
    • seen as single computation from above
      • dependencies: in args, other data nodes used by true or false bodies
      • produces: out args
    • cond arg evaluated at runtime, single body executed
  • Computation invocation
    • long version : provide same arguments as declaration
    • short version : last input arguments can be ignored if provided explicitly as partitions for other arguments
    • An arbitrary number of sequential computations s is assigned to each proc p
      • s = # domains assigned by <task-part> partition argument to p
      • default ALL1 gives s = 1 for all p
      • if s > 0, each argument should have a distribution with
        • 1 domain when a single domain / proc is allowed
        • >= s domains when multiple domains / proc are allowed

CGD Compiler

  • Traverses computation graph to satisfy data dependencies
    • when multiple options possible preserves original order
    • generates computation order
  • Adds 
    • redistribution, merge, and copy primitives to convert datastructures as needed
    • allocation primitives for datastructures
    • loops to execute computations called with MPart distributions (multiple domain/PE)
    • other auxiliary code
  • Optimizations
    • Data label to C++ datastructure assignment
      • avoids redistribution and loop related copy
      • memory reuse : multiple compatible non-overlapping data labels assigned to single datastructure
    • Computations
      • removes unnecessary computations
      • moves computations out of the loops when possible
    • Inlines parallel computations to increase scope of above optimizations