Stencil

This partial differential equation solver is a forward time centered space differencing scheme solving the equation of heat dissipation on a 2D grid. Every iteration each value of the grid is recomputed, and the ``halo exchange'' retrieves the neighbor points. Communication-computation ratio is good but scalability is limited for small data sizes due to constant costs. 3D solvers exhibit better scalability, however 2D solvers are an interesting case study being the bottleneck for mixed 2D, 3D solvers. The following code is the CGD implementation of the algorithm; sequential computations are implemented as C++ functions (see files).

Types

type range Range2D;
type partition <Range2D> PartRange2D [PartNP];
type mpartition <Range2D> MPartRange2D [PartNP];
type swap <Range2D> SwapRange2D [PartNP];

type data Int, Config, File;
type data Vector2Dreal [PartRange2D, MPartRange2D];

Init Data

const PartRange2D cell[ALLn], halom[ALLn], halo[ALLn], fullone[ALLn];
const MPartRange2D boundary[ALLn];
const SwapRange2D sw_chm[PPEn], sw_ca[PPEn];
const Config cfg[ALL1], File fout[ALL1];

Decomposition Rules

part cell < halom < halo;
part boundary < halo;

// merge rules
part halo = halom + boundary;

// redistribution rules
part sw_ca : cell -> fullone;
part sw_chm : cell -> halom;

Declarations

function ftcs_step (cfg, A, halo, cell -> B)
Vector2Dreal A[halo], B[cell],
Range2D cell, halo, Config cfg;

function matrix_read (cfg, cell, halo -> A, B)
Vector2Dreal A[cell], B[halo],
Config cfg, Range2D cell, halo;

function check_sym (A, ra)
Vector2Dreal A[ra],
Range2D ra;

function econd (cfg, idx -> cond)
Int idx, cond, Config cfg;

Parallel Computations

function foo () [gen]
{
// initialization

matrix_read (cfg, cell, halo -> A[cell], BD[halo]);

// main iteration, 2 steps

loop ( idx, cond ; A [cell] -> C [cell] ) [clock:MAIN]
{
copy ( BD[boundary] -> A[boundary] );
copy ( BD[boundary] -> B[boundary] );

ftcs_step ( cfg, A[halo] -> B[cell] );
ftcs_step ( cfg, B[halo] -> C[cell] );

econd (cfg, idx -> cond);
}

// print, check result correctness

print <ONE1> (fout, C[fullone]);
check_sym <ONE1> (C[fullone]);
}

CGD Library

The compiler generates the following C++ code corresponding to the main loop section:

  for (idx=0, cond=1; cond; idx++)
{
CLK_MAIN_BEGIN(idx);

swapBegin (sw_chm, A_halo, A_halo, 0, pe);

econd (cfg, idx, cond);

swapEnd (sw_chm, A_halo, A_halo, 0, pe);

ftcs_step (cfg, A_halo, halo[pe], cell[pe], B_halo);

swapBegin (sw_chm, B_halo, B_halo, 0, pe);

swapEnd (sw_chm, B_halo, B_halo, 0, pe);

ftcs_step (cfg, B_halo, halo[pe], cell[pe], A_halo);

CLK_MAIN_END(idx);
}

Optimization : Merged Communication Step

A possible optimization that reduces communication overhead is merging two communication steps into one. For small messages synchronization and latency account for a large fraction of communication; by sending a single larger message latency cost is halved. For the PDE solver this is achieved by computing the grid values for the larger "halom" domain, then for the smaller "cell" domain.

Decomposition Rules

part cell < halom < halo < halo2; 
part halom2 < halo2;
part boundary < boundary2 < halo2;
part boundary < halo;

// merge rules
part halo = boundary + halom;
part halo2 = boundary2 + halom2;

// redistribution rules
part sw_chm : cell -> halom2;
part sw_ca : cell -> fullone;

Parallel Computations

  // main iteration, 2 steps

loop ( idx, cond ; A [cell] -> C [cell] ) [clock:MAIN]
{
copy ( BD[boundary2] -> A[boundary2] );
copy ( BD[boundary] -> B[boundary] );

ftcs_step ( cfg, A[halo2] -> B[halom] );
ftcs_step ( cfg, B[halo] -> C[cell] );

econd (cfg, idx -> cond);
}

CGD Library

The generated C++ code shows two communication steps are replaced by one.

  for (idx=0, cond=1; cond; idx++)
{
CLK_MAIN_BEGIN(idx);

swapBegin (sw_chm, A_halo2, A_halo2, 0, pe);

econd (cfg, idx, cond);

swapEnd (sw_chm, A_halo2, A_halo2, 0, pe);

ftcs_step (cfg, A_halo2, halo2[pe], halom[pe], B_halo);

ftcs_step (cfg, B_halo, halo[pe], cell[pe], A_halo2);

CLK_MAIN_END(idx);
}

Optimization : Communication Computation Overlap

Another possible optimization that reduces communication cost is overlapping computation with communication. The values in the middle of the "cell" domain can be updated without using the "halo" points data; thus "ftcs_step" computation for "cell2" doesn't depend on "halo" data and can be overlapped with the "cell -> halo" redistribution.

Decomposition rules

part cell2 < cell < halom < halo;
part boundary < halo;
part border < cell;

// merge rules
part halo = boundary + halom;
part cell = cell2 + border;

// redistribution rules
part sw_ca : cell -> fullone;
part sw_chm : cell -> halom;

Computations

  // main iteration, 2 steps

loop ( idx, cond ; A [cell] -> C [cell] ) [clock:MAIN]
{
copy ( BD[boundary] -> A[boundary] );
copy ( BD[boundary] -> B[boundary] );

ftcs_step ( cfg, A[cell] -> B[cell2] );
ftcs_step ( cfg, A[halo] -> B[border] );

ftcs_step ( cfg, B[cell] -> C[cell2] );
ftcs_step ( cfg, B[halo] -> C[border] );

econd (cfg, idx -> cond);
}

CGD Library

The generated C++ code shows the overlap. Note the computation on the inner "border" is defined on a multi-partition i.e. a decomposition that assigns more subdomains to the same processor. This is translated to a loop iterating over these domains.

  for (idx=0, cond=1; cond; idx++)
{
CLK_MAIN_BEGIN(idx);

swapBegin (sw_chm, A_halo, A_halo, 0, pe);

ftcs_step (cfg, A_halo, cell[pe], cell2[pe], B_halo);

econd (cfg, idx, cond);

swapEnd (sw_chm, A_halo, A_halo, 0, pe);

for (int _mi=0; _mi<border.getNo(pe); _mi++)
{
ftcs_step (cfg, A_halo, halo[pe], border.idx(pe,_mi), B_halo);
}

swapBegin (sw_chm, B_halo, B_halo, 0, pe);

ftcs_step (cfg, B_halo, cell[pe], cell2[pe], A_halo);

swapEnd (sw_chm, B_halo, B_halo, 0, pe);

for (int _mi=0; _mi<border.getNo(pe); _mi++)
{
ftcs_step (cfg, B_halo, halo[pe], border.idx(pe,_mi), A_halo);
}

CLK_MAIN_END(idx);
}

Files


ProgrammerCompiler generated
CGD sourceType DeclarationsSequential ComputationsComputation DeclarationsParallel Functions
PDE solverpde.pdpde.hpde.ccpde_decl.hpde_auto.cc
Merged Steppdemr.pdpdemr.hpdemr.ccpdemr_decl.hpdemr_auto.cc
Comm Overlappdeov.pdpdeov.hpdeov.ccpdeov_decl.hpdeov_auto.cc



pde-common.cc