Stencil
This partial differential equation solver is a forward time centered space differencing scheme solving the equation of heat dissipation on a 2D grid. Every iteration each value of the grid is recomputed, and the ``halo exchange'' retrieves the neighbor points. Communication-computation ratio is good but scalability is limited for small data sizes due to constant costs. 3D solvers exhibit better scalability, however 2D solvers are an interesting case study being the bottleneck for mixed 2D, 3D solvers. The following code is the CGD implementation of the algorithm; sequential computations are implemented as C++ functions (see files).
Types
type range Range2D;
type partition <Range2D> PartRange2D [PartNP];
type mpartition <Range2D> MPartRange2D [PartNP];
type swap <Range2D> SwapRange2D [PartNP];
type data Int, Config, File;
type data Vector2Dreal [PartRange2D, MPartRange2D];
Init Data
const PartRange2D cell[ALLn], halom[ALLn], halo[ALLn], fullone[ALLn];
const MPartRange2D boundary[ALLn];
const SwapRange2D sw_chm[PPEn], sw_ca[PPEn];
const Config cfg[ALL1], File fout[ALL1];
Decomposition Rules
part cell < halom < halo;
part boundary < halo;
// merge rules
part halo = halom + boundary;
// redistribution rules
part sw_ca : cell -> fullone;
part sw_chm : cell -> halom;
Declarations
function ftcs_step (cfg, A, halo, cell -> B)
Vector2Dreal A[halo], B[cell],
Range2D cell, halo, Config cfg;
function matrix_read (cfg, cell, halo -> A, B)
Vector2Dreal A[cell], B[halo],
Config cfg, Range2D cell, halo;
function check_sym (A, ra)
Vector2Dreal A[ra],
Range2D ra;
function econd (cfg, idx -> cond)
Int idx, cond, Config cfg;
Parallel Computations
function foo () [gen]
{
// initialization
matrix_read (cfg, cell, halo -> A[cell], BD[halo]);
// main iteration, 2 steps
loop ( idx, cond ; A [cell] -> C [cell] ) [clock:MAIN]
{
copy ( BD[boundary] -> A[boundary] );
copy ( BD[boundary] -> B[boundary] );
ftcs_step ( cfg, A[halo] -> B[cell] );
ftcs_step ( cfg, B[halo] -> C[cell] );
econd (cfg, idx -> cond);
}
// print, check result correctness
print <ONE1> (fout, C[fullone]);
check_sym <ONE1> (C[fullone]);
}
CGD Library
The compiler generates the following C++ code corresponding to the main loop section:
for (idx=0, cond=1; cond; idx++)
{
CLK_MAIN_BEGIN(idx);
swapBegin (sw_chm, A_halo, A_halo, 0, pe);
econd (cfg, idx, cond);
swapEnd (sw_chm, A_halo, A_halo, 0, pe);
ftcs_step (cfg, A_halo, halo[pe], cell[pe], B_halo);
swapBegin (sw_chm, B_halo, B_halo, 0, pe);
swapEnd (sw_chm, B_halo, B_halo, 0, pe);
ftcs_step (cfg, B_halo, halo[pe], cell[pe], A_halo);
CLK_MAIN_END(idx);
}
Optimization : Merged Communication Step
A possible optimization that reduces communication overhead is merging two communication steps into one. For small messages synchronization and latency account for a large fraction of communication; by sending a single larger message latency cost is halved. For the PDE solver this is achieved by computing the grid values for the larger "halom" domain, then for the smaller "cell" domain.
Decomposition Rules
part cell < halom < halo < halo2;
part halom2 < halo2;
part boundary < boundary2 < halo2;
part boundary < halo;
// merge rules
part halo = boundary + halom;
part halo2 = boundary2 + halom2;
// redistribution rules
part sw_chm : cell -> halom2;
part sw_ca : cell -> fullone;
Parallel Computations
// main iteration, 2 steps
loop ( idx, cond ; A [cell] -> C [cell] ) [clock:MAIN]
{
copy ( BD[boundary2] -> A[boundary2] );
copy ( BD[boundary] -> B[boundary] );
ftcs_step ( cfg, A[halo2] -> B[halom] );
ftcs_step ( cfg, B[halo] -> C[cell] );
econd (cfg, idx -> cond);
}
CGD Library
The generated C++ code shows two communication steps are replaced by one.
for (idx=0, cond=1; cond; idx++)
{
CLK_MAIN_BEGIN(idx);
swapBegin (sw_chm, A_halo2, A_halo2, 0, pe);
econd (cfg, idx, cond);
swapEnd (sw_chm, A_halo2, A_halo2, 0, pe);
ftcs_step (cfg, A_halo2, halo2[pe], halom[pe], B_halo);
ftcs_step (cfg, B_halo, halo[pe], cell[pe], A_halo2);
CLK_MAIN_END(idx);
}
Optimization : Communication Computation Overlap
Another possible optimization that reduces communication cost is overlapping computation with communication. The values in the middle of the "cell" domain can be updated without using the "halo" points data; thus "ftcs_step" computation for "cell2" doesn't depend on "halo" data and can be overlapped with the "cell -> halo" redistribution.
Decomposition rules
part cell2 < cell < halom < halo;
part boundary < halo;
part border < cell;
// merge rules
part halo = boundary + halom;
part cell = cell2 + border;
// redistribution rules
part sw_ca : cell -> fullone;
part sw_chm : cell -> halom;
Computations
// main iteration, 2 steps
loop ( idx, cond ; A [cell] -> C [cell] ) [clock:MAIN]
{
copy ( BD[boundary] -> A[boundary] );
copy ( BD[boundary] -> B[boundary] );
ftcs_step ( cfg, A[cell] -> B[cell2] );
ftcs_step ( cfg, A[halo] -> B[border] );
ftcs_step ( cfg, B[cell] -> C[cell2] );
ftcs_step ( cfg, B[halo] -> C[border] );
econd (cfg, idx -> cond);
}
CGD Library
The generated C++ code shows the overlap. Note the computation on the inner "border" is defined on a multi-partition i.e. a decomposition that assigns more subdomains to the same processor. This is translated to a loop iterating over these domains.
for (idx=0, cond=1; cond; idx++)
{
CLK_MAIN_BEGIN(idx);
swapBegin (sw_chm, A_halo, A_halo, 0, pe);
ftcs_step (cfg, A_halo, cell[pe], cell2[pe], B_halo);
econd (cfg, idx, cond);
swapEnd (sw_chm, A_halo, A_halo, 0, pe);
for (int _mi=0; _mi<border.getNo(pe); _mi++)
{
ftcs_step (cfg, A_halo, halo[pe], border.idx(pe,_mi), B_halo);
}
swapBegin (sw_chm, B_halo, B_halo, 0, pe);
ftcs_step (cfg, B_halo, cell[pe], cell2[pe], A_halo);
swapEnd (sw_chm, B_halo, B_halo, 0, pe);
for (int _mi=0; _mi<border.getNo(pe); _mi++)
{
ftcs_step (cfg, B_halo, halo[pe], border.idx(pe,_mi), A_halo);
}
CLK_MAIN_END(idx);
}
Files
| Programmer | Compiler generated | ||||
| CGD source | Type Declarations | Sequential Computations | Computation Declarations | Parallel Functions | |
| PDE solver | pde.pd | pde.h | pde.cc | pde_decl.h | pde_auto.cc |
| Merged Step | pdemr.pd | pdemr.h | pdemr.cc | pdemr_decl.h | pdemr_auto.cc |
| Comm Overlap | pdeov.pd | pdeov.h | pdeov.cc | pdeov_decl.h | pdeov_auto.cc |
| pde-common.cc | |||||
