Results
The goal of our experiments is
comparing application performance for implementations based on
different programming
models and libraries. To maintain fairness we rely on the same
sequential
code for all implementations. This sequential code is optimized, but it
is not necesarily the fastest available implementation (FT relies on
NPB 2.3 code rather than FFTW or similar).
Implementations
NPB FT
- OpenMP
- Omni C version dervied from Fortran NPB 2.3
- OpenMP
- NPB MPI
- original MPI Fortran NPB 2.3
- algo: 1D decomposition small # proc, 2D decomposition
otherwise
- MPI
- CGD
- derived from Omni OpenMP C version
- derived from Fortran NPB 2.3 version
- algo: same as NPB MPI
- CGD runtime : SHMEM (small msgs) + MPI (large msgs)
- CGD slabs
- same code base as CGD
- algo: finer grain decomposition of 2D domains
- allows communication overlap
- better cache behaviour
- CGD runtime : SHMEM (small msgs) + MPI (large msgs)
Barnes-Hut
- Splash2
- Original implementation
- includes University Delaware patches
- pthreads with barriers
- CGD pthreads
- sequntial code derived from Splash2
- algo: same as Splash2
- CGD runtime : pthreads
- CGD SHMEM
- implementation, algo identical to CGD pthreads
- CGD runtime : SHMEM
Stencil
- MPI manual
- handwritten C++ code
- basic algo: one comm phase / step
- MPI calls
- CGD
- CGD generated C++ code
- basic algo: one comm phase / step
- CGD runtime : SHMEM calls
- CGD merge opt
- CGD generated C++ code
- optimized algo: one comm phase / two steps
- CGD runtime : SHMEM calls
The MPI manual implementation employes the following steps
forall (to receive) Irecv
forall (to send) marshall data, Isend
Waitany (to receive) unmarshall data
execute computations
Waitall (sends)
Altix 4700
The SGI Altix 4700 is a 1024 core machine with dual CPU, dual-core
Itanium processors running at 1.5GHz; the NUMAlink4 interconnect has
6.4Gb/sec links. This machine is used for production, and many
applications run at the same time loading the interconnect and giving
some variability in measured performance. To offset this effect
benchmarks were run multiple times and the shortest runtime was
used.
NPB FT
NPB FT : execution time
(total - setup) and speedup
- CGD slabs faster than MPI, CGD
- many fine grain messages vs. bulk messages
- allow some overlap on Altix
- better performance expected on asynchronous RDMA
machines
- better cache performance
- speedup almost linear
- CGD slightly faster than MPI
- machine fine-tuned / optimized
- messaging algorithm, size, and order
- data copy
- buffer management
- 128 proc
- execution times look similar
- Y range: 50 sec, 128 proc time: 2-2.4 sec
- diffs are meaningful percentage-wise (see speedup)
Barnes-Hut
Barnes-Hut : execution time
(main loop) and speedup
- CGD shmem slightly faster than CGD pthreads
- for 32K, 256K; virtually identical for larger problems
- Altix SHMEM has faster sync primitives
- CGD shmem faster than Splash
- CGD one processor runtime faster on Altix 4700 / Itanium
- smaller size nodes, pointers; pointer lookup vs. array indexing
- for 32K CGD runtime is faster but speedup is slower (0.32 vs. 0.37 sec, 95.87 vs. 100.53 speedup)
- speedups are relative to each implementation one processor runtime (30.89 vs 37.07 sec)
- constant costs (latency, top of tree overhead) hurt CGD speedup more since total runtime is shorter
- absolute speeups are significantly better for CGD (115.10 vs. 100.53)
- for 256K, 1M both CGD speedup and execution time are better
- CGD remote nodes are received as whole chunks, and stored locally until invalid
- cache line granularity is smaller, gives higher overhead
- 128 proc
- execution times look similar
- Y range: 250 sec, 128 proc time: 8-10 sec
- diffs are meaningful percentage-wise (see speedups)
- superlinear
speedup
- most obvious for 256K, 1M
- speedup about 20 for 16 CPU, then sublinear
- due to larger aggregated cache
size
Stencil
Stencil : communication,
execution time (micro sec/iteration) and speedup
Stencil: speedup function
of problem size
- CGD faster then MPI
- CGD communication is faster
- SHMEM
latency < MPI latency for small msgs.
- merge optimization
reduces latency / step
- superlinear
speedup
- for 1024x1024, but not for 512x512
- speedup about 26 for 16 CPU, then sublinear
- due to larger aggregated cache
size