Results

The goal of our experiments is comparing application performance for implementations based on different programming models and libraries. To maintain fairness we rely on the same sequential code for all implementations. This sequential code is optimized, but it is not necesarily the fastest available implementation (FT relies on NPB 2.3 code rather than FFTW or similar).

Implementations

NPB FT

  • OpenMP
    • Omni C version dervied from Fortran NPB 2.3
    • OpenMP
  • NPB MPI
    • original MPI Fortran NPB 2.3
    • algo: 1D decomposition small # proc, 2D decomposition otherwise
    • MPI
  • CGD
    • derived from Omni OpenMP C version
      •  derived from Fortran NPB 2.3 version
    • algo: same as NPB MPI
    • CGD runtime : SHMEM (small msgs) + MPI (large msgs)
  • CGD slabs
    • same code base as CGD
    • algo: finer grain decomposition of 2D domains
      • allows communication overlap
      • better cache behaviour
    • CGD runtime : SHMEM (small msgs) + MPI (large msgs)

Barnes-Hut

  • Splash2
    • Original implementation
      • includes University Delaware patches
    • pthreads with barriers
  • CGD pthreads
    • sequntial code derived from Splash2
      •  CGD distributed trees
    • algo: same as Splash2
    • CGD runtime : pthreads
  • CGD SHMEM
    • implementation, algo identical to CGD pthreads
    • CGD runtime : SHMEM

Stencil

  1. MPI manual
    1. handwritten C++ code
    2. basic algo: one comm phase / step
    3. MPI calls
  2. CGD 
    1. CGD generated C++ code
    2. basic algo: one comm phase / step
    3. CGD runtime : SHMEM calls
  3. CGD merge opt 
    1. CGD generated C++ code
    2. optimized algo: one comm phase / two steps
    3. CGD runtime : SHMEM calls

The MPI manual implementation employes the following steps

forall  (to receive) Irecv
forall (to send) marshall data, Isend
Waitany (to receive) unmarshall data
execute computations
Waitall (sends)

Altix 4700

The SGI Altix 4700 is a 1024 core machine with dual CPU, dual-core Itanium processors running at 1.5GHz; the NUMAlink4 interconnect has 6.4Gb/sec links. This machine is used for production, and many applications run at the same time loading the interconnect and giving some variability in measured performance. To offset this effect benchmarks were run multiple times and the shortest runtime was used.

NPB FT

NPB FT : execution time (total - setup) and speedup
  • CGD slabs faster than MPI, CGD
    • many fine grain messages vs. bulk messages
      • allow some overlap on Altix
        • better performance expected on asynchronous RDMA machines
      • better cache performance
    • speedup almost linear
  • CGD slightly faster than MPI
    • machine fine-tuned / optimized
      • messaging algorithm, size, and order
      • data copy
      • buffer management
  • 128 proc
    • execution times look similar
      • Y range: 50 sec, 128 proc time: 2-2.4 sec
    • diffs are meaningful percentage-wise (see speedup)

Barnes-Hut

Barnes-Hut : execution time (main loop) and speedup






  • CGD shmem slightly faster than CGD pthreads
    •  for 32K, 256K; virtually identical for larger problems
    • Altix SHMEM has faster sync primitives
  • CGD shmem faster than Splash
    • CGD one processor runtime faster on Altix 4700 / Itanium
      • smaller size nodes, pointers; pointer lookup vs. array indexing
    • for 32K CGD runtime is faster but speedup is slower (0.32 vs. 0.37 sec, 95.87 vs. 100.53 speedup)
      • speedups are relative to each implementation one processor runtime (30.89 vs 37.07 sec)
      • constant costs (latency, top of tree overhead) hurt CGD speedup more since total runtime is shorter
      • absolute speeups are significantly better for CGD (115.10 vs. 100.53)
    • for 256K, 1M both CGD speedup and execution time are better
    • CGD remote nodes are received as whole chunks, and stored locally until invalid
      • cache line granularity is smaller, gives higher overhead
  • 128 proc
    • execution times look similar
      • Y range: 250 sec, 128 proc time: 8-10 sec
    • diffs are meaningful percentage-wise (see speedups)
  • superlinear speedup
    • most obvious for 256K, 1M
    • speedup about 20 for 16 CPU, then sublinear
    • due to larger aggregated cache size

Stencil

Stencil : communication, execution time (micro sec/iteration) and speedup

Stencil: speedup function of problem size
  • CGD faster then MPI
    • CGD communication is faster
      • SHMEM latency < MPI latency for small msgs.
      • merge optimization reduces latency / step
  • superlinear speedup
    • for 1024x1024, but not for 512x512
    • speedup about 26 for 16 CPU, then sublinear
    • due to larger aggregated cache size