Benchmarks

GPI-2 aims at high performance. That includes maximum performance on micro-benchmarks as well as on different kinds of applications. Some of the results and applications based on GPI-2 are depicted below (use the tabs to navigate).

Micro-benchmarks

The presented results were obtained on a system equipped with Infiniband (FDR).

bw
lat_small
pingpong_small

hybrid_barrier_cray

barrier
barrier

Xeon Phi

The results below show the performance of GPI2 on heterogeneous architectures namely, on a system with the latest Connect-IB from Mellanox where each (Super compute)node  is equipped with 2 IB cards (dual port with port and device bonding enabled) and 2 Intel Xeon Phi, as depicted in the figure below:

system2

connectib_bw
connectib_mic_bw
connectib_lat_small
connectib_mic_lat_small

GPU

bw

 

The following figure depicts the obtained speedup when running the Himeno benchmark on a multi-GPU setup and compares it with the speedup obtained when using MVAPICH2 with GPUDirect RDMA enabled.

himeno

2D-FFT

To illustrate the benefit of computation-communication overlap with GPI-2, the figure below depicts the communication pattern of a 2D-FFT and the required run time (in seconds) on a system with up to 256 cores equipped with Infiniband DDR.

fft2d_14

UTS

The Unbalanced Tree Search (UTS) benchmark is a parallel benchmark that examines the performance achieved when performing an exhaustive search on an unbalanced tree. It presents significant load imbalance problems to any parallelization scheme. Below we present the performance results obtained with GPI(*) for one type of tree (binomial) of around 300 billion nodes. View related publication.

uts

(*) The depicted results were obtained with the previous version of GPI. With GPI-2 the same results are expected.

BQCD

Quantum Chromo Dynamics (QCD) is the fundamental theory of the strong interaction. BQCD is a typical representative of its ab-initio simulation, a boundary exchange problem. The figure below presents the results obtained  in terms of overlap of computation and communication efficiency, for a 24×24×24×48 lattice. The system used consists of dual socket Intel Xeon E5440  CPUs running at 2.83 GHz, each having 16 GByte of RAM connected with DDR Infiniband. View related publication.

bqcd

(*) The depicted results were obtained with the previous version of GPI. With GPI-2 the same results are expected.

3D Stencil

The following figure presents the obtained results on the parallelization of a 3D acoustic wave equation kernel from a bulk-synchronous MPI legacy code to asynchronous MPI and GPI. It is also a great representative of stencil computations. It uses OpenMP within the NUMA sockets and the results were obtained on a system with Intel Xeon X5660 (Westmere) and QDR Infiniband. The MPI implementation was Intel MPI 4.0.3. This work was in collaboration with Mauricio Araya-Polo and Repsol.

stencil

CFD

TAU is an unstructured CFD solver developer by the German Aerospace Center (DLR). The figure below presents the obtained speedup for 4W Multigrid with 2 million mesh points on up to 1440 cores. The system used was equipped with the Intel Xen X5670 and QDR Infiniband. View related publication.

tau

(*) The depicted results were obtained with the previous version of GPI. With GPI-2 the same results are expected.

Numerics


Here we present the results of our GPI-2 implementation of the Jacobi preconditioned Richardson method and we compare it with the one available on PETSc (version 3.2, compiled with OpenMPI). The calculations performed by both implementations are identical. The figure depicts the relative speed-up obtained after measuring the execution time to perform 4000 Jacobi preconditioned Richardson iterations. The initial approximation of the solution is, in both cases, zero. The system used is equipped with the Intel Xeon E5-2670 with 32 GB of RAM and QDR Infiniband.

richardson_401

The following are results on a more recent system: Intel Xeon E5-2680v2 (Ivybridge) with FDR Infiniband.

richardson_257_ivy

Seismic

GPI is one of the main building blocks of many applications at the Fraunhofer ITWM particularly in the seismic area. One example of such applications is the Parallel Edge- and Coherence-Enhancing Anisotropic Diffusion Filter (Parallel ECED). This filter is
used to reduce the noise of seismic data while preserving its layer structure at the same time. These are some results on up to 32768 cores.

scaling_eced_web_640x480

 

Another is Pre-Stack Pro, the first product of Sharp Reflections, a spin-off of Fraunhofer ITWM and EnVision AS.

pspro_01

Visualisation

When it comes to visualization (Real-Time Ray-Tracing, Photo-realistic Rendering), GPI is used to deliver parallel performance as in the example shown below.

 

bmw_gray