LibGeoDecomp (Library for Geometric Decomposition codes) is an auto-parallelizing library for stencil codes. It is specifically targeted computational science applications in the context of heterogeneous systems and supercomputers. The library eases development of such tightly coupled codes by essentially taking over the parallel programming: it handles both, the parallelization itself and parameter tuning. This enables scientists to focus on their simulation code, rather than the technical details of the parallel computer.

LibGeoDecomp's object-oriented design makes it well suited for multiphysics. Also, its API has proven to ease ports of existing, serial codes to LibGeoDecomp.

Supported Models

  • stencil codes
  • short-ranged n-body
  • meshfree methods
  • particle-in-cell codes

Supported Architectures

  • multi-cores
  • GPUs (via CUDA)
  • Intel MIC (via HPX backend)
  • MPI clusters

Tested Supercomputers

  • JUQUEEN at JSC
  • Stampede at TACC
  • Tsubame 2.0 at TiTech
  • SuperMUC at LRZ

Supporters

EPCC Logo Forschungszentrum Juelich Logo KONWIHR Logo OLCF Logo NVIDIA CUDA Research Center Logo Tokyo Institute of Technology Logo

What Is a Stencil Code?

A stencil code is a time and space discrete computer simulation on a regular grid, where each cell is updated using only its neighboring cells. The shape of the relevant neighborhood is called stencil. Examples include cellular automata (e.g. Conway's Game of Life) and Lattice Boltzmann kernels. Actually, in LibGeoDecomp we use a slightly relaxed notion of stencil codes, which is also the reason why it's not named LibStencilCode. Since cells are represented by objects, we can easily implement n-body codes, which are based on particles, not cells, too. The key is to superimpose a coarse grid onto the simulation space and sort the freely moving particles into their respective containers. Two of our codes use this method. See the gallery to see some real world applications build with LibGeoDecomp.

How Does It Work?

Generic parallelization has been the holy grail of parallel computing research for many decades. So far no one has come up with a language/compiler/library that could automatically parallelize any sequential code on any hardware. LibGeoDecomp therefore focuses on a class of applications which are in our eyes equally important and challenging: stencil codes. The internal workings of these algorithms are highly regular; the parallelization of a computational fluid dynamics code doesn't differ much from one for Conway's Game of Life. Therefore we set off to create a library based on C++ class templates. Users supply their actual simulation model (i.e. the data and update function for the individual cells) via template parameters. Since the code is generated at compile time, the runtime overhead is next to zero. In fact, we found that custom parallelizations may turn out to be slower than the generic code of LibGeoDecomp, simply because it takes a while to implement advanced features like dynamic load balancing or cache blocking, and few have the time to do that for their codes.

Different hardware architectures are represented by dedicated plugins. This allows us to use the most suitable algorithms, depending on which hardware is present. Also, this flexibility enables the library to grow as hardware evolves.

How Does LibGeoDecomp Relate to the Competition/Other Libraries?

Of course LibGeoDecomp is not the first library which targets stencil codes. A brief discussion of other approaches can be found on Wikipedia. The library which probably bears the strongest resemblance to ours is Physis. Another library would be Patus. Patus is limited to shared memory machines, so multiple nodes (MPI clusters) are not supported. Physis as well as Patus use a DSL for defining the stencil code while LibGeoDecomp has a two-way API and relies on plain C++. Both approaches have their merits; we prefer the latter as it yields users a smooth upgrade path to migrate their sequential codes to LibGeoDecomp.

Main Features

  • Boost Software License: free, open source, business compatible
  • parallelization via MPI -- scales to millions of cores (tested on IBM BG/Q with 1.85M MPI ranks)
  • accelerator offloading (currently NVIDIA CUDA only)
  • competitive performance
  • multiphysics (wrapping multiple models in a single class is easy)
  • dynamic load balancing
  • (remote) live steering
  • (remote) in situ visualization (via VisIt's libsim)
  • automatic alignment at cache-line boundaries
  • customizable domain decomposition techniques
  • latency hiding via
    • wide ghostzones (ghostzones of width k require synchronization only every kth timestep)
    • overlapping communication and computation
  • experimental support for gridless/meshfree codes (e.g. finite element method)
  • parallel I/O
  • application level checkpoint/restart
  • visualization via VisIt

Development Team

Current Members

  • Andreas Schäfer (project lead)
  • Thomas Heller (HPX backend, Intel MIC support)
  • Kurt Kanzenbach (Cactus interface)
  • Johannes Hofmann (BG/Q kernel optimization)
  • Julian Hammer (auto-vectorization)
  • Dominik Thönnes (graph partitioning)

Past Members

  • Björn Meier (libsim interface, remote live steering)
  • Konstantin Kronfeldner (MMORPG framework)
  • Jochen Keil (OpenCL plugin prototype)
  • Siegfried Schöfer (OpenCL plugin prototype)
  • Stephan Helou (LBM toolkit)
  • Arne Hendriks (cache blocking prototype)

Performance

Here are some early benchmark results. In this benchmark we've scaled DendSim3, the first real life simulation built with LibGeoDecomp, on up to 768 cores. The jobs were run on RRZE's LiMa. While scaling up we increased the grid-size accordingly (i.e. weak scaling or Gustafson and Barsis scaling). Ideally the execution time would remain constant. Fluctuations in the domain decomposition scheme and network overhead cause a minor slowdown, but the efficiency remains above 94% in all cases.

The plot below shows the performance of our prototype code running a Jacobi iteration with varying grid sizes (n) on an Nvidia Tesla C2050 GPU. The plot is taken from our ICCS paper. To the best of our knowledge, this is currently the fastest published code for this application.

What's Missing?

The library itself is a work in progress, so there are plenty of features which we would love to see implemented soon:
  • multi-GPU code is currently a (partial) user burden
  • OpenCL support (Jochen Keil is currently working on a plug-in)
  • autotuning for all runtime parameters
    • during runtime we currently optimize the load distribution only, but there are more parameters (e.g. the ghostzone width or blocking sizes) which could set automatically

last modified: Thu Feb 20 22:46:53 2014 +0100