LibGeoDecomp (Library for Geometric Decomposition codes) is an auto-parallelizing library for computer simulations. It is specifically targeted computational science applications in the context of supercomputers and heterogeneous systems. The library eases development of such tightly coupled codes by essentially taking over the parallel programming: it handles both, the parallelization itself and parameter tuning. This enables scientists to focus on their simulation code, rather than the technical details of the parallel computer.

LibGeoDecomp's object-oriented design makes it well suited for multiphysics. Also, its API has proven to ease ports of existing, serial codes to LibGeoDecomp.

LibGeoDecomp is a project of the Ste||ar Group. Development is lead by FAU. The library is released as free, open-source software under a liberal license (Boost Software Licence), which is compatible with closed source and commercial software.

Supported Models

  • stencil codes
  • short-ranged n-body
  • meshfree methods
  • particle-in-cell codes

Supported Architectures

  • multi-cores
  • GPUs (via CUDA)
  • Intel MIC (via HPX backend)
  • MPI clusters

Tested Supercomputers

  • Stampede at TACC
  • Tsubame 2.0 at TiTech
  • SuperMUC at LRZ


EPCC Logo Forschungszentrum Juelich Logo KONWIHR Logo OLCF Logo NVIDIA CUDA Research Center Logo Tokyo Institute of Technology Logo

What Is a Stencil Code?

A stencil code is a time and space discrete computer simulation on a regular grid, where each cell is updated using only its neighboring cells. The shape of the relevant neighborhood is called stencil. Examples include cellular automata (e.g. Conway's Game of Life) and Lattice Boltzmann kernels. Actually, in LibGeoDecomp we use a slightly relaxed notion of stencil codes, which is also the reason why it's not named LibStencilCode. Since cells are represented by objects, we can easily implement n-body codes, which are based on particles, not cells, too. The key is to superimpose a coarse grid onto the simulation space and sort the freely moving particles into their respective containers. Two of our codes use this method. See the gallery to see some real world applications build with LibGeoDecomp.

How Does It Work?

Generic parallelization has been the holy grail of parallel computing research for many decades. So far no one has come up with a language/compiler/library that could automatically parallelize any sequential code on any hardware. LibGeoDecomp therefore focuses on a class of applications which are in our eyes equally important and challenging: stencil codes. The internal workings of these algorithms are highly regular; the parallelization of a computational fluid dynamics code doesn't differ much from one for Conway's Game of Life. Therefore we set off to create a library based on C++ class templates. Users supply their actual simulation model (i.e. the data and update function for the individual cells) via template parameters. Since the code is generated at compile time, the runtime overhead is next to zero. In fact, we found that custom parallelizations may turn out to be slower than the generic code of LibGeoDecomp, simply because it takes a while to implement advanced features like dynamic load balancing or cache blocking, and few have the time to do that for their codes.

Different hardware architectures are represented by dedicated plugins. This allows us to use the most suitable algorithms, depending on which hardware is present. Also, this flexibility enables the library to grow as hardware evolves.

How Does LibGeoDecomp Relate to the Competition/Other Libraries?

Of course LibGeoDecomp is not the first library which targets stencil codes. A brief discussion of other approaches can be found on Wikipedia. The library which probably bears the strongest resemblance to ours is Physis. Another library would be Patus. Patus is limited to shared memory machines, so multiple nodes (MPI clusters) are not supported. Physis as well as Patus use a DSL for defining the stencil code while LibGeoDecomp has a two-way API and relies on plain C++. Both approaches have their merits; we prefer the latter as it yields users a smooth upgrade path to migrate their sequential codes to LibGeoDecomp.

Main Features

  • Boost Software License: free, open source, business compatible
  • parallelization via MPI -- scales to millions of cores (tested on IBM BG/Q with 1.85M MPI ranks)
  • accelerator offloading (currently NVIDIA CUDA only)
  • competitive performance
  • multiphysics (wrapping multiple models in a single class is easy)
  • dynamic load balancing
  • (remote) live steering
  • (remote) in situ visualization (via VisIt's libsim)
  • automatic alignment at cache-line boundaries
  • customizable domain decomposition techniques
  • latency hiding via
    • wide ghostzones (ghostzones of width k require synchronization only every kth timestep)
    • overlapping communication and computation
  • experimental support for gridless/meshfree codes (e.g. finite element method)
  • parallel I/O
  • application level checkpoint/restart
  • visualization via VisIt

Development Team

Current Members

  • Andreas Schäfer (project lead)
  • Mathias Schöll (auto-tuning)
  • Johannes Knödtel (auto-tuning)
  • Thomas Heller (HPX backend, Intel MIC support)
  • Kurt Kanzenbach (unstructured, vectorization, previously: CACTUS interfacing)
  • Konstantin Kronfeldner (unstructured, domaindecomposition, previously: MMORPG MeisterYuke)
  • Benno Schüpferling (MiniGhost port)
  • Sophie Wenzel-Teuber (dendrite simulation)

Past Members

  • Wolfgang Schäfer (SUMO parallelization)
  • Björn Meier (libsim interface, remote live steering)
  • Christopher Bross (unstructured grid container)
  • Dominik Thönnes (graph partitioning)
  • Jochen Keil (OpenCL plugin prototype)
  • Julian Hammer (gridless)
  • Siegfried Schöfer (OpenCL plugin prototype)
  • Stephan Helou (LBM toolkit)
  • Arne Hendriks (cache blocking prototype)


LibGeoDecomp has been proven to scale to 1.85M MPI processes, 28k nodes (both on JUQUEEN, a BG/Q at Forschungszentrum Jülich) and 9.4 PFLOPS (on Titan at the Oak Ridge Leadership Computing Facility). It performs well for both, weak and strong scaling setups. Big systems aside, it can also run efficiently on small clusters and workstations.

The plot below shows the performance of our prototype code running a Jacobi iteration with varying grid sizes (n) on an Nvidia Tesla C2050 GPU. The plot is taken from our ICCS paper. To the best of our knowledge, this is currently the fastest published code for this application.

What's Missing?

The library itself is a work in progress, so there are plenty of features which we would love to see implemented soon:
  • multi-GPU code is currently a (partial) user burden
  • autotuning for all runtime parameters
    • during runtime we currently optimize the load distribution only, but there are more parameters (e.g. the ghostzone width or blocking sizes) which could set automatically

last modified: Thu Oct 20 17:55:52 2016 +0200