LibGeoDecomp (Library for Geometric Decomposition codes) is an auto-parallelizing library for stencil codes. It is specifically targeted computational science applications in the context of heterogeneous systems and supercomputers. The library eases development of such tightly coupled codes by essentially taking over the parallel programming: it handles both, the parallelization itself and parameter tuning. This enables scientists to focus on their simulation code, rather than the technical details of the parallel computer.
LibGeoDecomp's object-oriented design makes it well suited for multiphysics. Also, its API has proven to ease ports of existing, serial codes to LibGeoDecomp.
A stencil code is a time and space discrete computer simulation on a regular grid, where each cell is updated using only its neighboring cells. The shape of the relevant neighborhood is called stencil. Examples include cellular automata (e.g. Conway's Game of Life) and Lattice Boltzmann kernels. Actually, in LibGeoDecomp we use a slightly relaxed notion of stencil codes, which is also the reason why it's not named LibStencilCode. Since cells are represented by objects, we can easily implement n-body codes, which are based on particles, not cells, too. The key is to superimpose a coarse grid onto the simulation space and sort the freely moving particles into their respective containers. Two of our codes use this method. See the gallery to see some real world applications build with LibGeoDecomp.
Generic parallelization has been the holy grail of parallel computing research for many decades. So far no one has come up with a language/compiler/library that could automatically parallelize any sequential code on any hardware. LibGeoDecomp therefore focuses on a class of applications which are in our eyes equally important and challenging: stencil codes. The internal workings of these algorithms are highly regular; the parallelization of a computational fluid dynamics code doesn't differ much from one for Conway's Game of Life. Therefore we set off to create a library based on C++ class templates. Users supply their actual simulation model (i.e. the data and update function for the individual cells) via template parameters. Since the code is generated at compile time, the runtime overhead is next to zero. In fact, we found that custom parallelizations may turn out to be slower than the generic code of LibGeoDecomp, simply because it takes a while to implement advanced features like dynamic load balancing or cache blocking, and few have the time to do that for their codes.
Different hardware architectures are represented by dedicated plugins. This allows us to use the most suitable algorithms, depending on which hardware is present. Also, this flexibility enables the library to grow as hardware evolves.
Of course LibGeoDecomp is not the first library which targets stencil codes. A brief discussion of other approaches can be found on Wikipedia. The library which probably bears the strongest resemblance to ours is Physis. Another library would be Patus. Patus is limited to shared memory machines, so multiple nodes (MPI clusters) are not supported. Physis as well as Patus use a DSL for defining the stencil code while LibGeoDecomp has a two-way API and relies on plain C++. Both approaches have their merits; we prefer the latter as it yields users a smooth upgrade path to migrate their sequential codes to LibGeoDecomp.
Here are some early benchmark results. In this benchmark we've scaled DendSim3, the first real life simulation built with LibGeoDecomp, on up to 768 cores. The jobs were run on RRZE's LiMa. While scaling up we increased the grid-size accordingly (i.e. weak scaling or Gustafson and Barsis scaling). Ideally the execution time would remain constant. Fluctuations in the domain decomposition scheme and network overhead cause a minor slowdown, but the efficiency remains above 94% in all cases.
The plot below shows the performance of our prototype code running a Jacobi iteration with varying grid sizes (n) on an Nvidia Tesla C2050 GPU. The plot is taken from our ICCS paper. To the best of our knowledge, this is currently the fastest published code for this application.