acts as a highly efficient multi-dimensional array of arbitrary objects, but really uses a struct of arrays memory layout. It's great for writing vectorized code and its lightning-fast iterators give you access to neighboring elements with zero address generation overhead.

Who should use it?

Basically anyone who wants to store objects or structs in an array with 2 or more dimensions and needs to iterate through them. Applications include computer simulations (e.g. stencil codes), image processing or linear algebra with complex numbers.


  • object-oriented interface, but SoA memory layout
  • good for vectorization (SSE, AVX, CUDA)
  • zero address generation overhead
  • fast iterators, yield access to neighboring elements
  • efficient copy in/out

Getting it

  • git clone
  • Travis CI status: , CircleCI status:

If you need to store objects in a regular grid, then you've got pretty much two options: boost::multi_array has a convenient interface and its code is efficient. But the current trend in hardware design towards longer and longer vector ALUs means that your computational kernels will either need to scatter/gather memory accesses (slow!) or your code will be scalar only.

Therefore the general opinion in HPC is to leave objects out of performance critical code and use a Struct of Arrays memory layout. This means that instead of storing a single, huge array of objects (structs), you'll allocate multiple arrays of scalar types -- one per member. The advantage is that all corresponding variables will be stored consecutively in memory, which makes your hardware happy. But you probably won't be as happy, as this is a clumsy and counterintuitive way of programming (not object-oriented). But even worse, your code may now run into a new problem: address generation. Did you ever wonder why constructs like A[z * DIM_X * DIM_Y + y * DIM_X + x] don't slow down your code? After all, one could assume that this code would result in three multiplications and two additions. In reality your compiler will try to keep a pointer to each row of consecutive elements that your innermost loop will access. This pointer will ideally be stored in a register. Trouble strikes if the number of pointers required exceeds the number of available registers. In that case these pointers may be spilled to the stack. That's awful as that means that an array access may now result in two memory accesses: one to retrieve the pointer, one to move the actual data. Doesn't apply to you? Well, that happens for models as simple as a Lattice Boltzmann method with a D3Q19 model.

Eric? What are you talking about? Are you mad? This is
           about arrays! Flat arrays!
Strided memory access when loading data into vector registers from an Array of Structs

An array of complex numbers in Array-of-Structs form: loading components into vector registers requires a strided memory access.

Sequential memory access when loading data into vector registers from a Structs of Array

Complex numbers stored in a Struct-of-Arrays memory layout. Here vector registers can be filled by a contiguous read from main memory.

The key to LibFlatArray's functionality is that we fix its size internally at compile time. By this the compiler can take over much of the address computation that would otherwise have to be done at runtime. flat_array allocates only one field internally. For each member a sub-array is reserved within. We know each member's size and the array's dimensions are fixed, thus we can deduce the offsets for each of these sub-arrays. Some care needs to be taken though to ensure optimal alignment for each field. To take advantage of these optimizations in user code we have to call it back via a functor. That is the only way our code can conveniently bind template parameters in external code.

Our promise was that flat_array would provide an object-oriented interface. This is implemented in the iterator, which tries to emulate the storage class' interface by providing the same members, only as functions. These are generated automatically by a set of macros, but the user needs to supply the list of member variables and their respective types. In a certain sense this is what the Ruby folks refer to as duck typing. The iterators don't really grant access to instances of the storage type, but rather yield objects which look and act like those.

The cool thing about the iterator is that it doesn't hold any data itself, only two pointers: one to the external data store and one counter for indexing. Compilers are generally clever enough to optimize the actual iterator instances away. They never get allocated. If a user wants to access neighbors via the same iterator and can provide the relative coordinates as constants (e.g. via template parameters), then even those offsets can be pre-computed. To protect the user from having to decide upon the array size at compile-time, we can simply maintain a list of the most relevant dimensions within the library and selecting the next larger predefined size to match the users selection. We can show that the memory losses incurred by allocating slightly too large arrays can be kept arbitrarily low, but this comes at the cost of generating more code. That however -- according to our experiences -- is a non-issue for most codes. And that's already it: code generation plus lots of compile-time computation plus a clever callback.

Please take a look at the examples and tests in our repository. The tests showcase rather simple kernels that are meant to illustrate how SoA models can be registered and used with the containers soa_array and soa_grid. The test for soa_grid also includes a variation of Conway's Game of Life. The test for soa_array contains a simple charged particle model.

The examples are meant to be closer to kernels in actual scientific applications and demonstrate the different performance features of LibFlatArray. Please feel free to consult the LibGeoDecomp mailing list should you have any questions. We'll be glad to help.

  • The Jacobi example is a stencil code and showcases the vectorization facilities of the library, including the loop peeler and the transparent overloads for non-temporal stores. It is a 3D Jacobean smoother. Such kernels are typically memory bound and operate on a 3D matrix of scalars. Hence the SoA structure is reduced to a single value. Compared with the performance reference, written in C99, the LibFlatArray-based code yields better performance as it can dynamically switch from standard stores to streaming stores.
  • The Gauss example sweeps through a 3D volume and applies a 5x5 Gauss filter on the YZ-plane. It is very similar to the Jacobi example, but uses a wider stencil radius (Jacobi: 3D von Neumann stencil of radius 1, Gauss: 2D Moore stencil of radius 2). The trick here is that the C99 reference code is slowed down by address calculation which the LibFlatArray-based implementation does not suffer from this: by using the callbacks provided by soa_grid all offsets can be calculated at compile time -- at the expense of compilation time.
  • The Smoothed Particle Hydrodynamics example is a simple SPH model based on an implementation by David Bindel. It shows how particle models with expensive, short-range interactions can be efficiently implemented. For this, I employ the soa_array together with a loop-peeler, vectorization and branch harvesting.

The code is available at, releases are here.

last modified: Sat Oct 15 16:09:09 2016 +0200