LibFlatArray

acts as a highly efficient multi-dimensional array of arbitrary objects, but really uses a struct of arrays memory layout. It's great for writing vectorized code and its lightning-fast iterators give you access to neighboring elements with zero address generation overhead.

Who should use it?

Basically anyone who wants to store objects or structs in an array with 2 or more dimensions and needs to iterate through them. Applications include computer simulations (e.g. stencil codes), image processing or linear algebra with complex numbers.

Features

  • object-oriented interface, but SoA memory layout
  • good for vectorization (SSE, AVX, CUDA)
  • zero address generation overhead
  • fast iterators, yield access to neighboring elements
  • efficient copy in/out

If you need to store objects in a regular grid, then you've got pretty much two options: boost::multi_array has a convenient interface and its code is efficient. But the current trend in hardware design towards longer and longer vector ALUs means that your computational kernels will either need to scatter/gather memory accesses (slow!) or your code will be scalar only.

Therefore the general opinion in HPC is to leave objects out of performance critical code and use a Struct of Arrays memory layout. This means that instead of storing a single, huge array of objects (structs), you'll allocate multiple arrays of scalar types -- one per member. The advantage is that all corresponding variables will be stored consecutively in memory, which makes your hardware happy. But you probably won't be as happy, as this is a clumsy and counterintuitive way of programming (not object-oriented). But even worse, your code may now run into a new problem: address generation. Did you ever wonder why constructs like A[z * DIM_X * DIM_Y + y * DIM_X + x] don't slow down your code? After all, one could assume that this code would result in three multiplications and two additions. In reality your compiler will try to keep a pointer to each row of consecutive elements that your innermost loop will access. This pointer will ideally be stored in a register. Trouble strikes if the number of pointers required exceeds the number of available registers. In that case these pointers may be spilled to the stack. That's awful as that means that an array access may now result in two memory accesses: one to retrieve the pointer, one to move the actual data. Doesn't apply to you? Well, that happens for models as simple as a Lattice Boltzmann method with a D3Q19 model.

Eric? What are you talking about? Are you mad? This is
           about arrays! Flat arrays!
Strided memory access when loading data into vector registers from an Array of Structs
Sequential memory access when loading data into vector registers from a Structs of Array

The key to LibFlatArray's functionality is that we fix its size internally at compile time. By this the compiler can take over much of the address computation that would otherwise have to be done at runtime. flat_array allocates only one field internally. For each member a sub-array is reserved within. We know each member's size and the array's dimensions are fixed, thus we can deduce the offsets for each of these sub-arrays. Some care needs to be taken though to ensure optimal alignment for each field. To take advantage of these optimizations in user code we have to call it back via a functor. That is the only way our code can conveniently bind template parameters in external code.

Our promise was that flat_array would provide an object-oriented interface. This is implemented in the iterator, which tries to emulate the storage class' interface by providing the same members, only as functions. These are generated automatically by a set of macros, but the user needs to supply the list of member variables and their respective types. In a certain sense this is what the Ruby folks refer to as duck typing. The iterators don't really grant access to instances of the storage type, but rather yield objects which look and act like those.

The cool thing about the iterator is that it doesn't hold any data itself, only two pointers: one to the external data store and one counter for indexing. Compilers are generally clever enough to optimize the actual iterator instances away. They never get allocated. If a user wants to access neighbors via the same iterator and can provide the relative coordinates as constants (e.g. via template parameters), then even those offsets can be pre-computed. To protect the user from having to decide upon the array size at compile-time, we can simply maintain a list of the most relevant dimensions within the library and selecting the next larger predefined size to match the users selection. We can show that the memory losses incurred by allocating slightly too large arrays can be kept arbitrarily low, but this comes at the cost of generating more code. That however -- according to our experiences -- is a non-issue for most codes. And that's already it: code generation plus lots of compile-time computation plus a clever callback.

The code is available at bitbucket.org/gentryx/libflatarray. Get it via hg clone https://bitbucket.org/gentryx/libflatarray.

last modified: Sun Dec 02 00:17:55 2012 +0100