Basically anyone who wants to store objects or structs in an array with 2 or more dimensions and needs to iterate through them. Applications include computer simulations (e.g. stencil codes), image processing or linear algebra with complex numbers.
git clone https://github.com/STEllAR-GROUP/libflatarray
If you need to store objects in a regular grid, then
you've got pretty much two
boost::multi_array has a convenient
interface and its code is efficient. But the current trend
in hardware design towards longer and longer vector ALUs
means that your computational kernels will either need to
scatter/gather memory accesses (slow!) or
your code will be scalar only.
Therefore the general opinion in HPC is
objects out of performance critical code and use
a Struct of Arrays memory layout. This means that
instead of storing a single, huge array of objects
(structs), you'll allocate multiple arrays of scalar types
-- one per member. The advantage is that all corresponding
variables will be stored consecutively in memory, which
makes your hardware happy. But you probably won't be as
happy, as this is a clumsy and counterintuitive way of
programming (not object-oriented). But even worse, your
code may now run into a new problem: address generation.
Did you ever wonder why constructs like
A[z * DIM_X
* DIM_Y + y * DIM_X + x] don't slow down your code?
After all, one could assume that this code would result in
three multiplications and two additions. In reality your
compiler will try to keep a pointer to each row of
consecutive elements that your innermost loop will access.
This pointer will ideally be stored in a register. Trouble
strikes if the number of pointers required exceeds the
number of available registers. In that case these pointers
may be spilled to the stack. That's awful as that means
that an array access may now result in two memory
accesses: one to retrieve the pointer, one to move the
actual data. Doesn't apply to you? Well, that happens for
models as simple as a Lattice Boltzmann method with a
The key to LibFlatArray's functionality is that we fix its
size internally at compile time. By this the compiler can take
over much of the address computation that would otherwise have
to be done at runtime.
flat_array allocates only
one field internally. For each member a sub-array is reserved
within. We know each member's size and the array's dimensions
are fixed, thus we can deduce the offsets for each of these
sub-arrays. Some care needs to be taken though to ensure
optimal alignment for each field. To take advantage of these
optimizations in user code we have to call it back via a
functor. That is the only way our code can conveniently bind
template parameters in external code.
Our promise was that
flat_array would provide an
object-oriented interface. This is implemented in the
iterator, which tries to emulate the storage class' interface
by providing the same members, only as functions. These are
generated automatically by a set of macros, but the user needs
to supply the list of member variables and their respective
types. In a certain sense this is what the Ruby folks refer to
as duck typing. The iterators don't really grant
access to instances of the storage type, but rather yield
objects which look and act like those.
The cool thing about the iterator is that it doesn't hold any data itself, only two pointers: one to the external data store and one counter for indexing. Compilers are generally clever enough to optimize the actual iterator instances away. They never get allocated. If a user wants to access neighbors via the same iterator and can provide the relative coordinates as constants (e.g. via template parameters), then even those offsets can be pre-computed. To protect the user from having to decide upon the array size at compile-time, we can simply maintain a list of the most relevant dimensions within the library and selecting the next larger predefined size to match the users selection. We can show that the memory losses incurred by allocating slightly too large arrays can be kept arbitrarily low, but this comes at the cost of generating more code. That however -- according to our experiences -- is a non-issue for most codes. And that's already it: code generation plus lots of compile-time computation plus a clever callback.
Please take a look at
in our repository. The tests showcase rather simple kernels
that are meant to illustrate how SoA models can be registered
and used with the containers
soa_grid also includes a variation
Game of Life.
soa_array contains a simple charged particle
The examples are meant to be closer to kernels in actual scientific applications and demonstrate the different performance features of LibFlatArray. Please feel free to consult the LibGeoDecomp mailing list should you have any questions. We'll be glad to help.
soa_gridall offsets can be calculated at compile time -- at the expense of compilation time.
soa_arraytogether with a loop-peeler, vectorization and branch harvesting.