We're very happy to announce the availability of LibFlatArray 0.3.0, our C++ library for Struct-of-Arrays containers and expression templates for vectorization. This latest release represents a huge leap forward. It comprises more code, more commits, and more supported instruction set architectures (ISAs) than all previous releases. Our direction of thrust for this release was to support all architectures that are releavant for HPC and to extend the vectorization intrinsics for kernels with irregular memory access patterns and control flow.
The authors would like to acknowledge the funding of the Deutsche Forschungsgemeinschaft (DFG) through the Cluster of Excellence Engineering of Advanced Materials.
short_vecnow support Intel's AVX512 which is being used by the current Intel Xeon Phi coprocessor (Knights Lading) and upcoming Intel Xeon products. ARM NEON is mostly interesting for handheld devices. Futher supported ISAs: SSE, AVX, QPX, MIC (Intel KNC).
short_veccan now perform gather loads and scatter stores to main memory. This is useful to vectorize kernels with irregular memory access patters, e.g. sparse matrix operations like SpMVM (requires a C++11 compliant compiler).
soa_arraycan both be used with CUDA memory and support moving data to host memory.
short_vecnow implements comparison operators (
<, <=, ==, >, >=).
any()can be used to quickly check if any vector element matches the comparison. Rare/expensive Conditionals can then be handled in a scalar fashin using
get()for element retrieval.
intas element types.
cuda_arrayis a convenient helper class exchanging AoS data between host and device.
streaming_short_vecbehaves just like a
short_vec, but will do all stores with the non-temporal (no read) hint to avoid cache pollution. The new type trait
estimate_optimum_short_vec_typecan be used to select both, the optimum store strategy and arity of
short_vec. Choosing an arity larger than the machine word's width results in automatic loop unrolling.
loop_peeler()can handle the scalar iterations at the begin and end of vectorizable loops. They're now also usable with C++14 lambdas (requires template lambdas), which results in a much more natural code layout. Previously the loop body had to be moved to a separate class.
soa_gridcan now load a subset of the grid from a contiguous region of memory and store it back. This is most useful for marshaling parts of the grid for exchanging halo regions (ghost zones) when using MPI or HPX for multi-node parallelization. See this unit test for how to use this feature.
soa_arraycan now work with member arrays and types other than built-in types (previously no c-tors and d-tors were run for members of the SoA structure, which is obviously terrible).