Stencil codes are often seen as a prime example for real-world problems where vectorization can be applied easily. After all, the same operations have to be carried out for each grid cell and many prominent kernels, e.g. LBM (Lattice Boltzmann Method) or the RTM (Reverse Time Migration) don't even contain conditionals. So it is even more surprising, that compilers still struggle at generating vectorized code automatically. Or are times changing?
JacobiCellSimpleis unsurprisingly the shortest Jacobi implementation one can write using LibGeoDecomp. To evaluate the new cell API, I've added
JacobiCellStraightForward. Both implement
updateLine(). While the former is a fairly sophisticated 8x unrolled SSE kernel with register shuffling to avoid unaligned loads, the latter is just a rather dumb, unrolled version of
JacobiCellSimple. As expected, my measurements show that the more intelligent a code is, the faster it will run. They also show that for large matrices memory bandwidth becomes the bottleneck. In this case non-temporal stores (or streaming stores, as Intel calls them) used in
JacobiCellStraightforwardNTcan squeeze out a couple of GLUPS. They avoid the write allocate, which typically doubles the required bandwidth for writing data.