A while ago Julian, a student of mine, started working on his
master's thesis. His task is to device means to facilitate
auto-vectorization for user-supplied kernels within LibGeoDecomp.
For a start, I wrote some variants of the 3D
Jacobi smoother
(link to the code). They're all very similar. Yet, the subtle
differences may lead to significant performance penalties, as you
can see on the right. I've plotted matrix size vs. kernel
performance measured in
GLUPS. My excuse for the
awkward names of the kernels is: they're named that way for
history reasons.
JacobiCellSimple is unsurprisingly
the shortest Jacobi implementation one can write using
LibGeoDecomp. To evaluate the new cell API, I've added
JacobiCellStreakUpdate and
JacobiCellStraightForward. Both implement
updateLine(). While the former is a fairly
sophisticated 8x unrolled SSE kernel with register shuffling to
avoid unaligned loads, the latter is just a rather dumb, unrolled
version of
JacobiCellSimple. As expected, my
measurements show that the more intelligent a code is, the faster
it will run. They also show that for large matrices memory
bandwidth becomes the bottleneck. In this case non-temporal stores
(or
streaming stores, as Intel calls them) used in
JacobiCellStraightforwardNT can squeeze out a couple
of GLUPS. They avoid the
write allocate, which typically
doubles the required bandwidth for writing data.