I first noticed that something was odd when I ran the first scaling tests on LiMa and could only get past a certain number of cores if the working set size per node was reduced -- otherwise the job would indefinitely hang during initialization. The bug only occurred when running on many cores and it took almost forever to get the jobs through LiMa's queue, so waiting to attach a debugger was out of the question. As my code passed all unit tests, I decided to blame this on MPI or a quirk of the machine and decided to move on. Matters became a bit more dramatic as I discovered the evening before our reservation on Tsubame that my job would consume excessive amounts of memory for jobs with more than 30 GPUs -- while I was prepping to run on >1000 GPUs. (BTW: I'll post the results from Tsubame soon, I just need to wait a bit until they're published.) The simulation model on Tsubame used much smaller simulation cells, which is why I could run larger grids and why the bug was triggered for smaller instances. Also, the nodes did not enter thrashing, but immediately crashed because of Tsubame's tighter restrictions on memory size. This allowed my to dig deeper into the code and isolate the Region class as the delinquent. While I could write a quit workaround for Tsubame, the problem itself did remain unsolved until today, 12:53 GMT. This is how it went: I first wrote a test case to measure memory consumption depending on the domain size n. It turned out that the original Region slurped a generous amount of 160 bytes per row (Streak) of consecutive coordinates -- much more than the approximately 32 bytes I was expecting. We need to store multiple Region instances per process. Together these may easily block more memory than the grid itself. Big O doesn't help us here, as the amount of memory per core did remain relatively stable at 2 GB per core in the past years and the grids can't outgrow geometry handling. The constants are obviously too large.
| # | Test Family | Dimensions | Performance |
|---|---|---|---|
| 1 | RegionInsert | (128, 128, 128) | 0.002968 s |
| 2 | RegionInsert | (512, 512, 512) | 0.052137 s |
| 3 | RegionInsert | (2048, 2048, 2048) | 0.960965 s |
| 4 | RegionIntersect | (128, 128, 128) | 0.023085 s |
| 5 | RegionIntersect | (512, 512, 512) | 0.395194 s |
| 6 | RegionIntersect | (2048, 2048, 2048) | 6.81856 s |
So what's the solution? The original Region stores parts of the information redundantly. Also, it uses std::map, ergo a tree with lots and lots of pointers, which take up more precious RAM. A reimplementation was imperative. Replacing a core class of the library made me feel uneasy and it took a some days to get it right/debugged.
The new design strips all redundant storage and uses only three flat arrays (one per spatial dimension), interlinked via indices. The drawback of this is that we have to update the indices in large parts of the arrays on every insert and deletion of streaks. The wost case complexity went up from O(log n) to O(n²). Really really bad. But wait? What about these constants? Surely it's not that bad, right? At this point I have to say that I'm extraordinarily happy to have both, unit tests and performance tests in place. In this case the performance tests measure inserts and intersection, the two most common operations for the Region, iteration aside. The results however were rather mixed. The first table contains a list of the results prior to the reimplementation (revision 350:d14ec73a10cd, tested with GCC 4.6.3 on a Intel Core i7-2600K)...
The next table lists the timing and relative speedup of the new Region implementation, revision 351:cc0339630cdf. Inserts (test 1-3) look pretty good. This is because the tests insert only at the end of the Region, which yields (even with the new implementation) a much better insert complexity of O(log n) compared to the worst case. It is 4-5x faster than the original code as it usually doesn't need to costly allocate new memory. Contrary, the intersection (tests 4-6) are prohibitively slow. The intersection operator &was implemented as a & b == b - (b - a)) and subtract needs to insert/delete streaks at the beginning of the arrays as it operates in place, thus O(n²). Naturally I had to reimplement this operator, too. The new algorithm traverses both regions and inserts matching streaks to a new array. And inserts are fast. The last table was measured with revision 352:112f4d2ad22d. The best thing about the new intersection operator is that it's not only not slower, but almost 10x faster. Incidentally the code now allocates just 16 bytes per Streak, also a 10x improvement. Not bad.
| # | Test Family | Dimensions | Performance | Speedup |
|---|---|---|---|---|
| 1 | RegionInsert | (128, 128, 128) | 0.000591 s | 5.022 |
| 2 | RegionInsert | (512, 512, 512) | 0.011337 s | 4.599 |
| 3 | RegionInsert | (2048, 2048, 2048) | 0.213258 s | 4.506 |
| 4 | RegionIntersect | (128, 128, 128) | 0.484808 s | 0.048 |
| 5 | RegionIntersect | (512, 512, 512) | 149.2 s | 0.003 |
| 6 | RegionIntersect | (2048, 2048, 2048) | eternity? | zero? |
| # | Test Family | Dimensions | Performance | Speedup |
|---|---|---|---|---|
| 1 | RegionInsert | (128, 128, 128) | 0.000646 s | 4.4257 |
| 2 | RegionInsert | (512, 512, 512) | 0.012235 s | 4.23098 |
| 3 | RegionInsert | (2048, 2048, 2048) | 0.227063 s | 4.19547 |
| 4 | RegionIntersect | (128, 128, 128) | 0.00203 s | 11.3015 |
| 5 | RegionIntersect | (512, 512, 512) | 0.039035 s | 10.0636 |
| 6 | RegionIntersect | (2048, 2048, 2048) | 0.718114 s | 9.39575 |