Recently the question popped up on Slashdot whether you use LibGeoDecomp in Fortran. I took a look at it and, to cut a long story short, the answer is: yes, it's surprisingly easy, but you lose some features. Here is the long story:

I never had much exposure to Fortran as all of our users roll C++ codes. But a lot of scientific codes are written in Fortran and developers presumably agonize about how these could be ported to state-of-the-art HPC resources. One of the design goals of LibGeoDecomp was to make the library well suited for interfacing with legacy codes -- rewriting the complete application is a thing you just cannot ask from a user.

My idea for this post was to take an simple 3rd party program and try to marry it with LibGeoDecomp while preserving as much as possible of its original code and structure. The Hello World equivalent for computer simulations is Conway's Game of Life. My choice fell on this code, kindly provided by KTH, Sweden. The code has several advantages:

  • Well written and simple; no cryptic code or complexity diverts attention from the real problem at hands: interfacing with LibGeoDecomp.
  • 3rd party code: so I can't cheat by conjuring up a Fortran code which makes life easy for me.
  • It's a sequential code, is has not been prepared for parallelization. Just like most legacy codes.

Of course one could argue that Game of Life is too simple and that real codes would not be as easy to port. And that's true. But this post is really just a primer to get people started. Feel free to post on our mailing list if you need help.

You can grab the complete source used for this post here: hg clone https://bitbucket.org/gentryx/game_of_life

Example: Conway's Game of Life

Game of Life animation

1. Understanding the Original Code

The original code can be split into three phases:

  1. initialization with random values (lines 20-25)
  2. evolution (lines 29-75, temporal loop with nested spatial loops)
  3. output (lines 79-83, count number of living cells)

What's interesting is that the original code (shown on the right) handles the boundary cases separately: cells on the edges are copied over to their counterparts on the opposite edges. This creates a torus topology: the left and right boundaries can be imagined as being glued together, which transforms the 2D simulation plane into a pipe. Glueing together the top and bottom edges transforms the pipe (topologically) into a doughnut.

torus

Fig. 1: Torus Topology

2. Porting It

The basic behind LibGeoDecomp is that it takes over grid storage and controls which cells get updated when. That's not much, yet it allows the library to add features such as multi-node parallelization and parallel IO. We can map the three phases of the sequential code mentioned above to components of LibGeoDecomp:

  1. An Initializer will set up the grid before the simulation starts.
  2. The Simulator controls the temporal and spatial loops.
  3. A Writer handles output. It will be periodically notified by the Simulator.

If you've checked out the repository then you'll have noticed that the parallel version comes with a couple of additional files. These files mainly act as the glue between the Fortran kernel and the C++ library. Fig. 2 illustrates the workflow of the ported code. Here is how it works:

  • We can't instantiate the components directly in the Fortran program -- but we can call a C function which will do this for us. This function is called simulate() and lives in wrapper.cpp.
  • The Initializer and Simulator will need a way to call back the relevant portions of the Fortran code, which is why I've packaged those in two subroutines, initialize(), which sets the initial state of a single cell, and update(), which is responsible for the evolution of a row of cells.
  • update() in turn will need to access data stored within LibGeoDecomp. I've added macros named LGD_GET() and LGD_SET() for reading/writing cell data.

The code in wrapper.cpp and libgeodecomp.F90 is fairly generic and could be moved within the library, should there be more interest in using it with Fortran apps. This would leave decomposing the Fortran code into appropriate functions as the only task left to the user.

source laying

Fig. 2: Simulation Callgraph

3. How Does the Wrapper Work?

The wrapper was meant to be generic, but since it's the first time I wrote such a piece of code, it's not. A quick examination might be interesting for those wishing to reuse it. To facilitate the two-way callback from C++ to Fortran (for the update) and back to C++ (for reading/writing data), I've defined a class Cell as shown below.

class Cell
{
public:
    class API :
        public APITraits::HasTorusTopology<2>,
        public APITraits::HasUpdateLineX,
        public APITraits::HasFixedCoordsOnlyUpdate
    {};

    inline Cell(int alive = 0) :
        alive(alive)
    {}

    template
    void update(const COORD_MAP& hood, const unsigned& nanoStep)
    { ... }

    template
    static void updateLineX(Cell *targetLine, long *index, long indexEnd, const NEIGHBORHOOD& hood, const unsigned& nanoStep)
    { ... }

    int alive;
};

The class Cell::API inherits from a couple of classes defined in LibGeoDecomp's APITraits. These type traits are used as flags which enable the library to discover which features the user code supports. The code above tells the library that it requests a 2D torus topology. Accesses to neighboring cells will automatically be wrapped around the edges of the grid -- this makes user code much shorter and easier to read and maintain. It also specifies that Cell has a function updateLineX() which will update a whole row of cells -- as opposed to update(), which operates on a single cell. It's more efficient to update multiple cells in one go, so this is a good thing. And it doesn't really add much complexity since the original code was capable of this already. Finally the API specifies that the update routines will only use a certain type (FixedCoord) to address neighbors. At this point it's not really important how this works; what is important though is that this gives us access to some compile time optimizations. Many of the performance benefits in LibGeoDecomp come from using templates as a code generator. And that's the downside of using Fortran: templates don't cross the language barrier. Without templates within the kernel, LibGeoDecomp can't apply all of its optimizations.

4. Performance Evaluation

Game of Life is probably not the kernel that comes first to mind when thinking about HPC benchmarks. But LibGeoDecomp is an HPC library after all, so let's take a look. I ran both programs on my notebook, which has an Intel Core i7-3610QM (Ivy Bridge quad-core at 2.3 GHz) at its heart. The table below gives the measured run times for a grid of 2000x2000 cells and 500 time steps:

Cores Program Time Speedup
1 game_of_life-serial 8.0s 1.0
1 game_of_life-parallel 6.9s 1.2
2 game_of_life-parallel 4.9s 1.6
3 game_of_life-parallel 3.8s 2.1
4 game_of_life-parallel 2.9s 2.8

What can we observe? First of all: yay, we get a speedup -- even with just one core. Why is that? The original code did copy over the new state to the old grid (line 73). That may be convenient in Fortran, but it's also needlessly slow. LibGeoDecomp simply swaps pointers internally. But what's more interesting is that adding more cores (e.g. by running mpirun -np 4 ./game_of_life-parallel) further reduces the run time. The code does by no means scale perfectly, 4 cores don't give us a speedup of 4. This is because the matrix size is really really small and we would probably have gotten better results when using OpenMP and not MPI. So, why did I choose MPI in the first place? Because it's usually much harder to embrace and requires more modifications to the code. With LibGeoDecomp it doesn't matter which types of parallelism the Simulator exploits. Its (almost) all hidden in the library. If we wanted to run this example with really huge matrices (e.g. 1M x 1M cells) on a cluster we could directly go ahead, no further code modifications required.

5. Summary

In this longer-than-expected post I've shown how you can use LibGeoDecomp to parallelize legacy Fortran codes. What's the gist of it?

  • shorter, faster code
  • scales on clusters (supercomputers even)
  • convenient to add IO
  • right now: no CUDA, sadly

Chances are that the port will benefit performance even on notebooks and workstation, but what's better is that it will also run on much larger systems. Apart from that, it's interesting to note that the resulting Fortran program is actually shorter than the original code (I'm discounting the wrapper code as this should be moved into the library). You also get access to LibGeoDecomp's IO infrastructure: With just a couple lines of code I've added the PPMWriter to the simulation (in wrapper.cpp), which can render the matrix to PPM image files.

P.S.: Why no CUDA?

What's a bit sad though is that we can't yet interface with CUDA. In C++ a user would simply chose a CUDA-capable Simulator and flag his update function with __host__ __device__ to make nvcc compile the code for both, the CPU and the GPU. For Fortran there is no counterpart to nvcc. PGI has CUDA Fortran. We could probably build something on top of that to bring Fortran kernels together with LibGeoDecomp on NVIDIA GPUs. But right now only few people have access to the PGI compiler suite. Perhaps this will change in the aftermath of the acquisition of PGI by NVIDIA. Another avenue of approach would be to convert Fortran to C, either via f2c or with LLVM's C backend, and to paste the result into a C++ class.

I've also toyed with this idea. In my trials f2c could not parse my Fortran code and LLVM lacks a Fortran frontend. Any ideas welcome.

game_of_life-serial.f90 -- The Original Code

!----------------------
!  Conway Game of Life
!    serial version
!----------------------

program life

  implicit none
  integer, parameter :: ni=2000, nj=2000, nsteps = 500
  integer :: i, j, n, im, ip, jm, jp, nsum, isum
  integer, allocatable, dimension(:,:) :: old, new
  real :: arand

  ! allocate arrays, including room for ghost cells

  allocate(old(0:ni+1,0:nj+1), new(0:ni+1,0:nj+1))

  ! initialize elements of old to 0 or 1

  do j = 1, nj
     do i = 1, ni
        call random_number(arand)
        old(i,j) = nint(arand)
     enddo
  enddo

  !  iterate

  time_iteration: do n = 1, nsteps

     ! corner boundary conditions

     old(0,0) = old(ni,nj)
     old(0,nj+1) = old(ni,1)
     old(ni+1,nj+1) = old(1,1)
     old(ni+1,0) = old(1,nj)

     ! left-right boundary conditions

     old(1:ni,0) = old(1:ni,nj)
     old(1:ni,nj+1) = old(1:ni,1)

     ! top-bottom boundary conditions

     old(0,1:nj) = old(ni,1:nj)
     old(ni+1,1:nj) = old(1,1:nj)

     do j = 1, nj
        do i = 1, ni

           im = i - 1
           ip = i + 1
           jm = j - 1
           jp = j + 1
           nsum = old(im,jp) + old(i,jp) + old(ip,jp) &
                + old(im,j )             + old(ip,j ) &
                + old(im,jm) + old(i,jm) + old(ip,jm)

           select case (nsum)
           case (3)
              new(i,j) = 1
           case (2)
              new(i,j) = old(i,j)
           case default
              new(i,j) = 0
           end select

        enddo
     enddo

     ! copy new state into old state

     old = new

  enddo time_iteration

  ! Iterations are done; sum the number of live cells

  isum = sum(new(1:ni,1:nj))

  ! Print final number of live cells.

  write(*,"(/'Number of live cells = ', i6/)") isum

  deallocate(old, new)

end program life

game_of_life-parallel.F90 -- The LibGeoDecomp-enabled Code

!----------------------
!  Conway Game of Life
!    parallel version
!----------------------

#include "libgeodecomp.F90"

subroutine initialize(target_cell, x, y)
  use, intrinsic :: iso_c_binding
  implicit none

  integer (c_int), intent (out) :: target_cell
  integer (c_int), intent (in) :: x
  integer (c_int), intent (in) :: y
  real :: arand

  call random_number(arand)
  target_cell = nint(arand)
end

subroutine update(LIBGEODECOMP_UPDATE_ARGS)
#include "libgeodecomp_declare_args.F90"

  integer :: x
  integer :: nsum

  do x = start_x, end_x
     nsum = LGD_GET(x, NW) + LGD_GET(x, N) + LGD_GET(x, NE) &
          + LGD_GET(x,  W)                 + LGD_GET(x,  E) &
          + LGD_GET(x, SW) + LGD_GET(x, S) + LGD_GET(x, SE)

     select case (nsum)
     case (3)
        LGD_SET(x) = 1
     case (2)
        LGD_SET(x) = LGD_GET(x, C)
     case default
        LGD_SET(x) = 0
     end select
  enddo
end subroutine update

program life

  implicit none
  integer, parameter :: ni=2000, nj=2000, nsteps = 500
  integer :: i, j, n, im, ip, jm, jp, nsum, isum
  integer, allocatable, dimension(:,:) :: old, new
  real :: arand

  ! "io_period" is used to control IO (for tracing execution time and
  ! printing to PPM image files).
  !
  ! "io_period = 0" to switch off
  ! "io_period = 5" to output every fifth step
  integer, parameter :: io_period = 1

  call simulate(ni, nj, nsteps, io_period)

end program life
last modified: Thu Oct 10 12:05:46 2013 +0200