Barnes-Hut: An Implementation and Study


Discussion of Implementation Difficulties

Our initial implementation of Barnes-Hut was relatively naive, but just getting it to work under Split-C was a important first accomplishment. Our initial efforts to port the SPLASH shared-memory code from [5] and [9] met with serious problems. First of all, the code was obviously converted from FORTRAN source, and then extensively modified, making the code rather less legible than we hoped. Additionally, the shared-memory paradigm hid where explicit communication was taking place, and the code did not indicate where it was expected. Our choice to rewrite the code in its entirety, seems, in hindsight, like the right choice.

Why are small problems and short runs bad?

Small problems and short runs are uncharacteristic of the actual runs that astrophysicists want. This is very clear from the literature. Warren and Salmon run over 17 million particles [10]; Hu and Johnson run 10 million particles [4]. These are gigantic runs on large machines. Furthermore, they are performing hundreds, thousands, or more timesteps in each of their runs. Even when astrophysicists do small runs (64-128k particles), they still perform hundred or more timesteps. Therefore, the question becomes: Are small problems and short runs representative of large problems and long runs?

Small problems != Large problems

The Flash and Dash multiprocessors have local cache on each node. This cache is about 1 Mb. Unfortunately, this means that during a run of 16k particles, it is likely that all of the particles and all of the trees (along with the cached subsections) will fit entirely into the cache. Our measurements show that 16k particles will cause about 128k of tree-cache to be used per node on a 32 node machine. In fact, on a 32 node machine, 256k particles only use about 700-1,400 k of memory to cache the trees. (This is with load balancing) Therefore, as problem size increases, there will come a point where the particles and tree don't entirely fit in cache. At this point, the program will start to run slower. As the total required memory continues to increase, the cache will become more and more useless. For this reason it is important to run the largest possible problem on a machine to see how it behaves under stress. It is also interesting to report this number with regard to the amount of memory per node and the number of nodes. For our implementation, we can run about 2 million particles on a 32-node CM5 with 32 Mb of memory per node before we run into problems with running out of memory. At this size, the machine is using about 8 megs of memory per node to cache the tree.

Short runs != Large runs

It should be very clear from our pictures that over a large number of time-steps, the distribution of the particles changes substantially. Warren and Salmon [12,13] reported for their runs starting from a smooth distribution that they changed the error bounds because after many timesteps the particles had become very clumped, changing the performance and accuracy of their implementation. These two examples show that performing one or two timesteps is not likely to show the extended behavior of the implementation. We measured differences of 50-100% in the total time for each timestep over the thousands of timesteps we performed to generate our pictures. Furthermore, in the long run, particles will migrate, causing implementations which cannot account for this to perform more poorly.

Return to