Barnes-Hut: An Implementation and Study
Discussion of Implementation Difficulties
Our initial implementation of Barnes-Hut was relatively naive,
but just getting it to work under Split-C was a important
first accomplishment. Our initial efforts to port the SPLASH
shared-memory code from  and  met with serious problems. First of
all, the code was obviously converted from FORTRAN source, and then
extensively modified, making the code rather less legible than we hoped.
Additionally, the shared-memory paradigm hid where explicit communication
was taking place, and the code did not indicate where it was expected.
Our choice to rewrite the code in its entirety, seems, in hindsight,
like the right choice.
Why are small problems and short runs bad?
Small problems and short runs are uncharacteristic of the actual runs
that astrophysicists want. This is very clear from the literature.
Warren and Salmon run over 17 million particles ; Hu and Johnson run
10 million particles . These are gigantic runs on large machines.
Furthermore, they are performing hundreds, thousands, or more timesteps
in each of their runs. Even when astrophysicists do small runs (64-128k
particles), they still perform hundred or more timesteps. Therefore, the
question becomes: Are small problems and short runs representative of
large problems and long runs?
Small problems != Large problems
The Flash and Dash multiprocessors have local cache on each node. This
cache is about 1 Mb. Unfortunately, this means that during a run of 16k
particles, it is likely that all of the particles and all of the trees
(along with the cached subsections) will fit entirely into the cache.
Our measurements show that 16k particles will cause about 128k of
tree-cache to be used per node on a 32 node machine. In fact, on a 32
node machine, 256k particles only use about 700-1,400 k of memory to
cache the trees. (This is with load balancing) Therefore, as problem
size increases, there will come a point where the particles and tree
don't entirely fit in cache. At this point, the program will start
to run slower. As the total required memory continues to increase,
the cache will become more and more useless. For this reason it is
important to run the largest possible problem on a machine to see how
it behaves under stress. It is also interesting to report this number
with regard to the amount of memory per node and the number of nodes.
For our implementation, we can run about 2 million particles on a
32-node CM5 with 32 Mb of memory per node before we run into problems
with running out of memory. At this size, the machine is using about
8 megs of memory per node to cache the tree.
Short runs != Large runs
It should be very clear from our pictures that
over a large number of time-steps, the distribution of the particles
changes substantially. Warren and Salmon [12,13] reported for their
runs starting from a smooth distribution that they changed the error
bounds because after many timesteps the particles had become very clumped,
changing the performance and accuracy of their implementation. These two
examples show that performing one or two timesteps is not likely to show
the extended behavior of the implementation. We measured differences
of 50-100% in the total time for each timestep over the thousands of
timesteps we performed to generate our pictures. Furthermore, in the
long run, particles will migrate, causing implementations which cannot
account for this to perform more poorly.