## Barnes-Hut: An Implementation and Study

### Our Load-Balancing Implementation

In an effort to allow us to have more particles in our runs, and to improve
the performance of our implementation, we implemented a very simple load
balancing scheme. We build the tree, and we then have each processor walk
the tree, copying its "subset" of the bodies to a duplicate array. Each
processor then exchanges the real bodies for this copy, and the algorithm
proceeds as before.

Unfortunately, because of the caching algorithm,
we can run out of memory trying to build the initial tree for the bodies.
For this reason, our implementation will build subsets of the particles
into a tree, and exchange the subsets. It will slowly increase the
size of the subsets it is building into trees relying on the fact
that balancing the smaller subsets will improve the balance of the
larger subsets. We find in practice this works very well, and allows
us to run with much larger data sets. For the random sphere surface
distribution, we can only get 512k particles without load balancing.
With load balancing, we can get 2M particles. The performance increases
are also substantial. Load balancing roughly doubles the performance
of tree build, and quadruples the performance of tree-walk for the
random-sphere-surface distribution. Unfortunately, our load-balancing
doesn't balance the work in tree-walk, and hence the percentage of
imbalance in our algorithm when the total time has been decreased.

Continue to
Results

-----

maintained by <hodes@cs.berkeley.edu>