Barnes-Hut: An Implementation and Study

Space-Separated Tree Building

The tree build phase of Barnes-Hut is particularly interesting because, although it accounts for only a trivial amount of time compared to the force calculations, it drives the force calculations by exposing the locality and pruning away calculations. Tree-building algorithms are also interesting because they generate space partitions of points in the plane -- a more general result that could be used to solve other types of problems.

This page details four different tree-building schemes. They include:

Single-processor
Parallel ``AtomicProc''
Parallel Merge
Parallel Hashed

A comparison of the time/space properties of the algorithms we implemented (Single-node, AtomicProc, and Hashed) are in the section on results. The space complexity is particularly important to us, since it can limit the number of particles that can be simulated.

Single-processor:

Our first implementation was a version where a single node builds the tree in local memory:

fun InsertBody(global root,body) =
   FindMyLeaf(body)
      if (leaf_is_full) 
         SplitLeaf(root,body)
         InsertBody(new_leaf)
      else
        InsertToLeaf(body)
end

Because the data structure in this case must reside in this single processor's memory, the problem size was limited by the tree size. In addition, not only was the data structure limited by a single processor's memory, our implementation was originally limited by the single processor's cache; without using a software caching method to reduce communication bandwidth and mask latency, the full cost of global reads were necessary:

a minimum of once for each insertion, plus...

some multiple of eight times when a leaf splits and becomes an internal node.

Note that this multiple of eight is only bounded by either the minimum perturbation constant or the floating point accuracy if the partition is a space-partition and the nodes are at nearly identical points.

Once software caching was added, the extreme communication costs were greatly minimized, but at the expense of additional memory and therefore maximum problem size.

Parallel ``AtomicProc:''

Our fastest implementation was a parallel tree-building algorithm loosely based on the original Barnes-Hut tree-build, but which we optimized to reduce lock contention. Additionally, we allow up to eight bodies per leaf node, greatly benefiting cache performance and reducing some of the memory overhead necessary for links between nodes.

The original Barnes-Hut algorithm works by loading the tree starting at the top and locking nodes on every insertion. Because the overhead for performing a locking operation is a full round trip for the lock and a full round trip for the unlock, we wanted to minimize the number of locks. Therefore, we implemented our ``AtomicProc'' algorithm such that the nodes are only locked when a node has to be split. However, as we will show, this still results in lock contention because the amount of work to split a node is substantial: all of the bodies have to be retrieved and re-inserted into a new tree which has been split into 8 parts. A sketch of the algorithm follows:

fun InsertBody(global root,body) =
  do 
    result := try_atomic_insert(root,body)
    case result in
      insert_good: 
      leaf_full: splitleaf(root,body)
      locked: 
      internal: insert_internal(root,body)
    esac
  while result = locked
end

In this case, the bulk of the work is in the try_atomic_insert function, which executes on the processor which owns the root (it is a global pointer). In the usual case, the root will be partially empty, and the body will just be stuck into the root. Occasionally, the root will become full. In this case, the node which caused the root to fill is responsible for splitting the root into the 8 children, and re-inserting all of the bodies into the appropriate leaves. While it is splitting the root, the root remains locked so that other processors can't try to update the value. The other exceptional case is that the node has been split at some point, in which case it is now an internal node. In this case, the algorithm simply recurses down the tree looking at the new internal node. As an additional optimization, our implementation caches these internal nodes; once a node has been split into an internal node, it will not be subsequently modified until after the end of the tree-build stage. For well-behaved insertions, where the body data on each node is relatively well space-separated, the insert operation will run in O(1) time instead of O(log n) using this optimization. One way to force the O(1) performance on random data (as mentioned and used in the Hashed Tree-Build algorithm of [10]) is to sort the body data beforehand, a relatively expensive, but reasonable, option. Note that sorting the body data before beginning gives far better performance for all of the parallel algorithms presented here: each processor eventually starts to work on a portion or portions of the tree for which the other processors have no interest, thereby greatly minimizing lock contention.

Parallel Merge:

The Parallel Merge algorithm is described in [6], and the SPLASH II papers [5], but was not implemented by the SPLASH group as they describe. The SPLASH group apparently decided that since the tree-build phase was such a small portion of the overall cost of Barnes-Hut, an optimized tree-build was uncalled for. They instead opted to use the naive original algorithm with significant locking overhead.

Parallel Merge can be described as follows:

Each processor builds a local tree in a manner similar to our single-node tree-build, above.
One processor designates its root as the global root node.
Each of the other processors tries to merge its current tree into the global tree by locking nodes it needs to modify.

The expectation is that whole subtrees of data will be added to the global tree with each lock, thereby greatly reducing contention. A natural optimization of the algorithm it to cache or duplicate upper tree nodes, leading toward a much decreased insertion cost.

Note that as before, the more well-behaved the data is, the less expensive the merge operations become. Simply sorting the data before starting the algorithm will give this heightened performance.

Parallel Hashed:

The Parallel Hashed tree-build algorithm of [10] attempts to replace the explicit tree structure with a hash table. The idea is to number the nodes with hash values similar to a hamming code. Thus, the root node is hashed to 000, its eight children are hashed to 001-111, the children of node 001 hash to 001000-001111, etc. There are some immediate benefits to this:

A given node can be accessed in O(1) time.
The location of a node's parent and children is simple to calculate.
Memory used by the links of an explicit tree structure is saved by the implicit link pattern.
Hash collisions will only occur at lower-level nodes in the tree, and these nodes should therefore have :
- fewer total accesses than higher-level nodes.
- less contention since only spatial distant nodes will hash to the same location

In order to get the expected O(1) hash insertion time, Warren and Salmon assumed that the nodes will be sorted before starting the tree-build.

Continue to Calculation of Cell and Body Interactions

-----

maintained by <hodes@cs.berkeley.edu>