Parallel Aspects

Impulse is not a simple system to parallelize and does not contain a simple obvious method of parallelization. There are some aspects that are east to parallelize while others are not. In this section we discuss the various parallel aspects that were implemented. However, future developments includes a discussion of some of the more difficult portions of the code to parallelize and how they should be tackled.

Fundamental Parallel Aspect

The fundamental parallel aspect is that objects are spread across processors. Essentially, objects are read in and distributed across the processors. This is done using a spread array structure provided by Split-C. The advantage is that each processor is in charge of maintaining the state of a smaller subset of the objects thereby exploiting the parallelism during various stages of the simulation. Fixed objects are local to all processors as well as static data about all of the objects such as their polygonal models, tiling resolution, etc. The result is that objects in collision with fixed objects or with objects that are local to the same processor achieve near linear parallel speedup results.


Here we exploit some of the parallelism such that all of the processors compute the initial bounding volumes of their local objects and place them into the global hash tiling. The bounding volume calculations all occur in parallel for the objects. Access to the hash tiling structure is made via atomic access routines to the hash bins. Objects that don't hash to the same spatial hash tiling will not be in contention for the same block of data and hence will proceed to operate in parallel. Future improvements involving distributing the hash across processors could be made to this routine. However, given the size of the project many of these such optimizations were not made in the sake of getting a fully and correctly operating simulation running within the time frame given by this project.

Since the initial computation of time to impact for object pairs is only performed once at the beginning of the simulation, no parallelization was made for computing the initial time to impact between object pairs. Instead this is portion of the code is performed entirely on one processor. This is not a problem since we only perform this initialization once. The collision heap resides on only one processor. This is not a problem for the parallel implementation as discussion in the next step of the operation. An earlier approach to this problem on a 2D system also explored the results of randomly spreading the heap across processors.

Heap Removal

The heap is local to only one processor. Therefore, that processor simply removes the top heap element and broadcasts to all of the other processors both the pair involved in the collision and the time step for this simulation loop. An earlier approach to this problem on a 2D system explored the results of randomly spreading the heap across processors.

Computing Bounding Volumes

The computation of the swept volumes occur in parallel for the objects on all processors. Since each processor maintains a distributed subset of all of the objects, all computations are local for determining the swept volumes.

Update Hash Tiling

After parallel computation of the swept volumes for all of the objects in the simulation, we determine the new hash entries that need to be inserted into the spatial tiling. If an object's bounding volume has not changed since it was last updated, no work is required and no modifications to the hash tiling are made for that object. When new hash entries need to be inserted and old ones removed, we use atomic operations for accessing each bin. Access to the tiling data structure can be performed in parallel when processors are making updates to different spatial hash tiling bins. Although, when objects are clustered together, they will hash into the same spatial hash tiling bin. Remember that we only require atomic access to the bin for updates and that we only make updates when an object enters or exits a spatial hash tiling bin. Therefore, objects that remain close to one another, hence sharing a spatial hash tiling bin, will not receive any performance hit after their initial insertion into the bin since there will be no contention after the initial atomic access to the hash tiling bin.

Evolve Ballistic Trajectory

Evolution of the ballistic trajectory of each object proceeds in parallel on all processors. No communication is necessary since all information is local to each of the processors.

Handle Collision

Since this operation does not immediately involve any of the other objects, we handle the collision on one processor. We do choose this processor with some intelligence. If the collision is between a moving object and a fixed object, we choose to perform the collision calculations on the processor local to the moving object. This will insure that the entire collision operation can occur locally since fixed objects are local to all processors. We may also have a collision between two objects that are on the same processor, in which case collision calculations can proceed without communication. Since these first two cases are highly desirable it would be an important improvement for future parallel implementations to bias the locality of the objects to significantly increase the probability of colliding objects residing on the same processor. In the worst case the two objects will be on separate processors. If so, we simply choose one of the processors to handle the collision and spend time in communicating a few of the necessary pieces of data about the remote object to the controlling processor. Once there, collision and impulse calculations can proceed locally.

After the impulses have been applied to the two objects, we need to perform the necessary updates to the heap as describe here. Since we have reduced the heap size to include only object pairs that are close to one another, we will only need to perform heap updates and new time to impact estimations for objects that are close, as defined by the spatial hash tiling, to one of the recently collided objects. Rather than performing all of the necessary communication to this single processor, we simply broadcast flags and the necessary object data necessary to processors with at least one of the objects in the pair to be updated as local. This improves the performance of this obviously expensive operation by distributing the workload of updating across processors.

Record State of System for Rendering

Since, again, each processor maintains a subset of all of the objects, all of the processors can, in parallel, record the object poses to either an internal array or movie file which can be played back on a system with a graphics workstation. Since the objects are local to each the processor, this proceeds in parallel without any need for communication.

[ Home ]

Eric Paulos / / 10 May 1995