Parallel Aspects
Impulse is not a simple system to parallelize and does not contain a
simple obvious method of parallelization. There are some aspects that
are east to parallelize while others are not. In this section we
discuss the various parallel aspects that were implemented. However,
future developments includes a discussion of
some of the more difficult portions of the code to parallelize and how
they should be tackled.
Fundamental Parallel Aspect
The fundamental parallel aspect is that objects are spread across
processors. Essentially, objects are read in and distributed across
the processors. This is done using a spread array structure provided
by Split-C. The advantage is that each processor is in charge of
maintaining the state of a smaller subset of the objects thereby
exploiting the parallelism during various stages of the
simulation. Fixed objects are local to all processors as well as
static data about all of the objects such as their polygonal models,
tiling resolution, etc. The result is that objects in collision with
fixed objects or with objects that are local to the same processor
achieve near linear parallel speedup results.
Initialization
Here we exploit some of the parallelism such that all of the
processors compute the initial bounding volumes of their local objects
and place them into the global hash tiling. The bounding volume
calculations all occur in parallel for the objects. Access to the hash
tiling structure is made via atomic access routines to the hash
bins. Objects that don't hash to the same spatial hash tiling will not
be in contention for the same block of data and hence will proceed to
operate in parallel. Future improvements involving distributing the
hash across processors could be made to this routine. However, given
the size of the project many of these such optimizations were not made
in the sake of getting a fully and correctly operating simulation
running within the time frame given by this project.
Since the initial computation of time to impact for object pairs is
only performed once at the beginning of the simulation, no
parallelization was made for computing the initial time to impact
between object pairs. Instead this is portion of the code is performed
entirely on one processor. This is not a problem since we only perform
this initialization once. The collision heap resides on only one
processor. This is not a problem for the parallel implementation as
discussion in the next step of the operation. An earlier approach to
this problem on a 2D system also explored the results of randomly
spreading the heap across processors.
Heap Removal
The heap is local to only one processor. Therefore, that processor
simply removes the top heap element and broadcasts to all of the other
processors both the pair involved in the collision and the time step
for this simulation loop. An earlier approach to this problem on a 2D
system explored the results of randomly spreading the heap across
processors.
Computing Bounding Volumes
The computation of the swept volumes occur in parallel for the objects
on all processors. Since each processor maintains a distributed subset
of all of the objects, all computations are local for determining the
swept volumes.
Update Hash Tiling
After parallel computation of the swept volumes for all of the objects
in the simulation, we determine the new hash entries that need to be
inserted into the spatial tiling. If an object's bounding volume has
not changed since it was last updated, no work is required and no
modifications to the hash tiling are made for that object. When new
hash entries need to be inserted and old ones removed, we use atomic
operations for accessing each bin. Access to the tiling data structure
can be performed in parallel when processors are making updates to
different spatial hash tiling bins. Although, when objects are
clustered together, they will hash into the same spatial hash tiling
bin. Remember that we only require atomic access to the bin for
updates and that we only make updates when an object enters or exits a
spatial hash tiling bin. Therefore, objects that remain close to one
another, hence sharing a spatial hash tiling bin, will not receive any
performance hit after their initial insertion into the bin since there
will be no contention after the initial atomic access to the hash
tiling bin.
Evolve Ballistic Trajectory
Evolution of the ballistic trajectory of each object proceeds in
parallel on all processors. No communication is necessary since all
information is local to each of the processors.
Handle Collision
Since this operation does not immediately involve any of the other
objects, we handle the collision on one processor. We do choose this
processor with some intelligence. If the collision is between a moving
object and a fixed object, we choose to perform the collision
calculations on the processor local to the moving object. This will
insure that the entire collision operation can occur locally since
fixed objects are local to all processors. We may also have a
collision between two objects that are on the same processor, in which
case collision calculations can proceed without communication. Since
these first two cases are highly desirable it would be an important
improvement for future parallel
implementations to bias the locality of the objects to
significantly increase the probability of colliding objects residing
on the same processor. In the worst case the two objects will be on
separate processors. If so, we simply choose one of the processors to
handle the collision and spend time in communicating a few of the
necessary pieces of data about the remote object to the controlling
processor. Once there, collision and impulse calculations can proceed
locally.
After the impulses have been applied to the two objects, we need to
perform the necessary updates to the heap as describe here. Since we have reduced the heap size to
include only object pairs that are close to one another, we will only
need to perform heap updates and new time to impact estimations for
objects that are close, as defined by the spatial hash tiling, to one
of the recently collided objects. Rather than performing all of the
necessary communication to this single processor, we simply broadcast
flags and the necessary object data necessary to processors with at
least one of the objects in the pair to be updated as local. This
improves the performance of this obviously expensive operation by
distributing the workload of updating across processors.
Record State of System for Rendering
Since, again, each processor maintains a subset of all of the objects,
all of the processors can, in parallel, record the object poses to
either an internal array or movie file which can be played back on a
system with a graphics workstation. Since the objects are local to
each the processor, this proceeds in parallel without any need for
communication.
Eric Paulos /
paulos@robotics.eecs.berkeley.edu /
10 May 1995