Several aspects of the parallelization of this system are non-optimal.
One of the key problems centers around the fact that we fix the
mapping of objects to processors. This mapping is completely
arbitrary and set as the object data is read in from the simulation
file. No intelligence has yet been put into the mapping strategy.
The problem with this approach is that by giving up the control over
this mapping we often find the system in states which are horrendously
expensive on a parallel machine but could be dramaticly improved
simply by a quick object to processor re-mapping.
A Bad Case
For example, consider a system with many objects where only two such
objects are actually colliding. Also, assume that these two colliding
objects are actually resting atop one another. If we are unfortunate
the two colliding objects will reside of different processors. Since
the system can only step forward until the next collision and no other
objects are colliding the system will step forward to the next
collision between the two colliding objects. Additionally since these
objects are resting upon one another the time to impact estimates will
be small. Each time through the loop we will handle the collision
between these two objects and each time we will be required to make a
remote request for data from another processor. This request is by
far the most expensive operation that the simulation can make and will
degrade the performance of the entire system.
A quick solution that was implemented is to cache objects as we
request them from remote processors. The problem is that the data
that we fetch becomes invalid on every cycle through the loop since
after the ballistic phase, which is performed on the processor that
owns the object, the processor that handles the collision will have an
invalid state for that object and must request a new copy. Again this
The Real Solution
The real solution would either observe through some form of history
recording or directly from spatial locality information that these two
frequently colliding objects should reside on the same processor.
When this is decided it would copy the object over to the new
processor and inform the other processors of the new mapping. The
idea is essentially an adaptive dynamic object to processor mapping
based on spatial locality. The real problem is in being smart
about when and how to perform the re-mapping. Clearly if we re-map
too often we will be in the same state that we were trying to, namely
overly zealous communication. In other words if we re-map an object
only to re-map it again after only a few iterations though the
simulator we will receive the same performance hit as we would have if
we had never done the re-mapping. Likewise if are too conservative in
our re-mapping we will not be able to leverage from a performance gain
when we re-map. Even worse, by the time a conservative re-mapping
strategy decides to re-map it may be too late to achieve and
significant performance gain from the re-mapping. The pot of gold is
to perform the re-mapping is such a was as to guarantee a performance
gain by looking at the past, present, and future state of the system.
We already have a spatial hash tiling which could be used to predict
more optimal object to processor mappings. Future work on such
parallel implementations of simulation systems should take advantage
of this in order to achieve significant performance gains from the
parallelization. Further discussion on taking advantage of locality for
parallel programs is discussed here
When communication of object data is needed we currently transmit the
entire rigid body structure. However, we often only need several
bytes of that data. Therefore, we could reduce the communication time
by sending these smaller sized packets. Perhaps, a simple
compression scheme could reduce the data transmission size even more.
Any reduction in remote processor communication will significantly
increase performance of the system.
An entirely different approach to this form of dynamic simulation is
also possible. Essentially, clusters of predicted future occurring
collisions could be shipped off to separate processors, even using a
simple system such as PVM. In the
best case this would allow for the "time frontier" of the simulation
system to move forward asynchronously. A true clock would be kept to
lock in the state of the system at particular points in time.
Essentially, that clock would insure that everything had been properly
handled by the simulation system on all processors up to that point.
Portions of the system moving their simulation time very far into
the future would be given a lower priority in terms of processors
allocation. This would load balance the system so that processors
will be bias towards working on solving portions of the system that
will occur soon.
If a collision occurs that threatens invalidates a portion of the
simulator running off on a remote processor, we check to make sure
that this collision actually does affect the state of that sub-system.
If it does, we must dump the data that has been computed by that
processor about that portion of the system since it is inaccurate and
recover. We recover by re-distributing the objects onto processors
The basic idea is to load balance the system over several processors
in which the object data is intelligently distributed. If a collision
on one processor invalidates the state that was computed by one or
more processors, we dump those results and recover. However, it is
hoped that the distribution strategy that is chosen will minimize
such events. We also, hope that even though it will most likely take
some time to recover we will still win because of the speedup gain we
achieve during the parallel evolution of the system across all of the
processors when nothing "bad" happens.
Some portions of the code that are likely to affect the performance of
the system also involve data redistribution. For example the closest
feature table at present resides only on one processor. Ideally, this
could be distributed with a similar strategy as the object
distribution. That is closest features that will be examined by
objects on a particular processor should be local.
Similarly, the spatial hash tiling table could be distributed rather
than on one processor with the aim being and reducing the contention
for communication to a single processor.
With over 27,000 lines of code, Impulse, was not trivial to get
running on the CM5 in Split-C. Clearly, many obvious optimizations
were not made in the interest of achieving a completely functional
parallel version of Impulse first. It is hoped that such
optimizations will occur soon under a longer range plan for
parallelizing Impulse. However, for now we are pleased to have
obtained a completely functional parallel version of Impulse. Stay tuned.
Eric Paulos /
10 May 1995