Future Work

Several aspects of the parallelization of this system are non-optimal. One of the key problems centers around the fact that we fix the mapping of objects to processors. This mapping is completely arbitrary and set as the object data is read in from the simulation file. No intelligence has yet been put into the mapping strategy. The problem with this approach is that by giving up the control over this mapping we often find the system in states which are horrendously expensive on a parallel machine but could be dramaticly improved simply by a quick object to processor re-mapping.

A Bad Case

For example, consider a system with many objects where only two such objects are actually colliding. Also, assume that these two colliding objects are actually resting atop one another. If we are unfortunate the two colliding objects will reside of different processors. Since the system can only step forward until the next collision and no other objects are colliding the system will step forward to the next collision between the two colliding objects. Additionally since these objects are resting upon one another the time to impact estimates will be small. Each time through the loop we will handle the collision between these two objects and each time we will be required to make a remote request for data from another processor. This request is by far the most expensive operation that the simulation can make and will degrade the performance of the entire system.

Quick solution

A quick solution that was implemented is to cache objects as we request them from remote processors. The problem is that the data that we fetch becomes invalid on every cycle through the loop since after the ballistic phase, which is performed on the processor that owns the object, the processor that handles the collision will have an invalid state for that object and must request a new copy. Again this is expensive.

The Real Solution

The real solution would either observe through some form of history recording or directly from spatial locality information that these two frequently colliding objects should reside on the same processor. When this is decided it would copy the object over to the new processor and inform the other processors of the new mapping. The idea is essentially an adaptive dynamic object to processor mapping based on spatial locality. The real problem is in being smart about when and how to perform the re-mapping. Clearly if we re-map too often we will be in the same state that we were trying to, namely overly zealous communication. In other words if we re-map an object only to re-map it again after only a few iterations though the simulator we will receive the same performance hit as we would have if we had never done the re-mapping. Likewise if are too conservative in our re-mapping we will not be able to leverage from a performance gain when we re-map. Even worse, by the time a conservative re-mapping strategy decides to re-map it may be too late to achieve and significant performance gain from the re-mapping. The pot of gold is to perform the re-mapping is such a was as to guarantee a performance gain by looking at the past, present, and future state of the system. We already have a spatial hash tiling which could be used to predict more optimal object to processor mappings. Future work on such parallel implementations of simulation systems should take advantage of this in order to achieve significant performance gains from the parallelization. Further discussion on taking advantage of locality for parallel programs is discussed here and here.

When communication of object data is needed we currently transmit the entire rigid body structure. However, we often only need several bytes of that data. Therefore, we could reduce the communication time by sending these smaller sized packets. Perhaps, a simple compression scheme could reduce the data transmission size even more. Any reduction in remote processor communication will significantly increase performance of the system.

Another Approach

An entirely different approach to this form of dynamic simulation is also possible. Essentially, clusters of predicted future occurring collisions could be shipped off to separate processors, even using a simple system such as PVM. In the best case this would allow for the "time frontier" of the simulation system to move forward asynchronously. A true clock would be kept to lock in the state of the system at particular points in time. Essentially, that clock would insure that everything had been properly handled by the simulation system on all processors up to that point. Portions of the system moving their simulation time very far into the future would be given a lower priority in terms of processors allocation. This would load balance the system so that processors will be bias towards working on solving portions of the system that will occur soon.

If a collision occurs that threatens invalidates a portion of the simulator running off on a remote processor, we check to make sure that this collision actually does affect the state of that sub-system. If it does, we must dump the data that has been computed by that processor about that portion of the system since it is inaccurate and recover. We recover by re-distributing the objects onto processors and simulating.

The basic idea is to load balance the system over several processors in which the object data is intelligently distributed. If a collision on one processor invalidates the state that was computed by one or more processors, we dump those results and recover. However, it is hoped that the distribution strategy that is chosen will minimize such events. We also, hope that even though it will most likely take some time to recover we will still win because of the speedup gain we achieve during the parallel evolution of the system across all of the processors when nothing "bad" happens.

Small Optimizations

Some portions of the code that are likely to affect the performance of the system also involve data redistribution. For example the closest feature table at present resides only on one processor. Ideally, this could be distributed with a similar strategy as the object distribution. That is closest features that will be examined by objects on a particular processor should be local.

Similarly, the spatial hash tiling table could be distributed rather than on one processor with the aim being and reducing the contention for communication to a single processor.

Final Note

With over 27,000 lines of code, Impulse, was not trivial to get running on the CM5 in Split-C. Clearly, many obvious optimizations were not made in the interest of achieving a completely functional parallel version of Impulse first. It is hoped that such optimizations will occur soon under a longer range plan for parallelizing Impulse. However, for now we are pleased to have obtained a completely functional parallel version of Impulse. Stay tuned.

[ Home ]

Eric Paulos / paulos@robotics.eecs.berkeley.edu / 10 May 1995