Results

Results
Since the motivation for parallelizing Impulse was to achieve some form of performance gain, we look at the results of our parallel simulation as our performance metric. It should be noted that the work involved in simply getting the parallel version of Impulse to function properly did not allow much time for various optimizations that would speedup the parallel execution. It should be noted that various future plans will have dramatic effects on the timings shown below.

It's important to note that the big goal, where we hope to achieve large performance gains, is on systems with more than a hundred objects. We expect this because of results obtained from an earlier version of a 2D Impulse-type dynamic simulation system. In fact, our system mimics the results we obtained on the 2D system for small particle systems. However, due to memory limitations, we have not been able to achieve timing runs on such significantly large systems. We expect that soon we will solve the system's greedy memory usage problem, and will then be able to test our simulator on these large systems. Stay tuned. For now we report our results on some smaller systems where we don't expect large performance gains.

Also note that although we desired to run simulations with large numbers of objects, the memory constraints of the CM5 prevented us from running with more than a few dozen objects. More impressive results of the speedup would be apparent on larger simulations when the more parallelized portions of the code play a larger roll. We also desire to run longer simulations. However, due to memory restrictions as the simulation proceeds we have truncated many of the simulations to only a few seconds of simulation time.
Finally, we are also slowed down by the rendering requirement. If we could simply advance from collision to collision we would be able to simulate the various systems much more quickly. However, we must artificially halt the system and take a picture at evenly spaced intervals even if nothing "interesting" has happened. We do this so that we can produce realistic movies with evenly spaced frames. If only some final result was desired from the simulation, then we could relax this frame rate and achieve improved performance.
Overall, the results are quite poor. Mostly this is due the fundamental degree of difficulty in parallelizing such a complicated system. Clearly, there exists a bottleneck in the collision step and heap updating. However, we are looking for big parallel speedup wins for large systems with many objects. In a previous parallel version of a 2D simulator with many of the same data structures we failed to observe real performance gains until a much larger number of objects were simulated. We expect similar results in the full 3D dynamic simulator. In fact we mimic the 2D speedup graph almost exactly. As shown in the graph below, there is no significant performance gain until systems with over 100 objects. This is promising for future work on the parallel simulator.

Note that timings for all simulations measure only the actual simulation time and do no include the time required during the initialization step with is usually less than a quarter of a second. These were all take on a 32 processor CM5.

Ballistic Coins

Below we observe the simulation of a system of 32 coins in ballistic trajectory on parallel processors. Remember that the objects to processor mapping is performed arbitrarily at startup. This maps out to advantageous object distribution on some numbers of processors and poor on others. Speedup 1 1.66 6.67 4.00 1.00 2 1.66 6.33 3.80 1.05 4 1.66 3.74 2.24 1.78 8 1.66 2.60 1.56 4.28 16 1.66 2.30 1.38 4.83

Coins

A system of 8 coins thrown into an arena that collide (Movie Here)

Processors	Simulation Time (sec)	Processing Time (sec)	Slowdown from Realtime	Speedup
1	1.66	27.7	16.7	1.00
2	1.66	38.5	23.1	0.72
4	1.66	20.3	12.1	1.36
8	1.66	17.9	10.7	1.55

More Coins

A system of 32 coins that are thrown into the arena and collide with one another. Notice that this system contains 4 times the number of coins as the simulation above and we do actually observe a more significat speedup. This is in agreement with our argument that this parallel implementation will perform substantually better or larger systems.

Processors	Simulation Time (sec)	Processing Time (sec)	Slowdown from Realtime	Speedup
1	1.66	28.9	17.3	1.00
2	1.66	16.5	9.9	1.75
4	1.66	19.9	11.9	1.45
8	1.66	16.1	9.6	1.80

Complex Feeder System

Two feeder trays with 32 objects which slide down the trays and all collide at the bottom. This is a collision intensive system where many of the collision will be non-local resulting in poor performance. Ideally we would be able to adapt the object to processor mapping so that we can achieve bursts of collisions between objects entirely on a local processor.

Processors	Simulation Time (sec)	Processing Time (sec)	Slowdown from Realtime	Speedup
1	0.55	12.53	22.6	1.00
2	0.55	10.5	19.0	1.2
4	0.55	9.1	16.5	1.4
8	0.55	9.93	17.8	1.3
16	0.55	10.6	19.1	1.2

Maliciously Stacked Coins

In this simulation we have constructed a worst case situation for the simulator in which inter-processor communication is almost always necessary for every cycle through the simulator. This is true because we have constructed a malicious example where collisions occur often between moving objects that do not reside on the same processor. We expect, and receive, a huge performance hit for these types of simulations on two processors. When run on two processors, pairs of stacked objects are assigned to different processors. When the object mapping is spread across more processors the performance improves slightly.

Processors	Simulation Time (sec)	Processing Time (sec)	Slowdown from Realtime	Speedup
1	0.55	17.7	31.2	1.00
2	0.55	50.3	90.6	0.35
4	0.55	12.93	23.2	1.37
8	0.55	10.56	19.0	1.68
16	0.55	10.73	19.3	1.64

[ Home ]

Eric Paulos / paulos@robotics.eecs.berkeley.edu / 10 May 1995