ME 964 Fall 2008

Midterm Project Report

Collision Detection

Page 2 of 2

Your Name

University of Wisconsin - Madison

Page 2 of 2

Abstract

This report summarizes the work done for the midterm project for ME 964. This project involved the implementation of a collision detection algorithm on the GPU via CUDA. The parallelized collision detection algorithm would then be compared to Bullet, a popular physics engine. The work presented here represents a more intelligent collision detection algorithm than the simple brute force method. The spatial subdivision method partitions 3D space into cells and checks for collisions only between bodies which occupy the same cell. The parallel collision detection implementation completed for this project showed a speedup of up to 17 times over Bullet for large sized problems.

1  Description of Algorithm

The spatial subdivision method works by dividing 3D space into cells. The cells are uniform is size and cubic, and the edge length is equal to the diameter of the largest sphere in the environment. With this convention, a sphere may intersect at most 8 cells. Therefore, an array of size 8*N is allocated on the device, where N is the number of bodies to process. This array is of type int2, where the first element is a cell ID and the second is a body ID. One kernel call fills this array. The kernel is executed with one thread per body. Each thread places cell ID/body ID pairs into the appropriate location in the array. A supporting data structure is also used to keep track of how many elements were added to the array. Next, a parallel radix sort was used to sort the array by cell ID. An implementation of radix sort from an SDK example was used. Next, another kernel was called to locate the locations in the sorted array where each cell ID begins (each cell ID may appear multiple times in the sorted array, each time representing another body which intersects that cell). This kernel was executed with one thread per element in the sorted array. Next, the cells were processed for collision data. One thread was used per collision cell. First, the cells were processed and collisions were counted using a supporting array and a parallel scan operation. This data was used to allocate an array to hold the collision data. The kernel was executed again, this time with a flag to write the collision data to the appropriate location in the collision data array. In post processing, the collision data was sorted so that it could be easily compared to the output from Bullet.

2  Timing Results

To compare the methods, both were run on systems of varying size. The number of bodies used was always a power of two, and varied from 1024 to 1048576. Note that the tolerances used when comparing the CUDA and Bullet results were modified. The angular tolerance on the collision normal was kept at 10 degrees but the tolerance on the absolute position of the collision on each body was increased from 0.1 to 0.11. The test was found to fail for larger test cases with the smaller tolerance but passed the slightly larger tolerance. The discrepancy was likely due to round off error or slight differences in the methods used to calculate the collision data. Table 1 below summarizes the timing results.

Table 1: Timing Results

N / Collisions Found / Bullet Time (s) / CUDA Time (s) / Speedup
2^10 / 1024 / 335 / 0.015 / 0.035306 / 0.424857
2^11 / 2048 / 712 / 0.016 / 0.036223 / 0.441708
2^12 / 4096 / 1464 / 0.015 / 0.035835 / 0.418585
2^13 / 8192 / 2969 / 0.031 / 0.037825 / 0.819564
2^14 / 16384 / 5811 / 0.062 / 0.041484 / 1.494552
2^15 / 32768 / 11672 / 0.172 / 0.048937 / 3.514723
2^16 / 65536 / 23479 / 0.39 / 0.064288 / 6.066451
2^17 / 131072 / 49004 / 0.922 / 0.09991 / 9.228305
2^18 / 262144 / 105920 / 2.36 / 0.171265 / 13.77981
2^19 / 524288 / 190747 / 5.109 / 0.296856 / 17.21036
2^20 / 1048576 / 487246 / 12.735 / 0.880544 / 14.46265

As can be seen in the timing results above, the CUDA code was faster than the Bullet code for system with 16384 or more bodies. The CUDA code was slower for small numbers of bodies due to the overhead associated with using the device. The results above were also plotted. Figure 1 below shows the timing data along with second order polynomial regressions for each data set.

Figure 1: Plot of Timing Data

Figure 1 depicts the growing divide between the CUDA and Bullet methods. For larger system, the time advantage of the CUDA code would be expected to be even larger. Figure 1 above shows that both data sets are well represented by second order polynomials. However, looking at the data in Table 1, the complexity of the methods may be less than second order. If the methods were O(N2), doubling the size of the problem would quadruple the simulation time. In this project, neither method quadruples simulation time when the problem size doubles. This indicates that the method is in fact better than O(N2), such as O(NlogN).

Finally, note that the speedup decreased slightly from the second to last to the last data point. More testing would be required to see if this downward trend continued. It is possible that very large sized problems have more overhead because of a very large number of blocks.

Finally, the timing data was plotted on log-log scales and can be seen in Figure 2 below.

Figure 2: Log-log Timing Results

This plot gives an idea of the order of both algorithms. The Bullet timing data (Btime), shows a slope of roughly 1.2, indicating that the method is better than O(N2), which would have a slope of 2. The CUDA timing data is not linear. Rather, the slope is near zero when overhead dominates at smaller problem sizes. This means there is no real cost to increasing the problem size, as can be seen in the data where time is nearly constant from 1024 to 8192 bodies. From that point on, however, the algorithm shows some positive slope less than 2. Qualitatively, the slope (order) for the CUDA code seems to be equal to or less than that of Bullet at higher problem sizes.

3  Parallel Patterns Used In Solution

Several parallel programming patterns were used while implementing this solution. Because this algorithm was implemented with CUDA, the Single Program Multiple Data (SPMD) was used. This pattern was used to take advantage of the GPU hardware. In this pattern, many lightweight threads all execute the same program. Each thread has an identifying number, which can control the path the thread takes through the program and the data the thread works with.

My solution for this project has three main kernel calls, each of which represents a SPMD pattern. The first fills the main data structure and uses one thread per body. The thread ID is used to access data belonging to the corresponding body. Each thread executes the same program, but uses different data and may take different paths through the program. The second kernel uses one thread per element in an array, accessing corresponding data. The final kernel uses one thread per collision cell.

The Task Parallelism pattern was also used. For example, the first and second kernel calls, previously described, consist of a collection of independent tasks. In the first kernel, a task can be identified as finding out which cells a given body overlaps. These tasks, one for each body, are independent and can be done in parallel.

The Geometric Decomposition pattern was used as well. The name of this algorithm, spatial decomposition, implies a geometric decomposition. The 3D space is divided into cells. In the third kernel, each cell is processed by one thread. The data associated with each cell is small enough to be processed by a thread, and no data must be shared between collision cells.

The Distributed Array pattern seems to be tied to the geometric decomposition pattern. My data structure was not physically distributed, but each block of threads worked on a different part of the array.

Page 2 of 2

Page 2 of 2