ME964

High Performance Computing for Engineering Applications

Assignment 2

Date Assigned: September 11, 2008

Date Due: September 18, 2008

Problem 1.

a) Follow the instructions from cudaInstallationInstructionsWinXP.doc to install CUDA on your XP machine.

b) Follow the instructions from cudaSetupInctructionsVC8.doc.

Please use Visual Studio 2005 Professional, which uses VC8. If you do not have Visual Studio 2005 Professional, please download it for free by following the instructions given at

http://sbel.wisc.edu/Forum/index.php?topic=4.0.

c) Unzip HW_HelloWorld.zip into <Proj_Dir>\NVIDIA CUDA SDK\projects

c) Build using the EmuRelease or EmuDebug configuration and you should see emulated 16 threads printing "Hello World!" to the screen.

d) Build in debug mode to see that the code doesn't compile because of the printf statement. Eliminate the printf statement and build again in debug mode. Confirm that the build is successful.

For this problem, answer the following question:

How can you modify the code so that you can switch from EmuDebug to Debug and back to EmuDebug without having to modify any lines of code? In other words, in emulation mode you get printf to work, while in nonemulation mode the code will compile fine. In this case, there would be no need for step d) above.

Hint: Consider the use of the preprocessor __DEVICE_EMULATION__ macro, see the CUDA Programming Guide available on the class website.

Problem 2.

The purpose of this exercise is to get you started with debugging CUDA code.

a) Copy everything under the directory <Proj_Dir>\NVIDIA CUDA SDK\projects\matrixMul into a newly created directory called <Proj_Dir>\NVIDIA CUDA SDK\projects\HW2matrixMul so that you keep the original directory unchanged in case you want to go back and get a fresh copy of it.

b) Edit the file matrixMul.h so that the dimensions of the matrices you work with will be:

#define WA (BLOCK_SIZE) // Matrix A width

#define HA (BLOCK_SIZE) // Matrix A height

#define WB (BLOCK_SIZE) // Matrix B width

c) Build an EmuDebug version of the code to generate the executable <Proj_Dir>\NVIDIA CUDA SDK\bin\Win32\EmuDebug\matrixMul.exe (Note: the DevStudio project places the executable in that directory for you by default).

d) Set a conditional break point at line 107 of the file matrixMul_kernel.cu (see Figure 1, where the break point is set on line 106 of the same file). Make the debugger stop at this line provided the thread that is executing this code has the threadIdx.x value equal to 3 and the threadIdx.y value equal to 7 (in Figure 1, the conditional break is set when threadIdx.x is 10 and threadIdx.y is 3, see “Watch 1” panel in Figure 1).

Note that this is source code that when compiled in release or debug mode is run on the GPU. The reason why you can set a break point has to do with the fact that the code is run in emulation mode. Windows threads are launched to emulate the threads that would normally run on the GPU. You can see the army of threads that are launched in the panel called “Threads” in Figure 1. Launch a debug session on this executable.

Figure 1. Debug session using the CUDA emulation mode.

For this problem, include in your report a snapshot of your debug session that shows:

i)  The calling stack (see panel “Call Stack” in Figure 1)

ii)  The modules loaded for the execution of the debug session (see panel “Modules” in Figure 1)

iii)  The list of break points you have defined in your debug session (see panel “Breakpoints” in Figure 1)

iv)  The collection of local variables in use (see panel “Locals” in Figure 1)

v)  A collection of `watched` variables that displays exactly the same variables currently displayed in panel “Watch 1” of Figure 1.

vi)  The source file where the execution is stopped due to the breakpoint.

To summarize, you should generate the analog of Figure 1 for the case when threadIdx.x is equal to 3 and threadIdx.y is equal to 7.

Problem 3.

The purpose here is to get familiar with the CUDA profiler.

a) Download the CUDA Profiler from http://developer.download.nvidia.com/compute/cuda/2.0-Beta2/profiler/CudaVisualProfiler_windows_1.0_13June08.zip

Extract the file in the directory <Proj_Dir>\NVIDIA CUDA SDK\.

b) Edit the file matrixMul.h so that the dimensions of the matrices you work with will be:

#define WA (80 * BLOCK_SIZE) // Matrix A width

#define HA (80 * BLOCK_SIZE) // Matrix A height

#define WB (80 * BLOCK_SIZE) // Matrix B width

c) Build a Release version of the code to generate the executable <Proj_Dir>\NVIDIA CUDA SDK\bin\Win32\Release\matrixMul.exe (Note: the DevStudio project places the executable in that directory for you by default).

d) Start the CUDA profiler by double clicking the “cudaprof” executable in the “bin” directory of the archive you just extracted. Start a new project. Give your project a name and a location. Profile the executable you just built, matrixMul.exe. My “Session Settings” in the profiler look like this:

Figure 2. Profiler’s “Session settings”

e) At the end of the profiling session, you should get a window that looks something like this:

For this problem, answer the following questions:

i)  Explain each entry in this 4 by 4 table.

ii)  Click on the “GPU Time Summary Plot”. Include in your homework a picture of this plot and explain its meaning.

iii)  Click on the “GPU Time Height Plot”. Include in your homework a picture of this plot and explain its meaning.

iv)  Click on the “GPU Time Width Plot”. Include in your homework a picture of this plot and explain its meaning. (ignore the warning you get when you do this operation)

v)  Click on the “Configuration” tab in the “Session Settings” dialog box (see Figure 2). You’ll see a set of 11 “Profiler Counters” that you can select: gld uncoalesced, gld coalesced, etc. Explain the meaning of each of them. You might want to use the profiler manual at http://developer.download.nvidia.com/compute/cuda/2.0-Beta2/docs/CudaVisualProfiler_README_1.0_13June08.txt

NOTE: Please report any installation or compiling issues on the Forum. This would help us to quickly recognize and resolve any problem.

4