9
General-Purpose Computing on Graphics Processing Units
Scott Goble
Department of Computer Science
University of Wisconsin-Platteville
Abstract
Graphical Processor Units are designed to offload vertex calculations from the CPU. Using stream processors, GPUs have the ability to efficiently handle many vertex calculations at once. Since the implementation of GPUs, programmers have been researching the possibility of using GPUS for non-graphical work. While there was some success using OpenGL libraries, these libraries only allowed very specific calculations to be done. In 2006, modern GPUs began to be designed for GPGPU by giving programmers access to the lower level instruction sets. In the following years several APIs where developed to fully take advantage of GPU architecture. OpenCL quickly became the industry standard for GPGPU development due to its platform independence. Using OpenCL, programmers are able to enhance their programs by offloading computations onto GPUs and other OpenCL supported devices. Today OpenCL can be found in programs such as Folding@Home, SETI@Home and Bit coin Mining where GPGPU code has drastic improvements over traditional CPU code.
What is GPGPU?
Introduction to GPGPU
General Purpose Computing on Graphical Processing Units (GPGPU) is the act of using Graphical Processing Units (GPU) to process non-3d graphic work. GPUs consist of many stream processors, which are designed to handle computing vertices and shaders of many objects at once. As these stream processors excel in handling parallel computations, programmers have been looking for ways to access these processors to use them to process non-3d graphical work. Before the latest APIs and GPUs, methods in OpenGL were used to pass instructions to GPUs. Using OpenGL showed promising returns in performance, however the applications where severely limited by the input and output of OpenGL. Today, however, we have four main competing APIs: NVIDIA’s CUDA, ATI’s Stream, Microsoft’s DirectCompute and OpenCL. These APIs were developed to overcome OpenGL’s GPGPU limited non-graphical computational power. Programmers are now have much more control over what they can process on a GPU, as well as increased efficiency from not having to translate OpenGL variables. Currently, OpenCL is the dominant API. It is the only one that is platform independent, and all major GPU companies actively provide support it.
History of GPGPU
The very first GPGPU application was the Ikonas Graphics System released in 1980. These systems where used in NASA and USAF cockpit simulations, as well as processed some of the first ray tracing examples. While the Ikonas Graphics System may not share the same architecture of current GPUS, it helped formed the basis of offloading computations onto a pixel rendering system that current GPGPU is based from.
Research and applications using GPGPU stagnated after Ikonas, as GPUs where not common and those available where slower than CPUs of the time. Starting in 2003, however, GPUs started to exceed CPUs in single precision floating point math. Modern GPGPU programs where starting to be developed to take advantage of the faster computations. Unfortunately, OpenGL was the only access to GPUs at this time, requiring additional layers of logic and often redoing algorithms to adhere to 3d Graphics computations. In 2004, 3rd party APIs such as Brook and Sh where developed, allowing programmers to program GPGPU programs without having to worry about using OpenGL operations directly. However these APIs where an intermediary between the programmer and OpenGL, the programmers code was translated into OpenGL before being compiled. This resulted programs running 10-20% slower than those directly using OpenGL.
In 2006 ATI introduced an API called Close To Metal (CTM). CTM was the first API that allowed direct access to GPUs functions. Programs switching to CTM from OpenGL saw gains of up to 8 times faster processing. In 2007 NVIDIA revealed CUDA, a C like SDK allowing computations on both CPUs and GPUs. ATI also released their Stream SDK, which used CTM as its base. Unlike NVIDIA, whose CUDA worked on consumer GPUs, ATI did not originally support Stream on their GPUs. This, combined with their language of choice, Brook+, led to the increased popularity of CUDA as the GPGPU API of choice. In 2008, Apple proposed OpenCL to the Khronos Group. For five months the group collaborated on technical details and finally released the specification in November 2008. Due to its platform independence, OpenCL quickly became popular among developers. In 2009 Microsoft released its DirectCompute API with the release of DirectX 11. While DirectCompute works on any GPU that supports DirectX 10 or 11, there has not been a lot of community or developer support outside of Microsoft so far.
Exploring GPGPU concepts using OpenCL
About OpenCL
OpenCL was originally developed by Apple and proposed to the Khronos Group in 2008. It is currently the industry standard cross-platform GPGPU API. The OpenCL programming language is based off of C99, an older version of C, but with a few modifications. OpenCL does not support recursion, pointers to functions, or varying-length arrays. OpenCL does provide support and functions for easy parallelism and work groups. It has also added memory qualifiers, such as global, constant, and private. As OpenCL is based off of C, it integrates well with C/C++ applications. Wrappers have been made for Java, JavaScript, Python and C# as well. JavaScript also has native OpenCL support under Khronos Group’s WebCL. [6]
OpenCL Architecture
OpenCL’s architecture can be split up into four different models; the Platform Model, the Memory Model, the Execution Model, and the Programming Model. These models are placed in a hierarchal order, where Programming Model is based off of Execution Model which is based off of the Memory Model and so on. OpenCL is designed to run on an OpenCL device, which is defined as a device being able to run OpenCL kernels. While OpenCL devices are primarily GPUs, the code and applications from GPGPU will run on any device that has been given support for OpenCL. [6]
Platform Model
The platform model consists of a host device connected to at least one OpenCL device. OpenCL devices are then divided into Compute Units, which are then further divided into Processing Elements as demonstrated in Figure 1.
Figure 1: The Host sends commands to many OpenGL devices. Each one of these devices are further divided into Compute Units, which themselves are divided into Processing Units.
OpenCL applications are divided into two main parts: the host code and the device kernel code. The host code runs on a primary processor and submits the kernel code as commands to the OpenCL devices. The OpenCL device then takes the kernel code and runs it on the Processing Elements. [5]
With OpenCL, programmers have two different choices on how computations are mapped onto an OpenCL’s processing elements. This first choice is called converged control flow. This is where all of the processing elements of a given compute unit are processing the same statements. The second, diverged control flow, is when each processing element is processing different statements. Kernel code may switch between the two, leading to great flexibility in what is being processed on each processing element. [6]
OpenCL is compiled via two different methods, called online and offline. Online compilers are available during the run time of a host program. Offline compilers are compiled outside of the host program and must be called by the host program. This allows OpenCL code to be run as a module within another program. [5]
Based on the compiler type, OpenCL offers two different profiles. If the code uses an online compiler, full or embedded profiles may be used. Code using an offline compiler must use the embedded profile. A full profile supports full functionality of OpenCL functions while embedded reduces the functionality to allow for the uncertainty of the conditions that the code is ran. [3]
Execution Model
As mentioned, OpenCL is executed on both host machines and OpenCL devices via kernels. Kernels are categorized into work-items, which execute in work-groups. The kernel is executed within a context that is managed by the host device. This context includes devices, kernel objects, program objects and memory objects. Devices are the OpenCL devices available to be used by the host. Kernel Objects is the code that is ran on the devices. Program Objects is the code that is run on the host, which implements the Kernel Objects. Memory Objects is the memory that is visible and usably by both the host and the OpenCL devices. This will be the memory that kernels use during operation. [4]
OpenCL uses command-queues to manage OpenCL devices. Each OpenCL device has it’s own command-queue. There are three categories of commands that are placed into a queue:
· Kernel en-queue: These commands en-queue a kernel for execution on the OpenCL device
· Memory: These commands manage data on the host and OpenCL device. Common uses include transferring data between the host and OpenCL device, mapping new memory allocations, and clearing old memory allocations for use.
· Synchronization: These commands are used to keep multiple OpenCL devices in sync.
Each device also has local command queue that can be accessed by kernels running on the device. Both command queues follow the following states and transitions:
Figure 2: Both the host controlled command queue and the device’s local command queue follow the same state transitions.
A Basic description of these states follows:
· Queued: The command is in the command-queue.
· Submitted: The command is flushed from the command-queue as is not submitted for execution on a device. Once in this state the command will execute as long as any other prerequisites are met.
· Ready: All prerequisites have been met and the kernel is now placed into a work-group to be run.
· Running: The kernel is currently executing on a work-group/
· Ended: The kernel is now done running. All work-groups assigned to this kernel have finished.
· Complete: Finally, the state is set to complete signaling there was no problem with any command or child command of the kernel.
Command status is seen through Event Objects. When a command reaches Complete, the Event Object is set to Complete. Otherwise, if any errors occurred or the command was terminated in any way the Event Object is set to a negative value. If this happens the command-queue may be blocked from processing any more commands for the duration of the program. [6]
There are two different prerequisites that may delay a command from running. The first is an arbitrary barrier from a previous command which prevents all new commands from running until the previous set of commands have been completed, even if there are available work groups for processing. The second is a command may have a secondary list of events that need to be completed prior to running. This command will not run until those event objects are set to the required state. [2]
A single command-queue can be run in two different modes: In-order and Out-of-order execution. In order command queues will launch commands in the order they were entered in the queue. If the top most command is waiting for pre-requisites, all further commands must also wait for the top command to complete. Out-of-order command-queues allow commands to “skip” ahead of commands that are waiting for pre-requisites. It is important to note that different command-queues within a host can be set to run in different modes, depending on the need of the given command-queue. [2]
When an OpenCL command is executed, an index space on the device is defined. This index space is the parts of the device the command will have access to. The device takes the kernel and runs the functions on each point of the index space. Each one of the points is called a work-item. The device also groups work-items within the index space into work-groups. Each work item has several IDs used to access it. The first is a global ID, which is its general location on the device. Each work-group is given an ID on the device as well, which gives sub-IDs to the work-group’s work-items. Thus you can also find a given work-item knowing the work-group it is in. [6]
Each work-group works independently from each other. Outside of host and global memory, each work-group has its own set of resources to work with. In order to assure synchronization of multiple-work groups, barriers and synchronization commands must be used, as there is no safe way to assume all work-groups will be processed at the same time and speed. Even within work-groups a programmer cannot expect perfectly parallel execution between work-items. Particular care needs to be taken to ensure that all running and en-queued commands handle synchronization safely as the there are many areas in OpenCL that require Synchronization between each other. [2]
Synchronization is handled by synchronization commands. These commands use synchronization points to assist in managing different command-queues, work-groups and work-items. OpenCL supports the following synchronization points:
· Launching a command: The kernel is launched onto a device and all pre-requirements have been set to COMPLETE