Lately, the focus of my research has been on a new programming language called Harlan (that link is for IU students only, sorry), which is a high level language for GPU programming. One important task in this project has been forming a reasonable mental model of how GPUs actually work. I’ve recently come to the conclusion that the model exposed as part of CUDA and OpenCL make it almost impossible to form a clear picture of what is actually happening in the hardware.
The CUDA documentation gives the impression that an NVIDIA GPU is a mystical processor that is capable of running thousands of threads at once. This leads to a unique programming model. Suppose you want to add two vectors of 10,000 elements together. All you have to do on a GPU is spawn 10,000 threads, and each thread adds one element of the vector. If we wanted each thread to run the function add_vector
, we could simply do this:
int block_size = ???;
int num_blocks = ???;
add_vector<<<num_blocks, block_size>>>(n, x, y, z);
This code adds vectors x
and y
, each of length n
, and stores the result in z
. Of course, we have already run into some complications. What should block_size
and num_blocks
be? CUDA partitions all of your threads into a grid of blocks, and each block has a certain number of threads. You can have basically as many blocks as you want, but for some reason the block size (or number of threads per block) cannot be more than 1024.
What quickly becomes clear is that these so-called thousands of cores that your GPU has are not the same as cores on a CPU. For example, we hear about how at least some of these cores execute in lock step, meaning they must execute the exact same instructions at the same time. Not all threads do though, because you can synchronize threads within a block using the __syncthreads()
function. Besides grouping threads into blocks, some kernels also make use of the fact that threads are further subdivided into warps of up to 32 threads. The question is, how do these concepts map onto hardware?
A look at the Fermi Architecture Whitepaper shows that NVIDIA’s Fermi processors are made up of some number of Streaming Multiprocessors (SMs), which each have 32 CUDA cores. The Wikipedia page shows that different boards within the Fermi series have a different number of SMs. One of the new features of the Fermi architecture is the GigaThread™ Thread Scheduler, which apparently provides 10x faster context switching. At SC11, I heard one NVIDIA employee claim that context switching was free.
GPUs are Vector Processors
To me, rather than thinking of GPUs in terms of grids and blocks and warps, it’s best to think of them as vector processors. Vector processors are CPUs that are designed around Single Instruction, Multiple Data (SIMD) instructions. Typical desktop CPUs contain SIMD extensions, such as Intel’s AVX instructions, which allow them to perform some vector operations efficiently, but their focus is still on low latency execution of scalar code. By contrast, vector processors expect most of the computation to be expressed in terms of vector operations, and are optimized to perform these operations as quickly as possible, perhaps even at the expense of scalar performance.
Under this model, each SM on an NIVIDA GPU corresponds to a more traditional CPU core. These SMs would contain some number of 32-wide vector registers. It seems that CUDA exposes operations on vector registers as a warp. They appear to be 32 threads because each instruction on 32 lanes at once, while the threads must proceed in lock step because they are actually a single stream of instructions.
Now, how do CUDA blocks fit with this view? These blocks seem to correspond to a set of warps executing on a single SM. Although an SM is a single core, it can run multiple threads through simultaneous multithreading, or HyperThreading as Intel calls it. Under HyperThreading, two threads can be assigned to a single processor core. The processor then multiplexes resources between the two threads. For example, if one thread is blocked on a memory operation, the CPU can execute instructions from the other thread while the first one waits on memory. Switching between these threads is basically free; it’s just a matter of assigning ready work to available hardware resources. In terms of CUDA blocks, if we divide the maximum number of threads per block (1024) by the number of threads per warp (32), we end up with 32. This suggests that each SM is able to keep around 32 thread (or warp) contexts, and swap between them easily as execution units and data become available.
In summary, we can think of a Fermi GPU as a multicore processor, where each core does 32-way HyperThreading and supports 32-wide vector instructions.
In order to really verify that this is the case, it would be helpful to see the actual Fermi instruction set. NVIDIA is very secretive about this, instead only publishing a virtual instruction set, PTX. This is understandable, as it means NVIDIA does not have to maintain backwards compatibility between GPU revisions. However, AMD does provide documentation for the actual instruction set for their GPUs. After briefly perusing their latest documentation, it seems that their instruction set is compatible with the idea of GPUs as vector processors.