A very basic implementation of the simplified (single-loop approximation) of a SIMT-Core as described in General-Purpose Graphics Processor Architecture by Aamodt (2018).
Features:
- RISC-style ISA with only basic instructions required to demo basic divergence/reconvergence handling
- divergence/reconvergence is handled via coded (usually compiler) reconvergence points, similar to NVIDIAs barrier-branch mechanism, but simpler
- a smaller warp/wave size of 8 to have more readable outputs
The TEST_PROG is executing something akin to this pseudo-CUDA program:
__device__ void foo() {
if (threadIdx.x <= 3) {
printf("%d\n",threadIdx.x);
}
}This demonstrates the basics of masking, diverging, and reconvergence.
Actual instructions:
LID 0 // load thread id into register 0
LI 1, 0x3 // load 0x3 into register 1
BGT 0, 1, 0x10, 0x10 // jumps over the PRINT if register 0 > register 1
PRINT 0
NOP
Running the simulation outputs:
CORE RUNNING:
ACTIVE 11111111 PC 0000 RPC 0000 LID 0
ACTIVE 11111111 PC 0008 RPC 0000 LI 1, 0x3
ACTIVE 11111111 PC 0010 RPC 0000 BGT 0, 1, 0x10, 0x10
ACTIVE 00001111 PC 0018 RPC 0020 PRINT 0
Thread 0: r0 = 0
Thread 1: r0 = 1
Thread 2: r0 = 2
Thread 3: r0 = 3
ACTIVE 11111111 PC 0020 RPC 0020 NOP
ACTIVE 11111111 PC 0028 RPC 0020 NOP
ACTIVE 11111111 PC 0030 RPC 0020 NOP
...