0% found this document useful (0 votes)

125 views25 pages

Parallel Programming Models For Real-Time Graphics: Aaron Lefohn

The document discusses parallel programming models for real-time graphics. It defines key concepts like execution contexts, work, concurrency, parallelism, and synchronization. It examines different programming models like vertex shaders, conventional threads, task systems, and GPU compute and how they abstract hardware resources and allow for synchronization. Specifically, it notes that vertex shaders provide "pure data parallelism", task systems abstract cores and execution contexts, while SPMD abstracts SIMD units. The document concludes that current rendering uses mixtures of these models and that future systems may combine tasks and SPMD.

Uploaded by

yurymik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views25 pages

Parallel Programming Models For Real-Time Graphics: Aaron Lefohn

Uploaded by

yurymik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Parallel Programming Models for Real-Time Graphics

Aaron Lefohn
Intel Corporation

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Hardware Resources
Core Execution Context SIMD functional units On-chip memory

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

CPU-GPU System-on-a-Chip

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Abstraction
Abstraction enables portability and system optimization
E.g., dynamic load balancing, SIMD utilization, producer-consumer

Lack of abstraction enables arch-specific programmer optimization

E.g., multiple execution contexts jointly building on-chip data structure

When a parallel programming model abstracts a HW resource,

code written in that programming model scales across architectures with varying amounts of that resource
Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Execution Definitions

Execution context
The state required to execute an instruction stream: instruction pointer, registers, etc
(aka thread)

Work
A logically related set of instructions executed in a single execution context
(aka shader, instance of a kernel, task)

Concurrent execution
Multiple units of work that may execute simultaneously
(because they are logically independent)

Parallel execution
Multiple units of work whose execution contexts are guaranteed to be live simultaneously
(because you want them to be for locality, synchronization, etc)
Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Synchronization
Synchronization between execution contexts
Enables inter-context communication Restricts when work is permitted to execute

Granularity of permitted synchronization

determines at which granularity system allows programmer to control scheduling
Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Vertex Shaders: Pure Data Parallelism

Execution
Concurrent execution of identical per-vertex work

What is abstracted?
Cores, execution contexts, SIMD functional units, memory
hierarchy

What synchronization is allowed?

Between draw calls
Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Pure Data-parallel Pseudocode

concurrent_for( i = 1 to numVertices) { // Execute vertex shader }

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Conventional Thread Parallelism

Execution
Parallel execution of N different units of work on N execution contexts Parallel execution of M identical units of work on M-wide SIMD functional unit

What is abstracted?
Nothing (ignoring preemption)

Where is synchronization allowed?

Between any execution context at various granularities
Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Conventional Thread Parallelism

CPU
Launch a pthread per hardware execution context

GPU
Persistent threads

Launch a workgroup per hardware execution

context sized to the HW SIMD width
Beyond Programmable Shading Course, ACM SIGGRAPH 2011

D3D/OpenGL Rendering Pipeline

Execution

Concurrent execution of identical work within each shading stage Concurrent execution of different shading stages Each stage spawns work to the next stage No parallelism exposed to user

What is abstracted?

Cores, execution contexts, SIMD functional units, memory hierarchy,

and fixed-function graphics units (tessellator, rasterizer, ROPs, etc)

Where is synchronization allowed?

Between draw calls
Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Abstracting SIMD ALUs

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Explicit SIMD Programming

float16 a = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; float16 b = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; float16 c = a + b;

Mechanisms

Intrinsics Assembly Wide vector types

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

SPMD/Implicit SIMD Programming

parallel_for( i = 1 to SIMD_width) { // Per-lane code goes here }

concurrent_for( i = 1 to someBigNumber) { // Per-lane code goes here }

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

SPMD/Implicit SIMD Programming

GPU
Current GPU programming models are always SPMD

CPU
Intel SPMD Program Compiler (ISPC) SPMD combined with other abstractions
OpenCL (some implementations) Intel Array Building Blocks

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Abstracting Cores and Execution Contexts

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Task Systems (Cilk, TBB, ConcRT, GCD, )

Execution

Concurrent execution of many (likely different) units of work Work runs in a single execution context

What is abstracted?

Cores and execution contexts Not abstracted: SIMD functional units or memory hierarchy Between tasks

Where is synchronization allowed?

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Task Pseudo Code

void myTask(some arguments) {

}
void main() { for( i = 0 to NumTasks - 1 ) { spawn myTask();

}
sync; // More work

}
Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Nested Task Pseudo Code

void barTask(some parameters) { } void fooTask(some parameters) { if (someCondition) { spawn barTask(); } else { spawn fooTask(); } } void main() { concurrent_for( i = 0 to NumTasks - 1 ) { fooTask(); } sync;

More code
Beyond Programmable Shading Course, ACM SIGGRAPH 2011

GPU Compute Pseudo Code

void myWorkGroup() { parallel_for(i = 0 to NumWorkItems - 1) { GPU Kernel Code (This is where you write GPU compute code) } } void main() { concurrent_for( i = 0 to NumWorkGroups - 1) { myWorkGroup(); } sync; }
Beyond Programmable Shading Course, ACM SIGGRAPH 2011

GPU Compute Languages

Execution
Lower level is parallel execution of identical work (work-items)
within work-group Upper level is concurrent execution of identical work-groups

What is abstracted?

Work-group abstracts a cores execution contexts, SIMD

functional units, memory

Where is synchronization allowed?

Between work-items in a work-group Between passes (set of work-groups)

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Summary of Concepts
Abstraction
When a parallel programming model abstracts a HW resource,
code written in that programming model scales across architectures with varying amounts of that resource

Concurrency versus parallelism

locality

Concurrency provides scalability and portability Parallel execution permits explicit communication and capturing

Synchronization

Where is user allowed to control scheduling?

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Conclusions
Current real-time rendering programming uses a mix of
data-, task-, and pipeline-parallel programming (and conventional threads as means to an end)

Future SOC (CPU + GPU) programming model directions

Tasks are effective way to abstract execution contexts and cores SPMD is an effective way to abstract over SIMD ALUs Many open questions

Look for uses of these different models throughout the rest

of the course
Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Acknowledgements

Tim Foley and Matt Pharr at Intel Mike Houston at AMD Kayvon Fatahalian at CMU The Advanced Rendering Technology research team, Pete Baker, Aaron Coday, and Elliot Garbus at Intel

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

References

GPU-inspired compute languages DX11 DirectCompute, OpenCL (CPU+GPU+), CUDA The Fusion APU Architecture: A Programmers Perspective (Ben Gaster) [Link] Task systems (CPU and CPU+GPU+) Cilk, Thread Building Blocks (TBB), Grand Central Dispatch (GCD), ConcRT, Task Parallel Library Conventional CPU thread programming Pthreads GPU task systems and persistent threads (i.e., conventional thread programming on GPU) Aila et al, Understanding the Efficiency of Ray Traversal on GPUs, High Performance Graphics 2009 Tzeng et al, Task Management for Irregular-Parallel Workloads on the GPU, High Performance Graphics 2010 Parker et al, OptiX: A General Purpose Ray Tracing Engine, SIGGRAPH 2010 Additional input (concepts, terminology, patterns, etc) Foley, Parallel Programming for Graphics,

Beyond Programmable Shading SIGGRAPH 2009 Beyond Programmable Shading CS448s Stanford course Fatahalian, Running Code at a Teraflop: How a GPU Shader Core Works, Beyond Programmable Shading SIGGRAPH 2009-2010 Keutzer et al, A Design Pattern Language for Engineering (Parallel) Software: Merging the PLPP and OPL projects, ParaPLoP 2010

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

GPU Pipeline Scheduling Insights
No ratings yet
GPU Pipeline Scheduling Insights
52 pages
05-JP Id Tech 5 Challenges
No ratings yet
05-JP Id Tech 5 Challenges
37 pages
GPU & CUDA Programming Essentials
No ratings yet
GPU & CUDA Programming Essentials
73 pages
BCS702 Module 5 Textbook
No ratings yet
BCS702 Module 5 Textbook
48 pages
Hpca2021 Gpu 3
No ratings yet
Hpca2021 Gpu 3
49 pages
GPU Insights for CPU Experts
100% (1)
GPU Insights for CPU Experts
70 pages
GPU Evolution in High-Performance Computing
No ratings yet
GPU Evolution in High-Performance Computing
36 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
CSED405 Lec1-Course Intro - 240903 - 203340
No ratings yet
CSED405 Lec1-Course Intro - 240903 - 203340
65 pages
Shader Compilation with LLVM
No ratings yet
Shader Compilation with LLVM
37 pages
GPU & Game Development Insights
100% (1)
GPU & Game Development Insights
58 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
Part4 22
No ratings yet
Part4 22
65 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
2620 Final PDF
No ratings yet
2620 Final PDF
45 pages
GPU Programming Slides 1
No ratings yet
GPU Programming Slides 1
33 pages
Chapt 07
No ratings yet
Chapt 07
208 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Ada2024 Gpu 2
No ratings yet
Ada2024 Gpu 2
55 pages
BCS702 Module1 Detailed Notes
No ratings yet
BCS702 Module1 Detailed Notes
14 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
DS1822-Parallel Computing - Unit5
No ratings yet
DS1822-Parallel Computing - Unit5
16 pages
GPUs DP Accelerators MSC
No ratings yet
GPUs DP Accelerators MSC
191 pages
Data Parallel Computation
No ratings yet
Data Parallel Computation
9 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Government Polytechnic College Perumbavoor: 6009 - Seminar Report
No ratings yet
Government Polytechnic College Perumbavoor: 6009 - Seminar Report
16 pages
06 Gpuarch
No ratings yet
06 Gpuarch
78 pages
GPU Architecture
33% (3)
GPU Architecture
28 pages
GPGPU
100% (1)
GPGPU
139 pages
Unit 1
No ratings yet
Unit 1
48 pages
GPU Computing Overview and Future
No ratings yet
GPU Computing Overview and Future
20 pages
4 - Key Concepts
No ratings yet
4 - Key Concepts
2 pages
AMPE Tema4 GPU Architecture
No ratings yet
AMPE Tema4 GPU Architecture
95 pages
Data-Level Parallelism with Vectors & GPUs
No ratings yet
Data-Level Parallelism with Vectors & GPUs
6 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
Paralelismo 2024
No ratings yet
Paralelismo 2024
30 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Module1 PP BDS701 Notes
No ratings yet
Module1 PP BDS701 Notes
27 pages
HPC Chapter 1
No ratings yet
HPC Chapter 1
12 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
cs179 2024 Lec01
No ratings yet
cs179 2024 Lec01
26 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
04 Progbasics
No ratings yet
04 Progbasics
51 pages
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
No ratings yet
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
31 pages
GPU Graph Algorithms with CUDA
No ratings yet
GPU Graph Algorithms with CUDA
26 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Brook For GPUs - Stream Computing On Graphics Hardware - Slides (2004)
No ratings yet
Brook For GPUs - Stream Computing On Graphics Hardware - Slides (2004)
36 pages
GPUParallelProgramming PDF
No ratings yet
GPUParallelProgramming PDF
104 pages
Upcrc Opencl Lec1
No ratings yet
Upcrc Opencl Lec1
38 pages
High-Performance GPU Rasterization Techniques
No ratings yet
High-Performance GPU Rasterization Techniques
40 pages
GPU Power Management Strategies
No ratings yet
GPU Power Management Strategies
32 pages
Order-Independent Transparency Techniques
No ratings yet
Order-Independent Transparency Techniques
66 pages
Blurry Rasterization Techniques Explained
No ratings yet
Blurry Rasterization Techniques Explained
52 pages
Order-Independent Transparency Techniques
No ratings yet
Order-Independent Transparency Techniques
66 pages
Portable Multi-Language Shading System
No ratings yet
Portable Multi-Language Shading System
36 pages
YUKEN Hyd TRG Brochure 2022 23
No ratings yet
YUKEN Hyd TRG Brochure 2022 23
18 pages
HSE Control of Contractors
100% (1)
HSE Control of Contractors
12 pages
Stepping Stones To Achieving Your Doctorate Focusing On Your Viva From The Start 1st Edition Vernon Trafford
No ratings yet
Stepping Stones To Achieving Your Doctorate Focusing On Your Viva From The Start 1st Edition Vernon Trafford
26 pages
Standard Work Chart: (NVA) (NVA) (NVA)
No ratings yet
Standard Work Chart: (NVA) (NVA) (NVA)
2 pages
Indico100 - 746069 Service Manual
No ratings yet
Indico100 - 746069 Service Manual
635 pages
2059 s14 Ms 1 PDF
No ratings yet
2059 s14 Ms 1 PDF
11 pages
David White l6-20 Manual
No ratings yet
David White l6-20 Manual
20 pages
Sanskrit Hi
No ratings yet
Sanskrit Hi
11 pages
BNWAS
No ratings yet
BNWAS
4 pages
Youth and The Labour Market in Romania 1st Edition Cristina Lincaru Instant Download
100% (2)
Youth and The Labour Market in Romania 1st Edition Cristina Lincaru Instant Download
57 pages
As 3891.2-2008 Air Navigation - Cables and Their Supporting Structures - Marking and Safety Requirements Mark
0% (2)
As 3891.2-2008 Air Navigation - Cables and Their Supporting Structures - Marking and Safety Requirements Mark
7 pages
Introduction to Management Information Systems
No ratings yet
Introduction to Management Information Systems
20 pages
Anti-Cancer Research Paper
No ratings yet
Anti-Cancer Research Paper
8 pages
Cyber Torts: Legal Challenges & Differences
No ratings yet
Cyber Torts: Legal Challenges & Differences
12 pages
HP Officejet Pro 8710/8720/8730/8740 All-In-One Printer Series
No ratings yet
HP Officejet Pro 8710/8720/8730/8740 All-In-One Printer Series
4 pages
Dairy Farming in Amargosa Valley
No ratings yet
Dairy Farming in Amargosa Valley
2 pages
Millwardbrown Perspectives 2008-2009
No ratings yet
Millwardbrown Perspectives 2008-2009
221 pages
Parcor Diaz
No ratings yet
Parcor Diaz
29 pages
UAS Control Station Handover Guide
No ratings yet
UAS Control Station Handover Guide
3 pages
hs486 Document OSHA Recording Work-Related Injuries and Illnesses
No ratings yet
hs486 Document OSHA Recording Work-Related Injuries and Illnesses
13 pages
Financial Results for Investors
No ratings yet
Financial Results for Investors
6 pages
Basics of Conflict Management
No ratings yet
Basics of Conflict Management
5 pages
Partial Differential Equation Toolbox: User's Guide
No ratings yet
Partial Differential Equation Toolbox: User's Guide
290 pages
Motivation: Key Elements of Motivation
No ratings yet
Motivation: Key Elements of Motivation
3 pages
Od435971760614164100 1
No ratings yet
Od435971760614164100 1
5 pages
Artec Ray: Scan Up To
No ratings yet
Artec Ray: Scan Up To
4 pages
Author's Purpose - Extrasheet - Grades 4-5-6
No ratings yet
Author's Purpose - Extrasheet - Grades 4-5-6
2 pages
XtremePower 61108
No ratings yet
XtremePower 61108
10 pages
Altec Federal Government Product Catalog
100% (3)
Altec Federal Government Product Catalog
146 pages
ENTREP-9 - 2nd Q
No ratings yet
ENTREP-9 - 2nd Q
25 pages

Parallel Programming Models For Real-Time Graphics: Aaron Lefohn

Uploaded by

Parallel Programming Models For Real-Time Graphics: Aaron Lefohn

Uploaded by

Parallel Programming Models for Real-Time Graphics

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Lack of abstraction enables arch-specific programmer optimization

When a parallel programming model abstracts a HW resource,

Granularity of permitted synchronization

Vertex Shaders: Pure Data Parallelism

What synchronization is allowed?

Pure Data-parallel Pseudocode

concurrent_for( i = 1 to numVertices) { // Execute vertex shader }

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Conventional Thread Parallelism

Where is synchronization allowed?

Conventional Thread Parallelism

Launch a workgroup per hardware execution

D3D/OpenGL Rendering Pipeline

Cores, execution contexts, SIMD functional units, memory hierarchy,

Where is synchronization allowed?

Abstracting SIMD ALUs

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Explicit SIMD Programming

Intrinsics Assembly Wide vector types

SPMD/Implicit SIMD Programming

concurrent_for( i = 1 to someBigNumber) { // Per-lane code goes here }

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

SPMD/Implicit SIMD Programming

Abstracting Cores and Execution Contexts

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Task Systems (Cilk, TBB, ConcRT, GCD, )

Where is synchronization allowed?

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Task Pseudo Code

Nested Task Pseudo Code

GPU Compute Pseudo Code

GPU Compute Languages

Work-group abstracts a cores execution contexts, SIMD

Where is synchronization allowed?

Between work-items in a work-group Between passes (set of work-groups)

Concurrency versus parallelism

Where is user allowed to control scheduling?

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Future SOC (CPU + GPU) programming model directions

Look for uses of these different models throughout the rest

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

You might also like