OpenMP programming and
execution model
SoHPC, 2021
Outline
▪ Execution Model
▪ Memory Model
▪ Race Condition
▪ Parallel Construct
▪ Hello World
▪ If Clause
▪ Dynamic and Nested Regions
▪ Data Clauses
▪ Number of Threads
▪ Practical
2
Execution Model
• Thread-based Parallelism
o Initially there is the master thread, at a designated point multiple threads are created,
a parallel region
• Compiler Directive Based
o Directives tell the compiler where are these parallel regions
o Means minimal and incremental changes needed to sequential code
• Explicit Parallelism
• Fork-Join Model
3
Execution Model
• Dynamic Threads
o More than one parallel region
o Different number of threads
• Nested Parallelism
o Parallel region inside another parallel region.
4
Memory model
• All threads have access to the shared memory.
• Rule of thumb: One thread per core (or processor)
• Cache is private to each core/thread.
• Maintaining a consistent view of main memory within the caches is called cache
coherency.
CPU CPU CPU CPU
Cache Cache Cache Cache
Main
Memory I/O System
5
Memory model
• Threads can share data with other threads, but also have
private data.
Thread 1 Thread 2 Thread 3
CPU Private data CPU Private data CPU Private data
Shared data
6
Four different parts of the
memory:
• Code area
• Globals area
• Heap
• Stack
Heap: large pool of memory,
deallocation, shared by all
threads
Stack: each thread has its
own, store private data, LIFO
principle, fast, no need for
deallocation like heap.
7
Race Condition
• Threads communicate through shared variables.
Uncoordinated access of these variables can lead to
undesired effects.
• two threads update (write) a shared variable in the
same step of execution, the result is dependent on
the way this variable is accessed. This is called a
race condition.
• Suppose that one processor has an updated result
in private cache. Second processor wants to access
that memory location - but a read from memory will
get the old value since original data not yet written
back.
• Can be time consuming; better first to change how data is
accessed
8
Parallel Constructs
• The fundamental construct in
OpenMP.
• Creates team of threads
• Every thread executes the same
statements inside the parallel
region at the end of the parallel
region there is an implicit barrier
C/C++: Fortran:
!$omp parallel [clauses]
#pragma omp parallel [clauses]
…
{
!$omp end parallel
…
}
9
double A[1000];
Create a 4-thread parallel region omp_set_num_threads(4);
#pragma omp parallel
{
int tid=omp_get_thread_num();
foo(tid,A);
double A[1000]; }
Tid: from 0 to 3
printf(“All Done\n”);
Each calls foo(tid, A)
omp_set_num_threads(4);
foo(0,A); foo(1,A); foo(2,A); foo(3,A);
Threads wait for all treads to finish
before proceeding
printf(“All Done\n”);
10
Parallel Construct
• Clauses:
num_threads (integer-expression)
if (scalar_expression)
private (list)
shared (list)
default (shared | none)
firstprivate (list)
reduction (operator: list)
copyin (list)
11
Hello World
C - Serial: C:
#include<stdio.h> #include<stdio.h>
#include<omp.h>
int main(int argc, char**argv){ int main(int argc, char**argv){
#pragma omp parallel
printf("Hello world!\n”); printf("Hello from thread %d out of %d\n",
omp_get_thread_num(),
omp_get_num_threads());
} }
12
Hello World
Fortran - Serial: Fortran:
program hello program hello
use omp_lib
implicit none implicit none
!$omp parallel
print *, 'Hello world!’ print *, 'Hello from thread', &
omp_get_thread_num(), &
'out of’,omp_get_num_threads()
!$omp end parallel
end program hello end program hello
13
If Clause
If Clause:
• Used to make the parallel region directive itself conditional.
• Only execute in parallel if expression is true.
(Checks the size of the data)
C/C++:
Fortran:
#pragma omp parallel if(n>100)
{ !$omp parallel if(n>100)
… ...
} !$omp end parallel
14
Dynamic Threads
Dynamic threads:
• Used to create parallel regions with a variable number of threads
• OpenMP runtime will decide the number of threads
• omp_set_dynamic(), OMP_DYNAMIC, omp_get_dynamic()
omp_set_dynamic(0);
omp_set_num_threads(10);
#pragma omp parallel
printf("Num threads in non-dynamic region is = %d\n",
omp_get_num_threads());
omp_set_dynamic(1);
omp_set_num_threads(10);
#pragma omp parallel
printf("Num threads in dynamic region is = %d\n", omp_get_num_threads());
15
Nested Regions
Nested parallel regions:
• If a parallel directive is encountered
within another parallel directive, a
new team of threads will be created.
• omp_set_nested(), OMP_NESTED,
omp_get_nested()
• Num threads affects the new regions
• New threads with one thread unless
nested parallelism is enabled
• num_threads(n) clause or dynamic
threading for different num threads
16
Data Clauses
• Used in conjunction with several directives to control the
scoping of enclosed variables.
– default(shared|none): The default scope for all of the variables;
Fortran has more options.
– shared(list): Variable is shared by all threads in the team. All threads
can read or write to that variable.
C/C++: #pragma omp parallel default(none) shared(n)
Fortran: !$omp parallel default(none) shared(n)
– private(list): Each thread has a private copy of variable. It can only
be read or written by its own thread.
C/C++: #pragma omp parallel default(shared) private(tid)
Fortran: !$omp parallel default(shared) private(tid)
17
Example
C: Fortran:
#include<stdio.h> program hello
#include<omp.h> use omp_lib
int main(){ implicit none
int tid, nthreads; integer tid, nthreads
#pragma omp parallel private(tid), shared(nthreads) !$omp parallel private(tid), shared(nthreads)
{ tid=omp_get_thread_num()
tid=omp_get_thread_num(); nthreads=omp_get_num_threads()
nthreads=omp_get_num_threads(); print*, 'Hello from thread',tid,'out of',nthreads
printf("Hello from thread %d out of %d\n", tid, !$omp end parallel
nthreads);
} end program hello
}
18
• How do we decide which variables should be shared
and which private?
– Loop indices - private
– Loop temporaries - private
– Read-only variables - shared
– Main arrays - shared
• Most variables are shared by default
– C/C++: File scope, static variables
– Fortran: COMMON blocks, SAVE, MODULE
variables
– Both: dynamically allocated variables
• Variables declared in parallel region are always
private 19
Additional Data Clauses
– firstprivate(list): pre-initialize
private vars with value of j = jstart;
variable with same name #pragma omp parallel shared(arr), firstprivate(j)
before parallel construct. {
Int tid = omp_get_thread_num()
arr[tid] = tid+j;
– lastprivate(list): On exiting the }
for (int i=0; i<nthreads; i++) printf("%d, %d\n",i,arr[i]);
parallel region, this gives
private data the value of last
iteration if sequential)
#pragma omp parallel copyin(jstart)
– threadprivate(list): Used to {
make global file scope int tid = omp_get_thread_num();
variables (C/C++) or common jstart = jstart + tid + 1;
printf("%d, %d\n",tid,jstart);
blocks (Fortran) private to }
thread.
printf("%d\n",jstart);{
– copyin(list): Copies the
threadprivate variables from
master thread to the team
20
threads.
Runtime Functions
• Runtime Functions: for managing the parallel program
dynamically.
– omp_set_num_threads(n) - set the desired number of
threads
– omp_get_num_threads() - returns the current number
of threads
– omp_get_thread_num() - returns the id of this thread
– omp_in_parallel() – returns .true. if inside parallel
region
C/C++: Add #include<omp.h>
Fortran: Add use omp_lib
21
Shell Variables
• Environment Variables: for controlling the
execution of parallel program at run-time.
– csh/tcsh: setenv OMP_NUM_THREADS n
– ksh/sh/bash: export OMP_NUM_THREADS=n
echo $OMP_NUM_THREADS
22
How many threads?
The number of threads in a parallel region is determined by:
▪ Setting of the OMP_NUM_THREADS environment
variable.
▪ Use of the omp_set_num_threads(n) library function.
▪ Use of num_threads(n) clause.
▪ The implementation default - usually the number of
CPUs/cores on a node
Threads are numbered from 0 (master thread) to n-1 where
n=the total number of threads.
23
Summary
• Parallel construct forks threads.
• There are several ways to determine the number of threads
per region.
• We can have dynamic and nested parallel regions.
• Variables must be defined as private or shared.
• One of the common problems is not declaring them properly.
This will lead to different results for different numbers of
threads.
• Program is said to be thread safe, if results are the same for
any number of threads.
24