Parallel Computing and Programming.
Lecture 7: Thread Level Parallelism: Shared
Memory and OpenMP.
Dr. Rony Kassam
IEF Tishreen Uni
S1 2021
Index
n Von Neumann vs Dataflow Models.
n ISA vs Microarchitecture.
n Single-cycle vs Multi-cycle Microarchitectures.
n Instruction Level Parallelism: Pipelining Intro.
n Instruction Level Parallelism: Issues in Pipeline Design.
n Thread Level Parallelism: Data Dependence Solutions.
n Thread Level Parallelism: Shared Memory and OpenMP.
2
Review: OpenMP Building Block:
for loop
for (i=0; i<max; i++) zero[i] = 0;
n Breaks for loop into chunks, and allocate each to a separate
thread
q e.g. if max = 100 with 2 threads:
assign 0-49 to thread 0, and 50-99 to thread 1
n Must have relatively simple “shape” for an OpenMP-aware
compiler to be able to parallelize it
q Necessary for the run-time system to be able to determine how
many of the loop iterations to assign to each thread
n No premature exits from the loop allowed
q i.e. No break, return, exit, goto statements In general,
don’t jump
outside of
any
pragma
block
3
Review: OpenMP Parallel for pragma
#pragma omp parallel for
for (i=0; i<max; i++) zero[i] = 0;
n Master thread creates additional threads, each with a
separate execution context
n All variables declared outside for loop are shared by default,
except for loop index which is implicitly private per thread
n Implicit “barrier” synchronization at end of for loop
n Divide index regions sequentially per thread
q Thread 0 gets 0, 1, …, (max/n)-1;
q Thread 1 gets max/n, max/n+1, …, 2*(max/n)-1
q Why?
4
Example 2: Computing p
Sequential p = 3.1415926535897932384626433832795028841971693993751...
5
Review: Working Parallel p
6
Trial Run
i = 1, id = 1
i = 0, id = 0
i = 2, id = 2
i = 3, id = 3
i = 5, id = 1
i = 4, id = 0
i = 6, id = 2
i = 7, id = 3
i = 9, id = 1
i = 8, id = 0
pi = 3.142425985001
7
Scale up: num_steps = 106
pi =
3.141592653590
You verify how many
digits are correct …
8
Can We Parallelize Computing sum?
Always looking for ways
to beat Amdahl’s Law …
Summation inside parallel section
• Insignificant speedup in this example, but …
• pi = 3.138450662641
• Wrong! And value changes between runs?!
• What’s going on?
9
What’s Going On?
• Operation is really
pi = pi + sum[id]
• What if >1 threads reads current (same)
value of pi, computes the sum, stores
the result back to pi?
• Each processor reads same
intermediate value of pi!
• Result depends on who gets there when
• A “race” à result is not
12/12/21 Fall 2017 - Lecture #20
deterministic
10
OpenMP Reduction
double avg, sum=0.0, A[MAX]; int i;
#pragma omp parallel for private ( sum )
for (i = 0; i <= MAX ; i++)
sum += A[i];
avg = sum/MAX; // bug
n Problem is that we really want sum over all threads!
n Reduction: specifies that 1 or more variables that are private to each thread
are subject of reduction operation at end of parallel region:
reduction(operation:var) where
q Operation: operator to perform on the variables (var) at the end of the parallel region
q Var: One or more variables on which to perform scalar reduction.
double avg, sum=0.0, A[MAX]; int i;
#pragma omp for reduction(+ : sum)
for (i = 0; i <= MAX ; i++)
sum += A[i];
avg = sum/MAX;
11
Calculating π Original Version
#include <omp.h>
#define NUM_THREADS 4
static long num_steps = 100000; double step;
void main () {
int i; double x, pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
#pragma omp parallel private ( i, x )
{
int id = omp_get_thread_num();
for (i=id, sum[id]=0.0; i<num_steps; i=i+NUM_THREADS)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i=1; i<NUM_THREADS; i++)
sum[0] += sum[i]; pi = sum[0];
printf ("pi = %6.12f\n", pi);
}
12
Version 2: parallel for, reduction
#include <omp.h>
#include <stdio.h>
/static long num_steps = 100000;
double step;
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum)
for (i=1; i<= num_steps; i++){
x = (i-0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = sum;
printf ("pi = %6.8f\n", pi);
}
13
Data Races and Synchronization
n Two memory accesses form a data race if from different
threads access same location, at least one is a write, and
they occur one after another
n If there is a data race, result of program varies depending
on chance (which thread first?)
n Avoid data races by synchronizing writing and reading to
get deterministic behavior
n Synchronization done by user-level routines that rely on
hardware synchronization instructions
14
Locks
n Computers use locks to control access to shared resources
q Serves purpose of microphone in example
q Also referred to as “semaphore”
n Usually implemented with a variable
q int lock;
n 0 for unlocked
n 1 for locked
15
Synchronization with Locks
// wait for lock released
while (lock != 0) ;
// lock == 0 now (unlocked)
// set lock
lock = 1;
// access shared resource ...
// e.g. pi
// sequential execution! (Amdahl ...)
// release lock
lock = 0;
16
Lock Synchronization
Thread 1 Thread 2
while (lock != 0) ;
while (lock != 0) ;
lock = 1;
• Thread 2 finds lock not set,
before thread 1 sets it
// critical section • Both threads believe they
got andlock
set the lock!
= 1;
lock = 0; // critical section
Try as you like,
thislock
problem
= 0; has no
solution, not even at the assembly
level.
Unless we introduce new instructions,
that is!
17
Hardware Synchronization
n Solution:
q Atomic read/write
q Read & write in single instruction
n No other access permitted between read and write
q Note:
n Must use shared memory (multiprocessing)
n Common implementations:
q Atomic swap of register ↔ memory
q Pair of instructions for “linked” read and write
n write fails if memory location has been “tampered” with after linked read
n RISC-V has variations of both, but for simplicity we will focus on
the former
18
RISC-V Atomic Memory Operations
(AMOs)
n AMOs atomically perform an operation on an operand in
memory and set the destination register to the original
memory value
n R-Type Instruction Format: Add, And, Or, Swap, Xor,
Max, Max Unsigned, Min, Min Unsigned
amoadd.w rd,rs2,(rs1):
Load from address in rs1 to “t” t = M[x[rs1]];
rd = ”t”, i.e., the value in memory x[rd] = t;
Store at address in rs1 the calculation “t” <operation> rs2M[x[rs1]] = t + x[rs2]
aq and rl insure in order execution
19
RISC-V Critical Section
n Assume that the lock is in memory location stored in register a0
n The lock is “set” if it is 1; it is “free” if it is 0 (it’s initial value)
li t0, 1 # Get 1 to set lock
Try: amoswap.w.aq t1, t0, (a0) # t1 gets old lock value
# while we set it to 1
bnez t1, Try # if it was already 1, another
# thread has the lock,
# so we need to try again
… critical section goes here …
amoswap.w.rl x0, x0, (a0) # store 0 in lock to release
20
Lock Synchronization
Broken Synchronization Fix (lock is at location (a0))
while (lock != 0) ; li t0, 1
Try amoswap.w.aq t1, t0, (a0)
bnez t1, Try
lock = 1; Locked:
// critical section # critical section
Unlock:
lock = 0; amoswap.w.rl x0, x0, (a0)
21
Deadlock
n Deadlock: a system state in which no progress is
possible
n Dining Philosopher’s Problem:
q Think until the left fork is available; when it is, pick it
up
q Think until the right fork is available; when it is, pick it
up
q When both forks are held, eat for a fixed amount of
time
q Then, put the right fork down
q Then, put the left fork down
q Repeat from the beginning
n Solution?
22
OpenMP Timing
n Elapsed wall clock time:
double omp_get_wtime(void);
q Returns elapsed wall clock time in seconds
q Time is measured per thread, no guarantee can be made
that two distinct threads measure the same time
q Time is measured from “some time in the past,” so
subtract results of two calls to omp_get_wtime to get
elapsed time
23
Matrix Multiply in OpenMP
// C[M][N] = A[M][P] × B[P][N]
start_time = omp_get_wtime();
#pragma omp parallel for private(tmp, j, k)
for (i=0; i<M; i++){ Outer loop spread across N
for (j=0; j<N; j++){ threads;
tmp = 0.0; inner loops inside a single thread
for (k=0; k<P; k++){
/* C(i,j) = sum(over k) A(i,k) * B(k,j)*/
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
run_time = omp_get_wtime() - start_time;
24
Matrix Multiply in Open MP
n More performance optimizations available:
q Higher compiler optimization (-O2, -O3) to reduce number
of instructions executed
The differences when using -O3 as compared to -O2 are:
•The threshold at which the compiler believes that it is profitable to inline a call site increases.
•The amount of loop unrolling that is performed is increased.
•More aggressive instruction optimizations are enabled late in the compiler pipeline.
q Cache blocking to improve memory performance
q Using SIMD AVX instructions to raise floating point
computation rate (Data Level Parallelism DLP)
25
OpenMP Review
26
What is OpenMP?
n API used for multi-threaded, shared memory parallelism
q Compiler Directives
q Runtime Library Routines
q Environment Variables
n Portable
n Standardized
n Resources: http://www.openmp.org/
and http://computing.llnl.gov/tutorials/openMP/
27
OpenMP Specification
28
Shared Memory Model with Explicit Thread-
based Parallelism
n Multiple threads in a shared memory environment,
explicit programming model with full programmer control
over parallelization
n Pros:
q Takes advantage of shared memory, programmer need not
worry (that much) about data placement
q Compiler directives are simple and easy to use
q Legacy serial code does not need to be rewritten
n Cons:
q Code can only be run in shared memory
environments
q Compiler must support OpenMP (e.g. gcc 4.2)
29
OpenMP Extends C with Pragmas
n Pragmas are a preprocessor mechanism C provides for
language extensions
n Commonly implemented pragmas:
structure packing, symbol aliasing, floating point
exception modes
n Good mechanism for OpenMP because compilers that
don't recognize a pragma are supposed to ignore them
q Runs on sequential computer even with embedded
pragmas
30
parallel Pragma and Scope
n Basic OpenMP construct for parallelization:
#pragma omp parallel
{ This is annoying, but curly brace MUST go on
separate line from #pragma
/* code goes here */
}
q Each thread runs a copy of code within the block
q Thread scheduling is non-deterministic
n OpenMP default is shared variables
q To make private, need to declare with pragma:
#pragma omp parallel private (x)
31
Thread Creation
n How many threads will OpenMP create?
n Defined by OMP_NUM_THREADS environment variable
(or code procedure call)
q Set this variable to the maximum number of threads you
want OpenMP to use
q Usually equals the number of cores in the underlying
hardware on which the program is run
32
OMP_NUM_THREADS
n OpenMP intrinsic to set number of threads:
omp_set_num_threads(x);
n OpenMP intrinsic to get number of threads:
num_th = omp_get_num_threads();
n OpenMP intrinsic to get Thread ID number:
th_ID = omp_get_thread_num();
33
Parallel Hello World
#include <stdio.h>
#include <omp.h>
int main () {
int nthreads, tid;
/* Fork team of threads with private var tid */
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num(); /* get thread id */
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master and terminate */
}
34
OpenMP Directives (Work-Sharing)
• These are defined within a parallel section
Shares iterations of a Each section is executed Serializes the execution
loop across the threads by a separate thread of a thread
35
Parallel Statement Shorthand
#pragma omp parallel
This is the
{ only directive
#pragma omp for in the parallel
section
for(i=0;i<len;i++) { … }
}
54
Parallel Statement Shorthand
#pragma omp parallel
This is the
{ only directive
#pragma omp for in the parallel
section
for(i=0;i<len;i++) { … }
}
can be shortened to:
#pragma omp parallel for
for(i=0;i<len;i++) { … }
54
Parallel Statement Shorthand
#pragma omp parallel
This is the
{ only directive
#pragma omp for in the parallel
section
for(i=0;i<len;i++) { … }
}
can be shortened to:
#pragma omp parallel for
for(i=0;i<len;i++) { … }
n Also works for sections
54
OpenMP Directives (Synchronization)
n These are defined within a parallel section
n master
q Code block executed only by the master thread
(all other threads skip)
n critical
q Code block executed by only one thread at a time
n atomic
q Specific memory location must be updated atomically (like
a mini-critical section for writing to memory)
q Applies to single statement, not code block
39
OpenMP Reduction
n Reduction specifies that one or more private variables
are the subject of a reduction operation at end of
parallel region
q Clause reduction(operation:var)
q Operation: Operator to perform on the variables at the
end of the parallel region
q Var: One or more variables on which to perform scalar
reduction
#pragma omp for reduction(+:nSum)
for (i = START ; i <= END ; i++)
nSum += i;
40
OpenMP Pitfall #1: Data Dependencies
n Consider the following code:
a[0] = 1;
for(i=1; i<5000; i++)
a[i] = i + a[i-1];
n There are dependencies between loop iterations!
q Splitting this loop between threads does not guarantee in-
order execution
q Out of order loop execution will result in undefined
behavior (i.e. likely wrong result)
41
Open MP Pitfall #2: Sharing Issues
n Consider the following loop:
#pragma omp parallel for
for(i=0; i<n; i++){
temp = 2.0*a[i];
a[i] = temp;
b[i] = c[i]/temp;
}
n temp is a shared variable!
#pragma omp parallel for private(temp)
for(i=0; i<n; i++){
temp = 2.0*a[i];
a[i] = temp;
b[i] = c[i]/temp;
}
42
OpenMP Pitfall #3: Updating Shared Variables
Simultaneously
n Now consider a global sum:
for(i=0; i<n; i++)
sum = sum + a[i];
n This can be done by surrounding the summation by a
critical/atomic section or reduction clause:
#pragma omp parallel for
reduction(+:sum)
{
for(i=0; i<n; i++)
sum = sum + a[i];
}
q Compiler can generate highly efficient code for reduction
43
OpenMP Pitfall #4: Parallel Overhead
n Spawning and releasing threads results in significant
overhead
n Better to have fewer but larger parallel regions
q Parallelize over the largest loop that you can (even though
it will involve more work to declare all of the private
variables and eliminate dependencies)
44
OpenMP Pitfall #4: Parallel Overhead
start_time = omp_get_wtime(); Too much overhead in thread
generation to have this
for (i=0; i<Ndim; i++){
statement run this frequently.
for (j=0; j<Mdim; j++){
Poor choice of loop to
tmp = 0.0;
parallelize.
#pragma omp parallel for reduction(+:tmp)
for( k=0; k<Pdim; k++){
/* C(i,j) = sum(over k) A(i,k) * B(k,j)*/
tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j));
}
*(C+(i*Ndim+j)) = tmp;
}
}
run_time = omp_get_wtime() - start_time;
45