0% found this document useful (0 votes)
60 views45 pages

Lec7 - TLP Shared Memory and OpenMP

Uploaded by

WoloWizard
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views45 pages

Lec7 - TLP Shared Memory and OpenMP

Uploaded by

WoloWizard
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Parallel Computing and Programming.

Lecture 7: Thread Level Parallelism: Shared


Memory and OpenMP.

Dr. Rony Kassam

IEF Tishreen Uni


S1 2021
Index
n Von Neumann vs Dataflow Models.
n ISA vs Microarchitecture.
n Single-cycle vs Multi-cycle Microarchitectures.
n Instruction Level Parallelism: Pipelining Intro.
n Instruction Level Parallelism: Issues in Pipeline Design.
n Thread Level Parallelism: Data Dependence Solutions.
n Thread Level Parallelism: Shared Memory and OpenMP.

2
Review: OpenMP Building Block:
for loop

for (i=0; i<max; i++) zero[i] = 0;

n Breaks for loop into chunks, and allocate each to a separate


thread
q e.g. if max = 100 with 2 threads:
assign 0-49 to thread 0, and 50-99 to thread 1
n Must have relatively simple “shape” for an OpenMP-aware
compiler to be able to parallelize it
q Necessary for the run-time system to be able to determine how
many of the loop iterations to assign to each thread
n No premature exits from the loop allowed
q i.e. No break, return, exit, goto statements In general,
don’t jump
outside of
any
pragma
block
3
Review: OpenMP Parallel for pragma

#pragma omp parallel for


for (i=0; i<max; i++) zero[i] = 0;

n Master thread creates additional threads, each with a


separate execution context
n All variables declared outside for loop are shared by default,
except for loop index which is implicitly private per thread
n Implicit “barrier” synchronization at end of for loop
n Divide index regions sequentially per thread
q Thread 0 gets 0, 1, …, (max/n)-1;
q Thread 1 gets max/n, max/n+1, …, 2*(max/n)-1
q Why?

4
Example 2: Computing p

Sequential p = 3.1415926535897932384626433832795028841971693993751...
5
Review: Working Parallel p

6
Trial Run

i = 1, id = 1
i = 0, id = 0
i = 2, id = 2
i = 3, id = 3
i = 5, id = 1
i = 4, id = 0
i = 6, id = 2
i = 7, id = 3
i = 9, id = 1
i = 8, id = 0
pi = 3.142425985001

7
Scale up: num_steps = 106

pi =
3.141592653590

You verify how many


digits are correct …

8
Can We Parallelize Computing sum?

Always looking for ways


to beat Amdahl’s Law …

Summation inside parallel section


• Insignificant speedup in this example, but …
• pi = 3.138450662641
• Wrong! And value changes between runs?!
• What’s going on?

9
What’s Going On?

• Operation is really
pi = pi + sum[id]
• What if >1 threads reads current (same)
value of pi, computes the sum, stores
the result back to pi?
• Each processor reads same
intermediate value of pi!
• Result depends on who gets there when
• A “race” à result is not
12/12/21 Fall 2017 - Lecture #20
deterministic

10
OpenMP Reduction

double avg, sum=0.0, A[MAX]; int i;


#pragma omp parallel for private ( sum )
for (i = 0; i <= MAX ; i++)
sum += A[i];
avg = sum/MAX; // bug

n Problem is that we really want sum over all threads!


n Reduction: specifies that 1 or more variables that are private to each thread
are subject of reduction operation at end of parallel region:
reduction(operation:var) where
q Operation: operator to perform on the variables (var) at the end of the parallel region
q Var: One or more variables on which to perform scalar reduction.

double avg, sum=0.0, A[MAX]; int i;


#pragma omp for reduction(+ : sum)
for (i = 0; i <= MAX ; i++)
sum += A[i];
avg = sum/MAX;

11
Calculating π Original Version
#include <omp.h>
#define NUM_THREADS 4
static long num_steps = 100000; double step;

void main () {
int i; double x, pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
#pragma omp parallel private ( i, x )
{
int id = omp_get_thread_num();
for (i=id, sum[id]=0.0; i<num_steps; i=i+NUM_THREADS)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i=1; i<NUM_THREADS; i++)
sum[0] += sum[i]; pi = sum[0];
printf ("pi = %6.12f\n", pi);
}

12
Version 2: parallel for, reduction

#include <omp.h>
#include <stdio.h>
/static long num_steps = 100000;
double step;
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum)
for (i=1; i<= num_steps; i++){
x = (i-0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = sum;
printf ("pi = %6.8f\n", pi);
}

13
Data Races and Synchronization
n Two memory accesses form a data race if from different
threads access same location, at least one is a write, and
they occur one after another
n If there is a data race, result of program varies depending
on chance (which thread first?)
n Avoid data races by synchronizing writing and reading to
get deterministic behavior
n Synchronization done by user-level routines that rely on
hardware synchronization instructions

14
Locks

n Computers use locks to control access to shared resources


q Serves purpose of microphone in example
q Also referred to as “semaphore”

n Usually implemented with a variable


q int lock;
n 0 for unlocked
n 1 for locked

15
Synchronization with Locks

// wait for lock released


while (lock != 0) ;
// lock == 0 now (unlocked)

// set lock
lock = 1;

// access shared resource ...


// e.g. pi
// sequential execution! (Amdahl ...)

// release lock
lock = 0;

16
Lock Synchronization

Thread 1 Thread 2
while (lock != 0) ;
while (lock != 0) ;

lock = 1;
• Thread 2 finds lock not set,
before thread 1 sets it
// critical section • Both threads believe they
got andlock
set the lock!
= 1;
lock = 0; // critical section
Try as you like,
thislock
problem
= 0; has no
solution, not even at the assembly
level.
Unless we introduce new instructions,
that is!

17
Hardware Synchronization

n Solution:
q Atomic read/write
q Read & write in single instruction
n No other access permitted between read and write
q Note:
n Must use shared memory (multiprocessing)
n Common implementations:
q Atomic swap of register ↔ memory
q Pair of instructions for “linked” read and write
n write fails if memory location has been “tampered” with after linked read
n RISC-V has variations of both, but for simplicity we will focus on
the former

18
RISC-V Atomic Memory Operations
(AMOs)
n AMOs atomically perform an operation on an operand in
memory and set the destination register to the original
memory value
n R-Type Instruction Format: Add, And, Or, Swap, Xor,
Max, Max Unsigned, Min, Min Unsigned

amoadd.w rd,rs2,(rs1):
Load from address in rs1 to “t” t = M[x[rs1]];
rd = ”t”, i.e., the value in memory x[rd] = t;
Store at address in rs1 the calculation “t” <operation> rs2M[x[rs1]] = t + x[rs2]
aq and rl insure in order execution

19
RISC-V Critical Section

n Assume that the lock is in memory location stored in register a0


n The lock is “set” if it is 1; it is “free” if it is 0 (it’s initial value)

li t0, 1 # Get 1 to set lock


Try: amoswap.w.aq t1, t0, (a0) # t1 gets old lock value
# while we set it to 1
bnez t1, Try # if it was already 1, another
# thread has the lock,
# so we need to try again
… critical section goes here …
amoswap.w.rl x0, x0, (a0) # store 0 in lock to release

20
Lock Synchronization

Broken Synchronization Fix (lock is at location (a0))

while (lock != 0) ; li t0, 1


Try amoswap.w.aq t1, t0, (a0)
bnez t1, Try
lock = 1; Locked:

// critical section # critical section


Unlock:
lock = 0; amoswap.w.rl x0, x0, (a0)

21
Deadlock

n Deadlock: a system state in which no progress is


possible
n Dining Philosopher’s Problem:
q Think until the left fork is available; when it is, pick it
up
q Think until the right fork is available; when it is, pick it
up
q When both forks are held, eat for a fixed amount of
time
q Then, put the right fork down
q Then, put the left fork down
q Repeat from the beginning
n Solution?

22
OpenMP Timing

n Elapsed wall clock time:


double omp_get_wtime(void);
q Returns elapsed wall clock time in seconds
q Time is measured per thread, no guarantee can be made
that two distinct threads measure the same time
q Time is measured from “some time in the past,” so
subtract results of two calls to omp_get_wtime to get
elapsed time

23
Matrix Multiply in OpenMP

// C[M][N] = A[M][P] × B[P][N]


start_time = omp_get_wtime();
#pragma omp parallel for private(tmp, j, k)
for (i=0; i<M; i++){ Outer loop spread across N
for (j=0; j<N; j++){ threads;
tmp = 0.0; inner loops inside a single thread
for (k=0; k<P; k++){
/* C(i,j) = sum(over k) A(i,k) * B(k,j)*/
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
run_time = omp_get_wtime() - start_time;

24
Matrix Multiply in Open MP
n More performance optimizations available:
q Higher compiler optimization (-O2, -O3) to reduce number

of instructions executed
The differences when using -O3 as compared to -O2 are:
•The threshold at which the compiler believes that it is profitable to inline a call site increases.
•The amount of loop unrolling that is performed is increased.
•More aggressive instruction optimizations are enabled late in the compiler pipeline.

q Cache blocking to improve memory performance


q Using SIMD AVX instructions to raise floating point
computation rate (Data Level Parallelism DLP)

25
OpenMP Review

26
What is OpenMP?

n API used for multi-threaded, shared memory parallelism


q Compiler Directives
q Runtime Library Routines
q Environment Variables
n Portable
n Standardized

n Resources: http://www.openmp.org/
and http://computing.llnl.gov/tutorials/openMP/
27
OpenMP Specification

28
Shared Memory Model with Explicit Thread-
based Parallelism
n Multiple threads in a shared memory environment,
explicit programming model with full programmer control
over parallelization
n Pros:
q Takes advantage of shared memory, programmer need not
worry (that much) about data placement
q Compiler directives are simple and easy to use
q Legacy serial code does not need to be rewritten
n Cons:
q Code can only be run in shared memory
environments
q Compiler must support OpenMP (e.g. gcc 4.2)
29
OpenMP Extends C with Pragmas

n Pragmas are a preprocessor mechanism C provides for


language extensions
n Commonly implemented pragmas:
structure packing, symbol aliasing, floating point
exception modes
n Good mechanism for OpenMP because compilers that
don't recognize a pragma are supposed to ignore them
q Runs on sequential computer even with embedded
pragmas

30
parallel Pragma and Scope

n Basic OpenMP construct for parallelization:


#pragma omp parallel
{ This is annoying, but curly brace MUST go on
separate line from #pragma
/* code goes here */
}
q Each thread runs a copy of code within the block
q Thread scheduling is non-deterministic

n OpenMP default is shared variables


q To make private, need to declare with pragma:
#pragma omp parallel private (x)
31
Thread Creation

n How many threads will OpenMP create?


n Defined by OMP_NUM_THREADS environment variable
(or code procedure call)
q Set this variable to the maximum number of threads you
want OpenMP to use
q Usually equals the number of cores in the underlying
hardware on which the program is run

32
OMP_NUM_THREADS

n OpenMP intrinsic to set number of threads:


omp_set_num_threads(x);
n OpenMP intrinsic to get number of threads:
num_th = omp_get_num_threads();
n OpenMP intrinsic to get Thread ID number:
th_ID = omp_get_thread_num();

33
Parallel Hello World
#include <stdio.h>
#include <omp.h>
int main () {
int nthreads, tid;
/* Fork team of threads with private var tid */
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num(); /* get thread id */
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master and terminate */
}
34
OpenMP Directives (Work-Sharing)

• These are defined within a parallel section

Shares iterations of a Each section is executed Serializes the execution


loop across the threads by a separate thread of a thread
35
Parallel Statement Shorthand

#pragma omp parallel


This is the
{ only directive
#pragma omp for in the parallel
section
for(i=0;i<len;i++) { … }
}

54
Parallel Statement Shorthand

#pragma omp parallel


This is the
{ only directive
#pragma omp for in the parallel
section
for(i=0;i<len;i++) { … }
}
can be shortened to:
#pragma omp parallel for
for(i=0;i<len;i++) { … }

54
Parallel Statement Shorthand

#pragma omp parallel


This is the
{ only directive
#pragma omp for in the parallel
section
for(i=0;i<len;i++) { … }
}
can be shortened to:
#pragma omp parallel for
for(i=0;i<len;i++) { … }

n Also works for sections


54
OpenMP Directives (Synchronization)

n These are defined within a parallel section


n master
q Code block executed only by the master thread
(all other threads skip)
n critical
q Code block executed by only one thread at a time
n atomic
q Specific memory location must be updated atomically (like
a mini-critical section for writing to memory)
q Applies to single statement, not code block

39
OpenMP Reduction

n Reduction specifies that one or more private variables


are the subject of a reduction operation at end of
parallel region
q Clause reduction(operation:var)
q Operation: Operator to perform on the variables at the
end of the parallel region
q Var: One or more variables on which to perform scalar
reduction
#pragma omp for reduction(+:nSum)
for (i = START ; i <= END ; i++)
nSum += i;

40
OpenMP Pitfall #1: Data Dependencies

n Consider the following code:


a[0] = 1;
for(i=1; i<5000; i++)
a[i] = i + a[i-1];
n There are dependencies between loop iterations!
q Splitting this loop between threads does not guarantee in-
order execution
q Out of order loop execution will result in undefined
behavior (i.e. likely wrong result)

41
Open MP Pitfall #2: Sharing Issues

n Consider the following loop:


#pragma omp parallel for
for(i=0; i<n; i++){
temp = 2.0*a[i];
a[i] = temp;
b[i] = c[i]/temp;
}
n temp is a shared variable!
#pragma omp parallel for private(temp)
for(i=0; i<n; i++){
temp = 2.0*a[i];
a[i] = temp;
b[i] = c[i]/temp;
}
42
OpenMP Pitfall #3: Updating Shared Variables
Simultaneously
n Now consider a global sum:
for(i=0; i<n; i++)
sum = sum + a[i];
n This can be done by surrounding the summation by a
critical/atomic section or reduction clause:
#pragma omp parallel for
reduction(+:sum)
{
for(i=0; i<n; i++)
sum = sum + a[i];
}
q Compiler can generate highly efficient code for reduction

43
OpenMP Pitfall #4: Parallel Overhead

n Spawning and releasing threads results in significant


overhead
n Better to have fewer but larger parallel regions
q Parallelize over the largest loop that you can (even though
it will involve more work to declare all of the private
variables and eliminate dependencies)

44
OpenMP Pitfall #4: Parallel Overhead

start_time = omp_get_wtime(); Too much overhead in thread


generation to have this
for (i=0; i<Ndim; i++){
statement run this frequently.
for (j=0; j<Mdim; j++){
Poor choice of loop to
tmp = 0.0;
parallelize.
#pragma omp parallel for reduction(+:tmp)
for( k=0; k<Pdim; k++){
/* C(i,j) = sum(over k) A(i,k) * B(k,j)*/
tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j));
}
*(C+(i*Ndim+j)) = tmp;
}
}
run_time = omp_get_wtime() - start_time;
45

You might also like