0% found this document useful (0 votes)

60 views45 pages

Lec7 - TLP Shared Memory and OpenMP

Uploaded by

WoloWizard

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views45 pages

Lec7 - TLP Shared Memory and OpenMP

Uploaded by

WoloWizard

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Parallel Computing and Programming.

Lecture 7: Thread Level Parallelism: Shared

Memory and OpenMP.

Dr. Rony Kassam

IEF Tishreen Uni

S1 2021
Index
n Von Neumann vs Dataflow Models.
n ISA vs Microarchitecture.
n Single-cycle vs Multi-cycle Microarchitectures.
n Instruction Level Parallelism: Pipelining Intro.
n Instruction Level Parallelism: Issues in Pipeline Design.
n Thread Level Parallelism: Data Dependence Solutions.
n Thread Level Parallelism: Shared Memory and OpenMP.

2
Review: OpenMP Building Block:
for loop

for (i=0; i<max; i++) zero[i] = 0;

n Breaks for loop into chunks, and allocate each to a separate

thread
q e.g. if max = 100 with 2 threads:
assign 0-49 to thread 0, and 50-99 to thread 1
n Must have relatively simple “shape” for an OpenMP-aware
compiler to be able to parallelize it
q Necessary for the run-time system to be able to determine how
many of the loop iterations to assign to each thread
n No premature exits from the loop allowed
q i.e. No break, return, exit, goto statements In general,
don’t jump
outside of
any
pragma
block
3
Review: OpenMP Parallel for pragma

#pragma omp parallel for

for (i=0; i<max; i++) zero[i] = 0;

n Master thread creates additional threads, each with a

separate execution context
n All variables declared outside for loop are shared by default,
except for loop index which is implicitly private per thread
n Implicit “barrier” synchronization at end of for loop
n Divide index regions sequentially per thread
q Thread 0 gets 0, 1, …, (max/n)-1;
q Thread 1 gets max/n, max/n+1, …, 2*(max/n)-1
q Why?

4
Example 2: Computing p

Sequential p = 3.1415926535897932384626433832795028841971693993751...
5
Review: Working Parallel p

6
Trial Run

i = 1, id = 1
i = 0, id = 0
i = 2, id = 2
i = 3, id = 3
i = 5, id = 1
i = 4, id = 0
i = 6, id = 2
i = 7, id = 3
i = 9, id = 1
i = 8, id = 0
pi = 3.142425985001

7
Scale up: num_steps = 106

pi =
3.141592653590

You verify how many

digits are correct …

8
Can We Parallelize Computing sum?

Always looking for ways

to beat Amdahl’s Law …

Summation inside parallel section

• Insignificant speedup in this example, but …
• pi = 3.138450662641
• Wrong! And value changes between runs?!
• What’s going on?

9
What’s Going On?

• Operation is really
pi = pi + sum[id]
• What if >1 threads reads current (same)
value of pi, computes the sum, stores
the result back to pi?
• Each processor reads same
intermediate value of pi!
• Result depends on who gets there when
• A “race” à result is not
12/12/21 Fall 2017 - Lecture #20
deterministic

10
OpenMP Reduction

double avg, sum=0.0, A[MAX]; int i;

#pragma omp parallel for private ( sum )
for (i = 0; i <= MAX ; i++)
sum += A[i];
avg = sum/MAX; // bug

n Problem is that we really want sum over all threads!

n Reduction: specifies that 1 or more variables that are private to each thread
are subject of reduction operation at end of parallel region:
reduction(operation:var) where
q Operation: operator to perform on the variables (var) at the end of the parallel region
q Var: One or more variables on which to perform scalar reduction.

double avg, sum=0.0, A[MAX]; int i;

#pragma omp for reduction(+ : sum)
for (i = 0; i <= MAX ; i++)
sum += A[i];
avg = sum/MAX;

11
Calculating π Original Version
#include <omp.h>
#define NUM_THREADS 4
static long num_steps = 100000; double step;

void main () {
int i; double x, pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
#pragma omp parallel private ( i, x )
{
int id = omp_get_thread_num();
for (i=id, sum[id]=0.0; i<num_steps; i=i+NUM_THREADS)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i=1; i<NUM_THREADS; i++)
sum[0] += sum[i]; pi = sum[0];
printf ("pi = %6.12f\n", pi);
}

12
Version 2: parallel for, reduction

#include <omp.h>
#include <stdio.h>
/static long num_steps = 100000;
double step;
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum)
for (i=1; i<= num_steps; i++){
x = (i-0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = sum;
printf ("pi = %6.8f\n", pi);
}

13
Data Races and Synchronization
n Two memory accesses form a data race if from different
threads access same location, at least one is a write, and
they occur one after another
n If there is a data race, result of program varies depending
on chance (which thread first?)
n Avoid data races by synchronizing writing and reading to
get deterministic behavior
n Synchronization done by user-level routines that rely on
hardware synchronization instructions

14
Locks

n Computers use locks to control access to shared resources

q Serves purpose of microphone in example
q Also referred to as “semaphore”

n Usually implemented with a variable

q int lock;
n 0 for unlocked
n 1 for locked

15
Synchronization with Locks

// wait for lock released

while (lock != 0) ;
// lock == 0 now (unlocked)

// set lock
lock = 1;

// access shared resource ...

// e.g. pi
// sequential execution! (Amdahl ...)

// release lock
lock = 0;

16
Lock Synchronization

Thread 1 Thread 2
while (lock != 0) ;
while (lock != 0) ;

lock = 1;
• Thread 2 finds lock not set,
before thread 1 sets it
// critical section • Both threads believe they
got andlock
set the lock!
= 1;
lock = 0; // critical section
Try as you like,
thislock
problem
= 0; has no
solution, not even at the assembly
level.
Unless we introduce new instructions,
that is!

17
Hardware Synchronization

n Solution:
q Atomic read/write
q Read & write in single instruction
n No other access permitted between read and write
q Note:
n Must use shared memory (multiprocessing)
n Common implementations:
q Atomic swap of register ↔ memory
q Pair of instructions for “linked” read and write
n write fails if memory location has been “tampered” with after linked read
n RISC-V has variations of both, but for simplicity we will focus on
the former

18
RISC-V Atomic Memory Operations
(AMOs)
n AMOs atomically perform an operation on an operand in
memory and set the destination register to the original
memory value
n R-Type Instruction Format: Add, And, Or, Swap, Xor,
Max, Max Unsigned, Min, Min Unsigned

amoadd.w rd,rs2,(rs1):
Load from address in rs1 to “t” t = M[x[rs1]];
rd = ”t”, i.e., the value in memory x[rd] = t;
Store at address in rs1 the calculation “t” <operation> rs2M[x[rs1]] = t + x[rs2]
aq and rl insure in order execution

19
RISC-V Critical Section

n Assume that the lock is in memory location stored in register a0

n The lock is “set” if it is 1; it is “free” if it is 0 (it’s initial value)

li t0, 1 # Get 1 to set lock

Try: amoswap.w.aq t1, t0, (a0) # t1 gets old lock value
# while we set it to 1
bnez t1, Try # if it was already 1, another
# thread has the lock,
# so we need to try again
… critical section goes here …
amoswap.w.rl x0, x0, (a0) # store 0 in lock to release

20
Lock Synchronization

Broken Synchronization Fix (lock is at location (a0))

while (lock != 0) ; li t0, 1

Try amoswap.w.aq t1, t0, (a0)
bnez t1, Try
lock = 1; Locked:

// critical section # critical section

Unlock:
lock = 0; amoswap.w.rl x0, x0, (a0)

21
Deadlock

n Deadlock: a system state in which no progress is

possible
n Dining Philosopher’s Problem:
q Think until the left fork is available; when it is, pick it
up
q Think until the right fork is available; when it is, pick it
up
q When both forks are held, eat for a fixed amount of
time
q Then, put the right fork down
q Then, put the left fork down
q Repeat from the beginning
n Solution?

22
OpenMP Timing

n Elapsed wall clock time:

double omp_get_wtime(void);
q Returns elapsed wall clock time in seconds
q Time is measured per thread, no guarantee can be made
that two distinct threads measure the same time
q Time is measured from “some time in the past,” so
subtract results of two calls to omp_get_wtime to get
elapsed time

23
Matrix Multiply in OpenMP

// C[M][N] = A[M][P] × B[P][N]

start_time = omp_get_wtime();
#pragma omp parallel for private(tmp, j, k)
for (i=0; i<M; i++){ Outer loop spread across N
for (j=0; j<N; j++){ threads;
tmp = 0.0; inner loops inside a single thread
for (k=0; k<P; k++){
/* C(i,j) = sum(over k) A(i,k) * B(k,j)*/
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
run_time = omp_get_wtime() - start_time;

24
Matrix Multiply in Open MP
n More performance optimizations available:
q Higher compiler optimization (-O2, -O3) to reduce number

of instructions executed
The differences when using -O3 as compared to -O2 are:
•The threshold at which the compiler believes that it is profitable to inline a call site increases.
•The amount of loop unrolling that is performed is increased.
•More aggressive instruction optimizations are enabled late in the compiler pipeline.

q Cache blocking to improve memory performance

q Using SIMD AVX instructions to raise floating point
computation rate (Data Level Parallelism DLP)

25
OpenMP Review

26
What is OpenMP?

n API used for multi-threaded, shared memory parallelism

q Compiler Directives
q Runtime Library Routines
q Environment Variables
n Portable
n Standardized

n Resources: http://www.openmp.org/
and http://computing.llnl.gov/tutorials/openMP/
27
OpenMP Specification

28
Shared Memory Model with Explicit Thread-
based Parallelism
n Multiple threads in a shared memory environment,
explicit programming model with full programmer control
over parallelization
n Pros:
q Takes advantage of shared memory, programmer need not
worry (that much) about data placement
q Compiler directives are simple and easy to use
q Legacy serial code does not need to be rewritten
n Cons:
q Code can only be run in shared memory
environments
q Compiler must support OpenMP (e.g. gcc 4.2)
29
OpenMP Extends C with Pragmas

n Pragmas are a preprocessor mechanism C provides for

language extensions
n Commonly implemented pragmas:
structure packing, symbol aliasing, floating point
exception modes
n Good mechanism for OpenMP because compilers that
don't recognize a pragma are supposed to ignore them
q Runs on sequential computer even with embedded
pragmas

30
parallel Pragma and Scope

n Basic OpenMP construct for parallelization:

#pragma omp parallel
{ This is annoying, but curly brace MUST go on
separate line from #pragma
/* code goes here */
}
q Each thread runs a copy of code within the block
q Thread scheduling is non-deterministic

n OpenMP default is shared variables

q To make private, need to declare with pragma:
#pragma omp parallel private (x)
31
Thread Creation

n How many threads will OpenMP create?

n Defined by OMP_NUM_THREADS environment variable
(or code procedure call)
q Set this variable to the maximum number of threads you
want OpenMP to use
q Usually equals the number of cores in the underlying
hardware on which the program is run

32
OMP_NUM_THREADS

n OpenMP intrinsic to set number of threads:

omp_set_num_threads(x);
n OpenMP intrinsic to get number of threads:
num_th = omp_get_num_threads();
n OpenMP intrinsic to get Thread ID number:
th_ID = omp_get_thread_num();

33
Parallel Hello World
#include <stdio.h>
#include <omp.h>
int main () {
int nthreads, tid;
/* Fork team of threads with private var tid */
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num(); /* get thread id */
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master and terminate */
}
34
OpenMP Directives (Work-Sharing)

• These are defined within a parallel section

Shares iterations of a Each section is executed Serializes the execution

loop across the threads by a separate thread of a thread
35
Parallel Statement Shorthand

#pragma omp parallel

This is the
{ only directive
#pragma omp for in the parallel
section
for(i=0;i<len;i++) { … }
}

54
Parallel Statement Shorthand

#pragma omp parallel

This is the
{ only directive
#pragma omp for in the parallel
section
for(i=0;i<len;i++) { … }
}
can be shortened to:
#pragma omp parallel for
for(i=0;i<len;i++) { … }

54
Parallel Statement Shorthand

#pragma omp parallel

This is the
{ only directive
#pragma omp for in the parallel
section
for(i=0;i<len;i++) { … }
}
can be shortened to:
#pragma omp parallel for
for(i=0;i<len;i++) { … }

n Also works for sections

54
OpenMP Directives (Synchronization)

n These are defined within a parallel section

n master
q Code block executed only by the master thread
(all other threads skip)
n critical
q Code block executed by only one thread at a time
n atomic
q Specific memory location must be updated atomically (like
a mini-critical section for writing to memory)
q Applies to single statement, not code block

39
OpenMP Reduction

n Reduction specifies that one or more private variables

are the subject of a reduction operation at end of
parallel region
q Clause reduction(operation:var)
q Operation: Operator to perform on the variables at the
end of the parallel region
q Var: One or more variables on which to perform scalar
reduction
#pragma omp for reduction(+:nSum)
for (i = START ; i <= END ; i++)
nSum += i;

40
OpenMP Pitfall #1: Data Dependencies

n Consider the following code:

a[0] = 1;
for(i=1; i<5000; i++)
a[i] = i + a[i-1];
n There are dependencies between loop iterations!
q Splitting this loop between threads does not guarantee in-
order execution
q Out of order loop execution will result in undefined
behavior (i.e. likely wrong result)

41
Open MP Pitfall #2: Sharing Issues

n Consider the following loop:

#pragma omp parallel for
for(i=0; i<n; i++){
temp = 2.0*a[i];
a[i] = temp;
b[i] = c[i]/temp;
}
n temp is a shared variable!
#pragma omp parallel for private(temp)
for(i=0; i<n; i++){
temp = 2.0*a[i];
a[i] = temp;
b[i] = c[i]/temp;
}
42
OpenMP Pitfall #3: Updating Shared Variables
Simultaneously
n Now consider a global sum:
for(i=0; i<n; i++)
sum = sum + a[i];
n This can be done by surrounding the summation by a
critical/atomic section or reduction clause:
#pragma omp parallel for
reduction(+:sum)
{
for(i=0; i<n; i++)
sum = sum + a[i];
}
q Compiler can generate highly efficient code for reduction

43
OpenMP Pitfall #4: Parallel Overhead

n Spawning and releasing threads results in significant

overhead
n Better to have fewer but larger parallel regions
q Parallelize over the largest loop that you can (even though
it will involve more work to declare all of the private
variables and eliminate dependencies)

44
OpenMP Pitfall #4: Parallel Overhead

start_time = omp_get_wtime(); Too much overhead in thread

generation to have this
for (i=0; i<Ndim; i++){
statement run this frequently.
for (j=0; j<Mdim; j++){
Poor choice of loop to
tmp = 0.0;
parallelize.
#pragma omp parallel for reduction(+:tmp)
for( k=0; k<Pdim; k++){
/* C(i,j) = sum(over k) A(i,k) * B(k,j)*/
tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j));
}
*(C+(i*Ndim+j)) = tmp;
}
}
run_time = omp_get_wtime() - start_time;
45

Open MPLecture
No ratings yet
Open MPLecture
54 pages
OpenMP Shared-Memory Programming Guide
No ratings yet
OpenMP Shared-Memory Programming Guide
37 pages
OpenMP Basics for Multithreading
No ratings yet
OpenMP Basics for Multithreading
14 pages
CS33 S25 L15 More OpenMP Annotated
No ratings yet
CS33 S25 L15 More OpenMP Annotated
65 pages
OpenMP for Parallel Programming
No ratings yet
OpenMP for Parallel Programming
40 pages
OpenMP Basics
No ratings yet
OpenMP Basics
47 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
Parallel Programming: in C With Mpi and Openmp Michael J. Quinn
No ratings yet
Parallel Programming: in C With Mpi and Openmp Michael J. Quinn
73 pages
Lecture Open MP
No ratings yet
Lecture Open MP
35 pages
OpenMP for Parallel Programming
No ratings yet
OpenMP for Parallel Programming
29 pages
Advanced OpenMP Pitfalls & Solutions
No ratings yet
Advanced OpenMP Pitfalls & Solutions
52 pages
OpenMP Parallel Programming Guide
No ratings yet
OpenMP Parallel Programming Guide
74 pages
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
No ratings yet
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
50 pages
OpenMP for Shared Memory Programming
No ratings yet
OpenMP for Shared Memory Programming
88 pages
Lecture 9-OpenMP Coclusion
No ratings yet
Lecture 9-OpenMP Coclusion
39 pages
Sample - Code - Parallel - Cse6230 Fa14 04 Omp
No ratings yet
Sample - Code - Parallel - Cse6230 Fa14 04 Omp
51 pages
OpenMP Guide for Parallel Computing
No ratings yet
OpenMP Guide for Parallel Computing
32 pages
High Performance Computing WS2022 Slides 2 Openmp GDB Gprof
No ratings yet
High Performance Computing WS2022 Slides 2 Openmp GDB Gprof
41 pages
Excelente
No ratings yet
Excelente
64 pages
E 3 (Openmp - Iii) : Matrix Multiplication
No ratings yet
E 3 (Openmp - Iii) : Matrix Multiplication
10 pages
OpenMP Shared
No ratings yet
OpenMP Shared
28 pages
OpenMP Programming Exercises in C
100% (1)
OpenMP Programming Exercises in C
15 pages
CS-3006 5 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 5 UsingOpenMP SharedMemoryProgramming
76 pages
OpenMP for Shared Memory Programming
No ratings yet
OpenMP for Shared Memory Programming
30 pages
OpenMP Shared Memory Guide
No ratings yet
OpenMP Shared Memory Guide
35 pages
Parallel Computing Lab Manual PDF
100% (1)
Parallel Computing Lab Manual PDF
51 pages
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
No ratings yet
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
32 pages
OpenMP Synchronization Guide
No ratings yet
OpenMP Synchronization Guide
32 pages
OpenMP Intro
No ratings yet
OpenMP Intro
52 pages
OpenMP SPM
No ratings yet
OpenMP SPM
9 pages
Omp Hands On SC08 PDF
No ratings yet
Omp Hands On SC08 PDF
153 pages
OpenMP Tutorial: Hands-On Introduction
No ratings yet
OpenMP Tutorial: Hands-On Introduction
153 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
PDC-Lab 21BCE10419
No ratings yet
PDC-Lab 21BCE10419
20 pages
Introduction to OpenMP Programming
No ratings yet
Introduction to OpenMP Programming
35 pages
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
No ratings yet
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
58 pages
OpenMP and MPI Parallel Programming Guide
No ratings yet
OpenMP and MPI Parallel Programming Guide
23 pages
OpenMP 01 Introduction
No ratings yet
OpenMP 01 Introduction
70 pages
OpenMP Parallel Programming Guide
No ratings yet
OpenMP Parallel Programming Guide
25 pages
Open MP
No ratings yet
Open MP
59 pages
Lect11 Openmp1
No ratings yet
Lect11 Openmp1
35 pages
Openmp
No ratings yet
Openmp
61 pages
OpenMP Parallel Processing Guide
No ratings yet
OpenMP Parallel Processing Guide
115 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
Openmp Boston
No ratings yet
Openmp Boston
90 pages
PDC Lecture 7
No ratings yet
PDC Lecture 7
11 pages
Openmp HPC Ass1
No ratings yet
Openmp HPC Ass1
43 pages
OpenMP P1
No ratings yet
OpenMP P1
32 pages
OpenMP for Parallel Programming
No ratings yet
OpenMP for Parallel Programming
10 pages
Unit III
No ratings yet
Unit III
15 pages
Chapter 5
No ratings yet
Chapter 5
92 pages
OpenMP Programming Guide
No ratings yet
OpenMP Programming Guide
38 pages
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
No ratings yet
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
46 pages
Parallel Reduction with OpenMP
No ratings yet
Parallel Reduction with OpenMP
5 pages
Lecture 5 PT - CreateTopology
No ratings yet
Lecture 5 PT - CreateTopology
17 pages
Lec4 - ILP Pipelining Intro
No ratings yet
Lec4 - ILP Pipelining Intro
24 pages
Von Neumann vs Dataflow Models
No ratings yet
Von Neumann vs Dataflow Models
52 pages
Lec6 - TLP Data Dependence Solutions
No ratings yet
Lec6 - TLP Data Dependence Solutions
20 pages
Lec2 - ISA Vs Microarchitecture
No ratings yet
Lec2 - ISA Vs Microarchitecture
38 pages
Single vs Multi-cycle Microarchitectures
No ratings yet
Single vs Multi-cycle Microarchitectures
121 pages
Lec5 - ILP Issues in Pipeline Design
No ratings yet
Lec5 - ILP Issues in Pipeline Design
38 pages
Programming in C Syllabus
No ratings yet
Programming in C Syllabus
1 page
Deo Aiims
No ratings yet
Deo Aiims
29 pages
Creo3 m110 Toolkit GS
No ratings yet
Creo3 m110 Toolkit GS
30 pages
Lesson Plan 2 2
No ratings yet
Lesson Plan 2 2
33 pages
1.1 Introduction To DBMS: Bus Reservation System 2020-2021
No ratings yet
1.1 Introduction To DBMS: Bus Reservation System 2020-2021
28 pages
Java Basics-Páginas-1
No ratings yet
Java Basics-Páginas-1
26 pages
Student Report Card System
No ratings yet
Student Report Card System
26 pages
Dynamic Programming for TSP
No ratings yet
Dynamic Programming for TSP
15 pages
Java OutOfMemoryError Dump Report
No ratings yet
Java OutOfMemoryError Dump Report
233 pages
Course Introduction - OOAD
No ratings yet
Course Introduction - OOAD
7 pages
Extending S4Hana With Developer Extensibility PDF
No ratings yet
Extending S4Hana With Developer Extensibility PDF
43 pages
S7-1200 Data Logging Guide
No ratings yet
S7-1200 Data Logging Guide
17 pages
Algorithms Quiz Practice
No ratings yet
Algorithms Quiz Practice
6 pages
DBMS II Syllabus
No ratings yet
DBMS II Syllabus
3 pages
Angular Event Loop and HTTP Interceptors
No ratings yet
Angular Event Loop and HTTP Interceptors
132 pages
Embedded Systems and Real-Time Systems
100% (2)
Embedded Systems and Real-Time Systems
9 pages
RPG/400 Data Structures Overview
No ratings yet
RPG/400 Data Structures Overview
18 pages
CSC126 Tutorial 11 Sample Answer
No ratings yet
CSC126 Tutorial 11 Sample Answer
7 pages
Three Steps in Dynamic Programming
No ratings yet
Three Steps in Dynamic Programming
7 pages
Chapter 2 - File Structures - Attempt Review
No ratings yet
Chapter 2 - File Structures - Attempt Review
15 pages
C++ Interview Questions
No ratings yet
C++ Interview Questions
24 pages
Project 1
No ratings yet
Project 1
6 pages
Banking Management System
75% (4)
Banking Management System
27 pages
Create A Neural Network in 7 Steps - Neural Designer
No ratings yet
Create A Neural Network in 7 Steps - Neural Designer
11 pages
What Is SDLC Model and Its Phases
No ratings yet
What Is SDLC Model and Its Phases
9 pages
Python For Data Science
No ratings yet
Python For Data Science
321 pages
Class 12 CS PB Worksheet
No ratings yet
Class 12 CS PB Worksheet
10 pages
Python Project Final Report
No ratings yet
Python Project Final Report
17 pages
Omni Studio Developer Set 3 - 81
No ratings yet
Omni Studio Developer Set 3 - 81
37 pages
Tabular Editor DevOps with Git & Azure
No ratings yet
Tabular Editor DevOps with Git & Azure
23 pages

Lec7 - TLP Shared Memory and OpenMP

Uploaded by

Lec7 - TLP Shared Memory and OpenMP

Uploaded by

Parallel Computing and Programming.

Lecture 7: Thread Level Parallelism: Shared

Dr. Rony Kassam

IEF Tishreen Uni

for (i=0; i<max; i++) zero[i] = 0;

n Breaks for loop into chunks, and allocate each to a separate

#pragma omp parallel for

n Master thread creates additional threads, each with a

You verify how many

Always looking for ways

Summation inside parallel section

double avg, sum=0.0, A[MAX]; int i;

n Problem is that we really want sum over all threads!

double avg, sum=0.0, A[MAX]; int i;

n Computers use locks to control access to shared resources

n Usually implemented with a variable

// wait for lock released

// access shared resource ...

n Assume that the lock is in memory location stored in register a0

li t0, 1 # Get 1 to set lock

Broken Synchronization Fix (lock is at location (a0))

while (lock != 0) ; li t0, 1

// critical section # critical section

n Deadlock: a system state in which no progress is

n Elapsed wall clock time:

// C[M][N] = A[M][P] × B[P][N]

q Cache blocking to improve memory performance

n API used for multi-threaded, shared memory parallelism

n Pragmas are a preprocessor mechanism C provides for

n Basic OpenMP construct for parallelization:

n OpenMP default is shared variables

n How many threads will OpenMP create?

n OpenMP intrinsic to set number of threads:

• These are defined within a parallel section

Shares iterations of a Each section is executed Serializes the execution

#pragma omp parallel

#pragma omp parallel

#pragma omp parallel

n Also works for sections

n These are defined within a parallel section

n Reduction specifies that one or more private variables

n Consider the following code:

n Consider the following loop:

n Spawning and releasing threads results in significant

start_time = omp_get_wtime(); Too much overhead in thread

You might also like