0% found this document useful (0 votes)
5 views90 pages

2 - ParallelProgramming

The document discusses the challenges and methodologies of parallel programming, particularly in engineering applications where large datasets and computations are involved. It highlights the limitations of traditional sequential processing and emphasizes the need for programs to be specifically designed for parallel execution to achieve significant speed improvements. Examples of parallel computing applications, such as computational electromagnetics and genetic algorithms, are provided to illustrate the practical implications and benefits of parallelization.

Uploaded by

007jiegao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views90 pages

2 - ParallelProgramming

The document discusses the challenges and methodologies of parallel programming, particularly in engineering applications where large datasets and computations are involved. It highlights the limitations of traditional sequential processing and emphasizes the need for programs to be specifically designed for parallel execution to achieve significant speed improvements. Examples of parallel computing applications, such as computational electromagnetics and genetic algorithms, are provided to illustrate the practical implications and benefits of parallelization.

Uploaded by

007jiegao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PARALLEL PROGRAMMING

EEEE3084 – Scalable, Cross-


Platform Software Design
HYPOTHETICAL PROBLEM
• Each panel needs a ‘filter’ that links it to each other polygon (and possibly itself)
• Here there are 8 panel, gives 64 filters
• 1000 panel gives 1,000,000 filters (1000 is still not a lot for this type of problem, could be millions)
• 1,000,000 panels, gives 1,000,000,000,000 filters (1 million panels in an engineering problem is not uncommon)

• Each filter:
• Needs to know which panel it connects, i.e. its input and output
• Contain the transfer function / equation that links these
• Probably represented by some sort of data structure or class, almost certainly then have a large array of filters

• Perspective: 1,000,000,000,000 filters


• If each filter takes 1us to evaluate, the total time required is 1,000,000,000,000 x 10^-6 = 1 million
seconds or 278 hours
• If each filter takes 100 bytes to store, the total storage is 1,000,000,000,000 x 100 B = ~ 90TB

• The point:
• Engineering problems get very large, very quickly.
• Clever ways of storing and processing very large sets of objects / arrays / matrices are needed
MOTIVATION
• Moore’s law
• Number of transistors on an integrated circuit doubles every two years
• Originally driven by the fact that transistors could be made smaller
• Smaller transistors switch faster – clock speeds increase
• Therefore processing speed doubled ~ two years

• This only works up to a point!


• There is a physical lower limit for transistor size, as you approach this it
becomes more difficult to reduce size
• Therefore it becomes more difficult to increase clock speed

• What does clock speed determine?


• Number of operations per second
• If a program/function has X operations that the processor must execute – clock
speed is directly linked to how fast a program will execute
https://queue.acm.org/detail.cfm?id=2181798
Admission: I did copy this from Wikipedia which isn’t good practice!
There is a more trustworthy source here:
https://queue.acm.org/detail.cfm?id=2181798 but it didn’t have a nice
clear chart!

I wouldn’t trust the actual numbers, but this does illustrate a general
trend – in the early 2000’s, processor clock speeds started to stabilise
at around 3GHz and haven’t really increased that much in the last 20
years.

At the same time, the number of cores on processors increased. If


you can’t process instructions faster, process more of them at the
same speed – parallel processing.

https://queue.acm.org/detail.cfm?id=2181798
PARALLEL PROGRAMMING
• Idea is easy – just get multiple mini processors (cores) to work on the same program at the same time
• Double the number of cores, half the execution time? 100 cores, 1% of the execution time?

• Practice doesn’t work like that


• Computer programs are lists of instructions
• They are inherently designed to be executed in a sequence – i.e. one instruction usually depends on the
results of instructions that came before
• This causes problems, you can’t run these instructions in parallel.

• Consequences:
• You cannot just compile an existing program to work on multiple cores – the program needs be written in
a way that identifies instructions that can be executed in parallel.
• This isn’t always possible – some things have to happen in a sequence

• Thought: What if about 50% of a program consists of non-sequential instructions?


• We can divide this 50% up amongst many cores (assume infinite number for now), but the other 50% is
going to have to run on a single core.
• What is the speed up? For the parallel part: 2 cores ½ the time, 10 cores 1/10 of the time, etc. ∞ cores 0
time
• For the sequential part – same as before, it was 50% of original program so probably took about 50% of
the total time
• An infinite number of cores has only achieved a 50% reduction in execution time!
PARALLEL PROGRAMMING
• Idea is easy – just get multiple mini processors (cores) to work on the same program at the same time
• Key point here: programs must be specifically designed to take advantage of
Double the number of cores, half the execution time? 100 cores, 1% of the execution time?

parallel computing:
• Practice doesn’t work like that
• Computer programs are lists of instructions
• • As much of the program as possible must be parallelisable:
They are inherently designed to be executed in a sequence – i.e. one instruction usually depends on the
results of instructions that came before
• if you must run 50% of it sequentially, you can only get a 50% speed up
• This causes problems, you can’t run these instructions in parallel.
• 30% sequentially – 70% speedup, etc
• Consequences: • This is assuming an infinite number of cores
• You cannot just compile an existing program to work on multiple cores – the program needs be written in
a way that identifies instructions that can be executed in parallel.
• • Other issues:
This isn’t always possible – some things have to happen in a sequence

• The code that is parallelisable might limit number of cores that can be used
• Thought: What if about 50% of a program consists of non-sequential instructions?
• There is also the obvious potential hardware limit – how many cores do you
• We can divide this 50% up amongst many cores (assume infinite number for now), but the other 50% is
actually
going to have to run on a single core. have access to?
• What is the speed up? For the parallel part: 2 cores ½ the time, 10 cores 1/10 of the time, etc. ∞ cores 0
time • Multiple cores ≠ multiple computers – they share memory, busses, etc. So
• For the sequential part – same
the total time
two cores,
as before, doesn’t
it was 50% of original actually mean
program so probably took half theof time.
about 50%

• An infinite number of cores has only achieved a 50% reduction in execution time!
EXAMPLE APPLICATIONS
Where is parallel computing used?
REALLY OBVIOUS EXAMPLE
• This computer, running Windows
• Multiple applications
• Email – managing live connection to server
• Windows itself, tracking mouse movements, etc
• I might have LTSpice running a simulation in the background

• Here there are multiple independent processes, running their own code/instructions,
with their own data

• Multiple cores / processors can help with this – quite an easy problem to solve – just
assign different cores to different jobs/applications

• This type of problem – different operations with different data can also occur within a A potential complication here is that
single program, e.g. a program that is doing some engineering calculations each process might need access to
the same hardware/peripherals. E.g.:
RAM, harddrives, which will need
• However, often you have a single problem/process that you want to speed up through managing
parallelisation
COMPUTATIONAL ELECTROMAGNETICS
Computer model of
a large 45kV
power converter in
Korean electricity
network

Aim is to predict
current and
electromagnetic
fields to determine
electromagnetic
emissions
MODEL
• Image shows selected cross
sections of the “mesh”

• The problem is divided into lots


(millions) of small segments,
propagation of fields through
these segments is then
calculated

• This was done in commercial


simulation software
• CST (at request of
company)
• Ansys (our preferred tool)
RESULTS
• Test in commercial simulation software
• Electromagnetic simulation of inductor

• Two stages:
• Meshing – generating the triangle cells (218,000)
• Simulation – Using the mesh to generate equations (218,000 simultaneous equations in matrix form) and
then solving for the fields

• Meshing does not seem to be parallelised – time independent of number of cores

• Simulation time reduced by about 25% by going from 2 cores to 6.


METHODS Divide model into
cuboids

• Different methods can be used by the software


• Finite-element In each cube we
• TLM sample the
signals against
• … time

• They all use the “divide model into millions of pieces The way each cube’s sample
approach, the maths is just a bit different. value changes
with time depends upon the
• You can usually select between different methods
sample values
(solvers) in the software in the 6 other cubes
surrounding it.

• Taking TLM as an example


• The TLM implementation in CST was originally It’s a bit like a Mexican
developed in Nottingham University Wave - What each
cube
does next depends
upon what its
neighbours are
doing now
EXAMPLE OF SCALING PROBLEM
So why parallel? The smaller the cubes the more accurate the results

• If a particular 3D problem uses 100 by 100 by 100 cubes we have

• 1,000,000 cubes and run time of say 20 seconds

• To improve the accuracy by 2 using 200 by 200 by 200 cubes

• 8,000,000 cubes and run time of 320 seconds

• To improve the accuracy by 10 using 1000 by 1000 by 1000 cubes

• 1,000,000,000 cubes and run time of 200,000 seconds (56 hours!)

Clearly run times explode!


Also we need a lot of memory; may need “distributed computing” just because of this!
PARALLELISATION
• Each cube immediately affects only its 6 neighbours
• Might be possible to process some cubes in parallel as results from
one may not immediately affect operation on other core ?

• The process applied to each cube is the same


CPU 1 CPU 2
• Same operation, different data
• Can just run same code on each core with different input data

• Obviously all cubes are eventually linked


• Need to manage these links in the calculation
GENETIC ALGORITHMS Generate initial
population

• Genetic algorithms – modern trial and error!


Evaluate
• I want to find the best design for a component.
• Identify “genes” or characteristics – lengths, widths,
thicknesses, etc
Good enough?
• Generate a generation
• Randomly determine lengths, widths, etc for a number of
samples in the generation
• Evaluate each member of the generation
Crossover / mutation
• Pick the best
• Use the characteristics of the best to make a new generation
• E.g. make the new generation have combinations of lengths, This might involve a computer simulation (see
widths, etc that are similar to / combinations of the best in the previous example): a lot of work!
previous generation
• Repeat
GENETIC ALGORITHMS
Generate initial
population

Evaluate Evaluate Evaluate Evaluate Evaluate … Evaluate

Good enough? Evaluation of each new option is an


independent process – take genes for
particular case and return a “fitness” figure.

Nice, clear points where parallel operation


starts and ends.
Crossover / mutation
The evaluate part will represent a large
proportion of overall effort
MATRIX SOLVERS
𝐴11 𝐴12 𝐴13 𝑥1 𝑏1
𝐴21 𝐴22 𝐴23 𝑥2 = 𝑏2
𝐴31 𝐴32 𝐴33 𝑥3 𝑏3
• In lots of applications, including computer simulation, you get a problem to solve
like: 𝐴11 𝑥1 + 𝐴12 𝑥2 + 𝐴13 𝑥3 = 𝑏1
𝐴𝑥 = 𝑏
𝐴21 𝑥1 + 𝐴22 𝑥2 + 𝐴33 𝑥3 = 𝑏2
• I.e. you need to find, x through: 𝐴31 𝑥1 + 𝐴22 𝑥2 + 𝐴33 𝑥3 = 𝑏3
𝑥 = 𝐴−1 𝑏
• The problem is that 𝐴 is a very large matrix (maybe 1 million square), 𝑥 and 𝑏 are
Equivalent form
the corresponding vectors
• Algorithms to do this are called matrix solvers – see 3rd year maths for more
technical details 𝐴11 ′ 𝐴12 ′ 𝐴13 ′ 𝑥1 𝑏1 ′
• The “obvious” types of algorithm factor the matrix
0 𝐴22 ′ 𝐴23 ′ 𝑥2 = 𝑏2 ′
• A set of equations in matrix form like this is basically a set of simultaneous 0 0 𝐴33 ′ 𝑥3 𝑏3 ′
equations
• Solving by factorisation is similar to applying Gaussian elimination
• Typical algorithms include LU and QR factorisations Factorisation 𝐴11 ′𝑥1 + 𝐴12 ′𝑥2 + 𝐴13 ′𝑥3 = 𝑏1 ′
converts to row 𝐴22 ′𝑥2 + 𝐴33 ′𝑥3 = 𝑏2 ′
• The problems: these algorithm are inherently sequential – they cannot be echelon form 𝐴33 ′𝑥3 = 𝑏3 ′
parallelised

https://www.geeksforgeeks.org/dsa/gaussian-elimination/
MATRIX-VECTOR PRODUCTS
𝑦1 𝐴11 𝐴12 𝐴13 𝑥1
• Alternative matrix operation 𝑦2 = 𝐴21 𝐴22 𝐴23 𝑥2
𝑦 = 𝐴𝑥 𝑦3 𝐴31 𝐴32 𝐴33 𝑥3

• This time we want the product of a large matrix, 𝐴, and a vector, 𝑏

• Each value in 𝑦 can be computed independently of all other


values in 𝑦! A potential complication here is that
each parallel operation might need to
access the same/different entries in
• This operation can be parallelised easily 𝑥 at the same time

𝑦1 𝐴11 𝐴12 𝐴13 𝑥1 𝑦1 𝐴11 𝐴12 𝐴13 𝑥1 𝑦1 𝐴11 𝐴12 𝐴13 𝑥1


𝑦2 = 𝐴21 𝐴22 𝐴23 𝑥2 𝑦2 = 𝐴21 𝐴22 𝐴23 𝑥2 𝑦2 = 𝐴21 𝐴22 𝐴23 𝑥2
𝑦3 𝐴31 𝐴32 𝐴33 𝑥3 𝑦3 𝐴31 𝐴32 𝐴33 𝑥3 𝑦3 𝐴31 𝐴32 𝐴33 𝑥3

𝑦1 = 𝐴11 𝑥1 + 𝐴12 𝑥2 + 𝐴13 𝑥3 𝑦2 = 𝐴21 𝑥1 + 𝐴22 𝑥2 + 𝐴23 𝑥3 𝑦3 = 𝐴31 𝑥1 + 𝐴32 𝑥2 + 𝐴33 𝑥3


ITERATIVE MATRIX SOLVERS 𝐴11
𝐴21
𝐴12
𝐴22
𝐴13
𝐴23
𝑥1 𝑏1
𝑥2 = 𝑏2
𝐴31 𝐴32 𝐴33 𝑥3 𝑏3
• You can solve a matrix using matrix vector products, problem to
solve is:
𝐴𝑥 = 𝑏
• We know 𝐴 and 𝑏, want to find 𝑥. Start by guessing 𝑥, then 𝑦1 𝐴11 𝐴12 𝐴13 𝑥1
evaluate the product: 𝑦2 = 𝐴21 𝐴22 𝐴23 𝑥2
𝑦 = 𝐴𝑥 𝑦3 𝐴31 𝐴32 𝐴33 𝑥3
• If 𝑦 = 𝑏, the guess was correct, if not then guess again
• This type of method is called an iterative solver – you guess and
iterate until you get the right answer
• A bit more complicated than this in reality!
• The point is that you are able to do a solve in a different way, that can
be parallelised
• Typical algorithms include Conjugate Gradient and GMRES
• Sometimes different algorithms are used because they can be
parallelised (and for other reasons – again, see 3rd year maths!)
SOME OBSERVATIONS
• Different types of problem
• Same operation, different data Come back to this later
• Different operations, different data
https://ftp.cvut.cz/kernel/people/geoff/cell/ps3-linux-
• Some algorithms that are an obvious choice for serial programming, don’t work for docs/CellProgrammingTutorial/BasicsOfSIMDProgra
parallel programming mming.html
• A “second-rate” serial algorithm might be preferable for parallel programming

• Processing / instructions are only one part of the problem


https://www.intel.com/content/www/us/en/do
• If cores / processors share the same memory and busses, they can hold each other up while
they wait to read/write data cs/cpp-compiler/developer-guide-
reference/2021-9/details-about-mmx-
technology-intrinsics.html
• Synchronisation
• E.g. in the case of the Genetic Algorithm, you need to evaluate each option and they wait for all
results before you proceed with next generation
• How do you synchronise this?
https://www.geeksforgeeks.org/cpp/thread-
• Mutually exclusive variables – only one process can access a variable at one time, another that
tries will be forced to wait until the variable is released synchronization-in-cpp/
• “Wait for” points – wait until all child processes reach this point
PRACTICAL PARALLEL PROGRAMMING
• Requirement: Need to create multiple “threads”
• Lots of ways for doing this, Windows API, Qt GUI, Gtk GUI, CPP std
library
• These are generally designed for application level multithreading
• I want a thread to handle my GUI while another thread loads a file (to
prevent GUI from freezing) Relatively simple, allows parts of
program to divide into multiple
threads through extra compiler
• Requirement: I want lots of threads working on same integrated directives
algorithm.
• Now need a more specialised library that also helps with More complex, allows more complete division
synchronisation into processes. Covered in 4th year
• OpenMP – Open MultiProcessing computing
• MPI – Message Passing Interface
• Cuda – for Nvidia Graphics Processors Graphics processors contain huge numbers of cores –
• OpenCL – for Graphics Processors see 4th year computing for more on this!
PRACTICAL PARALLEL PROGRAMMING
• Requirement: Need to create
Thread: A setmultiple “threads”that can execute in parallel with other threads
of instructions
• Lots of ways for doing this, Windows API, Qt GUI, Gtk GUI, CPP std
library are generally lightweight, a program quickly divides
Threads into multiple threads with each executing
• These are generally instructions
designed for application leveland
in parallel, multithreading
then the thread recombine.
• I want a thread to handle my GUI while another thread loads a file (to
prevent GUI from freezing) Relatively simple, allows parts of
Share memory – can see same variables, a pointer in one thread can point to atovariable
program in another
divide into multiple
thread. This can be good or bad. threads through extra compiler
• Requirement: I want lots of threads working on same integrated directives
algorithm.
Process:
• Now need a more specialised library that also A mini
helps with program More complex, allows more complete division
synchronisation into processes. Covered in 4th year
Processes are more
• OpenMP independent
– Open of each other than threads, therecomputing
MultiProcessing is more overhead involved in setting
• MPI – Message Passing Interface them up.
• Cuda – for Nvidia Graphics Processors Graphics processors contain huge numbers of cores –
A process can contain
• OpenCL – for Graphics Processors see 4multiple threads for more on this!
th year computing

They have their own “address space” – they cannot see variables in other processes.
PARALLEL PROGRAMMING
Hardware and software implementations
PROCESSORS
• Processors – do basic operations on numbers, add, subtract, compare, etc
• They do this by reading instructions (numbers) from RAM, these instructions then tell them to read
other numbers from RAM, do an operation, write results back to RAM
• Contain:
• ALU – logic unit
• Registers – temporary storage for working variables
• Cache – a local copy of section of RAM, much faster to access than going back to RAM all the time

• Each instruction actually consists of multiple stages:


• IF: Instruction fetch – read instruction from RAM (or cache)
• ID: Decode instruction – different instructions require different numbers of arguments, absolute
addresses may need computing
1 2 3 4 5
• EX: Execute instruction IF ID EX ME WB
• ME: Lookup values in memory if operation requires it
• WB: Write results to back to registers
One instruction requires up to 5
separate actions in the processor,
• Each of the above stages in theory takes one clock cycle – i.e. your clock frequency is 1Ghz,
period = 1ns, potentially 5ns to execute one instruction this means up to 5 clock cycles
PIPELINING
• Each of the 5 actions is different and using different hardware features,
e.g. the logic unit to execute an instruction or the memory bus to fetch
instructions
• This means the hardware is not fully utilised

• Pipelining allows the processor to work on multiple instructions at the


same time
• E.g. at cycle 4, it is doing the ME stage for one instruction, the EX stage for
the next instruction and the ID stage the for one after that 1 2 3 4 5
• This is instruction level parallelism
IF ID EX ME WB
• The software is not multithreaded, its still sequential – its just that multiple
instructions (in the sequence) are being worked on in parallel
One instruction requires up to 5
separate actions in the processor,
1 2 3 4 5 6 7
this means up to 5 clock cycles
IF ID EX ME WB
IF ID EX ME WB
IF ID EX ME WB
PIPELINING – BRANCH PREDICTION if( x > 4. ) {
x *= 10.;
• Branches in code (e.g. if statements) cause problems
y = x + 3.;
• Depending which way the if goes, the following instructions will be different
} else {
• Pipelining involves working on the following instructions while processing x *= 5.;
the current one
y = x - 3.;
• The outcome of the current instruction (if) determines the following
}
instructions

• Branch prediction can be used to guess which way an if statement


goes, so that the correct pipeline is setup
1 2 3 4 5 6 7
IF ID EX ME WB
• If the guess is wrong, the pipeline needs rewriting
IF ID EX ME WB
IF ID EX ME WB
BRANCH PREDICTION - METHODS
for( i = 0; i < 10; i++ ) {
• If you guess randomly on an if statement, you’ll still get it right 50% of the if( x > 40 ) {
time x += 5;
y = y--;
} else {
• Better methods will remember previous outcomes and guess that way
x += 10;
y++;
• Different levels possible }
• If you just remember the last outcome, this works in a loop with one if statement
• If there are two if statements then they confuse a single level predictor – a two if( y > 4. ) {
level predictor would have the capability of spotting patterns in the last two if
statements
y++;
} else {
y += 2;
• The way a program is written can help or hinder branch prediction, which
can make a different to speed }
• Example…. }
PARALLEL PROCESSING – VECTOR PROCESSORS
• Another type of parallel processing at CPU instruction level

• Each operation in a classic desktop CPU does something (e.g. a i486 from the 90’s), this
something is add/subtract/compare/etc some 32-bit numbers.
• One core, following one list of instructions (the program) where each instruction is a relatively
simply operation on 32-bit numbers – this is not parallel processing. Known as Single
Instruction Single Datastream (SISD)
• These instructions may be pipelined, but the processor is still following one list of instructions
each dealing with 32-bit numbers (for a 486, but the same applies for the 64-bit standard
operations on one core of a modern Intel Core processor for example)
x a b
1 1 1
• Some classic supercomputers had processors designed to operate on large vectors, rather
than small 32-bit (or 64-bit) numbers. x a b
• E.g. Cray supercomputers 2 2 2
• Still have the single instructions, but these act on larger vectors – multiple numbers at the = +
same time x a b
• One add instruction adds a set of numbers to another set, pairwise 3 3 3
• One instruction, multiple sub-operations undertaken by hardware in parallel
x a b
4 4 4
x = a + b
VECTOR PROCESSING IN DESKTOP CPUS
• This was adopted in desktop CPUs, e.g. Intel processors in the 90’s:
i486 -> Pentium -> Pentium MMX

• MMX was additional instructions and registers that allowed operations


on vectors
• 64-bit registers that can be filled with 2x32-bit numbers, 4x16 bit numbers,
8x8bit numbers
• All numbers in the registers can be processed by same operation at same
time – in parallel, e.g. adding 8, 8bit numbers in one operation
• Various further developments including SSE and AVX

• SSE: Streaming SIMD Extensions, AVX: Advanced Vector Extensions

• SIMD: Single Instruction, Multiple Data


VECTOR PROCESSING LOOP EXAMPLE
#include <cstddef>
#include <ctime>
#include <iostream>

• Try a basic addition of some arrays: using namespace std;

int main() {
• Compile with O0 and O3. int v1[64], v2[64], v3[64];

srand( time(NULL) );

for( int i = 0; i < 64; i++ ) {


v1[i] = rand();
v2[i] = rand();
}

for( int i = 0; i < 64; i++ ) {


v3[i] = v1[i] + v2[i];
}

for( int i = 0; i < 64; i++ ) {


cout << v3[i] << " ";
}
}

vp.cpp
-O0

VECTOR PROCESSING LOOP EXAMPLE


-O3

VECTOR PROCESSING LOOP EXAMPLE

MOVUPD – move unaligned packed double


precision FP
MOVDQU =move unaligned double quadword
PADDD – add packed integers
These are vectorised operations – moving /
adding multiple ints at a time
MULTI-CORE PARALLELISATION
for( i = 0; i < 100; i++ ) {
a[i] = b[i] + c[i];
}

Sequential code – each loop iteration


• Cores - multiple independent processing units – each has its
performed by single core in turn
own list instructions to work through
• Parallelisation at software level – write the software so that it
divides work up into different parts that can be done in
parallel
• E.g. the loop example here – get one core to do one half of
loop, another does the other half – at the same time!

• This won’t always be possible, what if for( i = 0; i < 50; i++ ) { for( i = 50; i < 100; i++ ) {
a[i] = b[i] + c[i]; a[i] = b[i] + c[i];
each loop iteration depends on result } }
of previous iteration? Parallel code – split loop into two parts, give each part to a
different core to execute in parallel
MULTI-CORE – HOW?
• Need a way of telling the compiler which parts of program can be paralleled, e.g. certain
loops
• Ideal case is that you just tell the compiler what to parallel, it will work out the rest
• It will need to know certain information though:
• E.g. a, b, c (the arrays / pointers) are the same in all parallel instances
• Each instance will need its own copy of i – each loop instance will be iterating through
different values of i at the same time
• Just want some additional lines of code that can be added to program to specify these
things
• OpenMP works like this
// Compiler_please_split_next_loop_between_the available_cores
for( i = 0;i < 100; i++ ) {
a[i] = b[i] + c[i];
}
MULTI-CORE – HOW?
• Another example – not just loops, might want to split a task into sections that can be
done in parallel
• This type of method – adding extra commands to existing code – can allow serial code
to be easily retrofitted for partial parallel execution

main(){
//Compiler_please_use_ a_separate_CPU_for each_of following_sections

//Section#1
int a = get_a();

//Section#2
int b = get_b();

//“Compiler_wait_here_till_both_sections_completed”

int c = a + b;
}
MULTI-CORE – HOW MUCH BETTER?
// Compiler_please_split_next_loop_between_the available_cores
• Two cores, half the for( i = 0;i < 100; i++ ) {
time? a[i] = b[i] + c[i];
}

1 Core
time

time
Overhead to manage four cores
2 Cores outweighs any advantage for most
scenarios
4 Cores

Array Size Array Size

1
Wishful thinking: 𝑡𝑖𝑚𝑒 ∝ Overhead to manage two cores
𝑛𝐶𝑜𝑟𝑒𝑠
MULTI-CORE – HOW MUCH BETTER?
• For very large arrays, and/or complex algorithms in the loop, more cores will
always be better?

1 Core Threads can get in each others way – they are


time

sharing memory, so bottleneck can become


memory access. No benefit to extra cores as
4 Cores limiting factor is getting data to/from memory
by shared bus. Extra complexity of extra cores
means might actually be slightly slower
2 Cores

Array Size

Time dominated by time taken to manage the threads –


overhead involved in dividing work between cores
dominates
MEMORY ACCESS BOTTLENECKS
• Accessing RAM is quite slow – a core can process data more quickly that it can read/write numbers in RAM
• Multiple cores will not help this situation as multiple cores on same processor use same RAM!

• Not quite this simple though – processors have cache

• High level example:


• Before the internet, you might have relied on textbooks and these would usually have come from the library.
• If you have 5 modules, 5 text books, and you go to the library to get the relevant one each time you need it –
slow.
• Go to the library once, get all 5,put of shelf above desk – much faster
• The shelf is your cache.

• Problem: Your shelf isn’t infinitely big, you can’t put a whole library there. The solution is to try to optimise trips
to the library – make sure the right books are always on the shelf, with the minimum number of library trips.

• Processors work like this – they have cache: fast, but small, memory available on the chip.
CACHE L1 Data cache

CPU Main memory


L2 cache L3 cache (RAM)
L1 Instr. cache

• Multiple levels – L1 close to CPU, fast and small, L3 larger but further away and slower
• From “Memory part 2: CPU caches” http://lwn.net/Articles/252125/ These are the numbers Intel lists for a Pentium M – access times in cycles:
• Register 1 On CPU
• L1 Data ~3 Fast cache
• L2 ~14 Slow cache
• Main memory ~240

• Ok, so lets use big caches – very expensive, limit to what can fit on chip The L1 cache is also split,
• So use small caches – can’t fit much in, rely on RAM access more often, very slow one part is holding the
instructions to execute and
the other the data they will
• This isn’t a parallel specific issue – same problems arise in serial code too
• Parallel code does complicate the problem though! work on
CACHE
L1 Data cache

Core L2 cache
L1 Instr. cache

L3 cache Main memory

L1 Data cache

Core L2 cache
L1 Instr. cache

• Some of the cache (typically L3) is shared


• Contents of this cache must satisfy all cores
MY LAPTOP • CTRL-ALT-DEL on
Windows

• Performance tab

• AMD Ryzen 7 PRO 4750u

• I have:
• 512KB L1
• 4.0MB L2
• 8.0MB L3
• 8 cores

• Note: I also have 16


“logical processors”

• So is the cache per core?


Shared?
MY LAPTOP
• I looked the processor
up on one of the
benchmarking /
technical spec websites

• The cache numbers I


have from Windows are
total shared

• L1 – 8 separate 32KB
L1 data caches and 8
separate 32KB L1
instruction caches (per
core)

• L2 – 8 separate L2
512KB L2 caches (per
core)

• L3 – 1 8MB L3 cache
(shared between cores)

https://www.cpubenchmark.net/cpu.php?cpu=AMD+Ryzen
+7+PRO+4750U&id=3740
x

CACHE COHERENCE Core


L1d x
L2
x
x

• Code for section 1 and section 2 will retrieve ‘a’ variable L1i

and modify it
Main
L3
memory
• Both sections might have ‘a’ loaded into their core
L1d x x
specific cache
Core L2
L1i
• One section will probably get to the line where it modifies
a first, the value for a in the other cache is now invalid

• If the second section were to use its cached value for a, //Section#1
write its result, and save this back to main memory there result1 = function1();
will be an error a += result1;

• There needs to be mechanisms to ensure the


independent caches are kept up to date based on what is //Section#2
happening in other cores result2 = function2();
a += result2;
• We’ll come back to this later
SHARED MEMORY MODEL
• What we’ve just talked about involves memory that is shared between cores
• Shared memory approach
• OpenMP implements this (we’ll look at OpenMP next)

CPU
CPU

Memory
• Advantages of shared memory model
• Easy to retrofit serial code

CPU
CPU
• Avoids copies of data

• Disadvantages
• Need a compiler that allows extra directives to be added to code and supports them
• All of the problems just discussed with shared memory
• Large scale parallel computing isn’t “8 cores + some RAM”…
DISTRIBUTED MEMORY MODEL

Memory

CPU
CPU

CPU
• Shared memory: limited to single machines, typically

CPU
with 1/2 processors
• This might limit you to two, 8 core processors.
• Might not be enough for large scale computing

Local
CPU
• Distributed memory model: a collection of nodes with memory

own processors and memory – a “cluster”


Local Local
• Each CPU only accesses its own memory CPU
memory
CPU
memory
• Must be data exchange mechanism
• Each node runs own copy of program and exchanges Local
CPU
messages with other nodes over switch network memory
DISTRIBUTED MEMORY MODEL CPU
Local
memory

Local Local
• Different CPUs take different route through same program, sending & receiving CPU CPU
messages along the way memory memory
• No shared memory, data sharing must be achieved using explicit message passing
• Synchronisation is still an issue Local
CPU
• Size of messages is an issue – sending data across a network is much slower than even memory
main memory access
main() { // Launched on all CPUs simultaneously
int myID = getMyID();
• Advantages: int a, b, c;
• A lot of flexibility for different platforms
• A lot of processor scalability – 10’s to 1000’s of cores if( myID == CPU_1 ) {
• A lot of memory scalability – but it’s not in the same place a = get_a();
} else if( myID == CPU_2 ) {
int b=get_b();
• Disadvantages sendMessage( CPU_1, b );
• Software must be specifically designed to get good results – to maximise usage of available }
cores and minimise network traffic
• As a result, much harder to retrofit existing serial code if( myID == CPU_1 ) {
b = waitForMessage( CPU_2 );
c = a + b;
• One library for this is MPI – message passing interface, see Advanced Computing }
in 4th year. }
FINAL THOUGHTS ON

Memory

CPU
CPU
HARDWARE/SOFTWARE

CPU
CPU
• Optimising for speed/memory use cannot be done without consideration of the
hardware!
• “Just using more cores” – will this actually help, or will other memory issues get in the way?
• How is cache structured, etc

• There will always be a balance with portability (cross-platform)


• Highly optimised code for one machine might perform poorly on a different machine Local
CPU
memory
• High performance applications on specific clusters might be specifically optimised for
those machines Local Local
CPU CPU
• Might be possible to have same code compiled in slightly different ways for different systems – memory memory
e.g. compiler detect cache size, or network speed, etc and optimise compiled code to suit
Local
CPU
memory
• However: all software can consider generic issues like:
• Avoiding code that will typically result in lots of main memory access.
• Heping pipelining by not confusing the branch prediction

more on this later


BACK TO A PREVIOUS POINT… • CTRL-ALT-DEL on
Windows

• Performance tab

• AMD Ryzen 7 PRO 4750u

• I have:
• 512KB L1
• 4.0MB L2
• 8.0MB L3
• 8 cores

• Note: I also have 16


“logical processors”

• So is the cache per core?


Shared?
BACK TO A PREVIOUS POINT… • CTRL-ALT-DEL on
Windows
• This is called HyperThreading – allows one core to run two threads simultaneously
• Performance tab
• 16 logical processors means the OS sees two cores, for each one physical core

• So how is this different to two cores • AMD Ryzen 7 PRO 4750u


• Two cores have two logic units, two instruction decoding units, two L1/L2 caches, etc – effectively
two independent processors
• HyperThreading is only one core, with a few extra registers • I have:
• It allows the core to remember “where it is” in two different threads. • 512KB L1
• Means it can very quicky switch between two threads • 4.0MB L2
• Where is it useful? • 8.0MB L3
• If you have lots of threads that do not require a lot of processing power. Cores have to switch • 8 cores
between different threads a lot to service them all. A desktop PC running many applications –
each one doesn’t need a lot of processing power, but each will have many threads required to
keep its GUI alive, maintain a network connection, etc • Note: I also have 16
• If a processor has two threads it can work on, and one needs something fetching from main
memory (due to a cache miss), it can just jump to the other thread rather than doing nothing “logical processors”
• Where is it less useful?
• If you have computationally demanding threads, you are better having the core keeping working
• So is the cache per core?
on one thread. Shared?
• Splitting the work into two threads, then having the same core jump back and forwards between
them isn’t likely to give you any benefit, You don’t increase the amount of processing power
available, but you do increase the overheads
OPEN MP
Practical examples
OPENMP – WHAT IS IT?
• “An Application Program Interface (API) for writing multithreaded
applications in a shared memory environment”
• Comprised of three primary API components:
• Compiler Directives
• Runtime Library Routines
• Environment Variables

• Relatively easy and straight forward to implement in C/C++/Fortran

CPU
CPU

Memory
• Thread based – shared memory

CPU
CPU
• See
• http://openmp.org
• https://computing.llnl.gov/?set=training&page=index#training_materials
MOTIVATION
• Threads – lightweight branches of code that appear, run in parallel, then disappear
• Master thread and worker threads that exist for finite time
• A fork-join model

Master thread Master thread

Parallel team Parallel team

Serial Parallel Serial Parallel


#include <omp.h> // Contains the OpenMP library

EXAMPLE
functions

main () {
int var1, var2, var3;
• #pragma is a compiler
directive // Some serial code

• Tells the compiler to // Beginning of parallel section. Generate


needs to do // a team of threads.
something (its not #pragma omp parallel private(var1, var2) shared(var3)
part of the C++code)
{
• In this case, create // Parallel section, this code is executed by
separate threads for /// all threads in the parallel team
// After which all non-master threads rejoin the
the block of code that
// master thread and the team is destroyed
follows
}

// Resume serial code


}
GENERAL FORM OF OPENMP CONSTRUCTS
#pragma omp directive-name [clauses...]
Definition Meaning

#pragma omp Required for all OpenMP This is an OpenMP directive


C/C++ directives
directive-name A valid OpenMP directive. What to do
Must appear after the
pragma and before any
clauses
[clauses...] Optional. Clauses can be in With these options
any order, and are repeated
as necessary unless
otherwise restricted.
OMP PARALLEL
• The fundamental construct that
#pragma omp parallel \ // directive-name=parallel
initiates parallel execution
private (var1, var2, …) \ // clauses
shared (var1, var2, …) \
• Note the construct is spread firstprivate(var1, var2, …) \
over multiple lines lastprivate(var1, var2, …) \
if(scalar expression) \
• Use of ‘\’ to allow this (i.e. ‘\’
means that the construct default(shared|none) \
continues on next line {
// a structured block of code which each
// thread of a team executes
• There are lots of options! }

https://www.openmp.org/spec-html/5.0/openmpse14.html
PRIVATE VS SHARED
• Private means each worker thread gets its own
copy of the variables listed
• Useful for working variables that will have a #pragma omp parallel \ // directive-name=parallel
different value in each thread
• Any variable defined inside the omp block is
private (var1, var2, …) \ // clauses
private by default shared (var1, var2, …) \
• Remember to initialise any private variables firstprivate(var1, var2, …) \
in the thread block, each thread has a new
variable with this name lastprivate(var1, var2, …) \
if(scalar expression) \
• Public means each worker thread shares the default(shared|none) \
variables listed
{
• Useful for variables that hold a fixed value,
that each thread must use in calculations // a structured block of code which each
• Any variable defined outside the thread // thread of a team executes
block is shared by default (you need to
override to make it private) }
• Need to be careful – if one thread changes
this, it will affect other threads and no way
of knowing relative timing of threads

https://www.openmp.org/spec-html/5.2/openmpsu37.html
https://www.openmp.org/spec-html/5.2/openmpsu36.html
OMP PARALLEL
• firstprivate
• private variables are usually not #pragma omp parallel \ // directive-name=parallel
initialised, but firstprivate private (var1, var2, …) \ // clauses
variables take their initial value
from the value they had in shared (var1, var2, …) \
preceding code firstprivate(var1, var2, …) \
lastprivate(var1, var2, …) \
• lastprivate if(scalar expression) \
• private variables usually lose their default(shared|none) \
value at end of parallel code.
lastprivate variables copy their {
final value back to the same // a structured block of code which each
variable in following code
• The value that will be seen is that
// thread of a team executes
of the last thread to finish – the }
one that copied its value in last

https://www.openmp.org/spec-html/5.2/openmpsu38.html
https://www.openmp.org/spec-html/5.2/openmpsu39.html
OMP PARALLEL
• default allows overriding the
assumption that variables in the #pragma omp parallel \ // directive-name=parallel
parallel code will be private private (var1, var2, …) \ // clauses
shared (var1, var2, …) \
• default(shared) means that all firstprivate(var1, var2, …) \
variables in the parallel code will be lastprivate(var1, var2, …) \
shared across threads by default if(scalar expression) \
default(shared|none) \
• default(none) means there is no {
shared/private default setting. You // a structured block of code which each
must specific this with private/shared
// thread of a team executes
}
• Default(firstprivate is also possible)

https://www.openmp.org/spec-html/5.2/openmpsu35.html
HOW MANY THREADS?
• The number of threads can be controlled by function calls from within the program
• omp_set_num_threads( int num ) omp_set_num_threads( 3 );
• omp_get_num_threads()

omp_get_num_threads();
• Or by setting the default environment variable on the command prompt
• OMP_NUM_THREADS
omp_get_thread_num();
• Each thread has a number from 0 (master) to N-1
• Calling omp_get_thread_num() from within parallel code retrieves the ID of the current thread

• In general there is no synchronisation between threads in the parallel region! Different


threads reach particular statements at unpredictable times

• If not specified the default number of threads is the number of cores on the platform
#include <iostream> OMPTest1.cpp
#include <omp.h>

EXAMPLE
using namespace std;

main () {
int nthreads, myid;

// Don’t be stupid here, e.g. on a quad core platform, don’t set to 12!
omp_set_num_threads(4);

#pragma omp parallel private(myid) default(shared) // Each thread has own myid variable
{
myid = omp_get_thread_num(); // Obtain and print thread id

cout << "I am thread number: " << myid << endl;

if( myid == 0 ) { // Only master thread does this


nthreads = omp_get_num_threads();
cout << "Number of threads = " << nthreads;
} Some issues with this – will
investigate later!
}
// All threads join master thread and terminate
WORK SHARING CONSTRUCTS
• Work-Sharing constructs are placed inside parallel regions
• They distribute the execution of associated statements among existing threads.
However, no new threads are created
• No implied synchronisation between threads at the start of the work sharing
construct
• Types of Work-Sharing Constructs:
• for directive Divide the iterations of a loop amongst the threads
• sections directive Each thread does a different “section” of code
• single directive Only to be done by one thread - e.g. I/O
FOR DIRECTIVE
int i, a[N], b[N];
#pragma omp parallel shared(a,b) private(j)
• Probably most commonly used {
#pragma omp for
for( j = 0; j < N; j++ ) {
• Distribute iterations of the immediately following a[j] = a[j] + b[j];
loop amongst threads in a team – allows for data }
parallelism }

Each thread works on the same a and


• By default there is a barrier at the end of the loop: b arrays
wait until all threads are finished, then proceed.
However we can override this with the nowait Each thread takes different values of j,
clause to allow continuation into the following
and therefore operates on different
serial region without waiting
array elements, so no memory conflict
CONTROLLING THE WORK SHARING
• The schedule(type,[chunk]) clause controls how the iterations are distributed among the threads
• type may be one of the following: static , dynamic , guided, runtime
• The optional chunk, a loop-invariant positive integer constant or integer expression, is used to specify the size of each work parcel
(number of iterations). If omitted, implementation-dependent default values are used
• schedule (static) : Iterations are divided evenly among threads, i.e. 4 threads, 100 iterations each takes 25 iterations
• schedule (static, chunk): Divides the work load into chunk sized parcels. If there are N threads, each thread does every Nth chunk of work
• schedule (dynamic, chunk): Divides the workload into chunk sized parcels. As a thread finishes one chunk, it is assigned the next available
chunk. Default value for chunk is 1.
• schedule (guided, chunk): Like dynamic scheduling, but the chunk size varies dynamically but chunk sizes depend on the number of
unassigned iterations. The chunk size decreases toward the specified value of chunk
• schedule (runtime): The scheduling decision is deferred until runtime by the environment variable OMP_SCHEDULE. It is illegal to specify a
chunk size for this clause
int i, a[N], b[N];
#pragma omp parallel shared(a,b) private(j)
{
#pragma omp for schedule(static)
for( j = 0; j < N; j++ ) {
a[j] = a[j] + b[j];
}
}
#include <omp.h>

float func(int i ) {
// do something

BASIC EXAMPLE
}

main () {
int i;
float a[10000], b[10000], c[10000];
• A loop in serial code sets up arrays
• Then have two loops which both run across for( i = 0; i < 10000; i++ ) {
// Just initialising the data here
parallel threads
a[i] = b[i] = float(i);
• Different scheduling options are specified for }
each loop
#pragma omp parallel shared(a,b,c) private(i)
• Static scheduling takes less effort to manage, if {
we know each loop takes the same amount of #pragma omp for schedule(static)
effort, just divide up the work equally for( i = 0; i < N; i++ ) {
• Dynamic scheduling adjusts the work load for c[i] = a[i] + b[i];
}
each thread to maximise performance.
}
• We don’t know what func(i) does – might take
a different amount of time depending on the #pragma omp parallel shared(a,b,c) private(i)
value of i {
• In this case, might better to let the program work #pragma omp for schedule(dynamic,100)
out how to distribute the work across threads for( i = 0; i < N; i++ ) {
c[i] = func(i);
}
}
}
SECTIONS DIRECTIVE
#pragma omp sections [clauses…]
• Each parallel section is run on a separate thread
{
• Allows for functional decomposition #pragma omp section
{
• Implicit barrier at the end of the sections construct. Use // Code block
the nowait clause to suppress this }

• Valid clauses include: #pragma omp section


• private(list) {
// Code block
• firstprivate(list)
}
• lastprivate(list) }
• nowait
SINGLE/MASTER DIRECTIVE
• The single directive specifies that the enclosed code is to be executed
by only one thread in the team
• Valid clauses include: #pragma omp single [clauses…]
• private (list) {
• firstprivate (list) // Code block
• Nowait }

#pragma omp master


• Code in a master block is executed only by the master thread. Other
threads skip over entire master region {
// Code block
}
• Can be useful when dealing with sections of code that are not thread
safe (such as I/O)
• Tou don’t normally want multiple threads trying to write to I/O functions –
could happen at same time
• Many I/O and GUI functions are not thread-safe – their behaviour is
undefined if threads other than main thread access them
SYNCHRONISATION CONSTRUCTS
#pragma omp critical [ name ] #pragma omp barrier #pragma omp for ordered [clauses]
{ Causes threads to stop until all threads for( i = 0; i < N; i++ ) {
// Code block have reached the barrier #pragma omp ordered
} {
Ensures that a code block is executed by #pragma omp flush (var1, ) }
only one thread at a time in a parallel
region. Causes the present value of the named }
shared variable to be immediately written
When one thread is in the critical region, back (“flushed”) to memory Within an ordered region, loop iterations are
the others wait until the thread inside exits forced to be executed in sequential order. An
the critical section. name is optional - Enables signalling between threads by ordered region can only appear in a parallel
identifies the critical region. Multiple critical using a shared variable as a semaphore loop
sections are independent of one another (protected variable)
unless they use the same name. All The parallel loop directive must contain the
unnamed critical regions are considered to When other threads see that the shared ordered clause
have the same identity variable has been changed, they know that
an event has occurred and proceed The code in the ordered region will be executed
accordingly in thread order – thread 0 will do it first, 1 will
wait until 0 has finished this part of the loop, etc
OMPTest1.cpp
OMPTest2.cpp

DEMONSTRATION
TESTING PERFORMANCE
What difference does it make?
void doRun( int N, ofstream &f ) {
double result = 0.; SpeedTest1.cpp
double *a = new double[N];

auto start = chrono::high_resolution_clock::now();

#pragma omp parallel for schedule(static)


Note that I have only added OpenMP
for( int i = 0; i < N; i++ ) { to the rand() loop – this is the one
a[i] = rand();
} that dominates the runtime
for( int i = 0; i < N; i++ ) {
result += a[i];
} Code is available if you want to try the
auto end = chrono::high_resolution_clock::now();
other loop
auto duration = chrono::duration_cast<chrono::nanoseconds>(end - start).count();

f << N
<< "\t"
Large array is dynamically allocated using new
<< duration
<< endl;
If you statically allocate it (i.e. double a[N]), it goes
}
delete [] a; on the stack. There is a maximum stack size
(typically a few MB), if you exceed this the program
main (){
omp_set_num_threads(8); will crash.
ofstream timing( "timing.txt" );
new (or malloc() in C) allocate on the heap. You
for( int i = 10; i < 100000000; i*=1.2 ) {
doRun( i, timing ); can request as much as you like on the heap, the
} limit is the address space – if you request 1GB, there
timing.close(); needs to be a continuous 1GB block of empty
cout << "done" << endl;
memory addresses. This is not usually a problem with
} 64-bit programs, but you can easily hit it if compiled
32-bit.
No OpenMP OpenMP 2 threads OpenMP 4 threads

RESULTS OpenMP 8 threads


0.E+00 2.E+07
OpenMP 16 threads
4.E+07 6.E+07 8.E+07 1.E+08
2.E+09
• 2 threads approximately 2x faster than 1 thread
• 4 threads approximately 1.5x faster than 2
threads
2.E+09
• 8 threads approximately 1.2x faster than 4
threads
• 16 threads about 1.2x faster than 8 threads –
hyperthreading is giving some advantage 1.E+09

time (ns)
• You don’t get the theoretical speed up with
larger numbers of cores 5.E+08

• Need to be a little bit careful though, the number


of loop iterations goes up to 400,000,000 – the 0.E+00
timings here are much larger than if we only had
10 or 100 iterations
• The large scale range obscures this low region
• Can plot on log scale to get better overall -5.E+08
picture… N
No OpenMP OpenMP 2 threads OpenMP 4 threads

RESULTS 1.E+00
OpenMP 8 threads
1.E+01 1.E+02
OpenMP 16 threads
1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08
• OpenMP adds additional overhead – creating
and recombining the threads 1.E+10
• Slower when N is small 1.E+09

1.E+08
• The absolute value of this overhead is
relatively small (around 0.1ms) 1.E+07
Speedup
1.E+06
• But each loop iteration is taking around 1/10 of

time (ns)
a microsecond 1.E+05
• 100,000,000 loops – loop time dominates
1.E+04
• 10 or 100 loops, the overhead is significantly
larger than the time taken to complete the
loops 1.E+03
Overhead
1.E+02
• Worth remembering that this is a very simple
example – its just a loop that calls a function. 1.E+01
• Also – the rand() function is quite 1.E+00
demanding, each loop iteration takes some N
effort. This means paralleling is likely to help
template<typename T>
class BiQuad { SpeedTest2.cpp
private:
T u2, u1, u0; // three coefficients for numerator
T v2, v1; // two coefficients for numerator
T vin1, vin2; // past values for vin

BIQUAD FILTER
T vout1, vout2; // past values for vout
T result;

public:
// constructor that sets up coefficient values
BiQuad( T _u0, T _u1, T _u2, T _v1, T _v2 )
• Back to the example from previous weeks {
: u0(_u0), u1(_u1), u2(_u2), v1(_v1), v2(_v2), vin1(0), vin2(0), vout1(0), vout2(0)

BiQuad() : u0(0), u1(0), u2(0), v1(0), v2(0), vin1(0), vin2(0), vout1(0), vout2(0) {}
• Using a modified class – some extra
void setCoeffs( T _u0, T _u1, T _u2, T _v1, T _v2 ) {
functions added to support these tests u0 = _u0;
u1 = _u1;
• A default constructor BiQuad() – because u2 = _u2;
need to create uninitialised BiQuads v1 = _v1;
v2 = _v2;
• A setCoeffs() function – this allows the }
internal filter coefficients of a uninitialised
void compute( T vin ) {
BiQuad to be updated at a later time. result = u0*vin + u1*vin1 + u2*vin2 - v1*vout1 - v2*vout2;
vin2 = vin1; vin1 = vin;
• Computational effort in evaluating each filter vout2 = vout1; vout1 = result;
is relatively low, compared to rand() anyway }

T operator() ( T vin ) {
// Compute output
T vout = u0*vin + u1*vin1 + u2*vin2 - v1*vout1 - v2*vout2;
// Save previous values
vin2 = vin1; vin1 = vin;
vout2 = vout1; vout1 = vout;
// return output
return vout;
}
};
void doRun( int N, ofstream &f ) { SpeedTest2.cpp
BIQUAD EXAMPLE
BiQuad<double> *filters = new BiQuad<double>[N];
double *results = new double[N];

for( int i = 0; i < N; i++ )


filters[i].setCoeffs( 0.01,0.01,0.01,0.01,0.01 );
• Back to the example from previous weeks

auto start = chrono::high_resolution_clock::now();


• Also added a doRun() function. This: #pragma omp parallel for schedule(static)
• Creates N filters for( int i = 0; i < N; i++ ) {
• Initialises them results[i] = filters[i]( 1. );
• Loops through them and calls compute() }
auto end = chrono::high_resolution_clock::now();
• Times the computation loop
auto duration = chrono::duration_cast<chrono::nanoseconds>(end - start).count();
• Saves N and the time to a file provided as an
argument
f << N
• Also prints some information to stdout – just so << "\t"
we can monitor progress
<< duration
<< endl;
• The aim is to allow the time taken to evaluate
N filters, for many different values of N, to be cout<< N
computed and saved << "\t"
<< duration
<< endl;
• We can then try to use OpenMP to see how delete [] filters;
much faster things can be with parallel delete [] results;
processing. }
BIQUAD FILTER
• Back to the example from main (){
omp_set_num_threads(8);
previous weeks
ofstream timing( "timing.txt" );

• The main() function just for( int i = 10; i < 10000000; i*=1.2 ) {
doRun( i, timing );
opens a file for writing, loops }
through a range of values for
timing.close();
N and calls doRun() for
each. cout << "done" << endl;
}

SpeedTest2.cpp
No OpenMP OpenMP 2 threads OpenMP 4 threads OpenMP 8 threads
BIQUAD ARRAY 0.E+00 2.E+06 4.E+06 6.E+06 8.E+06 1.E+07
1.E+08
EXAMPLE
1.E+08
• Two cores helps, further cores
do not
8.E+07

time (ns)
6.E+07

4.E+07

2.E+07

0.E+00
N
No OpenMP OpenMP 2 threads OpenMP 4 threads OpenMP 8 threads
BIQUAD ARRAY 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
1.E+09
EXAMPLE 1.E+08

• Two cores helps, further cores 1.E+07


do not
1.E+06

1.E+05

time (ns)
• Extra threads “getting in each 1.E+04
others way”?
1.E+03

1.E+02
• The “ripples” are also present
1.E+01
again
1.E+00
N
SO WHAT’S GOING ON? // could have used templates, but want to keep it simple and avoid
// introducing anything where we lose control over code generated
// by compiler
• Why does the BiQuad example with the array behave differently to the basic
loop example? #define SIZE 10

class Data {
• Why does a larger number of threads actually seem to cause a slight slow private:
down? double data[SIZE]; // array of padding doubles (8 bytes each)
double result; // 64 bits / 8 bytes
double input; // 64 bits / 8 bytes
• Possibly something to do with memory access?
public:
void init() {
• In the initial example there was just an array of doubles, whereas in the for( int i = 0; i < SIZE; i++ ) {
second example there is an array of doubles and an array of objects – data[i] = rand();
logically this must be more demanding in terms of memory. }
}

• Setup a third example where the size of the object in the arrays can be void process( double d ) {
controlled result = 2*d*d + d;
• Idea is to control the rate at which the cache fills up, and therefore how often } SpeedTest3.cpp
main memory must be accessed
};
• A few small objects – all in cache
• Lots of large objects, need to update cache from main memory
• Not using threads to begin with!
SO WHAT’S GOING ON? void doRun( int N, ofstream &f ) {
Data *data = new Data[N];

srand(1);
double arg = rand();
• Array [N] of Data objects – with for( int i = 0; i < N; i++ )
variable size array data[i].init();

• Controls object size


auto start = chrono::high_resolution_clock::now();
#pragma omp parallel for schedule(static)
for( int i = 0; i < N; i++ ) {
• Code calculates average time to }
data[i].process( arg );

process each object in array auto end = chrono::high_resolution_clock::now();


(time/N) double duration = (double)chrono::duration_cast<chrono::nanoseconds>(end - start).count()/N;

• Should in theory be constant? f << N


<< "\t"
• Looking at time for different array << duration
<< endl;
size (N)
• And different object sizes cout<<
<<
N
"\t"
<< duration
<< endl;

• Larger object size should fill cache delete [] data; SpeedTest3.cpp


sooner, i.e. for smaller N }
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
1.8E+01 70

SO WHAT’S GOING ON?


1.6E+01 24 96 816
60
1.4E+01
• Array [N] of Data objects – with variable 50
size array
time (ns) - for 100 loops

1.2E+01
• Controls object size
1.0E+01 40

• Code calculates average time to process


8.0E+00
each object in array (time/N) 30
• Should in theory be constant?
6.0E+00
• Looking at time for different array size (N) 20
• And different object sizes
4.0E+00
10
2.0E+00
• Larger object size should fill cache sooner,
i.e. for smaller N
Points calculated as:
0.0E+00 0
N (L3 cache size) / (object size)
L1 L2 Predict vale for N where cache
fills up
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
1.8E+01 70

SO WHAT’S GOING ON?


1.6E+01 rise in average time for one loop
Significant
iteration for larger values of N
24 96 816
60
1.4E+01
• Array
Points [N] of
correlate Data
with objects
predicted – with
cache variable
filling, 50
size
although arrayat smaller values of N (there
happen
time (ns) - for 100 loops

1.2E+01
are other things
• Controls in cache
object size too!)
40
A step1.0E+01
shape – because of log scale. After cache
fill point,
• Code it is constantly
calculates refilling
average due time
to logto
scale
process
(e.g. if8.0E+00
fill
each is 1x10in5, array
pointobject next graph
(time/N)line is 2x105 30
so it will fill again, etc
• Should in theory be constant?
6.0E+00
Larger object • Looking at time
sizes mean morefor different array
refilling for size (N)
given 20
N, so total
4.0E+00 • And
time different
to completeobject sizes
loops increases,
therefore higher average computation.
10
2.0E+00
• Larger
Although L3 fillobject sizequite
steps are should fill cache
obvious, more sooner,
i.e. for
difficult to smaller
see L1/L2N on these scales. Points calculated as:
0.0E+00 0
(L3 cache size) / (object size)
N
Predict vale for N where cache
NO WITH THREADS void doRun( int N, ofstream &f ) {
Data *data = new Data[N];

srand(1);
double arg = rand();

• Repeat same example but ask for( int i = 0; i < N; i++ )


data[i].init();
OpenMP to use multiple threads
auto start = chrono::high_resolution_clock::now();
#pragma omp parallel for schedule(static)
for( int i = 0; i < N; i++ ) {
• Does this affect cache filling / }
data[i].process( arg );

refresh rate? auto end = chrono::high_resolution_clock::now();


double duration = (double)chrono::duration_cast<chrono::nanoseconds>(end - start).count()/N;

f << N
<< "\t"
<< duration
<< endl;

cout<< N
<< "\t"
<< duration
<< endl;

delete [] data;
}
1.E+03 1.E+04 1.E+05 1.E+06
1.0E+02 L2 cache limit L3 cache limit 70

SO WHAT’S GOING ON? 60

• Array [N] of Data objects – with variable 50


size array
• Controls object size
40
time (ns)

•1.0E+01
Code calculates average time to process
each object in array (time/N) 30
• Should in theory be constant?
• Looking at time for different array size (N) 20
• And different object sizes

10
• Larger object size should fill cache sooner,
i.e. for smaller N
1.0E+00 0
N
1 Core 2 Core 4 Core
For 96 byte structure
1.E+03 1.E+04 1.E+05 1.E+06
1.0E+02 L2 cache limit L3 cache limit 70

SO WHAT’S GOING ON? Comparison when using


60
multiple cores – zoomed in on
• Array [N] of Data objects – with variable relevant region. 50
size array
• Controls object size Overheads are again obvious
for 2 and 4 core options 40
time (ns)

1.0E+01
• Code calculates average time to process The L3 rise is less clear but
each object in array (time/N) the step is in about the same 30
position for all curves
• Should in theory be constant?
• Looking at time for different array size (N) Some distinctive spikes near 20
• And different object sizes L2 limit although hard to
conclusively link them to this.
10
• Larger object size should fill cache sooner, Cache is filling at about same
i.e. for smaller N point – N split between cores
1.0E+00 0
N
1 Core 2 Core 4 Core
For 96 byte structure
SCHEDULING
No OpenMP
0 20 40 60 80 100 120
120

• We are currently using static work scheduling, without a


chunk specification
100

• What does this mean?


• To find out I have plotted the value for i in the order that it was 80
used.
• i.e., which order are the filters processed?
• For one thread, this will always be sequential – 0,1,2,3,4,….
60

i
• For parallel threads it will depend how the work is allocated to
the threads

40
• The image on right is how loop is processed without
OpenMP – each value for i is taken in order

20
• With static scheduling, the compiler decides to do something
like give values 0-24 to thread 0, 25-49 to thread 2 etc.
• This means i=0 and i=25 will be processed at about the same 0
time Sequence
SCHEDULING
No OpenMP OpenMP 2 threads OpenMP 3 threads OpenMP 4 threads
0 20 40 60 80 100 120
120

• Results with:
100
for schedule(static)
• Reduced number of loop iterations
80
(100) for clarity
• With only one thread, the loop is
processed in sequence 60

i
• With more than one thread, each thread
takes a section of the loop indices 40
• E.g. for two threads
• Thread 1 takes 0-49 20
• Thread 2 takes 50-99

0
Sequence
No OpenMP OpenMP 2 threads OpenMP 3 threads OpenMP 4 threads

SCHEDULE 120
0 20 40 60 80 100 120

• Results with:
for ordered schedule(dynamic) 100

• Now each thread takes the next available index 80


• Ordered – try to make sure each loop index (i)
is processed in order
• Dynamic – try to dynamically balance work
60

i
between threads.

• The loop gets processes more or less in order 40

• There is a bit of noise at the start when the 20


threads take their first index

0
Sequence
SCHEDULE COMPARISON – 4 THREADS
Static Ordered Static Static,5 Static,10 Dynamic Ordered Dynamic Dynamic,5 Dynamic,10
0 20 40 60 80 100 120 0 20 40 60 80 100 120
120 120

100 100

80 80

60 60
i

i
40 40

20 20

0 0
Sequence Sequence
Static Ordered Static Static,5 Static, 10
TIMING Dynamic Ordered Dynamic Dynamic,5 Dynamic,10
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08
• Main observation here is the difference between static 1.E+03
and dynamic options

• Static means that the division of labour between 1.E+02


threads is predetermined when the program is
compiled 1.E+01
• This plan is just followed at runtime
1.E+00
• Dynamic means that the program can make its own

time (s)
decision about how to assign work to threads
• This decision takes effort 1.E-01
• In this example, each loop iteration will take the
same time – same code 1.E-02
• There is no benefit to dynamic scheduling –
costs more time, but doesn’t have the potential
1.E-03
to save any because changing thread work
allocation cannot change outcome
• If we had some threads that got “easier” work 1.E-04
and then finished early, dynamic scheduling
should allow them to dynamically pick up work 1.E-05
of others N

You might also like