2 - ParallelProgramming
2 - ParallelProgramming
• Each filter:
• Needs to know which panel it connects, i.e. its input and output
• Contain the transfer function / equation that links these
• Probably represented by some sort of data structure or class, almost certainly then have a large array of filters
• The point:
• Engineering problems get very large, very quickly.
• Clever ways of storing and processing very large sets of objects / arrays / matrices are needed
MOTIVATION
• Moore’s law
• Number of transistors on an integrated circuit doubles every two years
• Originally driven by the fact that transistors could be made smaller
• Smaller transistors switch faster – clock speeds increase
• Therefore processing speed doubled ~ two years
I wouldn’t trust the actual numbers, but this does illustrate a general
trend – in the early 2000’s, processor clock speeds started to stabilise
at around 3GHz and haven’t really increased that much in the last 20
years.
https://queue.acm.org/detail.cfm?id=2181798
PARALLEL PROGRAMMING
• Idea is easy – just get multiple mini processors (cores) to work on the same program at the same time
• Double the number of cores, half the execution time? 100 cores, 1% of the execution time?
• Consequences:
• You cannot just compile an existing program to work on multiple cores – the program needs be written in
a way that identifies instructions that can be executed in parallel.
• This isn’t always possible – some things have to happen in a sequence
parallel computing:
• Practice doesn’t work like that
• Computer programs are lists of instructions
• • As much of the program as possible must be parallelisable:
They are inherently designed to be executed in a sequence – i.e. one instruction usually depends on the
results of instructions that came before
• if you must run 50% of it sequentially, you can only get a 50% speed up
• This causes problems, you can’t run these instructions in parallel.
• 30% sequentially – 70% speedup, etc
• Consequences: • This is assuming an infinite number of cores
• You cannot just compile an existing program to work on multiple cores – the program needs be written in
a way that identifies instructions that can be executed in parallel.
• • Other issues:
This isn’t always possible – some things have to happen in a sequence
• The code that is parallelisable might limit number of cores that can be used
• Thought: What if about 50% of a program consists of non-sequential instructions?
• There is also the obvious potential hardware limit – how many cores do you
• We can divide this 50% up amongst many cores (assume infinite number for now), but the other 50% is
actually
going to have to run on a single core. have access to?
• What is the speed up? For the parallel part: 2 cores ½ the time, 10 cores 1/10 of the time, etc. ∞ cores 0
time • Multiple cores ≠ multiple computers – they share memory, busses, etc. So
• For the sequential part – same
the total time
two cores,
as before, doesn’t
it was 50% of original actually mean
program so probably took half theof time.
about 50%
• An infinite number of cores has only achieved a 50% reduction in execution time!
EXAMPLE APPLICATIONS
Where is parallel computing used?
REALLY OBVIOUS EXAMPLE
• This computer, running Windows
• Multiple applications
• Email – managing live connection to server
• Windows itself, tracking mouse movements, etc
• I might have LTSpice running a simulation in the background
• Here there are multiple independent processes, running their own code/instructions,
with their own data
• Multiple cores / processors can help with this – quite an easy problem to solve – just
assign different cores to different jobs/applications
• This type of problem – different operations with different data can also occur within a A potential complication here is that
single program, e.g. a program that is doing some engineering calculations each process might need access to
the same hardware/peripherals. E.g.:
RAM, harddrives, which will need
• However, often you have a single problem/process that you want to speed up through managing
parallelisation
COMPUTATIONAL ELECTROMAGNETICS
Computer model of
a large 45kV
power converter in
Korean electricity
network
Aim is to predict
current and
electromagnetic
fields to determine
electromagnetic
emissions
MODEL
• Image shows selected cross
sections of the “mesh”
• Two stages:
• Meshing – generating the triangle cells (218,000)
• Simulation – Using the mesh to generate equations (218,000 simultaneous equations in matrix form) and
then solving for the fields
• They all use the “divide model into millions of pieces The way each cube’s sample
approach, the maths is just a bit different. value changes
with time depends upon the
• You can usually select between different methods
sample values
(solvers) in the software in the 6 other cubes
surrounding it.
https://www.geeksforgeeks.org/dsa/gaussian-elimination/
MATRIX-VECTOR PRODUCTS
𝑦1 𝐴11 𝐴12 𝐴13 𝑥1
• Alternative matrix operation 𝑦2 = 𝐴21 𝐴22 𝐴23 𝑥2
𝑦 = 𝐴𝑥 𝑦3 𝐴31 𝐴32 𝐴33 𝑥3
They have their own “address space” – they cannot see variables in other processes.
PARALLEL PROGRAMMING
Hardware and software implementations
PROCESSORS
• Processors – do basic operations on numbers, add, subtract, compare, etc
• They do this by reading instructions (numbers) from RAM, these instructions then tell them to read
other numbers from RAM, do an operation, write results back to RAM
• Contain:
• ALU – logic unit
• Registers – temporary storage for working variables
• Cache – a local copy of section of RAM, much faster to access than going back to RAM all the time
• Each operation in a classic desktop CPU does something (e.g. a i486 from the 90’s), this
something is add/subtract/compare/etc some 32-bit numbers.
• One core, following one list of instructions (the program) where each instruction is a relatively
simply operation on 32-bit numbers – this is not parallel processing. Known as Single
Instruction Single Datastream (SISD)
• These instructions may be pipelined, but the processor is still following one list of instructions
each dealing with 32-bit numbers (for a 486, but the same applies for the 64-bit standard
operations on one core of a modern Intel Core processor for example)
x a b
1 1 1
• Some classic supercomputers had processors designed to operate on large vectors, rather
than small 32-bit (or 64-bit) numbers. x a b
• E.g. Cray supercomputers 2 2 2
• Still have the single instructions, but these act on larger vectors – multiple numbers at the = +
same time x a b
• One add instruction adds a set of numbers to another set, pairwise 3 3 3
• One instruction, multiple sub-operations undertaken by hardware in parallel
x a b
4 4 4
x = a + b
VECTOR PROCESSING IN DESKTOP CPUS
• This was adopted in desktop CPUs, e.g. Intel processors in the 90’s:
i486 -> Pentium -> Pentium MMX
int main() {
• Compile with O0 and O3. int v1[64], v2[64], v3[64];
srand( time(NULL) );
vp.cpp
-O0
• This won’t always be possible, what if for( i = 0; i < 50; i++ ) { for( i = 50; i < 100; i++ ) {
a[i] = b[i] + c[i]; a[i] = b[i] + c[i];
each loop iteration depends on result } }
of previous iteration? Parallel code – split loop into two parts, give each part to a
different core to execute in parallel
MULTI-CORE – HOW?
• Need a way of telling the compiler which parts of program can be paralleled, e.g. certain
loops
• Ideal case is that you just tell the compiler what to parallel, it will work out the rest
• It will need to know certain information though:
• E.g. a, b, c (the arrays / pointers) are the same in all parallel instances
• Each instance will need its own copy of i – each loop instance will be iterating through
different values of i at the same time
• Just want some additional lines of code that can be added to program to specify these
things
• OpenMP works like this
// Compiler_please_split_next_loop_between_the available_cores
for( i = 0;i < 100; i++ ) {
a[i] = b[i] + c[i];
}
MULTI-CORE – HOW?
• Another example – not just loops, might want to split a task into sections that can be
done in parallel
• This type of method – adding extra commands to existing code – can allow serial code
to be easily retrofitted for partial parallel execution
main(){
//Compiler_please_use_ a_separate_CPU_for each_of following_sections
//Section#1
int a = get_a();
//Section#2
int b = get_b();
//“Compiler_wait_here_till_both_sections_completed”
int c = a + b;
}
MULTI-CORE – HOW MUCH BETTER?
// Compiler_please_split_next_loop_between_the available_cores
• Two cores, half the for( i = 0;i < 100; i++ ) {
time? a[i] = b[i] + c[i];
}
1 Core
time
time
Overhead to manage four cores
2 Cores outweighs any advantage for most
scenarios
4 Cores
1
Wishful thinking: 𝑡𝑖𝑚𝑒 ∝ Overhead to manage two cores
𝑛𝐶𝑜𝑟𝑒𝑠
MULTI-CORE – HOW MUCH BETTER?
• For very large arrays, and/or complex algorithms in the loop, more cores will
always be better?
Array Size
• Problem: Your shelf isn’t infinitely big, you can’t put a whole library there. The solution is to try to optimise trips
to the library – make sure the right books are always on the shelf, with the minimum number of library trips.
• Processors work like this – they have cache: fast, but small, memory available on the chip.
CACHE L1 Data cache
• Multiple levels – L1 close to CPU, fast and small, L3 larger but further away and slower
• From “Memory part 2: CPU caches” http://lwn.net/Articles/252125/ These are the numbers Intel lists for a Pentium M – access times in cycles:
• Register 1 On CPU
• L1 Data ~3 Fast cache
• L2 ~14 Slow cache
• Main memory ~240
• Ok, so lets use big caches – very expensive, limit to what can fit on chip The L1 cache is also split,
• So use small caches – can’t fit much in, rely on RAM access more often, very slow one part is holding the
instructions to execute and
the other the data they will
• This isn’t a parallel specific issue – same problems arise in serial code too
• Parallel code does complicate the problem though! work on
CACHE
L1 Data cache
Core L2 cache
L1 Instr. cache
L1 Data cache
Core L2 cache
L1 Instr. cache
• Performance tab
• I have:
• 512KB L1
• 4.0MB L2
• 8.0MB L3
• 8 cores
• L1 – 8 separate 32KB
L1 data caches and 8
separate 32KB L1
instruction caches (per
core)
• L2 – 8 separate L2
512KB L2 caches (per
core)
• L3 – 1 8MB L3 cache
(shared between cores)
https://www.cpubenchmark.net/cpu.php?cpu=AMD+Ryzen
+7+PRO+4750U&id=3740
x
• Code for section 1 and section 2 will retrieve ‘a’ variable L1i
and modify it
Main
L3
memory
• Both sections might have ‘a’ loaded into their core
L1d x x
specific cache
Core L2
L1i
• One section will probably get to the line where it modifies
a first, the value for a in the other cache is now invalid
• If the second section were to use its cached value for a, //Section#1
write its result, and save this back to main memory there result1 = function1();
will be an error a += result1;
CPU
CPU
Memory
• Advantages of shared memory model
• Easy to retrofit serial code
CPU
CPU
• Avoids copies of data
• Disadvantages
• Need a compiler that allows extra directives to be added to code and supports them
• All of the problems just discussed with shared memory
• Large scale parallel computing isn’t “8 cores + some RAM”…
DISTRIBUTED MEMORY MODEL
Memory
CPU
CPU
CPU
• Shared memory: limited to single machines, typically
CPU
with 1/2 processors
• This might limit you to two, 8 core processors.
• Might not be enough for large scale computing
Local
CPU
• Distributed memory model: a collection of nodes with memory
Local Local
• Different CPUs take different route through same program, sending & receiving CPU CPU
messages along the way memory memory
• No shared memory, data sharing must be achieved using explicit message passing
• Synchronisation is still an issue Local
CPU
• Size of messages is an issue – sending data across a network is much slower than even memory
main memory access
main() { // Launched on all CPUs simultaneously
int myID = getMyID();
• Advantages: int a, b, c;
• A lot of flexibility for different platforms
• A lot of processor scalability – 10’s to 1000’s of cores if( myID == CPU_1 ) {
• A lot of memory scalability – but it’s not in the same place a = get_a();
} else if( myID == CPU_2 ) {
int b=get_b();
• Disadvantages sendMessage( CPU_1, b );
• Software must be specifically designed to get good results – to maximise usage of available }
cores and minimise network traffic
• As a result, much harder to retrofit existing serial code if( myID == CPU_1 ) {
b = waitForMessage( CPU_2 );
c = a + b;
• One library for this is MPI – message passing interface, see Advanced Computing }
in 4th year. }
FINAL THOUGHTS ON
Memory
CPU
CPU
HARDWARE/SOFTWARE
CPU
CPU
• Optimising for speed/memory use cannot be done without consideration of the
hardware!
• “Just using more cores” – will this actually help, or will other memory issues get in the way?
• How is cache structured, etc
• Performance tab
• I have:
• 512KB L1
• 4.0MB L2
• 8.0MB L3
• 8 cores
CPU
CPU
Memory
• Thread based – shared memory
CPU
CPU
• See
• http://openmp.org
• https://computing.llnl.gov/?set=training&page=index#training_materials
MOTIVATION
• Threads – lightweight branches of code that appear, run in parallel, then disappear
• Master thread and worker threads that exist for finite time
• A fork-join model
EXAMPLE
functions
main () {
int var1, var2, var3;
• #pragma is a compiler
directive // Some serial code
https://www.openmp.org/spec-html/5.0/openmpse14.html
PRIVATE VS SHARED
• Private means each worker thread gets its own
copy of the variables listed
• Useful for working variables that will have a #pragma omp parallel \ // directive-name=parallel
different value in each thread
• Any variable defined inside the omp block is
private (var1, var2, …) \ // clauses
private by default shared (var1, var2, …) \
• Remember to initialise any private variables firstprivate(var1, var2, …) \
in the thread block, each thread has a new
variable with this name lastprivate(var1, var2, …) \
if(scalar expression) \
• Public means each worker thread shares the default(shared|none) \
variables listed
{
• Useful for variables that hold a fixed value,
that each thread must use in calculations // a structured block of code which each
• Any variable defined outside the thread // thread of a team executes
block is shared by default (you need to
override to make it private) }
• Need to be careful – if one thread changes
this, it will affect other threads and no way
of knowing relative timing of threads
https://www.openmp.org/spec-html/5.2/openmpsu37.html
https://www.openmp.org/spec-html/5.2/openmpsu36.html
OMP PARALLEL
• firstprivate
• private variables are usually not #pragma omp parallel \ // directive-name=parallel
initialised, but firstprivate private (var1, var2, …) \ // clauses
variables take their initial value
from the value they had in shared (var1, var2, …) \
preceding code firstprivate(var1, var2, …) \
lastprivate(var1, var2, …) \
• lastprivate if(scalar expression) \
• private variables usually lose their default(shared|none) \
value at end of parallel code.
lastprivate variables copy their {
final value back to the same // a structured block of code which each
variable in following code
• The value that will be seen is that
// thread of a team executes
of the last thread to finish – the }
one that copied its value in last
https://www.openmp.org/spec-html/5.2/openmpsu38.html
https://www.openmp.org/spec-html/5.2/openmpsu39.html
OMP PARALLEL
• default allows overriding the
assumption that variables in the #pragma omp parallel \ // directive-name=parallel
parallel code will be private private (var1, var2, …) \ // clauses
shared (var1, var2, …) \
• default(shared) means that all firstprivate(var1, var2, …) \
variables in the parallel code will be lastprivate(var1, var2, …) \
shared across threads by default if(scalar expression) \
default(shared|none) \
• default(none) means there is no {
shared/private default setting. You // a structured block of code which each
must specific this with private/shared
// thread of a team executes
}
• Default(firstprivate is also possible)
https://www.openmp.org/spec-html/5.2/openmpsu35.html
HOW MANY THREADS?
• The number of threads can be controlled by function calls from within the program
• omp_set_num_threads( int num ) omp_set_num_threads( 3 );
• omp_get_num_threads()
omp_get_num_threads();
• Or by setting the default environment variable on the command prompt
• OMP_NUM_THREADS
omp_get_thread_num();
• Each thread has a number from 0 (master) to N-1
• Calling omp_get_thread_num() from within parallel code retrieves the ID of the current thread
• If not specified the default number of threads is the number of cores on the platform
#include <iostream> OMPTest1.cpp
#include <omp.h>
EXAMPLE
using namespace std;
main () {
int nthreads, myid;
// Don’t be stupid here, e.g. on a quad core platform, don’t set to 12!
omp_set_num_threads(4);
#pragma omp parallel private(myid) default(shared) // Each thread has own myid variable
{
myid = omp_get_thread_num(); // Obtain and print thread id
cout << "I am thread number: " << myid << endl;
float func(int i ) {
// do something
BASIC EXAMPLE
}
main () {
int i;
float a[10000], b[10000], c[10000];
• A loop in serial code sets up arrays
• Then have two loops which both run across for( i = 0; i < 10000; i++ ) {
// Just initialising the data here
parallel threads
a[i] = b[i] = float(i);
• Different scheduling options are specified for }
each loop
#pragma omp parallel shared(a,b,c) private(i)
• Static scheduling takes less effort to manage, if {
we know each loop takes the same amount of #pragma omp for schedule(static)
effort, just divide up the work equally for( i = 0; i < N; i++ ) {
• Dynamic scheduling adjusts the work load for c[i] = a[i] + b[i];
}
each thread to maximise performance.
}
• We don’t know what func(i) does – might take
a different amount of time depending on the #pragma omp parallel shared(a,b,c) private(i)
value of i {
• In this case, might better to let the program work #pragma omp for schedule(dynamic,100)
out how to distribute the work across threads for( i = 0; i < N; i++ ) {
c[i] = func(i);
}
}
}
SECTIONS DIRECTIVE
#pragma omp sections [clauses…]
• Each parallel section is run on a separate thread
{
• Allows for functional decomposition #pragma omp section
{
• Implicit barrier at the end of the sections construct. Use // Code block
the nowait clause to suppress this }
DEMONSTRATION
TESTING PERFORMANCE
What difference does it make?
void doRun( int N, ofstream &f ) {
double result = 0.; SpeedTest1.cpp
double *a = new double[N];
f << N
<< "\t"
Large array is dynamically allocated using new
<< duration
<< endl;
If you statically allocate it (i.e. double a[N]), it goes
}
delete [] a; on the stack. There is a maximum stack size
(typically a few MB), if you exceed this the program
main (){
omp_set_num_threads(8); will crash.
ofstream timing( "timing.txt" );
new (or malloc() in C) allocate on the heap. You
for( int i = 10; i < 100000000; i*=1.2 ) {
doRun( i, timing ); can request as much as you like on the heap, the
} limit is the address space – if you request 1GB, there
timing.close(); needs to be a continuous 1GB block of empty
cout << "done" << endl;
memory addresses. This is not usually a problem with
} 64-bit programs, but you can easily hit it if compiled
32-bit.
No OpenMP OpenMP 2 threads OpenMP 4 threads
time (ns)
• You don’t get the theoretical speed up with
larger numbers of cores 5.E+08
RESULTS 1.E+00
OpenMP 8 threads
1.E+01 1.E+02
OpenMP 16 threads
1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08
• OpenMP adds additional overhead – creating
and recombining the threads 1.E+10
• Slower when N is small 1.E+09
1.E+08
• The absolute value of this overhead is
relatively small (around 0.1ms) 1.E+07
Speedup
1.E+06
• But each loop iteration is taking around 1/10 of
time (ns)
a microsecond 1.E+05
• 100,000,000 loops – loop time dominates
1.E+04
• 10 or 100 loops, the overhead is significantly
larger than the time taken to complete the
loops 1.E+03
Overhead
1.E+02
• Worth remembering that this is a very simple
example – its just a loop that calls a function. 1.E+01
• Also – the rand() function is quite 1.E+00
demanding, each loop iteration takes some N
effort. This means paralleling is likely to help
template<typename T>
class BiQuad { SpeedTest2.cpp
private:
T u2, u1, u0; // three coefficients for numerator
T v2, v1; // two coefficients for numerator
T vin1, vin2; // past values for vin
BIQUAD FILTER
T vout1, vout2; // past values for vout
T result;
public:
// constructor that sets up coefficient values
BiQuad( T _u0, T _u1, T _u2, T _v1, T _v2 )
• Back to the example from previous weeks {
: u0(_u0), u1(_u1), u2(_u2), v1(_v1), v2(_v2), vin1(0), vin2(0), vout1(0), vout2(0)
BiQuad() : u0(0), u1(0), u2(0), v1(0), v2(0), vin1(0), vin2(0), vout1(0), vout2(0) {}
• Using a modified class – some extra
void setCoeffs( T _u0, T _u1, T _u2, T _v1, T _v2 ) {
functions added to support these tests u0 = _u0;
u1 = _u1;
• A default constructor BiQuad() – because u2 = _u2;
need to create uninitialised BiQuads v1 = _v1;
v2 = _v2;
• A setCoeffs() function – this allows the }
internal filter coefficients of a uninitialised
void compute( T vin ) {
BiQuad to be updated at a later time. result = u0*vin + u1*vin1 + u2*vin2 - v1*vout1 - v2*vout2;
vin2 = vin1; vin1 = vin;
• Computational effort in evaluating each filter vout2 = vout1; vout1 = result;
is relatively low, compared to rand() anyway }
T operator() ( T vin ) {
// Compute output
T vout = u0*vin + u1*vin1 + u2*vin2 - v1*vout1 - v2*vout2;
// Save previous values
vin2 = vin1; vin1 = vin;
vout2 = vout1; vout1 = vout;
// return output
return vout;
}
};
void doRun( int N, ofstream &f ) { SpeedTest2.cpp
BIQUAD EXAMPLE
BiQuad<double> *filters = new BiQuad<double>[N];
double *results = new double[N];
• The main() function just for( int i = 10; i < 10000000; i*=1.2 ) {
doRun( i, timing );
opens a file for writing, loops }
through a range of values for
timing.close();
N and calls doRun() for
each. cout << "done" << endl;
}
SpeedTest2.cpp
No OpenMP OpenMP 2 threads OpenMP 4 threads OpenMP 8 threads
BIQUAD ARRAY 0.E+00 2.E+06 4.E+06 6.E+06 8.E+06 1.E+07
1.E+08
EXAMPLE
1.E+08
• Two cores helps, further cores
do not
8.E+07
time (ns)
6.E+07
4.E+07
2.E+07
•
0.E+00
N
No OpenMP OpenMP 2 threads OpenMP 4 threads OpenMP 8 threads
BIQUAD ARRAY 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
1.E+09
EXAMPLE 1.E+08
1.E+05
time (ns)
• Extra threads “getting in each 1.E+04
others way”?
1.E+03
1.E+02
• The “ripples” are also present
1.E+01
again
1.E+00
N
SO WHAT’S GOING ON? // could have used templates, but want to keep it simple and avoid
// introducing anything where we lose control over code generated
// by compiler
• Why does the BiQuad example with the array behave differently to the basic
loop example? #define SIZE 10
class Data {
• Why does a larger number of threads actually seem to cause a slight slow private:
down? double data[SIZE]; // array of padding doubles (8 bytes each)
double result; // 64 bits / 8 bytes
double input; // 64 bits / 8 bytes
• Possibly something to do with memory access?
public:
void init() {
• In the initial example there was just an array of doubles, whereas in the for( int i = 0; i < SIZE; i++ ) {
second example there is an array of doubles and an array of objects – data[i] = rand();
logically this must be more demanding in terms of memory. }
}
• Setup a third example where the size of the object in the arrays can be void process( double d ) {
controlled result = 2*d*d + d;
• Idea is to control the rate at which the cache fills up, and therefore how often } SpeedTest3.cpp
main memory must be accessed
};
• A few small objects – all in cache
• Lots of large objects, need to update cache from main memory
• Not using threads to begin with!
SO WHAT’S GOING ON? void doRun( int N, ofstream &f ) {
Data *data = new Data[N];
srand(1);
double arg = rand();
• Array [N] of Data objects – with for( int i = 0; i < N; i++ )
variable size array data[i].init();
1.2E+01
• Controls object size
1.0E+01 40
1.2E+01
are other things
• Controls in cache
object size too!)
40
A step1.0E+01
shape – because of log scale. After cache
fill point,
• Code it is constantly
calculates refilling
average due time
to logto
scale
process
(e.g. if8.0E+00
fill
each is 1x10in5, array
pointobject next graph
(time/N)line is 2x105 30
so it will fill again, etc
• Should in theory be constant?
6.0E+00
Larger object • Looking at time
sizes mean morefor different array
refilling for size (N)
given 20
N, so total
4.0E+00 • And
time different
to completeobject sizes
loops increases,
therefore higher average computation.
10
2.0E+00
• Larger
Although L3 fillobject sizequite
steps are should fill cache
obvious, more sooner,
i.e. for
difficult to smaller
see L1/L2N on these scales. Points calculated as:
0.0E+00 0
(L3 cache size) / (object size)
N
Predict vale for N where cache
NO WITH THREADS void doRun( int N, ofstream &f ) {
Data *data = new Data[N];
srand(1);
double arg = rand();
f << N
<< "\t"
<< duration
<< endl;
cout<< N
<< "\t"
<< duration
<< endl;
delete [] data;
}
1.E+03 1.E+04 1.E+05 1.E+06
1.0E+02 L2 cache limit L3 cache limit 70
•1.0E+01
Code calculates average time to process
each object in array (time/N) 30
• Should in theory be constant?
• Looking at time for different array size (N) 20
• And different object sizes
10
• Larger object size should fill cache sooner,
i.e. for smaller N
1.0E+00 0
N
1 Core 2 Core 4 Core
For 96 byte structure
1.E+03 1.E+04 1.E+05 1.E+06
1.0E+02 L2 cache limit L3 cache limit 70
1.0E+01
• Code calculates average time to process The L3 rise is less clear but
each object in array (time/N) the step is in about the same 30
position for all curves
• Should in theory be constant?
• Looking at time for different array size (N) Some distinctive spikes near 20
• And different object sizes L2 limit although hard to
conclusively link them to this.
10
• Larger object size should fill cache sooner, Cache is filling at about same
i.e. for smaller N point – N split between cores
1.0E+00 0
N
1 Core 2 Core 4 Core
For 96 byte structure
SCHEDULING
No OpenMP
0 20 40 60 80 100 120
120
i
• For parallel threads it will depend how the work is allocated to
the threads
40
• The image on right is how loop is processed without
OpenMP – each value for i is taken in order
20
• With static scheduling, the compiler decides to do something
like give values 0-24 to thread 0, 25-49 to thread 2 etc.
• This means i=0 and i=25 will be processed at about the same 0
time Sequence
SCHEDULING
No OpenMP OpenMP 2 threads OpenMP 3 threads OpenMP 4 threads
0 20 40 60 80 100 120
120
• Results with:
100
for schedule(static)
• Reduced number of loop iterations
80
(100) for clarity
• With only one thread, the loop is
processed in sequence 60
i
• With more than one thread, each thread
takes a section of the loop indices 40
• E.g. for two threads
• Thread 1 takes 0-49 20
• Thread 2 takes 50-99
0
Sequence
No OpenMP OpenMP 2 threads OpenMP 3 threads OpenMP 4 threads
SCHEDULE 120
0 20 40 60 80 100 120
• Results with:
for ordered schedule(dynamic) 100
i
between threads.
0
Sequence
SCHEDULE COMPARISON – 4 THREADS
Static Ordered Static Static,5 Static,10 Dynamic Ordered Dynamic Dynamic,5 Dynamic,10
0 20 40 60 80 100 120 0 20 40 60 80 100 120
120 120
100 100
80 80
60 60
i
i
40 40
20 20
0 0
Sequence Sequence
Static Ordered Static Static,5 Static, 10
TIMING Dynamic Ordered Dynamic Dynamic,5 Dynamic,10
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08
• Main observation here is the difference between static 1.E+03
and dynamic options
time (s)
decision about how to assign work to threads
• This decision takes effort 1.E-01
• In this example, each loop iteration will take the
same time – same code 1.E-02
• There is no benefit to dynamic scheduling –
costs more time, but doesn’t have the potential
1.E-03
to save any because changing thread work
allocation cannot change outcome
• If we had some threads that got “easier” work 1.E-04
and then finished early, dynamic scheduling
should allow them to dynamically pick up work 1.E-05
of others N