Lecture 05: Performance in
Parallel Programming
Vivek Kumar
Computer Science and Engineering
IIIT Delhi
[email protected]
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar
Lecture 05: Performance in Parallel Programming
Last Lecture
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 1
Lecture 05: Performance in Parallel Programming
Today’s Lecture
● Parallel runtime system for task-scheduling
● Work-sharing
● Work-stealing
Tasks-based parallel programming model and its underlying
runtime system would be referred throughout in this course
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 2
Lecture 05: Performance in Parallel Programming
Mapping the Linguistic Interface to the
Parallel Runtime
● Compiler based runtimes
o User code translated to runtime code and then compiled using a
native compiler (e.g., gcc)
o Compiler maintenance is a costly affair and it is not so easy to use
new features from mainstream languages
o Using standard debugger (e.g., gdb) is not possible as the line
number information inside the symbol table is w.r.t. the compiler
generated code and not w.r.t. the user written code
o However, compiler based approach provide several opportunities for
code optimizations and doing smart things
● Library based runtimes Our focus
o Removes all the drawbacks of a compiler based approach
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 3
Lecture 05: Performance in Parallel Programming
Tasks Based Parallel Programming Model
● High productivity due to serial elision
o Removing all async and finish constructs
results in a valid sequential program
o Several existing frameworks support this
programming model, although the name of the
APIs for tasking would be different
● Uses an underlying high performance
parallel runtime system for load balancing
of dynamically created asynchronous tasks
Java Fork/Join Cilk OpenMP HClib[1] TBB C++11
Popular options
Serial Elision NO Yes Yes Yes NO Yes for simple tasks
spawn-sync #pragma omp task async-finish async-future based parallel
#pragma omp taskwait
programming
Performance Limited High Limited High High NO model
[1] http://habanero-rice.github.io/hclib/
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 4
Lecture 05: Performance in Parallel Programming
Mapping the Linguistic Interface to Library
Based Parallel Runtime
Runtime APIs
#include <runtime-API.h>
main() { Initialize runtime
init_runtime(); and associated
finish { data-structures
async (S1);
S2;
}
finalize_runtime();
}
Release runtime
resources
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 5
Lecture 05: Performance in Parallel Programming
Mapping the Linguistic Interface to Library
Based Parallel Runtime
Runtime APIs
Runtime
equivalent of #include <runtime-API.h>
starting a finish main() { Initialize runtime
scope init_runtime(); and associated
start_finish(); data-structures
async (S1);
S2;
Runtime end_finish();
equivalent of finalize_runtime();
closing a finish }
scope Release runtime
resources
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 6
Lecture 05: Performance in Parallel Programming
Mapping the Linguistic Interface to Library
Based Parallel Runtime
volatile boolean shutdown = false;
void init_runtime() {
#include <runtime-API.h> int size = runtime_pool_size();
main() { for(int i=1; i<size; i++) {
init_runtime(); pthread_create(worker_routine);
start_finish(); }
async (S1); }
S2;
end_finish();
finalize_runtime();
} void worker_routine() {
while( !shutdown ) {
find_and_execute_task();
}
}
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 7
Lecture 05: Performance in Parallel Programming
Mapping the Linguistic Interface to Library
Based Parallel Runtime
volatile int finish_counter = 0;
#include <runtime-API.h> void start_finish() {
main() { finish_counter = 0; //reset
init_runtime(); }
start_finish();
async (S1);
S2; Note: in case of nested finish (e.g.,
end_finish();
finalize_runtime(); Fibonacci), we need a better way to
} manage finish scopes. Recall, in
Fibonacci every fib(n) call created a
new finish, which ultimately creates a
tree of finishes
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 8
Lecture 05: Performance in Parallel Programming
Mapping the Linguistic Interface to Library
Based Parallel Runtime
void async(task) {
lock_finish();
finish_counter++;//concurrent access
#include <runtime-API.h> unlock_finish();
main() { // copy task on heap
init_runtime(); void* p = malloc(task_size);
start_finish(); memcpy(p, task, task_size);
async (S1); //thread-safe push_task_to_runtime
S2; push_task_to_runtime(&p);
end_finish(); return;
finalize_runtime(); }
}
Note: Runtime stores pointer to the tasks Note: there are better ways to
passed in the async. To ensure valid pointer increment finish counter rather
during task execution, we heap allocate the than doing it inside locks
task and store pointer to the task on heap.
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 9
Lecture 05: Performance in Parallel Programming
Mapping the Linguistic Interface to Library
Based Parallel Runtime voidwhile(finish_counter
end_finish() {
!= 0) {
find_and_execute_task();
}
}
#include <runtime-API.h>
main() {
init_runtime();
start_finish(); void find_and_execute_task() {
//pop_from_runtime is thread-safe
async (S1);
task = pop_task_from_runtime();
S2;
if(task != NULL) {
end_finish();
finalize_runtime(); execute_task(task);
} free(task);
lock_finish();
finish_counter--;
Note: there are better ways to unlock_finish();
decrement finish counter rather }
than doing it inside locks }
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 10
Lecture 05: Performance in Parallel Programming
Mapping the Linguistic Interface to Library
Based Parallel Runtime
#include <runtime-API.h>
main() {
init_runtime();
start_finish(); void finalize_runtime() {
async (S1); //all spinning workers
S2; //will exit worker_routine
end_finish(); shutdown = true;
finalize_runtime(); int size = runtime_pool_size();
} // master waits for helpers to join
for(int i=1; i<size; i++) {
pthread_join(thread[i]);
}
}
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 11
Lecture 05: Performance in Parallel Programming
How to Store Tasks in Runtime ?
● push_task_to_runtime()
● pop_task_from_runtime()
Data-structures for storing tasks in a thread pool based
runtime plays a very important role in determining the
scalability and performance of the runtime
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 12
Lecture 05: Performance in Parallel Programming
Parallel Runtime for Task Scheduling
● There are several different implementations of parallel
runtimes, but at the core almost all of them uses either a
work-sharing a work-stealing runtime underneath
● Tasks based parallel runtime systems primarily use work-
stealing runtime only
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 13
Lecture 05: Performance in Parallel Programming
Work-Sharing Runtime System
Locality ? Why ?
Push Push Pop
W1 W2 W3
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 14
Lecture 05: Performance in Parallel Programming
Work-Stealing Runtime System
Tail End Steal
Deque
Locality ? Why ?
Head End
Push Pop
W1 W2 W3
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 15
Lecture 05: Performance in Parallel Programming
Work-Sharing v/s Work-Stealing
● Work-sharing
o Busy worker re-distributes the task eagerly
o Easy implementation through global task pool
o Access to the global pool needs to be synchronized: scalability
bottleneck
● Work-stealing
o Busy worker pays little overhead to enable stealing
§ A lock is required for pop and steal only in case single task remaining on
deque (only feasible by using atomic operations)
§ Idle worker steals the tasks from busy workers
o Distributed task pools
o Better scalability
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 16
Lecture 05: Performance in Parallel Programming
Supported on Wide Range of Architectures
Multiprocessor System-on-Chip
Supercomputers
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 17
Lecture 05: Performance in Parallel Programming
Supported/Used by Several Companies/Projects
Twitter
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 18
Lecture 05: Performance in Parallel Programming
Reading Materials
● https://doi.org/10.1007/s11227-018-2238-4
● https://gee.cs.oswego.edu/dl/papers/fj.pdf
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 19
Lecture 05: Performance in Parallel Programming
Next Lecture (#06)
● Introduction to User Level Threads
● Project deadline-1 will be announced tonight with a
deadline of one week
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 20