0% found this document useful (0 votes)
13 views21 pages

Lec05 Runtime Systems

The lecture focuses on performance in parallel programming, covering task-scheduling in parallel runtime systems, work-sharing, and work-stealing techniques. It discusses the advantages of library-based runtimes over compiler-based approaches, emphasizing high productivity and load balancing in tasks-based parallel programming models. The lecture also compares work-sharing and work-stealing runtime systems, highlighting their scalability and implementation differences.

Uploaded by

waboj55600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views21 pages

Lec05 Runtime Systems

The lecture focuses on performance in parallel programming, covering task-scheduling in parallel runtime systems, work-sharing, and work-stealing techniques. It discusses the advantages of library-based runtimes over compiler-based approaches, emphasizing high productivity and load balancing in tasks-based parallel programming models. The lecture also compares work-sharing and work-stealing runtime systems, highlighting their scalability and implementation differences.

Uploaded by

waboj55600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lecture 05: Performance in

Parallel Programming
Vivek Kumar
Computer Science and Engineering
IIIT Delhi
[email protected]
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar
Lecture 05: Performance in Parallel Programming

Last Lecture

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 1


Lecture 05: Performance in Parallel Programming

Today’s Lecture
● Parallel runtime system for task-scheduling
● Work-sharing
● Work-stealing
Tasks-based parallel programming model and its underlying
runtime system would be referred throughout in this course

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 2


Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to the


Parallel Runtime
● Compiler based runtimes
o User code translated to runtime code and then compiled using a
native compiler (e.g., gcc)
o Compiler maintenance is a costly affair and it is not so easy to use
new features from mainstream languages
o Using standard debugger (e.g., gdb) is not possible as the line
number information inside the symbol table is w.r.t. the compiler
generated code and not w.r.t. the user written code
o However, compiler based approach provide several opportunities for
code optimizations and doing smart things
● Library based runtimes Our focus
o Removes all the drawbacks of a compiler based approach

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 3


Lecture 05: Performance in Parallel Programming

Tasks Based Parallel Programming Model


● High productivity due to serial elision
o Removing all async and finish constructs
results in a valid sequential program
o Several existing frameworks support this
programming model, although the name of the
APIs for tasking would be different
● Uses an underlying high performance
parallel runtime system for load balancing
of dynamically created asynchronous tasks
Java Fork/Join Cilk OpenMP HClib[1] TBB C++11
Popular options
Serial Elision NO Yes Yes Yes NO Yes for simple tasks
spawn-sync #pragma omp task async-finish async-future based parallel
#pragma omp taskwait
programming
Performance Limited High Limited High High NO model
[1] http://habanero-rice.github.io/hclib/

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 4


Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library


Based Parallel Runtime
Runtime APIs

#include <runtime-API.h>
main() { Initialize runtime
init_runtime(); and associated
finish { data-structures
async (S1);
S2;
}
finalize_runtime();
}
Release runtime
resources

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 5


Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library


Based Parallel Runtime
Runtime APIs

Runtime
equivalent of #include <runtime-API.h>
starting a finish main() { Initialize runtime
scope init_runtime(); and associated
start_finish(); data-structures
async (S1);
S2;
Runtime end_finish();
equivalent of finalize_runtime();
closing a finish }
scope Release runtime
resources

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 6


Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library


Based Parallel Runtime
volatile boolean shutdown = false;
void init_runtime() {
#include <runtime-API.h> int size = runtime_pool_size();
main() { for(int i=1; i<size; i++) {
init_runtime(); pthread_create(worker_routine);
start_finish(); }
async (S1); }
S2;
end_finish();
finalize_runtime();
} void worker_routine() {
while( !shutdown ) {
find_and_execute_task();
}
}

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 7


Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library


Based Parallel Runtime
volatile int finish_counter = 0;
#include <runtime-API.h> void start_finish() {
main() { finish_counter = 0; //reset
init_runtime(); }
start_finish();
async (S1);
S2; Note: in case of nested finish (e.g.,
end_finish();
finalize_runtime(); Fibonacci), we need a better way to
} manage finish scopes. Recall, in
Fibonacci every fib(n) call created a
new finish, which ultimately creates a
tree of finishes
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 8
Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library


Based Parallel Runtime
void async(task) {
lock_finish();
finish_counter++;//concurrent access
#include <runtime-API.h> unlock_finish();
main() { // copy task on heap
init_runtime(); void* p = malloc(task_size);
start_finish(); memcpy(p, task, task_size);
async (S1); //thread-safe push_task_to_runtime
S2; push_task_to_runtime(&p);
end_finish(); return;
finalize_runtime(); }
}

Note: Runtime stores pointer to the tasks Note: there are better ways to
passed in the async. To ensure valid pointer increment finish counter rather
during task execution, we heap allocate the than doing it inside locks
task and store pointer to the task on heap.
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 9
Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library


Based Parallel Runtime voidwhile(finish_counter
end_finish() {
!= 0) {
find_and_execute_task();
}
}
#include <runtime-API.h>
main() {
init_runtime();
start_finish(); void find_and_execute_task() {
//pop_from_runtime is thread-safe
async (S1);
task = pop_task_from_runtime();
S2;
if(task != NULL) {
end_finish();
finalize_runtime(); execute_task(task);
} free(task);
lock_finish();
finish_counter--;
Note: there are better ways to unlock_finish();
decrement finish counter rather }
than doing it inside locks }

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 10


Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library


Based Parallel Runtime

#include <runtime-API.h>
main() {
init_runtime();
start_finish(); void finalize_runtime() {
async (S1); //all spinning workers
S2; //will exit worker_routine
end_finish(); shutdown = true;
finalize_runtime(); int size = runtime_pool_size();
} // master waits for helpers to join
for(int i=1; i<size; i++) {
pthread_join(thread[i]);
}
}

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 11


Lecture 05: Performance in Parallel Programming

How to Store Tasks in Runtime ?


● push_task_to_runtime()
● pop_task_from_runtime()

Data-structures for storing tasks in a thread pool based


runtime plays a very important role in determining the
scalability and performance of the runtime

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 12


Lecture 05: Performance in Parallel Programming

Parallel Runtime for Task Scheduling


● There are several different implementations of parallel
runtimes, but at the core almost all of them uses either a
work-sharing a work-stealing runtime underneath
● Tasks based parallel runtime systems primarily use work-
stealing runtime only

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 13


Lecture 05: Performance in Parallel Programming

Work-Sharing Runtime System

Locality ? Why ?

Push Push Pop

W1 W2 W3
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 14
Lecture 05: Performance in Parallel Programming

Work-Stealing Runtime System


Tail End Steal

Deque

Locality ? Why ?

Head End

Push Pop

W1 W2 W3
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 15
Lecture 05: Performance in Parallel Programming

Work-Sharing v/s Work-Stealing


● Work-sharing
o Busy worker re-distributes the task eagerly
o Easy implementation through global task pool
o Access to the global pool needs to be synchronized: scalability
bottleneck
● Work-stealing
o Busy worker pays little overhead to enable stealing
§ A lock is required for pop and steal only in case single task remaining on
deque (only feasible by using atomic operations)
§ Idle worker steals the tasks from busy workers
o Distributed task pools
o Better scalability

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 16


Lecture 05: Performance in Parallel Programming

Supported on Wide Range of Architectures

Multiprocessor System-on-Chip

Supercomputers
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 17
Lecture 05: Performance in Parallel Programming

Supported/Used by Several Companies/Projects

Twitter

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 18


Lecture 05: Performance in Parallel Programming

Reading Materials
● https://doi.org/10.1007/s11227-018-2238-4
● https://gee.cs.oswego.edu/dl/papers/fj.pdf

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 19


Lecture 05: Performance in Parallel Programming

Next Lecture (#06)


● Introduction to User Level Threads
● Project deadline-1 will be announced tonight with a
deadline of one week

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 20

You might also like