0% found this document useful (0 votes)

13 views21 pages

Lec05 Runtime Systems

The lecture focuses on performance in parallel programming, covering task-scheduling in parallel runtime systems, work-sharing, and work-stealing techniques. It discusses the advantages of library-based runtimes over compiler-based approaches, emphasizing high productivity and load balancing in tasks-based parallel programming models. The lecture also compares work-sharing and work-stealing runtime systems, highlighting their scalability and implementation differences.

Uploaded by

waboj55600

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views21 pages

Lec05 Runtime Systems

Uploaded by

waboj55600

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture 05: Performance in

Parallel Programming
Vivek Kumar
Computer Science and Engineering
IIIT Delhi
[email protected]
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar
Lecture 05: Performance in Parallel Programming

Last Lecture

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 1

Lecture 05: Performance in Parallel Programming

Today’s Lecture
● Parallel runtime system for task-scheduling
● Work-sharing
● Work-stealing
Tasks-based parallel programming model and its underlying
runtime system would be referred throughout in this course

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 2

Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to the

Parallel Runtime
● Compiler based runtimes
o User code translated to runtime code and then compiled using a
native compiler (e.g., gcc)
o Compiler maintenance is a costly affair and it is not so easy to use
new features from mainstream languages
o Using standard debugger (e.g., gdb) is not possible as the line
number information inside the symbol table is w.r.t. the compiler
generated code and not w.r.t. the user written code
o However, compiler based approach provide several opportunities for
code optimizations and doing smart things
● Library based runtimes Our focus
o Removes all the drawbacks of a compiler based approach

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 3

Lecture 05: Performance in Parallel Programming

Tasks Based Parallel Programming Model

● High productivity due to serial elision
o Removing all async and finish constructs
results in a valid sequential program
o Several existing frameworks support this
programming model, although the name of the
APIs for tasking would be different
● Uses an underlying high performance
parallel runtime system for load balancing
of dynamically created asynchronous tasks
Java Fork/Join Cilk OpenMP HClib[1] TBB C++11
Popular options
Serial Elision NO Yes Yes Yes NO Yes for simple tasks
spawn-sync #pragma omp task async-finish async-future based parallel
#pragma omp taskwait
programming
Performance Limited High Limited High High NO model
[1] http://habanero-rice.github.io/hclib/

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 4

Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library

Based Parallel Runtime
Runtime APIs

#include <runtime-API.h>
main() { Initialize runtime
init_runtime(); and associated
finish { data-structures
async (S1);
S2;
}
finalize_runtime();
}
Release runtime
resources

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 5

Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library

Based Parallel Runtime
Runtime APIs

Runtime
equivalent of #include <runtime-API.h>
starting a finish main() { Initialize runtime
scope init_runtime(); and associated
start_finish(); data-structures
async (S1);
S2;
Runtime end_finish();
equivalent of finalize_runtime();
closing a finish }
scope Release runtime
resources

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 6

Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library

Based Parallel Runtime
volatile boolean shutdown = false;
void init_runtime() {
#include <runtime-API.h> int size = runtime_pool_size();
main() { for(int i=1; i<size; i++) {
init_runtime(); pthread_create(worker_routine);
start_finish(); }
async (S1); }
S2;
end_finish();
finalize_runtime();
} void worker_routine() {
while( !shutdown ) {
find_and_execute_task();
}
}

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 7

Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library

Based Parallel Runtime
volatile int finish_counter = 0;
#include <runtime-API.h> void start_finish() {
main() { finish_counter = 0; //reset
init_runtime(); }
start_finish();
async (S1);
S2; Note: in case of nested finish (e.g.,
end_finish();
finalize_runtime(); Fibonacci), we need a better way to
} manage finish scopes. Recall, in
Fibonacci every fib(n) call created a
new finish, which ultimately creates a
tree of finishes
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 8
Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library

Based Parallel Runtime
void async(task) {
lock_finish();
finish_counter++;//concurrent access
#include <runtime-API.h> unlock_finish();
main() { // copy task on heap
init_runtime(); void* p = malloc(task_size);
start_finish(); memcpy(p, task, task_size);
async (S1); //thread-safe push_task_to_runtime
S2; push_task_to_runtime(&p);
end_finish(); return;
finalize_runtime(); }
}

Note: Runtime stores pointer to the tasks Note: there are better ways to
passed in the async. To ensure valid pointer increment finish counter rather
during task execution, we heap allocate the than doing it inside locks
task and store pointer to the task on heap.
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 9
Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library

Based Parallel Runtime voidwhile(finish_counter
end_finish() {
!= 0) {
find_and_execute_task();
}
}
#include <runtime-API.h>
main() {
init_runtime();
start_finish(); void find_and_execute_task() {
//pop_from_runtime is thread-safe
async (S1);
task = pop_task_from_runtime();
S2;
if(task != NULL) {
end_finish();
finalize_runtime(); execute_task(task);
} free(task);
lock_finish();
finish_counter--;
Note: there are better ways to unlock_finish();
decrement finish counter rather }
than doing it inside locks }

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 10

Lecture 05: Performance in Parallel Programming

Mapping the Linguistic Interface to Library

Based Parallel Runtime

#include <runtime-API.h>
main() {
init_runtime();
start_finish(); void finalize_runtime() {
async (S1); //all spinning workers
S2; //will exit worker_routine
end_finish(); shutdown = true;
finalize_runtime(); int size = runtime_pool_size();
} // master waits for helpers to join
for(int i=1; i<size; i++) {
pthread_join(thread[i]);
}
}

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 11

Lecture 05: Performance in Parallel Programming

How to Store Tasks in Runtime ?

● push_task_to_runtime()
● pop_task_from_runtime()

Data-structures for storing tasks in a thread pool based

runtime plays a very important role in determining the
scalability and performance of the runtime

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 12

Lecture 05: Performance in Parallel Programming

Parallel Runtime for Task Scheduling

● There are several different implementations of parallel
runtimes, but at the core almost all of them uses either a
work-sharing a work-stealing runtime underneath
● Tasks based parallel runtime systems primarily use work-
stealing runtime only

Lecture 05: Performance in Parallel Programming

Work-Sharing Runtime System

Locality ? Why ?

Push Push Pop

W1 W2 W3
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 14
Lecture 05: Performance in Parallel Programming

Work-Stealing Runtime System

Tail End Steal

Deque

Locality ? Why ?

Head End

Push Pop

W1 W2 W3
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 15
Lecture 05: Performance in Parallel Programming

Work-Sharing v/s Work-Stealing

● Work-sharing
o Busy worker re-distributes the task eagerly
o Easy implementation through global task pool
o Access to the global pool needs to be synchronized: scalability
bottleneck
● Work-stealing
o Busy worker pays little overhead to enable stealing
§ A lock is required for pop and steal only in case single task remaining on
deque (only feasible by using atomic operations)
§ Idle worker steals the tasks from busy workers
o Distributed task pools
o Better scalability

Lecture 05: Performance in Parallel Programming

Supported on Wide Range of Architectures

Multiprocessor System-on-Chip

Supercomputers
CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 17
Lecture 05: Performance in Parallel Programming

Supported/Used by Several Companies/Projects

Twitter

Lecture 05: Performance in Parallel Programming

Reading Materials
● https://doi.org/10.1007/s11227-018-2238-4
● https://gee.cs.oswego.edu/dl/papers/fj.pdf

Lecture 05: Performance in Parallel Programming

Next Lecture (#06)

● Introduction to User Level Threads
● Project deadline-1 will be announced tonight with a
deadline of one week

Michael Klemm, Jim Cownie - High Performance Parallel Runtimes - Design and Implementation-De Gruyter (2020)
No ratings yet
Michael Klemm, Jim Cownie - High Performance Parallel Runtimes - Design and Implementation-De Gruyter (2020)
356 pages
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
No ratings yet
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
37 pages
02 - Introduction To Concurrent Systems PDF
No ratings yet
02 - Introduction To Concurrent Systems PDF
31 pages
Apt05 2024S2
No ratings yet
Apt05 2024S2
23 pages
08 Systems Programming-Concurrent Programming
No ratings yet
08 Systems Programming-Concurrent Programming
61 pages
Threadpool Handout
No ratings yet
Threadpool Handout
13 pages
01 Introduction
No ratings yet
01 Introduction
41 pages
Benchmarking Performance of in ( ) &:, Platforms:, &
No ratings yet
Benchmarking Performance of in ( ) &:, Platforms:, &
22 pages
Lecture 06 - Concurrency
No ratings yet
Lecture 06 - Concurrency
36 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
Implicit Threading in Multicore Systems
No ratings yet
Implicit Threading in Multicore Systems
35 pages
Threads: Multicore Programming Multithreading Models Thread Libraries Threading Issues Operating System Examples
No ratings yet
Threads: Multicore Programming Multithreading Models Thread Libraries Threading Issues Operating System Examples
22 pages
C++ STL Parallelization Guide
No ratings yet
C++ STL Parallelization Guide
67 pages
Parallel Answers
No ratings yet
Parallel Answers
6 pages
Qthreads PDF
No ratings yet
Qthreads PDF
8 pages
Parallel Programming Models Survey
No ratings yet
Parallel Programming Models Survey
18 pages
Computação Paralela
No ratings yet
Computação Paralela
18 pages
C++ Thread Management Techniques
No ratings yet
C++ Thread Management Techniques
28 pages
UNIT 2 (HPC)
No ratings yet
UNIT 2 (HPC)
10 pages
Unit 4
No ratings yet
Unit 4
42 pages
Hpec 19
No ratings yet
Hpec 19
7 pages
Pdf24 Merged
No ratings yet
Pdf24 Merged
54 pages
Introduction to POSIX Threads in C
No ratings yet
Introduction to POSIX Threads in C
32 pages
C++ Threads and Concurrency
No ratings yet
C++ Threads and Concurrency
203 pages
5 Threads
No ratings yet
5 Threads
33 pages
Threads: Tevfik Koşar
100% (1)
Threads: Tevfik Koşar
40 pages
04 Progbasics
No ratings yet
04 Progbasics
51 pages
Parallel Programming Models Overview
No ratings yet
Parallel Programming Models Overview
62 pages
Thread Models & Pthreads Overview
No ratings yet
Thread Models & Pthreads Overview
21 pages
Parallelism and Concurrency Guide
No ratings yet
Parallelism and Concurrency Guide
18 pages
Introduction To Multithreading Cpp20-1
No ratings yet
Introduction To Multithreading Cpp20-1
106 pages
A Comparison of Different Multithreading Architectures: Jeroen Hordijk Henk Corporaal
No ratings yet
A Comparison of Different Multithreading Architectures: Jeroen Hordijk Henk Corporaal
7 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
Parallel Framework, The Need
No ratings yet
Parallel Framework, The Need
6 pages
2 ParallelArchExec
No ratings yet
2 ParallelArchExec
46 pages
Dependency-Based Automatic Parallelization of Java Applications
No ratings yet
Dependency-Based Automatic Parallelization of Java Applications
13 pages
TranQuocVietAnh HW5
No ratings yet
TranQuocVietAnh HW5
5 pages
Lecture 4
No ratings yet
Lecture 4
41 pages
Concurrency & Parallelism Guide
No ratings yet
Concurrency & Parallelism Guide
45 pages
Kraevaya Journal Manager Sterling 48 68
No ratings yet
Kraevaya Journal Manager Sterling 48 68
21 pages
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
No ratings yet
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
22 pages
Chapter 4: Threads
No ratings yet
Chapter 4: Threads
33 pages
Pthread PDF
No ratings yet
Pthread PDF
33 pages
Week 02
No ratings yet
Week 02
41 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
29 pages
Pipelining vs. Parallel Processing Explained
No ratings yet
Pipelining vs. Parallel Processing Explained
23 pages
High-Level Languages For Low-Level Programming: Thesis Proposal First Year Report
No ratings yet
High-Level Languages For Low-Level Programming: Thesis Proposal First Year Report
17 pages
02 Multicore
No ratings yet
02 Multicore
66 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
Parallel Programming Basics
No ratings yet
Parallel Programming Basics
17 pages
PDC Lecture 02
No ratings yet
PDC Lecture 02
35 pages
Thread Libraries and Implicit Threading
No ratings yet
Thread Libraries and Implicit Threading
3 pages
2 TypesofParallelism
No ratings yet
2 TypesofParallelism
69 pages
OS Module 1 Slides-2
No ratings yet
OS Module 1 Slides-2
47 pages
A Modern C++ Parallel Task Programming Library
No ratings yet
A Modern C++ Parallel Task Programming Library
4 pages
COA UNIT 5 (AutoRecovered)
No ratings yet
COA UNIT 5 (AutoRecovered)
14 pages
LP V Theory and Practical Explanation: o o o o
No ratings yet
LP V Theory and Practical Explanation: o o o o
96 pages
Distributed Systems
No ratings yet
Distributed Systems
238 pages
Concurrency: CS2403 Programming Languages
No ratings yet
Concurrency: CS2403 Programming Languages
44 pages
Surveyor CV for Engineering Firms
No ratings yet
Surveyor CV for Engineering Firms
4 pages
CE Certification for Jiangsu Jingchuang
No ratings yet
CE Certification for Jiangsu Jingchuang
1 page
Tech Support & Quality Specialist Profile
No ratings yet
Tech Support & Quality Specialist Profile
2 pages
PYQ OF OOP Unit 3&4
No ratings yet
PYQ OF OOP Unit 3&4
4 pages
Diesel Locomotive Lubricant Guidelines
No ratings yet
Diesel Locomotive Lubricant Guidelines
17 pages
Deviation Management in Pharma
50% (2)
Deviation Management in Pharma
14 pages
Datasheet AVEVA XRforTraining
No ratings yet
Datasheet AVEVA XRforTraining
5 pages
SainSmart UNO Starter Kits Tutorials PDF
No ratings yet
SainSmart UNO Starter Kits Tutorials PDF
103 pages
Optimize Symfony Dev with PhpStorm
No ratings yet
Optimize Symfony Dev with PhpStorm
39 pages
Digit Twin
No ratings yet
Digit Twin
22 pages
Verilog - FPGA QP - 3
No ratings yet
Verilog - FPGA QP - 3
4 pages
Bushing Dimension Calculation Method
No ratings yet
Bushing Dimension Calculation Method
35 pages
Import of Projects From PG3 To PG5 301002
No ratings yet
Import of Projects From PG3 To PG5 301002
11 pages
Lsb02779 en KNX Catalogue 11 2023
No ratings yet
Lsb02779 en KNX Catalogue 11 2023
123 pages
Komponen PCB Main
No ratings yet
Komponen PCB Main
6 pages
Microsoft Certification Poster RGB 32x39.75 (Oct 2019)
No ratings yet
Microsoft Certification Poster RGB 32x39.75 (Oct 2019)
1 page
MOSFET Device Metrics Guide
No ratings yet
MOSFET Device Metrics Guide
10 pages
TeSysDTMLibrary v2.17.0.0 ReleaseNotes
No ratings yet
TeSysDTMLibrary v2.17.0.0 ReleaseNotes
9 pages
2SJ160, 2SJ161, 2SJ162: Silicon P Channel MOS FET
No ratings yet
2SJ160, 2SJ161, 2SJ162: Silicon P Channel MOS FET
6 pages
Senior Lecturer Resume: Imrul Kaes
No ratings yet
Senior Lecturer Resume: Imrul Kaes
3 pages
Kopykitab Civil
No ratings yet
Kopykitab Civil
8 pages
Underwater Robotics Innovations
No ratings yet
Underwater Robotics Innovations
9 pages
Empowerment Lesson 1
No ratings yet
Empowerment Lesson 1
4 pages
Dec Test Ictsm
No ratings yet
Dec Test Ictsm
1 page
Selfie Poses For Girls at Home - Google Search
No ratings yet
Selfie Poses For Girls at Home - Google Search
1 page
Computer Architecture Types and Concepts
No ratings yet
Computer Architecture Types and Concepts
39 pages
Excavator Liebherr R920
No ratings yet
Excavator Liebherr R920
16 pages
Unit 1
No ratings yet
Unit 1
152 pages
E78-915TBL-02 UserManual EN v1.1
No ratings yet
E78-915TBL-02 UserManual EN v1.1
7 pages
Firewall Opening Template
No ratings yet
Firewall Opening Template
8 pages

Lec05 Runtime Systems

Uploaded by

Lec05 Runtime Systems

Uploaded by

Lecture 05: Performance in

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 1

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 2

Mapping the Linguistic Interface to the

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 3

Tasks Based Parallel Programming Model

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 4

Mapping the Linguistic Interface to Library

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 5

Mapping the Linguistic Interface to Library

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 6

Mapping the Linguistic Interface to Library

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 7

Mapping the Linguistic Interface to Library

Mapping the Linguistic Interface to Library

Mapping the Linguistic Interface to Library

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 10

Mapping the Linguistic Interface to Library

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 11

How to Store Tasks in Runtime ?

Data-structures for storing tasks in a thread pool based

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 12

Parallel Runtime for Task Scheduling

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 13

Work-Sharing Runtime System

Push Push Pop

Work-Stealing Runtime System

Work-Sharing v/s Work-Stealing

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 16

Supported on Wide Range of Architectures

Supported/Used by Several Companies/Projects

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 18

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 19

Next Lecture (#06)

CSE513: Parallel Runtimes for Modern Processors © Vivek Kumar 20

You might also like