0% found this document useful (0 votes)

15 views21 pages

APT06 2024S2 New

The document discusses advanced programming techniques focusing on parallel programming, specifically matrix multiplication and quicksort algorithms. It covers sequential and parallel approaches, optimization techniques for cache efficiency, and the performance of parallel quicksort. The lecture also outlines common cache optimization strategies and the structure of parallel sorting tasks.

Uploaded by

minhtrongc31120

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views21 pages

APT06 2024S2 New

Uploaded by

minhtrongc31120

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

ADVANCED

PROGRAMMING
TECHNIQUES

VNU - UNIVERSITY of ENGINEERING & TECHNOLOGY

LECTURE 6: Parallel Programming (cont.)

CONTENTS

> Sequential matrix multiplication

> Parallelization of matrix multiplication
> Optimizing parallel matrix multiplication
> Sequential quicksort
> Parallel quicksort
Sequential Matrix Multiplication
> Numerous real-life applications
▪ Systems of linear equations, PCA, PageRank, DNA sequencing, etc.

> Sequential algorithm

▪ Computational complexity: Θ(𝑛3 )

Paralellization of Matrix Multiplication
> Idea: data parallelism → parallel block multiplication.

𝐶11 𝐶12 𝐴11 𝐴12 𝐵11 𝐵12

𝐶21 𝐶22 𝐴21 𝐴22 𝐵21 𝐵22

𝑛 Τ2 𝑛 Τ2 𝑛 Τ2 𝑛 Τ2 𝑛 Τ2 𝑛 Τ2

▪ The submatrices 𝐶11 = 𝐴11 𝐵11 + 𝐴12 𝐵21 , 𝐶12 = 𝐴11 𝐵12 + 𝐴12 𝐵22 , 𝐶21 = 𝐴21 𝐵11 +
𝐴22 𝐵21 , 𝐶22 = 𝐴21 𝐵12 + 𝐴22 𝐵22 are computed simultaneously.
▪ Computational complexity: Θ(𝑛3 )
Parallel Matrix Multiplication Algorithm
Strassen’s Algorithm

▪ Computational complexity: Θ 𝑛𝟐.𝟖𝟎𝟕

Is This the Fastest?
Issues with Memory Hierarchy

On-Chip Components
Control

Cache Cache
Second Secondary

Instr Data
ITLB DTLB
Level Main Memory
Registers Memory (Disk)
Datapath Cache
(SRAM) (DRAM)

Speed (%cycles): ½’s 1’s 10’s 100’s 10,000’s

Size (bytes): 100’s 10K’s M’s G’s T’s

> Computers employ multi-level caches to exploit data locality

▪ If data is not efficiently used in cache, performance drops significantly.

> Issues with naïve parallel implementation

▪ Cache inefficiency: each row of 𝐴 & column of 𝐵 may exceed cache capacity.
▪ Poor spatial locality: matrix 𝐵 is accessed column-wise.
▪ False sharing: distinct threads independently update data within the same
cache line, leading to cache invalidations and unnecessary memory access.
Common Cache Optimization Techniques (1)
> Cache Blocking (Tiling):
▪ Divide data and computations into smaller blocks that fit within the cache.
✓ E.g., BLOCK_SIZE = 32/64 is good for an L1 cache of 32KB.
▪ Process these blocks sequentially, maximizing data reuse within the cache.

// Preprocessor directives: #include…

#define BLOCK_SIZE 64 // Example block size
void *matrix_multiply_block(void *arg) {
/ / Caching data in local variables, e.g., int *A = ((struct thread_data*)arg)->A; …
for (int ii = start_i; ii < end_i; ii += BLOCK_SIZE) {
for (int jj = 0; jj < n; jj += BLOCK_SIZE) {
for (int kk = 0; kk < n; kk += BLOCK_SIZE) {
/* Process each block sequentially */
for (int i = ii; i < ii + BLOCK_SIZE && i < end_i; i++) {
for (int j = jj; j < jj + BLOCK_SIZE && j < n; j++) {
for (int k = kk; k < kk + BLOCK_SIZE && k < n; k++) {
C[i * n + j] += A[i * n + k] * B[k * n + j];}}}}}}
}
Common Cache Optimization Techniques (2)
> Loop Interchange
▪ Reorder nested loops to ensure consecutive memory accesses are close
together, improving data locality.
for (int i = 0; i < N; i++) {
for (int jk = 0; j k < N; j++)
k++) {
C[i][j] = 0;
for (int kj = 0; k < N; k++)
j++) {
C[i][j] += A[i][k] * B[k][j];}}}

> Data alignment and padding:

▪ Align data on cache line boundaries or pad data structures to prevent cache
line splits and avoid false sharing.

#define CACHE_LINE_SIZE 64
typedef struct { int value; char padding[CACHE_LINE_SIZE - sizeof(int)]; // Padding
} padded_int;
padded_int shared_data[NUM_THREADS]; // Array of padded integers
void *thread_func(void *arg) { int thread_id = *(int *)arg;
shared_data[thread_id].value++; // Modify separate padded elements}
Quicksort Parallelization
> Sort set of N random numbers
> Multiple possible algorithms
▪ Use quicksort for parallelization

> Sequential quicksort of set of values X

▪ Choose “pivot” p from X
▪ Rearrange X into
✓ L: Values  p
✓ R: Values  p
▪ Recursively sort L to get L’
▪ Recursively sort R to get R’
▪ Return L’ : p : R’
Sequential Quicksort

p X

L p R

L2 p2 R2
•
•
•

L

L3 p3 R3
•
•
•

R

L p R
Parallel Quicksort
p X

L p R

L2 p2 R2 p L3 p3 R3
• •
• •
• •
L p R

> If N  Nthresh, do sequential quicksort

> Else
▪ Choose “pivot” p from X
▪ Rearrange X into L and R
▪ Recursively spawn separate threads
✓ Sort L to get L’
✓ Sort R to get R’
▪ Return L’ : p : R’
Thread Structure: Sorting Tasks

  

Task Threads

> Task: sort subrange of data

▪ Specify as:
✓ base: Starting address
✓ nele: Number of elements in subrange

> Run as separate thread

Small Sort Task Operation

  

Task Threads

> Sort subrange using serial quicksort

Large Sort Task Operation

L p R X
Partition Subrange

  

L p R X
Spawn 2 tasks

  
Top-Level Function (Simplified)

void tqsort(data_t *base, size_t nele) {

init_task(nele);
global_base = base;
global_end = global_base + nele - 1;
task_queue_ptr tq = new_task_queue();
tqsort_helper(base, nele, tq);
join_tasks(tq);
free_task_queue(tq);
}

> Steps:
▪ Sets up data structures
▪ Calls recursive sort routine
▪ Keeps joining threads until none left
▪ Frees data structures
Recursive sort routine (Simplified)

/* Multi-threaded quicksort */
static void tqsort_helper(data_t *base, size_t nele,
task_queue_ptr tq) {
if (nele <= nele_max_sort_serial) {
/* Use sequential sort */
qsort_serial(base, nele);
return;
}
sort_task_t *t = new_task(base, nele, tq);
spawn_task(tq, sort_thread, (void *) t);
}

> Idea:
▪ Small partition: Sort serially
▪ Large partition: Spawn new sort task
Sort task thread (Simplified)
/* Thread routine for many-threaded quicksort */
static void *sort_thread(void *vargp) {
sort_task_t *t = (sort_task_t *) vargp;
data_t *base = t->base;
size_t nele = t->nele;
task_queue_ptr tq = t->tq;
free(vargp);
size_t m = partition(base, nele);
if (m > 1)
tqsort_helper(base, m, tq);
if (nele-1 > m+1)
tqsort_helper(base+m+1, nele-m-1, tq);
return NULL;
}

▪ Get task parameters

▪ Perform partitioning step
▪ Call recursive sort routine on each partition
Parallel Quicksort Performance

▪ Serial fraction: Fraction of input at which do serial sort

▪ Sort 227 (134,217,728) random values
▪ Best speedup = 6.84X
NEXT LECTURE

[Flipped class] Parallel programming (cont.)

> Pre-class
▪ Study pre-class materials on Canvas

> In class
▪ Reinforcement/enrichment discussion

> Post class

▪ Homework
▪ Consultation (if needed)

Data and Instruction Locality in Caches
No ratings yet
Data and Instruction Locality in Caches
78 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Performance Optimization Insights
No ratings yet
Performance Optimization Insights
104 pages
Lecture 8
No ratings yet
Lecture 8
22 pages
HPC Codes-2
No ratings yet
HPC Codes-2
15 pages
Unit 4 - Threads - Program Solution
No ratings yet
Unit 4 - Threads - Program Solution
9 pages
Game Optimization Techniques and Guidelines
100% (1)
Game Optimization Techniques and Guidelines
34 pages
Web GPU
0% (1)
Web GPU
40 pages
MIT6 172F10 Lec03
No ratings yet
MIT6 172F10 Lec03
75 pages
Program Optimization Techniques
No ratings yet
Program Optimization Techniques
35 pages
Lab 7
No ratings yet
Lab 7
3 pages
LEC12-Optimization and New Trends
No ratings yet
LEC12-Optimization and New Trends
23 pages
Module 7
No ratings yet
Module 7
28 pages
Optimize Matrix Multiplication Utilizing Opencl Fpga Kernel
No ratings yet
Optimize Matrix Multiplication Utilizing Opencl Fpga Kernel
8 pages
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
No ratings yet
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
57 pages
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
No ratings yet
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
32 pages
High Level Synthesis II: ECE 3401 Digital Systems Design
No ratings yet
High Level Synthesis II: ECE 3401 Digital Systems Design
35 pages
Program Optimization
No ratings yet
Program Optimization
63 pages
MIT6 172F09 Lec02
No ratings yet
MIT6 172F09 Lec02
85 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
58 pages
Matrix Multiplication Lab Guide
No ratings yet
Matrix Multiplication Lab Guide
6 pages
Super Linear Speedup in Parallel Algorithms
33% (3)
Super Linear Speedup in Parallel Algorithms
4 pages
Cache Lab: Simulator & Optimization
No ratings yet
Cache Lab: Simulator & Optimization
7 pages
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
No ratings yet
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
60 pages
High Performance Computing Labs & Concepts
No ratings yet
High Performance Computing Labs & Concepts
5 pages
Day01 HPC WRKSHP Compiler Opt
No ratings yet
Day01 HPC WRKSHP Compiler Opt
61 pages
TP1 - Optimizing Memory Access: Imad Kissami
No ratings yet
TP1 - Optimizing Memory Access: Imad Kissami
5 pages
Cache Memory Optimization Techniques
No ratings yet
Cache Memory Optimization Techniques
100 pages
DSA Lab Manual
No ratings yet
DSA Lab Manual
58 pages
Questions On Chapter 1 and 2 Color New V2
No ratings yet
Questions On Chapter 1 and 2 Color New V2
8 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
CS222 - COAL - SOLUTION - Final - Spring2023
No ratings yet
CS222 - COAL - SOLUTION - Final - Spring2023
12 pages
Lab Programs
No ratings yet
Lab Programs
15 pages
Cache Performance
No ratings yet
Cache Performance
44 pages
Lab Syllabus
No ratings yet
Lab Syllabus
21 pages
CUDA Matrix Multiplication Techniques
100% (1)
CUDA Matrix Multiplication Techniques
55 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Aca Seminar Report
No ratings yet
Aca Seminar Report
11 pages
Os Lab Exp
No ratings yet
Os Lab Exp
24 pages
OpenMP Performance Consideration
No ratings yet
OpenMP Performance Consideration
49 pages
Document 15
No ratings yet
Document 15
5 pages
1.1 Parallelism Is Ubiquitous
No ratings yet
1.1 Parallelism Is Ubiquitous
3 pages
4 Performance.4x
No ratings yet
4 Performance.4x
14 pages
Lab Manual
No ratings yet
Lab Manual
33 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Parallel Sorting Algorithm for Integers
No ratings yet
Parallel Sorting Algorithm for Integers
5 pages
PP Manual
No ratings yet
PP Manual
22 pages
HPC Unit 5 B
No ratings yet
HPC Unit 5 B
31 pages
CS33 S25 L14 OpenMP Intro Annotated
No ratings yet
CS33 S25 L14 OpenMP Intro Annotated
73 pages
(并行课件w3) 第2讲 1&2
No ratings yet
(并行课件w3) 第2讲 1&2
143 pages
Lecture04 Machine Programming 4 Advanced
No ratings yet
Lecture04 Machine Programming 4 Advanced
30 pages
PC - Lab Manuall
No ratings yet
PC - Lab Manuall
15 pages
CPP Concepts
No ratings yet
CPP Concepts
23 pages
DAA Slip Ans
No ratings yet
DAA Slip Ans
101 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
23 pages
Apt05 2024S2
No ratings yet
Apt05 2024S2
23 pages
HPC Output
No ratings yet
HPC Output
12 pages
Pipelining vs. Parallel Processing Explained
No ratings yet
Pipelining vs. Parallel Processing Explained
23 pages
Computing Lab Lab Test 2
No ratings yet
Computing Lab Lab Test 2
2 pages
11th Grade Matrix Mastery
No ratings yet
11th Grade Matrix Mastery
4 pages
Solving Linear Equations with Matrices
No ratings yet
Solving Linear Equations with Matrices
78 pages
BEC505 - Digital Signal Processing - NOTES
No ratings yet
BEC505 - Digital Signal Processing - NOTES
102 pages
Tutorial 6 - Introduction To Matrices - 2010
No ratings yet
Tutorial 6 - Introduction To Matrices - 2010
10 pages
Matrices A Deep Dive
No ratings yet
Matrices A Deep Dive
10 pages
DL Unit-1
No ratings yet
DL Unit-1
38 pages
21.csec Maths June 2014
No ratings yet
21.csec Maths June 2014
38 pages
Math Linear Algebra
No ratings yet
Math Linear Algebra
54 pages
Daa Project
No ratings yet
Daa Project
7 pages
R Programming for BCA Students
No ratings yet
R Programming for BCA Students
35 pages
Module IV
No ratings yet
Module IV
19 pages
Linear Transformations in Matrix Algebra
No ratings yet
Linear Transformations in Matrix Algebra
4 pages
Basic Simulation Lab File (Es-204) : Ravi Kumar A45615820008 B.Tech Ce 4 SEM
No ratings yet
Basic Simulation Lab File (Es-204) : Ravi Kumar A45615820008 B.Tech Ce 4 SEM
49 pages
Matrices CSEC Questions
100% (1)
Matrices CSEC Questions
8 pages
Matrix Multiplication with Pointers in C
No ratings yet
Matrix Multiplication with Pointers in C
2 pages
Data Science Report
No ratings yet
Data Science Report
126 pages
DL Unit-1
No ratings yet
DL Unit-1
20 pages
MATLAB Crash Course
No ratings yet
MATLAB Crash Course
11 pages
Modeling and Analysis of Principles For Chemical and Biological Engineers
100% (12)
Modeling and Analysis of Principles For Chemical and Biological Engineers
565 pages
Assignment 23
No ratings yet
Assignment 23
1 page
Solution of The Time-Dependent Schrödinger Equation Using The Crank-Nicolson Algorithm With MPI On A 2-D Regular Cartesian Grid
No ratings yet
Solution of The Time-Dependent Schrödinger Equation Using The Crank-Nicolson Algorithm With MPI On A 2-D Regular Cartesian Grid
13 pages
Matrix Multiplication
No ratings yet
Matrix Multiplication
3 pages
1.3) Matrices
No ratings yet
1.3) Matrices
17 pages
Matracis Note
No ratings yet
Matracis Note
14 pages
Unit 1
No ratings yet
Unit 1
38 pages
Matrices MCQ 2024-2025 Class 12.PDF Practice Question
No ratings yet
Matrices MCQ 2024-2025 Class 12.PDF Practice Question
7 pages
Strassen's Matrix Multiplication Guide
No ratings yet
Strassen's Matrix Multiplication Guide
2 pages
MATLAB for Geoscience Students
No ratings yet
MATLAB for Geoscience Students
22 pages
Spherical k-Means for Text Clustering
No ratings yet
Spherical k-Means for Text Clustering
22 pages

APT06 2024S2 New

Uploaded by

APT06 2024S2 New

Uploaded by

ADVANCED

VNU - UNIVERSITY of ENGINEERING & TECHNOLOGY

> Sequential matrix multiplication

> Sequential algorithm

▪ Computational complexity: Θ(𝑛3 )

𝐶11 𝐶12 𝐴11 𝐴12 𝐵11 𝐵12

𝐶21 𝐶22 𝐴21 𝐴22 𝐵21 𝐵22

▪ Computational complexity: Θ 𝑛𝟐.𝟖𝟎𝟕

Speed (%cycles): ½’s 1’s 10’s 100’s 10,000’s

> Computers employ multi-level caches to exploit data locality

> Issues with naïve parallel implementation

// Preprocessor directives: #include…

> Data alignment and padding:

> Sequential quicksort of set of values X

> If N  Nthresh, do sequential quicksort

> Task: sort subrange of data

> Run as separate thread

> Sort subrange using serial quicksort

void tqsort(data_t *base, size_t nele) {

▪ Get task parameters

▪ Serial fraction: Fraction of input at which do serial sort

[Flipped class] Parallel programming (cont.)

> Post class

You might also like