0% found this document useful (0 votes)
15 views21 pages

APT06 2024S2 New

The document discusses advanced programming techniques focusing on parallel programming, specifically matrix multiplication and quicksort algorithms. It covers sequential and parallel approaches, optimization techniques for cache efficiency, and the performance of parallel quicksort. The lecture also outlines common cache optimization strategies and the structure of parallel sorting tasks.

Uploaded by

minhtrongc31120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views21 pages

APT06 2024S2 New

The document discusses advanced programming techniques focusing on parallel programming, specifically matrix multiplication and quicksort algorithms. It covers sequential and parallel approaches, optimization techniques for cache efficiency, and the performance of parallel quicksort. The lecture also outlines common cache optimization strategies and the structure of parallel sorting tasks.

Uploaded by

minhtrongc31120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

ADVANCED

PROGRAMMING
TECHNIQUES

VNU - UNIVERSITY of ENGINEERING & TECHNOLOGY


LECTURE 6: Parallel Programming (cont.)

CONTENTS

> Sequential matrix multiplication


> Parallelization of matrix multiplication
> Optimizing parallel matrix multiplication
> Sequential quicksort
> Parallel quicksort
Sequential Matrix Multiplication
> Numerous real-life applications
▪ Systems of linear equations, PCA, PageRank, DNA sequencing, etc.

> Sequential algorithm

▪ Computational complexity: Θ(𝑛3 )


Paralellization of Matrix Multiplication
> Idea: data parallelism → parallel block multiplication.

𝐶11 𝐶12 𝐴11 𝐴12 𝐵11 𝐵12

𝐶21 𝐶22 𝐴21 𝐴22 𝐵21 𝐵22

𝑛 Τ2 𝑛 Τ2 𝑛 Τ2 𝑛 Τ2 𝑛 Τ2 𝑛 Τ2

▪ The submatrices 𝐶11 = 𝐴11 𝐵11 + 𝐴12 𝐵21 , 𝐶12 = 𝐴11 𝐵12 + 𝐴12 𝐵22 , 𝐶21 = 𝐴21 𝐵11 +
𝐴22 𝐵21 , 𝐶22 = 𝐴21 𝐵12 + 𝐴22 𝐵22 are computed simultaneously.
▪ Computational complexity: Θ(𝑛3 )
Parallel Matrix Multiplication Algorithm
Strassen’s Algorithm

▪ Computational complexity: Θ 𝑛𝟐.𝟖𝟎𝟕


Is This the Fastest?
Issues with Memory Hierarchy

On-Chip Components
Control

Cache Cache
Second Secondary

Instr Data
ITLB DTLB
Level Main Memory
Registers Memory (Disk)
Datapath Cache
(SRAM) (DRAM)

Speed (%cycles): ½’s 1’s 10’s 100’s 10,000’s


Size (bytes): 100’s 10K’s M’s G’s T’s

> Computers employ multi-level caches to exploit data locality


▪ If data is not efficiently used in cache, performance drops significantly.

> Issues with naïve parallel implementation


▪ Cache inefficiency: each row of 𝐴 & column of 𝐵 may exceed cache capacity.
▪ Poor spatial locality: matrix 𝐵 is accessed column-wise.
▪ False sharing: distinct threads independently update data within the same
cache line, leading to cache invalidations and unnecessary memory access.
Common Cache Optimization Techniques (1)
> Cache Blocking (Tiling):
▪ Divide data and computations into smaller blocks that fit within the cache.
✓ E.g., BLOCK_SIZE = 32/64 is good for an L1 cache of 32KB.
▪ Process these blocks sequentially, maximizing data reuse within the cache.

// Preprocessor directives: #include…


#define BLOCK_SIZE 64 // Example block size
void *matrix_multiply_block(void *arg) {
/ / Caching data in local variables, e.g., int *A = ((struct thread_data*)arg)->A; …
for (int ii = start_i; ii < end_i; ii += BLOCK_SIZE) {
for (int jj = 0; jj < n; jj += BLOCK_SIZE) {
for (int kk = 0; kk < n; kk += BLOCK_SIZE) {
/* Process each block sequentially */
for (int i = ii; i < ii + BLOCK_SIZE && i < end_i; i++) {
for (int j = jj; j < jj + BLOCK_SIZE && j < n; j++) {
for (int k = kk; k < kk + BLOCK_SIZE && k < n; k++) {
C[i * n + j] += A[i * n + k] * B[k * n + j];}}}}}}
}
Common Cache Optimization Techniques (2)
> Loop Interchange
▪ Reorder nested loops to ensure consecutive memory accesses are close
together, improving data locality.
for (int i = 0; i < N; i++) {
for (int jk = 0; j k < N; j++)
k++) {
C[i][j] = 0;
for (int kj = 0; k < N; k++)
j++) {
C[i][j] += A[i][k] * B[k][j];}}}

> Data alignment and padding:


▪ Align data on cache line boundaries or pad data structures to prevent cache
line splits and avoid false sharing.

#define CACHE_LINE_SIZE 64
typedef struct { int value; char padding[CACHE_LINE_SIZE - sizeof(int)]; // Padding
} padded_int;
padded_int shared_data[NUM_THREADS]; // Array of padded integers
void *thread_func(void *arg) { int thread_id = *(int *)arg;
shared_data[thread_id].value++; // Modify separate padded elements}
Quicksort Parallelization
> Sort set of N random numbers
> Multiple possible algorithms
▪ Use quicksort for parallelization

> Sequential quicksort of set of values X


▪ Choose “pivot” p from X
▪ Rearrange X into
✓ L: Values  p
✓ R: Values  p
▪ Recursively sort L to get L’
▪ Recursively sort R to get R’
▪ Return L’ : p : R’
Sequential Quicksort

p X

L p R

L2 p2 R2


L

L3 p3 R3


R

L p R
Parallel Quicksort
p X

L p R

L2 p2 R2 p L3 p3 R3
• •
• •
• •
L p R

> If N  Nthresh, do sequential quicksort


> Else
▪ Choose “pivot” p from X
▪ Rearrange X into L and R
▪ Recursively spawn separate threads
✓ Sort L to get L’
✓ Sort R to get R’
▪ Return L’ : p : R’
Thread Structure: Sorting Tasks

  

Task Threads

> Task: sort subrange of data


▪ Specify as:
✓ base: Starting address
✓ nele: Number of elements in subrange

> Run as separate thread


Small Sort Task Operation

  

Task Threads

> Sort subrange using serial quicksort


Large Sort Task Operation

L p R X
Partition Subrange

  

L p R X
Spawn 2 tasks

  
Top-Level Function (Simplified)

void tqsort(data_t *base, size_t nele) {


init_task(nele);
global_base = base;
global_end = global_base + nele - 1;
task_queue_ptr tq = new_task_queue();
tqsort_helper(base, nele, tq);
join_tasks(tq);
free_task_queue(tq);
}

> Steps:
▪ Sets up data structures
▪ Calls recursive sort routine
▪ Keeps joining threads until none left
▪ Frees data structures
Recursive sort routine (Simplified)

/* Multi-threaded quicksort */
static void tqsort_helper(data_t *base, size_t nele,
task_queue_ptr tq) {
if (nele <= nele_max_sort_serial) {
/* Use sequential sort */
qsort_serial(base, nele);
return;
}
sort_task_t *t = new_task(base, nele, tq);
spawn_task(tq, sort_thread, (void *) t);
}

> Idea:
▪ Small partition: Sort serially
▪ Large partition: Spawn new sort task
Sort task thread (Simplified)
/* Thread routine for many-threaded quicksort */
static void *sort_thread(void *vargp) {
sort_task_t *t = (sort_task_t *) vargp;
data_t *base = t->base;
size_t nele = t->nele;
task_queue_ptr tq = t->tq;
free(vargp);
size_t m = partition(base, nele);
if (m > 1)
tqsort_helper(base, m, tq);
if (nele-1 > m+1)
tqsort_helper(base+m+1, nele-m-1, tq);
return NULL;
}

▪ Get task parameters


▪ Perform partitioning step
▪ Call recursive sort routine on each partition
Parallel Quicksort Performance

▪ Serial fraction: Fraction of input at which do serial sort


▪ Sort 227 (134,217,728) random values
▪ Best speedup = 6.84X
NEXT LECTURE

[Flipped class] Parallel programming (cont.)


> Pre-class
▪ Study pre-class materials on Canvas

> In class
▪ Reinforcement/enrichment discussion

> Post class


▪ Homework
▪ Consultation (if needed)

You might also like