ADVANCED
PROGRAMMING
TECHNIQUES
VNU - UNIVERSITY of ENGINEERING & TECHNOLOGY
LECTURE 6: Parallel Programming (cont.)
CONTENTS
> Sequential matrix multiplication
> Parallelization of matrix multiplication
> Optimizing parallel matrix multiplication
> Sequential quicksort
> Parallel quicksort
Sequential Matrix Multiplication
> Numerous real-life applications
▪ Systems of linear equations, PCA, PageRank, DNA sequencing, etc.
> Sequential algorithm
▪ Computational complexity: Θ(𝑛3 )
Paralellization of Matrix Multiplication
> Idea: data parallelism → parallel block multiplication.
𝐶11 𝐶12 𝐴11 𝐴12 𝐵11 𝐵12
𝐶21 𝐶22 𝐴21 𝐴22 𝐵21 𝐵22
𝑛 Τ2 𝑛 Τ2 𝑛 Τ2 𝑛 Τ2 𝑛 Τ2 𝑛 Τ2
▪ The submatrices 𝐶11 = 𝐴11 𝐵11 + 𝐴12 𝐵21 , 𝐶12 = 𝐴11 𝐵12 + 𝐴12 𝐵22 , 𝐶21 = 𝐴21 𝐵11 +
𝐴22 𝐵21 , 𝐶22 = 𝐴21 𝐵12 + 𝐴22 𝐵22 are computed simultaneously.
▪ Computational complexity: Θ(𝑛3 )
Parallel Matrix Multiplication Algorithm
Strassen’s Algorithm
▪ Computational complexity: Θ 𝑛𝟐.𝟖𝟎𝟕
Is This the Fastest?
Issues with Memory Hierarchy
On-Chip Components
Control
Cache Cache
Second Secondary
Instr Data
ITLB DTLB
Level Main Memory
Registers Memory (Disk)
Datapath Cache
(SRAM) (DRAM)
Speed (%cycles): ½’s 1’s 10’s 100’s 10,000’s
Size (bytes): 100’s 10K’s M’s G’s T’s
> Computers employ multi-level caches to exploit data locality
▪ If data is not efficiently used in cache, performance drops significantly.
> Issues with naïve parallel implementation
▪ Cache inefficiency: each row of 𝐴 & column of 𝐵 may exceed cache capacity.
▪ Poor spatial locality: matrix 𝐵 is accessed column-wise.
▪ False sharing: distinct threads independently update data within the same
cache line, leading to cache invalidations and unnecessary memory access.
Common Cache Optimization Techniques (1)
> Cache Blocking (Tiling):
▪ Divide data and computations into smaller blocks that fit within the cache.
✓ E.g., BLOCK_SIZE = 32/64 is good for an L1 cache of 32KB.
▪ Process these blocks sequentially, maximizing data reuse within the cache.
// Preprocessor directives: #include…
#define BLOCK_SIZE 64 // Example block size
void *matrix_multiply_block(void *arg) {
/ / Caching data in local variables, e.g., int *A = ((struct thread_data*)arg)->A; …
for (int ii = start_i; ii < end_i; ii += BLOCK_SIZE) {
for (int jj = 0; jj < n; jj += BLOCK_SIZE) {
for (int kk = 0; kk < n; kk += BLOCK_SIZE) {
/* Process each block sequentially */
for (int i = ii; i < ii + BLOCK_SIZE && i < end_i; i++) {
for (int j = jj; j < jj + BLOCK_SIZE && j < n; j++) {
for (int k = kk; k < kk + BLOCK_SIZE && k < n; k++) {
C[i * n + j] += A[i * n + k] * B[k * n + j];}}}}}}
}
Common Cache Optimization Techniques (2)
> Loop Interchange
▪ Reorder nested loops to ensure consecutive memory accesses are close
together, improving data locality.
for (int i = 0; i < N; i++) {
for (int jk = 0; j k < N; j++)
k++) {
C[i][j] = 0;
for (int kj = 0; k < N; k++)
j++) {
C[i][j] += A[i][k] * B[k][j];}}}
> Data alignment and padding:
▪ Align data on cache line boundaries or pad data structures to prevent cache
line splits and avoid false sharing.
#define CACHE_LINE_SIZE 64
typedef struct { int value; char padding[CACHE_LINE_SIZE - sizeof(int)]; // Padding
} padded_int;
padded_int shared_data[NUM_THREADS]; // Array of padded integers
void *thread_func(void *arg) { int thread_id = *(int *)arg;
shared_data[thread_id].value++; // Modify separate padded elements}
Quicksort Parallelization
> Sort set of N random numbers
> Multiple possible algorithms
▪ Use quicksort for parallelization
> Sequential quicksort of set of values X
▪ Choose “pivot” p from X
▪ Rearrange X into
✓ L: Values p
✓ R: Values p
▪ Recursively sort L to get L’
▪ Recursively sort R to get R’
▪ Return L’ : p : R’
Sequential Quicksort
p X
L p R
L2 p2 R2
•
•
•
L
L3 p3 R3
•
•
•
R
L p R
Parallel Quicksort
p X
L p R
L2 p2 R2 p L3 p3 R3
• •
• •
• •
L p R
> If N Nthresh, do sequential quicksort
> Else
▪ Choose “pivot” p from X
▪ Rearrange X into L and R
▪ Recursively spawn separate threads
✓ Sort L to get L’
✓ Sort R to get R’
▪ Return L’ : p : R’
Thread Structure: Sorting Tasks
Task Threads
> Task: sort subrange of data
▪ Specify as:
✓ base: Starting address
✓ nele: Number of elements in subrange
> Run as separate thread
Small Sort Task Operation
Task Threads
> Sort subrange using serial quicksort
Large Sort Task Operation
L p R X
Partition Subrange
L p R X
Spawn 2 tasks
Top-Level Function (Simplified)
void tqsort(data_t *base, size_t nele) {
init_task(nele);
global_base = base;
global_end = global_base + nele - 1;
task_queue_ptr tq = new_task_queue();
tqsort_helper(base, nele, tq);
join_tasks(tq);
free_task_queue(tq);
}
> Steps:
▪ Sets up data structures
▪ Calls recursive sort routine
▪ Keeps joining threads until none left
▪ Frees data structures
Recursive sort routine (Simplified)
/* Multi-threaded quicksort */
static void tqsort_helper(data_t *base, size_t nele,
task_queue_ptr tq) {
if (nele <= nele_max_sort_serial) {
/* Use sequential sort */
qsort_serial(base, nele);
return;
}
sort_task_t *t = new_task(base, nele, tq);
spawn_task(tq, sort_thread, (void *) t);
}
> Idea:
▪ Small partition: Sort serially
▪ Large partition: Spawn new sort task
Sort task thread (Simplified)
/* Thread routine for many-threaded quicksort */
static void *sort_thread(void *vargp) {
sort_task_t *t = (sort_task_t *) vargp;
data_t *base = t->base;
size_t nele = t->nele;
task_queue_ptr tq = t->tq;
free(vargp);
size_t m = partition(base, nele);
if (m > 1)
tqsort_helper(base, m, tq);
if (nele-1 > m+1)
tqsort_helper(base+m+1, nele-m-1, tq);
return NULL;
}
▪ Get task parameters
▪ Perform partitioning step
▪ Call recursive sort routine on each partition
Parallel Quicksort Performance
▪ Serial fraction: Fraction of input at which do serial sort
▪ Sort 227 (134,217,728) random values
▪ Best speedup = 6.84X
NEXT LECTURE
[Flipped class] Parallel programming (cont.)
> Pre-class
▪ Study pre-class materials on Canvas
> In class
▪ Reinforcement/enrichment discussion
> Post class
▪ Homework
▪ Consultation (if needed)