1.
Difference between SIMD and MIMD Architectures
1 SIMD (Single Instruction, Multiple Data): One instruction stream applied to many data elements in
lockstep.
2 MIMD (Multiple Instruction, Multiple Data): Multiple independent processors execute different
instructions on different data.
3 Control: SIMD uses a single control unit; MIMD has multiple independent control units.
4 Synchrony: SIMD executes the same instruction simultaneously; MIMD executes asynchronously.
5 Use cases: SIMD for data-parallel tasks (e.g., vector math, image filtering); MIMD for task-parallel or
general-purpose multiprocessing (OS, servers).
6 Programming: SIMD is simpler for SPMD patterns; MIMD requires explicit concurrency control
(threads, MPI).
7 Hardware: SIMD hardware is simpler; MIMD requires richer interconnect and coherence mechanisms.
8 Real-world SIMD examples: GPUs (shader cores), SIMD instructions (Intel AVX, ARM NEON).
9 Advantages of SIMD: High throughput, energy-efficient for data-parallel loops.
10 Real-world MIMD examples: Multi-core CPUs (Linux servers), distributed-memory clusters.
11 Advantages of MIMD: Flexibility for heterogeneous tasks, good for task-level parallelism.
2. Classification of Parallel Computers
1 By memory organization: Shared-memory (UMA, NUMA), Distributed-memory (clusters), Hybrid.
2 By control/instruction streams (Flynn’s taxonomy): SISD, SIMD, MISD, MIMD.
3 By coupling: Tightly coupled (multicore), Loosely coupled (clusters).
4 By programming model: Shared memory (threads, OpenMP), Distributed memory (MPI),
Data-parallel (GPU, MapReduce).
5 Architecture examples:
6 • Shared-memory multiprocessor: CPUs connected via coherence fabric to shared memory.
7 • Distributed-memory cluster: Each node has CPU+RAM, connected via a network.
3. Synchronization Mechanism in Shared-Memory Systems
1 Race condition: Occurs when multiple threads access shared data without synchronization.
2 Common mechanisms: Mutex, Spinlock, Semaphores, Barriers, Atomic operations.
3 Mutex ensures only one thread enters a critical section at a time.
4 Barriers synchronize all threads at a certain point.
5 Atomic operations provide lock-free synchronization.
6 Best practices: Keep critical sections short, avoid deadlocks, prefer lock-free methods when possible.
4. Impact of Cache Coherence in Shared-Memory Systems
1 Cache coherence ensures consistency across multiple caches.
2 Positive impact: Transparent programming model, faster access with caches.
3 Negative impact: Coherence traffic, false sharing, scalability limits.
4 Protocols: Write-invalidate (MESI), Write-update.
5 Mitigation: Align variables, use thread-local data, minimize sharing.
5. Routing in a 4×4 Mesh
1 Mesh topology: Nodes connected to up/down/left/right neighbors.
2 Routing algorithm: XY routing – move along X, then Y.
3 Example: (0,0) → (0,3) → (3,3), total 6 hops.
4 Alternative: YX routing or adaptive routing to reduce congestion.
6. Shared-Memory vs Distributed-Memory Systems
1 UMA: Equal memory access latency (e.g., small SMPs).
2 NUMA: Latency depends on locality (e.g., multi-socket servers).
3 Distributed-memory: Each node has private memory, uses message passing (MPI).
4 Pros/cons: Shared memory – simple but limited scalability; Distributed – scalable but complex
programming.
7. Performance of Interconnection Networks
1 Bus: Low cost, poor scalability.
2 Crossbar: High bandwidth, expensive, O(N²) hardware.
3 Ring: Moderate cost, O(N) latency.
4 Mesh: Scalable, moderate latency, used in NoCs and large systems.
8. Scalability Calculation Example
1 Tserial = 40s, Tparallel = 10s, p=8.
2 Speedup S = 40/10 = 4.
3 Efficiency E = 4/8 = 0.5 (50%).
4 Interpretation: Only half of processor capacity utilized.
9. Simple CUDA Kernel for Vector Addition
1 Kernel: Each thread computes one element C[i] = A[i] + B[i].
2 Threads organized into blocks (e.g., 256 threads per block).
3 Blocks organized into grid: (n + threadsPerBlock - 1) / threadsPerBlock.
4 Guard condition ensures threads outside range do not execute.
10. Amdahl’s Law Example
1 Parallel fraction = 80%, Serial = 20%, p=8.
2 Speedup S = 1 / (0.2 + 0.8/8) = 3.33.
3 Even with 8 processors, limited speedup due to 20% serial part.
11. Example Program with Serial/Parallel Analysis
1 Tserial=24 ms, Tparallel=4 ms, p=8.
2 Observed speedup = 24/4 = 6.
3 Serial fraction ≈ 4.76%, Parallel ≈ 95.2%.
4 Program design: Partition workload, balance threads, minimize synchronization.
12. Scalability in MIMD Systems
1 Strong scalability: Runtime decreases with more processors for fixed problem size.
2 Weak scalability: Runtime remains constant with proportional increase in workload and processors.
3 Factors: Communication, synchronization, load balancing, cache coherence.
4 Strategies: Reduce communication, improve locality, balance load.
13. Message Passing vs Shared Memory
1 MPI: Explicit communication, scalable to large clusters, overhead from message latency.
2 Shared-memory (Pthreads/OpenMP): Easier programming, limited scalability due to memory
bandwidth and coherence issues.
3 Hybrid MPI+OpenMP combines strengths for multi-node parallelism.
14. Flynn’s Taxonomy
1 SISD: Single Instruction Single Data (uniprocessor).
2 SIMD: Single Instruction Multiple Data (vector processors, GPUs).
3 MISD: Multiple Instruction Single Data (rare, fault-tolerance pipelines).
4 MIMD: Multiple Instruction Multiple Data (multicore, clusters).
15. Vector Processors
1 Operate on entire vectors with one instruction.
2 Use vector registers to hold multiple data elements.
3 Highly pipelined functional units for throughput.
4 Efficient for scientific computing and linear algebra.
5 Limitations: Less efficient for irregular data access patterns.
16. Cache Coherence Approaches
1 Snooping-based coherence: Caches monitor bus and invalidate/update on writes (e.g., MESI protocol).
2 Directory-based coherence: Directory tracks sharers and manages invalidations/updates.
3 Snooping is simple but not scalable; Directory scales better but requires extra storage.
17. Scalability in MIMD Systems (Expanded)
1 Granularity: Fine-grained vs coarse-grained tasks.
2 Communication: All-to-all is costly, nearest-neighbor scales better.
3 Topology: Low-diameter networks improve scalability.
4 Software: Efficient runtime, thread management, load balancing.
5 Metrics: Speedup, efficiency, isoefficiency function.
18. Assumptions and Rules for I/O in Parallel Programs
1 Shared file systems may serialize access.
2 I/O is slower than memory operations.
3 Best practices: Batch I/O, use collective I/O (MPI-IO), avoid small writes, use parallel file systems,
manage consistency with locks.
19. GPU Programming Overview
1 GPUs contain many cores optimized for data-parallel workloads.
2 Programming models: CUDA, OpenCL, Vulkan compute, high-level libraries.
3 Execution model: Threads organized into blocks and grids.
4 Memory hierarchy: Global, shared, registers, caches.
5 Performance: Maximize occupancy, coalesce memory accesses, avoid warp divergence.
6 Use cases: Linear algebra, deep learning, image processing.
20. Performance of MIMD Systems
1 Factors: Processor speed, memory hierarchy, interconnect, cache coherence, synchronization, I/O.
2 Metrics: Speedup, efficiency, throughput, latency, scalability.
3 Optimizations: Improve locality, reduce synchronization, use efficient communication, optimize
scheduling.
Technologies Used in Big Data Environments
1 In-Memory Analytics: Stores data in RAM for faster access and insights.
2 In-Database Processing: Analytics performed inside the database, avoiding export overhead.
3 Symmetric Multiprocessor System (SMP): Multiple processors share memory and I/O under one OS.
4 Massively Parallel Processing (MPP): Independent processors with local memory working in parallel.
5 Parallel vs Distributed Systems: Parallel = tightly coupled, Distributed = loosely coupled.
6 Shared Nothing Architecture: No shared memory or disk among processors; contrasts with Shared
Memory and Shared Disk.