0% found this document useful (0 votes)
25 views5 pages

Student Friendly Notes Module2

Uploaded by

prajwalsv1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views5 pages

Student Friendly Notes Module2

Uploaded by

prajwalsv1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2.4.

5 GPU Programming
GPUs work with a CPU host that manages memory, I/O, and program start. They can run
thousands of threads grouped under SIMD (Single Instruction Multiple Data). Each processor has
small fast memory (cache) and a shared memory block. Performance drops when threads branch
differently since some become idle. Hardware scheduler handles threads efficiently, but
programmers must minimize branching.

2.4.6 Programming Hybrid Systems


Hybrid systems combine shared-memory API (within a node) and distributed-memory API (between
nodes). They are mainly used in high-performance applications like scientific simulations.
Development is more complex, so most prefer a single distributed-memory API for simplicity. Hybrid
models are powerful but harder to maintain.
2.5.1 MIMD Systems
Parallel I/O involves multiple processes accessing disks/devices simultaneously. Most programs do
little I/O, but when multiple threads use printf/scanf, the output becomes nondeterministic. To avoid
confusion, usually only one process/thread handles input. Output may appear in mixed or jumbled
order if many threads print together.

2.5.2 GPUs
In GPU programming, the host CPU usually performs all input/output operations. GPU threads can
write to stdout during debugging, but the output order is unpredictable. GPU threads don’t have
access to stderr, stdin, or secondary storage. This ensures smoother execution, with I/O being
centralized at the CPU.

2.6.1 Speedup and Efficiency in MIMD Systems


Speedup (S) = Tserial / Tparallel. Ideally, speedup equals the number of cores (linear speedup).
Efficiency (E) = S / p, showing how well resources are used. In practice, overhead (mutex locks,
communication delays) reduces efficiency. As more cores are added, overhead grows and
efficiency drops. Graphs show speedup and efficiency for different problem sizes.
2.6.2 Amdahl’s Law
Amdahl’s Law states that the maximum speedup is limited by the serial part of a program. If 10% of
code is serial, maximum speedup ≤ 10, no matter how many cores are used. Even with perfect
parallelization, serial sections cap performance. This shows why minimizing serial code is crucial
for scalability.

2.6.3 Scalability in MIMD Systems


A program is scalable if efficiency remains constant when both problem size and cores increase.
Strong scalability: efficiency constant without increasing problem size. Weak scalability: efficiency
constant when problem size grows with cores. Example: If processes × k, problem size must also ×
k for weak scalability.

2.6.4 Taking Timings of MIMD Programs


Timings measure how fast a parallel program runs. Wall-clock time is preferred over CPU time
since it includes waiting periods. Synchronization (barriers) is used before timing starts. Usually,
maximum time across processes is considered. Multiple runs are taken, and the minimum time is
reported for accuracy.

2.6.5 GPU Performance


GPU performance is compared with CPU programs, but linear speedup concepts don’t apply
directly. Scalability in GPUs is defined informally: performance improves with larger GPUs.
Amdahl’s Law still applies if serial code runs on CPU. Timers from CPU or GPU APIs are used for
measuring performance. Even small speedups can be useful in real applications.

You might also like