Ade 2
Ade 2
Computer Architecture
• Refers to the abstract model and design of a computer system.
• It includes the instruction set architecture (ISA), data formats, addressing modes, and
how instructions are executed.
• Example: The x86 or ARM instruction set used by modern CPUs.
Computer Organization
• Refers to the physical implementation of a computer system.
• It focuses on how hardware components like the control unit, ALU, memory, and I/O are
connected and work together.
• Example: How a processor executes an instruction using control signals and registers.
3. Registers
• Small, fast storage units inside the CPU.
• Types:
ARM Processors:
• ARM1 (1985): Developed by Acorn Computers.
• Known for RISC (Reduced Instruction Set Computing).
• Focus on low power, high efficiency.
• Widely used in mobile phones, embedded systems, and IoT devices.
Metric Description
Throughput Number of tasks a system can complete in a given time (e.g., MIPS).
• Additional Metrics:
1. Architecture Design
o Factors include pipelining, cache size, branch prediction, and bus width.
o CISC (Complex Instruction Set Computing) has more powerful instructions but
can be slower.
o RISC (Reduced Instruction Set Computing) uses simpler instructions for faster
execution.
What is Benchmarking?
• A method of measuring and comparing the performance of processors using standard
programs or tests.
Use of Benchmarking
• Helps users and manufacturers evaluate CPU performance.
• Used to:
o Identify bottlenecks.
o Optimize performance.
Examples of Benchmarks:
• SPEC (Standard Performance Evaluation Corporation): Measures CPU and system
performance.
• Geekbench, Cinebench, PassMark: Used in real-world and synthetic testing scenarios.
1. Memory-Mapped I/O:
Importance:
• Ensures data exchange between processor and peripheral devices.
• Requires I/O controllers and ports.
1. Interrupt-Driven I/O
• Concept: Processor is interrupted when an I/O device needs attention.
• Allows the CPU to perform other tasks until the device is ready.
• Steps:
o Better multitasking.
2.1 Types of Computer Architecture (Including Harvard and Von Neumann Architectures)
Computer architecture refers to the design principles and structure of a computer system. There
are several types of architectures, but two foundational ones are Von Neumann and Harvard
architectures. These architectures define how memory, data, and instructions interact within a
system.
Definition:
Proposed by John Von Neumann in 1945, this architecture uses a single memory for storing both
instructions and data.
Key Features:
• Single memory space for both data and programs.
• Instructions are fetched and executed sequentially.
• Uses a single bus for data and instructions, leading to bottlenecks.
• Simpler and cost-effective design.
2. Harvard Architecture
Definition:
This architecture uses separate memory for storing instructions and data, allowing simultaneous
access.
Key Features:
• Separate storage and buses for instructions and data.
• Allows simultaneous fetching of instructions and reading/writing of data.
• More efficient and faster than Von Neumann for certain applications.
• More complex and costly due to dual memory systems.
Advantages:
• No bottleneck between data and instruction transfer.
• Better performance in real-time and embedded systems.
Applications:
• Common in microcontrollers, digital signal processors (DSPs), and embedded systems.
4. Parallel Architecture:
Definition:
The instruction cycle, also called the fetch-decode-execute cycle, is the basic operational process
of a computer. It refers to the sequence of steps the CPU follows to fetch an instruction from
memory, decode it, and execute it.
This cycle repeats continuously while the computer is running.
1. Fetch
o The Control Unit (CU) retrieves the next instruction from memory.
2. Decode
o The Control Unit decodes the binary code into a form that the CPU can understand
(e.g., operation type, operands).
3. Execute
o The Arithmetic Logic Unit (ALU) or appropriate part of the CPU carries out the
instruction.
o Operations may include arithmetic (e.g., addition), logic (e.g., AND), data
movement (e.g., load/store), or control (e.g., jump).
Register Function
Program Counter (PC) Holds the memory address of the next instruction
Memory Data Register (MDR) Holds the data being transferred to/from memory
Definition:
A register set refers to a collection of high-speed, small-sized memory units located inside the
Central Processing Unit (CPU). Registers temporarily hold data, instructions, addresses, and
intermediate results that the CPU uses during processing.
Registers are faster than RAM and are crucial for CPU operations.
Characteristics of Registers:
• Very fast access time (faster than cache and RAM).
• Small in size, typically measured in bits (e.g., 8-bit, 16-bit, 32-bit, or 64-bit).
• Located directly inside the CPU.
• Used during instruction execution, arithmetic operations, memory access, and control
operations.
Program Counter (PC) Holds the address of the next instruction to be fetched.
Status Register (Flag Contains flags that indicate the status of operations (e.g., Zero,
Register) Carry, Overflow).
1. Instruction Execution:
o Registers like IR, PC, and ACC play key roles in fetching and executing
instructions.
2. Data Storage:
3. Address Handling:
o Registers like MAR and PC handle memory addressing and program sequencing.
4. Control Flow:
o Registers like SP and status flags are essential in managing procedure calls, loops,
and branching.
Example in Practice:
2.1 Describe different CPU Registers Including General purpose and Specific Registers
CPU Registers
Registers are small, fast storage locations inside the Central Processing Unit (CPU). They are
used to store data, addresses, instructions, and intermediate results during processing.
Registers are essential to the functioning of the instruction cycle and overall CPU operations.
These are flexible, multipurpose registers used by the CPU to perform arithmetic, logic, and data
manipulation operations. They temporarily hold operands and intermediate values during
program execution.
Register Description
Typically labeled as R0, R1, R2, ..., Rn depending on the architecture (e.g.,
R0 - Rn
8 or 16 registers in RISC). Used for holding temporary data.
AX (Accumulator
Used in arithmetic and logic operations. Common in x86 architecture.
Register)
DX (Data Register) Used for I/O operations and extended precision arithmetic.
Note: In RISC (e.g., ARM) architectures, general-purpose registers are usually named R0 to R15.
2. Special-Purpose Registers (Specific Registers)
These registers serve dedicated roles in the CPU for control and memory operations, instruction
execution, and status reporting.
Memory Address Holds the address of the memory location to read from or
MAR
Register write to.
FLAGS / Flag Register or Holds condition flags (e.g., Zero, Carry, Sign, Overflow)
PSW Program Status Word that affect instruction flow and decision-making.
Flag Description
Overflow (O) Set if the result of an operation exceeds the data size.
Certainly!
Each CPU architecture (like x86, ARM, MIPS, etc.) has its own unique instruction set.
o Examples:
▪ MOV R1, R2 – Copy contents of R2 into R1
▪ LOAD A, [1000] – Load data from memory location 1000 into register A
▪ STORE A, [1000] – Store contents of A into memory location 1000
2. Arithmetic Instructions
o Examples:
▪ ADD R1, R2 – Add R1 and R2
▪ SUB R3, R4 – Subtract contents of R4 from R3
o Examples:
▪ AND R1, R2
▪ OR R3, R4
▪ XOR R5, R6
▪ NOT R1
o Examples:
▪ JMP 200 – Jump to instruction at address 200
5. Input/Output Instructions
o Examples:
▪ IN R1, PORT1 – Input from port1 to R1
▪ OUT PORT2, R2 – Output contents of R2 to port2
6. Shift and Rotate Instructions
o Examples:
▪ SHL R1, 1 – Shift bits of R1 left by 1
▪ SHR R2, 2 – Shift bits of R2 right by 2
▪ ROL R3 – Rotate bits of R3 left
o Fixed-length instructions
o Faster execution
o Example: ARM, MIPS
o Variable-length instructions
Examples:
• MOV R1, R2: Copy contents of R2 into R1.
• LOAD R1, [1000]: Load data from memory location 1000 into R1.
• STORE R1, [1000]: Store contents of R1 into memory location 1000.
2. Arithmetic Instructions
Purpose:
3. Logical Instructions
Purpose:
To perform bitwise operations and logical comparisons.
Examples:
• AND R1, R2: Perform bitwise AND on R1 and R2.
• OR R1, R2: Perform bitwise OR.
• NOT R1: Invert bits of R1.
Purpose:
To alter the flow of execution based on conditions or jump to subroutines.
Examples:
• JMP 200: Jump to instruction at address 200.
• CALL 300: Call subroutine at 300.
• RET: Return from subroutine.
• JZ, JNZ, JE, JNE: Conditional jumps based on flag status.
Purpose:
To enable communication with external devices such as keyboards, screens, printers, and
sensors.
Use in Program Development:
• Reading user input.
• Displaying output to the screen or other devices.
• Sending and receiving data in embedded or real-time systems.
Examples:
• IN R1, PORT1: Input data from port1 to R1.
• OUT PORT2, R1: Output contents of R1 to port2.
Examples:
• SHL R1, 1: Shift bits in R1 to the left.
• SHR R1, 1: Shift bits in R1 to the right.
• ROL R1: Rotate bits of R1 left.
Definition:
The Instruction Set Architecture (ISA) is the interface between hardware and software of a
computer. It defines how the processor understands and executes instructions—including the
types of instructions available, their format, and how they interact with memory and registers.
In simple terms, the ISA is the programmer's view of the computer hardware, and it tells how
a CPU should behave in response to specific binary commands.
o Typically includes:
▪ Opcode: Operation code (e.g., ADD, LOAD).
▪ Operands: Registers or memory addresses involved in the operation.
▪ Addressing Mode: How operands are accessed (e.g., direct, immediate).
2. Instruction Types
o I/O instructions
3. Data Types
o Defines the size and type of data the CPU can process (e.g., 8-bit, 16-bit, 32-bit
integers, floating-point numbers).
4. Registers
o Describes the number, type, and function of CPU registers available (e.g., general-
purpose, program counter, stack pointer).
5. Addressing Modes
6. Memory Model
o Describes how memory is accessed and managed (e.g., byte-addressable vs. word-
addressable).
Importance of ISA:
• Software Compatibility: Programs compiled for a specific ISA will only run on
processors that support that ISA.
• Hardware Flexibility: Allows hardware engineers to design different implementations of
the same ISA with varying performance and cost.
• Compiler Design: The compiler relies on ISA to generate machine code that the CPU can
understand.
Definition:
RISC is an ISA design philosophy that uses a small number of simple instructions, each designed
to execute in a single clock cycle.
Key Characteristics:
• Fixed-length instructions
• Fewer instruction types
• Load/store architecture: Memory is accessed only via specific load and store instructions.
• Simplified addressing modes
• Emphasizes compiler optimization and hardware pipelining
Advantages:
• Faster execution due to simple instructions
• Easier to implement pipelining and parallelism
• Lower power consumption
Disadvantages:
• Programs may require more instructions
• More demand on memory bandwidth
Definition:
CISC uses a large and rich set of instructions, where each instruction may perform multiple
operations (e.g., memory access + arithmetic + condition check).
Key Characteristics:
• Variable-length instructions
• Supports complex operations with fewer instructions
• Rich addressing modes (immediate, direct, indirect, indexed)
• Emphasizes instruction-level efficiency
Advantages:
• Programs can be smaller in size (fewer instructions)
• Reduces the complexity of compiler design
• Good for code density in memory-constrained systems
Disadvantages:
• Slower instruction decoding
• More difficult to pipeline
• Higher power consumption
c. Stack-Based Architecture
• Uses a stack to hold intermediate values rather than registers
• Simple design but limited flexibility
• Example: Java Virtual Machine (JVM)
Definition:
Addressing modes are the methods used by the CPU to locate the operands (data values) required
for executing an instruction. In simple terms, addressing modes define how and where the CPU
should look for the data to be processed.
Each addressing mode offers a different way to access data from memory, registers, or directly
from the instruction itself.
Types of Addressing Modes:
The Arithmetic and Logic Unit (ALU) is a core component of the Central Processing Unit
(CPU) responsible for carrying out arithmetic operations (like addition and subtraction) and
logic operations (like AND, OR, NOT). It is the part of the CPU where actual data processing
takes place.
1. Arithmetic Operations
o Multiplication
o Division
Components of an ALU:
Component Function
Flag Description
Carry (C) Set if there's a carry out of the most significant bit in addition
Definition:
A Graphical Processing Unit (GPU) is a specialized electronic circuit designed to accelerate the
processing of images, videos, and complex graphical computations. Unlike the Central
Processing Unit (CPU), which is designed for general-purpose tasks, a GPU is optimized for
parallel processing—making it extremely effective for tasks that involve large-scale data
operations, such as rendering graphics, video editing, and machine learning.
1. Rendering Graphics: The primary purpose of a GPU is to convert data into images by
performing calculations related to lighting, shading, textures, and object positioning in
2D/3D spaces.
2. Parallel Data Processing: GPUs can process thousands of threads simultaneously,
which makes them ideal for performing repetitive tasks across large data sets.
3. Acceleration of Complex Computations: Modern GPUs are also used for scientific
computing, AI training, cryptocurrency mining, and more.
Thread Handling Handles few threads efficiently Handles thousands of threads in parallel
Components of a GPU:
o High-speed memory (e.g., GDDR6) used to store textures, data, and intermediate
results.
3. Shader Units:
o Perform pixel shading, vertex calculations, lighting, and other rendering effects.
Types of GPUs:
1. Integrated GPUs:
Applications of GPUs:
1. Gaming:
2. Scientific Computing:
4. Cryptocurrency Mining:
5. Medical Imaging:
o Open standard for parallel programming of diverse platforms including GPUs and
CPUs.
• DirectX / OpenGL / Vulkan:
Definition:
The Control Unit (CU) is a fundamental component of the Central Processing Unit (CPU)
responsible for directing the operations of the processor. It controls how data moves within the
CPU, to and from memory, and between input/output devices. Essentially, it acts as the "brain
within the brain"—coordinating and sequencing all actions taken by the computer.
4. Control Signals: Generates and sends signals to control the flow of data and the operation
of hardware components.
3. Execute by sending control signals to the appropriate units (ALU, registers, memory, I/O).
4. Repeat for the next instruction.
Definition:
A control unit where control signals are generated by hardware logic circuits (gates, flip-flops,
decoders, etc.).
Features:
• Fast and efficient.
• Difficult to modify (not flexible).
• Best suited for simple and fixed instruction sets.
Advantages:
• High performance (speed).
• Low latency signal generation.
Disadvantages:
• Hard to change or upgrade.
• Complex to design for large instruction sets.
Use Cases:
• Used in high-speed processors or embedded systems with fixed functionality.
Components:
• Control memory: Stores microprograms.
• Control address register: Points to the current microinstruction.
• Control data register: Holds the microinstruction being executed.
Advantages:
• Easier to design and modify.
• Supports complex and large instruction sets.
Disadvantages:
• Slower due to memory fetches.
• Higher control memory overhead.
Use Cases:
• Used in CISC (Complex Instruction Set Computer) architectures like Intel x86.
Complexity Complex for large instruction sets Simpler and more scalable
Instruction Set Best for small/fixed sets Suitable for complex instructions
Component Function
Control Signal Generator Generates control signals for all other CPU components
The Control Unit (CU) is the component of the CPU (Central Processing Unit) that directs the
operations of the processor by generating control signals to guide the execution of instructions.
There are two primary design approaches for building control units:
1. Hardwired Control Unit
Definition:
A Hardwired Control Unit uses combinational logic circuits (like gates, flip-flops, and
decoders) to generate control signals directly based on the instruction opcode and the current
timing step.
How It Works:
• The control logic is implemented with hardware circuits.
• The instruction is decoded, and based on its type and the timing signals, control signals
are activated.
• The circuit is pre-designed for a specific instruction set.
Advantages:
• Very fast (ideal for high-speed processors).
• No memory needed for control instructions.
Disadvantages:
• Difficult to modify or upgrade (not flexible).
• Complex for large instruction sets (hard to manage as logic becomes dense).
Definition:
A Microprogrammed Control Unit uses a set of microinstructions stored in a special memory
called the control memory. These microinstructions are fetched and executed to generate the
required control signals.
How It Works:
• The control unit contains a control memory with microprograms.
• Each machine instruction maps to a sequence of microinstructions.
• The control memory outputs microinstructions which generate control signals.
Advantages:
• Easier to modify and extend (microcode can be changed).
• Simplifies complex instruction sets (ideal for CISC architectures).
Disadvantages:
• Slower than hardwired control units due to microinstruction fetch time.
• Consumes more memory for microprogram storage.
Definition:
In computer systems, various types of memory and storage devices are used to hold data
temporarily or permanently. Each type has specific characteristics such as speed, size, cost,
volatility, and accessibility that make it suitable for particular tasks.
Below is a detailed explanation of the key characteristics of different types of memory and storage
devices.
1. Registers
Characteristics:
• Speed: Fastest memory (1 nanosecond or less).
• Volatility: Volatile (data is lost when power is off).
• Size: Very small (few bytes).
• Location: Inside the CPU.
• Purpose: Holds data currently being processed, such as instructions and immediate results.
• Accessibility: Directly accessed by the CPU without delay.
2. Cache Memory
Characteristics:
• Speed: Very fast (faster than RAM, slower than registers).
• Volatility: Volatile.
• Size: Small (KB to a few MB).
• Location: Between CPU and main memory.
• Purpose: Stores frequently accessed data and instructions to reduce memory access time.
• Types:
Characteristics:
• Speed: Fast (slower than cache).
• Volatility: Volatile.
• Size: Medium (GBs, e.g., 4GB – 64GB).
• Location: On the motherboard.
• Purpose: Temporarily stores data and programs that are in use.
• Types:
Characteristics:
• Speed: Slower than RAM.
• Volatility: Non-volatile (retains data when power is off).
• Size: Small (KB to MB).
• Purpose: Stores firmware and bootloader.
• Types:
Characteristics:
• Speed: Slow (compared to HDD/SSD).
• Volatility: Non-volatile.
• Size: Small to medium (700MB for CD, 4.7GB for DVD).
• Purpose: Distribution of media, backup.
• Durability: Can degrade over time; sensitive to scratches.
• Usage: Becoming obsolete due to flash storage and cloud.
8. Cloud Storage
Characteristics:
• Speed: Depends on internet connection.
• Volatility: Non-volatile.
• Size: Virtually unlimited.
• Purpose: Online storage, backup, remote access.
• Durability: Data stored in data centers with redundancy.
• Cost: Subscription-based or free with limits.
4.3 How Data is Stored and Retrieved in Different Memory and Storage Devices
Overview:
In computer systems, data storage and retrieval vary depending on the type of memory or storage
device involved. These processes are influenced by the device’s structure, access method, speed,
and volatility. Understanding how data is stored and retrieved in different types of memory helps
in optimizing system performance and reliability.
1. Registers (CPU Registers)
Storage:
• Data is stored directly using electrical flip-flops or latches within the CPU.
• Each register is identified by name (e.g., AX, BX, PC).
Retrieval:
• CPU directly accesses and retrieves data from registers during instruction execution.
• Access is instantaneous and requires no addressing mechanism.
Speed: Fastest
Volatility: Volatile (data lost when power is off)
Storage:
• Uses SRAM (Static RAM) technology.
• Data and instructions that are frequently accessed from RAM are automatically stored in
cache.
Retrieval:
• When the CPU requests data:
Storage:
• Data is stored in memory cells, each with a unique address.
• The CPU uses address buses to write data to specific memory locations.
Retrieval:
• The CPU sends a memory address via the address bus.
• The RAM locates the data and sends it back via the data bus.
Retrieval:
• Data is retrieved using the address lines, like RAM.
• The CPU reads the content but cannot modify it during execution.
5. Secondary Storage
Storage:
• Data is stored magnetically on rotating platters in sectors and tracks.
• A read/write head moves to the correct track to write bits as magnetic patterns.
Retrieval:
• The read/write head locates the correct track and sector.
• Data is read magnetically and passed to RAM via the I/O interface.
Access Method: Semi-sequential/random access
Volatility: Non-volatile
Speed: Relatively slow due to mechanical parts
B. Solid-State Drive (SSD)
Storage:
• Uses flash memory (NAND chips) to store data as electrical charges in cells.
• Organized into blocks and pages.
Retrieval:
• Controller fetches the data electrically without moving parts.
• Fast access to specific pages.
Storage:
• Data is stored as pits and lands on a reflective disk surface.
• A laser burns these patterns during writing.
Retrieval:
• A laser beam scans the surface.
• Reflection differences (pits and lands) are translated into binary data.
Storage:
• Data is stored by trapping electrons in floating gate transistors.
• Each bit is stored as a charge state (1 or 0).
Retrieval:
• Controller reads voltage levels to determine stored bits.
• Access is block-based.
8. Cloud Storage
Storage:
• Data is stored on remote servers in data centers.
• Uses a combination of SSDs, HDDs, and backup tapes.
Retrieval:
• Data is accessed over the internet via protocols (e.g., HTTP, FTP).
• Requires authentication and encryption for security.
4.4 Performance Differences Among Various Types of Memory and Storage Devices
Overview:
The performance of memory and storage devices in a computer system depends on several factors,
including speed (latency and bandwidth), capacity, cost, volatility, and power consumption.
These differences affect how quickly data can be accessed, how much data can be stored, and how
efficiently a system operates.
Metric Explanation
Access Time
Time taken to read/write a single piece of data. Lower is better.
(Latency)
Bandwidth Amount of data that can be transferred per second. Higher is better.
Metric Explanation
Expense to store one bit of data. Lower is more cost-effective for large
Cost per Bit
storage.
1. CPU Registers
• Speed: Fastest (nanoseconds or less)
• Latency: Almost zero (directly accessed by CPU)
• Bandwidth: Extremely high
• Capacity: Very limited (bytes)
• Use Case: Immediate data used by the CPU
• Cost: Very high per bit
9. Cloud Storage
• Speed: Depends on internet speed
• Latency: High (due to network)
• Bandwidth: Varies with internet and provider
• Capacity: Virtually unlimited
• Use Case: Remote access, backup, collaboration
• Cost: Subscription-based
Definition:
Virtual Memory is a memory management technique that allows a computer to compensate
for physical memory (RAM) limitations by temporarily transferring data from RAM to disk
storage. It enables a system to execute large programs or multiple programs simultaneously,
even if the physical RAM is insufficient.
Key Concepts:
1. Address Translation:
2. Paging:
3. Page Table:
o A data structure that keeps track of the mapping between virtual pages and
physical frames.
4. Page Fault:
o The operating system retrieves the page from disk into RAM.
o Resumes execution.
Uses of Virtual Memory
5. Simplified Programming
• Programmers can write programs as if infinite memory is available.
Advantage Description
Efficient memory usage Only necessary parts of programs are loaded into RAM
Simplified program development Developers don’t worry about exact memory availability
Disadvantage Description
Slower than physical RAM Disk access is much slower than RAM
Page faults cause delays Excessive page swapping (thrashing) degrades performance
Requires disk space Needs part of disk for swap file or page file
Real-Life Example:
If your computer has 4 GB of RAM and you're running programs that require 6 GB, virtual
memory will use disk space (e.g., 2 GB) as an extension of RAM, ensuring smooth operation —
though with reduced speed compared to real RAM.
5.1 Instruction Level Parallelism (ILP) and Its Importance in Increasing Computing
Performance
In simple terms, ILP means doing more work at once within the processor by overlapping the
execution of instructions.
5.2 Hardware and Software Techniques Used to Increase Instruction-Level Parallelism (ILP)
Hardware techniques are implemented within the processor architecture to dynamically identify
and exploit ILP during program execution.
1. Pipelining
• Breaks instruction execution into multiple stages (e.g., fetch, decode, execute, write-back).
• Allows overlapping execution of multiple instructions at different stages.
Result: Increases throughput and reduces instruction latency.
2. Superscalar Architecture
• Allows multiple instructions to be fetched, decoded, and executed per clock cycle.
• Uses multiple functional units (e.g., ALUs, FPUs).
Example: A dual-issue processor can process two instructions per cycle.
3. Out-of-Order Execution
• CPU executes instructions out of the original program order as long as data
dependencies are not violated.
• Keeps execution units busy and improves performance.
4. Branch Prediction
• Predicts the outcome of conditional branches (e.g., if-else) before they are executed.
• Ensures the pipeline stays filled by speculatively executing instructions.
5. Register Renaming
• Resolves false data dependencies by assigning new registers to avoid naming conflicts.
• Prevents Write After Read (WAR) and Write After Write (WAW ) hazards.
6. Speculative Execution
• Executes instructions before it is certain they are needed (based on branch prediction).
• If the prediction is correct, the results are kept; if not, they are discarded.
Software techniques are applied by compilers and programmers to optimize code and expose
more parallelism.
1. Instruction Scheduling
• Compilers rearrange instructions to avoid stalls due to data hazards or delays.
• Example: Placing independent instructions between a load and its dependent instruction.
2. Loop Unrolling
• Reduces loop control overhead by executing multiple iterations in one loop pass.
• Increases the number of independent instructions for parallel execution.
3. Software Pipelining
• Reorganizes loops so that different iterations are overlapped.
• Independent operations from multiple loop iterations are interleaved.
4. Function Inlining
• Reduces function call overhead and exposes more optimization opportunities.
• Allows better scheduling across what would have been function call boundaries.
Introduction:
Example: While waiting for the result of an instruction, other independent instructions can be
executed.
3. Increased Instruction Throughput
• Throughput is the number of instructions completed per cycle.
• A high degree of ILP allows multiple instructions per cycle, especially in superscalar
and out-of-order processors.
Effect: More instructions are retired per second, leading to higher performance.
1. Frequent Stalls
• Due to data or control dependencies, the pipeline stages wait for previous instructions to
complete.
3. Increased Latency
• Execution time for each instruction increases, as the pipeline isn’t fully optimized.
Illustrative Example:
Introduction
2. Speculative Execution
These techniques allow the processor to continue executing instructions even when the natural
(program) order would require it to wait due to dependencies or uncertainties (like branches).
Definition:
Out-of-Order Execution is a technique where the processor executes instructions as soon as their
operands are available, not necessarily in the order they appear in the program. This
increases instruction throughput by avoiding unnecessary stalls.
How It Works:
3. Dependency Check: The processor checks if the operands (data) needed for an instruction
are available.
4. Execution: If ready, the instruction is sent to the execution unit, even if earlier instructions
are not yet complete.
B. Speculative Execution
Definition:
Speculative Execution is a technique where the processor guesses the outcome of instructions
(typically branches) and executes subsequent instructions ahead of time. If the guess is correct,
execution continues without interruption. If incorrect, the speculative work is discarded.
How It Works:
1. Branch Prediction:
2. Speculative Execution:
o The CPU continues executing instructions beyond the branch as if the prediction
was correct.
3. Commit or Rollback:
o If wrong, the CPU rolls back and executes the correct path.
Introduction
Dependencies between instructions fall into three main categories, based on how they affect
parallel execution:
assembly
1. A = B + C ; Instruction 1
These dependencies exist because of shared register names, not because of actual data flow.
a. Write-After-Read (WAR)
• An instruction writes to a register after another instruction has read it.
assembly
1. B = A + 5 ; Reads A
b. Write-After-Write ( WAW)
• Two instructions write to the same register; the second must not overwrite the first
prematurely.
assembly
1. A = B + 1
2. A = C + 2
• Implication: WAR and WAW can often be resolved by register renaming.
• Impact on ILP: Can cause unnecessary stalls if not handled.
3. Control Dependencies
• Occur when an instruction depends on the outcome of a branch or conditional statement.
Example:
assembly
1. if (x > 0)
2. y = x + 1;
• The instruction in line 2 depends on whether the branch in line 1 is taken.
• Implication: Until the condition is evaluated, the processor cannot know which path to
execute.
• Impact on ILP: Limits the number of instructions that can be issued speculatively.
5.6 Flynn’s Classification of Parallel Computers
Introduction
Flynn's taxonomy classifies computers into four categories, using the concept of Instruction
Stream (IS) and Data Stream (DS):
o Basic microcontrollers
o Real-time systems with redundancy for fault detection (e.g., space shuttles)
o Parallel servers
Definition:
A multiprocessor system is a computer system with two or more processors (CPUs) that share
a common memory and are capable of executing multiple instructions simultaneously. These
processors collaborate to perform computational tasks more efficiently.
Types of Multiprocessor Systems
Type Description
Symmetric Multiprocessing All processors have equal access to shared memory and operate
(SMP) under a single OS.
Asymmetric Multiprocessing One processor is the master and controls others, often used in
(AMP) embedded systems.
Advantages of Multiprocessors
• Increased system throughput and performance.
• Faster execution of parallel tasks.
• Better reliability and fault tolerance.
• Efficient resource utilization in multitasking environments.
Definition:
Thread-Level Parallelism (TLP) is the ability of a processor to execute multiple threads
simultaneously, either on separate cores or interleaved on the same core. A thread is the
smallest unit of a program that can be scheduled for execution.
TLP vs ILP
Type Description
Coarse-Grained Switches threads only when a stall occurs (e.g., memory access
Multithreading delay).
Fine-Grained
Switches threads every clock cycle, reducing idle time.
Multithreading
Advantages of TLP
• Better utilization of CPU resources.
• Improved responsiveness in multi-user/multitasking systems.
• Increased performance for parallel workloads (e.g., web servers, simulations).
• Enables smooth execution of background tasks.
A multiprocessor system is a computer system with two or more processing units (CPUs or
cores) that operate simultaneously. The architecture of such systems determines how processors
are connected, how they communicate, and how memory is shared. These architectural designs
impact performance, scalability, complexity, and the ability to execute parallel tasks efficiently.
Multiprocessor systems are generally classified based on how they manage memory and
coordinate processors. The major architectures are:
Description:
In shared memory architecture, all processors have access to a common global memory.
Processors communicate and exchange data by reading and writing to the shared memory.
Characteristics:
• All CPUs access the same address space.
• Communication is implicit via memory operations.
• Requires memory synchronization mechanisms (e.g., locks, semaphores).
Type Description
Non-Uniform Memory Access Access time varies depending on which processor accesses
(NUMA) which memory segment.
Advantages:
• Easy to program (since memory is globally accessible).
• Efficient for tasks with frequent communication between processors.
Disadvantages:
• Scalability issues due to contention for shared memory.
• Complexity in managing cache coherence.
Each processor has its own private local memory. Processors communicate by explicit message
passing rather than shared memory.
Characteristics:
• No global memory.
• Communication is done using interconnection networks.
• Used in cluster computing and massively parallel processors (MPPs).
Advantages:
• Highly scalable.
• Avoids memory contention and bottlenecks.
Disadvantages:
• Complex programming model (requires explicit communication).
• Data locality must be managed manually.
Examples:
• MPI-based (Message Passing Interface) systems.
• Beowulf clusters.
• Supercomputers like Cray and IBM Blue Gene.
Description:
Combines both shared and distributed memory approaches. Often implemented in multi-core
clusters where each node has shared memory (among cores), and nodes communicate via message
passing.
Characteristics:
• Intra-node communication uses shared memory.
• Inter-node communication uses distributed memory.
Advantages:
• Balances ease of programming (within nodes) and scalability (across nodes).
• Efficient resource utilization.
Disadvantages:
• Increased system complexity.
• Requires hybrid programming models (e.g., OpenMP + MPI).
Characteristics:
• Shared memory.
• Single operating system instance.
• Common in desktop, server, and small multiprocessor systems.
Advantages:
• Simplified design.
• Efficient for a small number of processors.
Disadvantages:
• Not scalable beyond a certain number of processors.
Description:
One processor is the master and controls the system; others are slaves that perform specific tasks.
Characteristics:
• Used in embedded systems or systems with mixed workloads.
• Limited or no memory sharing.
Advantages:
• Simpler task division.
• Useful in real-time or specialized systems.
Disadvantages:
• Poor fault tolerance.
• Not suitable for general-purpose parallel computing.
6.3 Concept of Cache Coherence and Memory Consistency in Multiprocessor Systems
Introduction
In multiprocessor systems, where multiple CPUs (or cores) have their own private caches and
share a main memory, ensuring data consistency becomes a major challenge. This leads to the
concept of:
• Cache Coherence
• Memory Consistency
These mechanisms help ensure that all processors see the correct and updated value of shared
variables and maintain the expected behavior of memory operations across processors.
A. Cache Coherence
Let’s consider two processors P1 and P2, both having a cached copy of variable X = 5.
• P1 updates X to 10 in its cache.
• P2 continues to read X as 5 from its own cache.
This creates an inconsistency — a cache coherence problem.
Protocol Description
Writes are immediately passed to the main memory and other caches are
Write-through
updated.
Protocol Description
All caches monitor (or "snoop") a common bus to observe and react to
Snoopy Protocol
memory actions.
Directory-based A centralized directory keeps track of where copies of data exist and
Protocol coordinates coherence.
These protocols define states and transitions for cache blocks based on reads/writes by other
processors.
B. Memory Consistency
While cache coherence ensures correctness of individual memory locations, memory consistency
defines the order in which memory operations (reads/writes) become visible across processors.
Model Description
Strict Consistency A read returns the most recent write instantly — hard to implement.
Sequential The result is the same as if operations were executed in some sequential
Consistency order.
6.4 Role of Parallel Programming and How to Exploit Thread-Level Parallelism (TLP)
Introduction
Modern computing systems are built with multiple cores and processors. To fully utilize their
power, we need to write programs that can execute tasks concurrently — this is where parallel
programming and thread-level parallelism (TLP) come in.
• Parallel Programming is the technique of writing code that runs multiple instructions
or tasks simultaneously.
• Thread-Level Parallelism (TLP) refers to executing multiple threads of a program
concurrently to improve speed, efficiency, and performance.
Role of Parallel Programming in Modern Computing
3. Scalability
• Well-designed parallel programs can scale across many cores or nodes, making them
suitable for:
o Cloud computing
o Real-time systems
o Scientific simulations
o Video rendering
o Machine learning
o Web servers