HIGH PERFORMANCE COMPUTING – UNIT-1
Chapter 1: Modern Processor
1.1 Stored-program computer architecture
1.2 General-purpose cache-based microprocessor architecture.
1.2.1 Performance metrics and benchmarks.
1.2.2 Transistors galore: Moore’s Law.
1.2.3 Pipelining
1.2.4 Superscalarity
1.2.5 SIMD
1.3 Memory hierarchies
1.3.1 Cache
1.3.2 Cache mapping
1.3.3 Pre-fetch
1.4 Multicore processors
1.5 Multithreaded processors
1.6 Vector processors
Stored Program Architecture:
Before Stored Program Architecture:
Early computers like ENIAC used hard-wired programs:
Programs were not stored in memory.
Changing a program required manually rewiring the hardware.
This process was time-consuming and error-prone.
Evolution with EDVAC:
EDVAC (Electronic Discrete Variable Automatic Computer) was one of the
first computers to use stored program architecture.
It was proposed in 1945 by John von Neumann, who suggested that both
instructions and data should be stored in the same memory.
Hence, the architecture is often called the Von Neumann Architecture.
What is Stored Program Architecture?
A computer model where:
Program instructions (code) and data are both stored in main memory
(RAM).
The CPU fetches and executes instructions sequentially using a common
bus.
This model is used by almost all general-purpose computers today.
Based on SISD Model:
SISD: Single Instruction, Single Data
A single processor executes one instruction at a time on one data item.
Describes most traditional serial processors.
Represents the basic sequential processing model.
Why is Stored Program Architecture Important?
Flexible programming: Programs can be loaded, modified, and executed
easily.
Automatic execution: No need to rewire hardware to run different programs.
Efficient use of hardware: One memory for both code and data reduces
complexity.
Laid the foundation of modern computing.
How is Stored Program Architecture Structured?
Main Components:
CPU (Central Processing Unit):
Includes ALU, registers, and control unit
Memory:
Stores data and instructions
I/O System:
Handles input/output devices
Instruction Cycle:
Fetch: Get the instruction from memory
Decode: Identify the operation
Execute: Perform the operation (e.g., add, load, store)
Von Neumann Bottleneck
Instructions and data use the same bus for communication with memory.
Only one access can happen at a time (either data or instruction).
This causes a bottleneck and slows down performance, especially in high-
speed computing.
General-Purpose Cache-Based Microprocessor Architecture
Why the name ?
General purpose because these microprocessors are designed to execute a
wide range of applications, including scientific computing, everyday
software, and operating systems.
Cache-based indicates that the design includes multiple levels of cache
memory (like L1, L2) to reduce memory latency and increase speed.
The term microprocessor refers to a CPU implemented on a single chip.
What is a General-Purpose Cache-Based Microprocessor Architecture?
It is a hardware architecture for CPUs that:
Implements the stored-program digital computer model
Includes arithmetic units (for FP and INT operations), registers,
caches, and control logic
Executes code using a structured pipeline and execution units
Though extremely complex, only a small portion of the chip actually
performs computations (INT/FP units); the rest supports data movement and
control.
Components:
Main Memory
Memory Interface
L2 unified cache
L1 Data cache
L1 instruction Cache
Memory Queue
INT/FP Queue
FP register File
INT register File
Shift Mask
INT Operation
LD : Load ( data transfer memory to Register )
ST : Store ( data transfer Register to memory )
FP mult : Floating Point Multiply
FP add : Floating Point Add
Example :
LOAD R1, [R2 + 8] : Load value from memory address R2 + 8 into register R1
Components Used:
L1 Instruction Cache – fetches the LOAD instruction.
INT Reg. File – provides the address base (value of R2).
Memory Queue – queues the load request.
L1 Data Cache – checks if data is cached.
L2 Unified Cache / Main Memory – accessed if L1 cache misses.
LD Unit – performs the actual data fetch.
INT Reg. File – stores the result in R1.
Transistors galore: Moore’s Law
Even before personal computers, computers were already used in science .
Every ~2 years, the number of transistors in chips doubles.
More transistors = more complex logic = better performance.
Even though chip design methods changed (e.g., from 90nm → 5nm),
the doubling trend has stayed on track.
More transistors allowed: Better CPUs , More cores ,More cache ,Faster
instructions.
Moore’s Law means numbers of transistors are increased performance also
increased.
Advanced Techniques Enabled by More Transistors:
Pipelined Functional Units
Superscalar Architecture
Data parallelism through SIMD ( Single Instruction , Multiple Data )
Out of Order Execution
Larger caches
Simplified Instruction Set
Pipelined Functional Units :
The term “pipelined” multiple steps can operate simultaneously on different
inputs.
"Functional units" refer to hardware blocks like adders, multipliers, etc.,
inside the CPU that perform specific operations.
So, pipelined functional units are those CPU parts that break down complex
operations into smaller steps, allowing concurrent execution at different
stages.