0% found this document useful (0 votes)

58 views33 pages

GPU Architecture and Programming NPTEL Week 1 Assignment

The document discusses GPU architectures and programming, highlighting the evolution from VGA controllers to complex GPUs with programmable software. It covers key concepts such as the classic 5-stage RISC pipeline, memory hierarchy, and instruction level parallelism (ILP). Additionally, it outlines a course organization for teaching GPU architectures, including topics like CUDA programming and efficient neural network training.

Uploaded by

mr.descifrador98

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views33 pages

GPU Architecture and Programming NPTEL Week 1 Assignment

Uploaded by

mr.descifrador98

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

EL
GPU Architectures and Programming

Soumyajit Dey, Assistant Professor,

PT
CSE, IIT Kharagpur

December 5, 2019
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

EL
I Fifteen years ago, Graphics on a PC were performed by a video graphics array
(VGA) controller.
I VGAs evolved to more complex hardwares : accelerating graphics functions

PT
I Early GPUs and their associated drivers implemented the OpenGL and DirectX
models (APIs) of graphics processing.
I With time, HW functionality evolved as programmable SW
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

EL
CPU
FRONT SIDE BUS

NORTH
MEMORY
BRIDGE
PCI BUS

PT
SOUTH VGA FRAMEBUFFER
BRIDGE CONTROLLER MEMORY

VGA
LAN UART DISPLAY
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: Historical PC. - Hennessy and Patterson "Computer Organization and Design" -

IND

19 5 1

yog, kms kOflm^

(Figure reproduced)
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Host CPU Bridge System Memory

GPU
Host Interface

EL
Input Assembler Clip/Setup/Raster/ZCull Compute Work HD Video Processor SM
Distribution
Vertex Work Distribution Pixel Work Distribution I-Cache
MT Issue
TPC TPC TPC
C-Cache

SM SM SM SM SM SM SP SP

….. SP SP

PT
SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP SP SP

SP SP
Texture Unit Texture Unit Texture Unit
Tex L1 Tex L1 Tex L1
SFU SFU
Interconnection Network

ROP L2 ROP L2 ….. ROP L2 Display Interface

Shared
Memory
N
DRAM DRAM DRAM DISPLAY
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
Figure: GPU Architecture - Hennessy Patterson (Figure reproduced)

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Course Organization

EL
Topic Week Hours
Review of basic COA w.r.t. performance 1 2
Intro to GPU architectures 2 3
Intro to CUDA programming 3 2
Multi-dimensional data and synchronization 4 2

PT
Warp Scheduling and Divergence 5 2
Memory Access Coalescing 6 2
Optimizing Reduction Kernels 7 3
Kernel Fusion, Thread and Block Coarsening 8 3
OpenCL - runtime system 9 3
N
OpenCL - heterogeneous computing 10 2 TECHNO
OF LO
TE

Efficient Neural Network Training/Inferencing 11-12 6

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

EL
Section 1

The classic 5-stage RISC pipeline

PT
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Basic RISC architecture

EL
I The operation of a processor is characterized by a fetch⇒ decode⇒execute cycle.
I RISC n CISC ⇒ two different philosophies of computing hardware design
I RISC/CISC - Reduced/Complex Instruction Set Computing

PT
I CISC approach - complete a task with as few instructions (instrs) as possible
I A CISC instruction : MUL addr1 addr2 addr3
I Equivalent RISC : LOAD R2 addr2 ; LOAD R3 addr3 ; MUL R1 R2 R3; STORE
addr1 R1
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

CISC vs RISC

EL
CISC features
RISC features
I Older ISA
I Ideas emerged in 1980s
I Multi-cycle instructions, HW intensive
I Single-cycle instructions, SW intensive
design

PT
design
I Efficient RAM usage
I Heavy RAM usage, Large Register file
I Instructions - complex and variable
I Small no. of simple fixed length
length, lots of them
instructions
I Micro-code support
I Less no. of addressing modes
N
I Compound addressing modes OF
TECHNO
LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Elementary CPU Datapath

EL
PCSrc

4 M
U
X
ADD
Shift
ADD left 2

ALU operation
MemRead
R−Reg−1
Data−1

PT
MemToReg
PC Address Address Data
Instruction Zero
R−Reg−2
ALUSrc
Instruction Registers Data Memory M
W−Reg Result U
Data−2 M X
U Write data
Instruction
Memory Data X
MemWrite
RegWrite

Sign
Extend
N
I The datapath ‘fetches’ instruction, ‘decodes’ and ‘executes’ it
I Control logic generates suitable activation signals TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
I Executes different instructions with variable delays

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Single cycle implementation of datapath

EL
I The choice of clock rate is limited by the instruction with maximum delay
I Options : choose the clock period more than latency of ‘slowest’ instruction or,
I choose variable periods for diff instructions – not practical !

PT
I Alternate possibility - break the instruction execution cycle into a series of basic
steps
I Basic steps have less delay, choose a fast clock and use it to execute one basic step
at a time
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Multi-cycle instructions

EL
A basic stage represents one of the following states in the execution of an instruction
I Fetch (IF): IR ⇐ Memory[PC]; PC=PC+4
I Decode (ID): Understand instruction semantics

PT
I Execute (EX): based on instruction type
I Arithmetic/logical operation, Mem address / Branch condition computation
I Memory (MEM): For load/store Instr, read/write data from/to memory
I Writeback (WB): Update register file
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Pipelining

EL
I Operate IF→ID→EX→MEM→WB in parallel for a sequence of instructions
I Every basic stage is always processing some instruction

PT
I In every clock cycle, one instruction completes - ideal scenario
I Practical issues - pipeline hazards
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Structural hazard

EL
I Consider a sequence of 4 lw (load-word) instructions
I When the first instruction fetches data from memory, the fourth instruction itself is

PT
to be fetched from memory
I This is structural hazard as the pipeline needs to stall due to lack of resources, if
the hardware cannot support multiple reads in parallel
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Data Hazard : MIPS example

EL
I sub $2, $1, $3; and $12, $2, $5 Read after Write (RAW)
I if ‘sub’ is in IF stage in i + 1-th clock cycle, $2 is updated in (i + 5)-th cycle
I ‘and’ is in EX stage in i + 4-th cycle, updated value of $2 is not yet ready

PT
I Solution : ‘sub’ computes the value for $2 in (i + 3)-th stage,
I this may be forwarded directly to execution of ‘and’
I need suitable logic to detect hazard and forwarding requirement
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Control hazards

EL
I Branch decisions : the branch condition needs evaluation (beq $1, $2, offset)
I The branch decision is inferred only in MEM stage
I Optimization : assume branch not taken, operate pipeline normally,

PT
I Execute branch when decision is evaluated as true (taken) and flush intermediate
instructions from pipeline
I Sophisticated schemes : use branch prediction HW (predict a branch decision
based on branch history table content)
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

EL
Section 2

The Memory Hierarchy

PT
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Multi-level Arrangement

EL
PT
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

Figure: Near to CPU is faster 19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Principle of locality

EL
I Temporal locality : If an item is referenced, it will tend to be referenced again soon
I Spatial locality : If an item is referenced, items at nearby addresses will be
referenced soon

PT
I Hence, computer memory is hierarchically organized
I Register file provides fastest access,
I Cache memory uses (fast) SRAM (static random access memory)
I Main memory uses (slow) DRAM (dynamic random access memory) : is less costly
per bit than SRAM
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Cache Mapping

EL
Cache

011
000

010

100

110
101
001

111
PT
000001 000101 001001 001101
Memory
…

110001 110101 111001 111101

N
I Direct mapped : Cache block address = (memory block address) modulo (Number
of cache blocks in the cache)
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
I Block = minimum unit of information that can be either present or not present

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Cache Blocks

EL
I With larger blocks we have lower miss rates due to spatial locality, large blocks
lead to large miss penalty
I Nothing is free : with very big block sizes, we have too small no of blocks in

PT
cache, eventually the miss rate goes up
I Handling Cache Miss:
I Send the PC value (current PC – 4) to the memory
I Read access from main memory, write updated cache entry
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Cache write policy

EL
I Handling consistency : always write the data into both the memory and the cache
(write-through)
I Conservative policy, slows things down
I Use write buffer to perform writes only when buffer is full. Buffer size can be

PT
decided by memory speed
I Alternative policy write-back : Writes are updated only in cache. Main memory is
update only during cache block replacement
I Write-back offers better performance in case of frequent writes, is more complex to
N
implement
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Memory System

EL
I Memory chips are designed to read/write more than one word in parallel (hiding
latency)
I Use a wide bus - allow parallel access to all words in a block

PT
I OR - keep bus of standard width (= memory word length = register size) and
connect bus with multiple memory units in parallel (memory banks)
I WHY ? bus transmission is fast, memory read/write is slow
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Cache Mapping: alternate schemes

EL
I Fully associative: a block can be placed in any location in the cache. (Large HW
requirement for fast parallel search)
I Practical only for cache with small number of blocks
I Optimizing in the middle : set associative cache

PT
I An n-way set-associative cache consists of a number of sets, each of which consists
of n blocks.
I Set number = (Memory Block number) modulo (Number of sets in the cache)
I Inside a set, all the tags of all the elements must be searched
N
I Increasing associativity decreases miss rate up to a point, but increases hit time
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Cache replacement policy

EL
I In direct mapped cache, a new block can go to exactly one location
I In fully associative cache, a new block can potentially replace any existing block -
how to resolve ?

PT
I In set associative cache, a new block can potentially replace any existing block
inside a matching set - how to resolve ?
I Least Recently Used (LRU) policy - The block replaced is the one that has been
unused for the longest time.
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

EL
Section 3

Instruction Level Parallelism (ILP)

PT
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Actual Pipeline CPI

EL
Pipeline Cycles per instruction (CPI) = Ideal pipeline CPI + Structural stalls + Data
hazard stalls + Control stalls
I Handling hazards require both architectural and compiler techniques
I Data hazard types while executing instruction i followed by j in a pipeline

PT
I RAW — j tries to read a source before i writes it, so j incorrectly gets the old value
I WAW — j tries to write an operand before it is written by i. Will not happen in
simple RISC, but in pipelines that write in more than one basic stage or allow an
instruction to proceed even when a previous instruction is stalled
I WAR - j tries to write a destination before it is read by i, can happen in case
instructions are reordered
N
I RAR - not a hazard
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Compiler Techniques for ILP

EL
To keep a pipeline full, a compiler can find sequences of unrelated instructions that can
be overlapped
for (i=100; i>=0; i=i–1)
x[i] = x[i] + s;
Unoptimized MIPs

PT
Loop:
L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
N
DADDUI R1,R1,#-8 ;decrement pointer //loop overhead
;8 bytes (per DW) TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
BNE R1,R2,Loop ;branch R1!=R2 //branch decision

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Unrolling: eliminated three branches and decrements of R1 (Hen Pat etl. al.)

EL
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
L.D F6,-8(R1)
ADD.D F8,F6,F2

PT
S.D F8,-8(R1) //Code size increase - more instr cache miss
L.D F10,-16(R1) //more no. of live values - increased register pressure
ADD.D F12,F10,F2
S.D F12,-16(R1)
L.D F14,-24(R1)
ADD.D F16,F14,F2
N
S.D F16,-24(R1) TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
DADDUI R1,R1,#-32

IND

19 5 1

yog, kms kOflm^

BNE R1,R2,Loop
GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Branch Prediction assisted ILP

EL
General single level predictor with 2-bit saturating counter

taken taken taken

PT
strongly weakly weakly strongly
not taken not taken taken taken
not taken taken

not taken not taken not taken

I conditional jump has to deviate twice from past before the prediction changes.
N
I Consider a sequence of altering decisions in a loop and calculate performance TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
improvement over 1-bit saturating counter !!!!

IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Hierarchical Prediction

EL
How about generalizing the idea of prediction with larger branch histories.
I store m length history of a branch - 2m possibilities

PT
I for each possibility use an n-bit predictor : (m, n) prediction scheme
I a two-level predictor with m-bit history can predict any repetitive sequence with
any period if all m-bit sub-sequences are different.
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Dynamic Scheduling for ILP

EL
I Simple pipelines execute instructions in-order
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14

PT
I SUB.D suffers as ADD.D stalls due to dependence
I different ordering will avoid stall in this case
I Out of order execution brings in the possibility of WAR and WAW hazards
Robert Tomasulo: developed algorithm to minimize WAW and WAR hazards while
N
allowing out of order execution (tracks when operands for instructions are available to
minimize RAW hazards and uses register renaming to minimize WAW and WAR). TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

EL
DIV F0,F2,F4
ADD F6,F0,F8 //(RAW for DIV : F0) DIV F0,F2,F4
S F6,0(R1) //(RAW for ADD : F6) ADD S,F0,F8
SUB F8,F10,F14 //(WAR for ADD : F8) S S,0(R1)

PT
MUL F6,F10,F8 // (WAR for S, WAW for ADD) SUB T,F10,F14
//(RAW for SUB : F8) MUL F6,F10,T

I RAW is due to data dependency, stalls in-order I S removes WAR of MUL,

pipeline I S removes WAW of MUL,
N
I WAR/WAW constrains out-of-order execution I T removes WAR of SUB,
TECHNO
OF LO
TE

GY
ITU
⇒

IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

ILP Using Multiple Issue and Static Scheduling

EL
Multiple-issue processors - allow multiple instructions to be issued in a clock cycle
I VLIW (very long instruction word) - Parallel instructions statically scheduled by
compiler; issue a fixed number of instructions formatted as one large instruction
I Statically scheduled superscalar - issue a varying rather than a fixed number of

PT
instructions (compiler decided) per clock, in-order execution
I Dynamically scheduled superscalar - issue a varying rather than a fixed number of
instructions (hardware decided) per clock, out-of-order execution
For large issue width VLIW (with multiple independent FUs) is preferred w.r.t.
statically scheduled superscalar
N
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur

Cpu Arc 2
No ratings yet
Cpu Arc 2
33 pages
106105220
No ratings yet
106105220
993 pages
17CS72 Mod 2 PPT
No ratings yet
17CS72 Mod 2 PPT
74 pages
CUDA and CPU Parallelism Overview
No ratings yet
CUDA and CPU Parallelism Overview
3 pages
GPGPU
100% (1)
GPGPU
139 pages
CA I - Chapter 1 Introduction
No ratings yet
CA I - Chapter 1 Introduction
39 pages
Computer Architecture John L. Hennessy PDF Download
No ratings yet
Computer Architecture John L. Hennessy PDF Download
88 pages
Week 11-13
No ratings yet
Week 11-13
76 pages
Seminar
No ratings yet
Seminar
22 pages
CA Classes-106-110
No ratings yet
CA Classes-106-110
5 pages
Microprocessors
No ratings yet
Microprocessors
3 pages
RISC Processor Design with Verilog HDL
No ratings yet
RISC Processor Design with Verilog HDL
95 pages
UNIT1
No ratings yet
UNIT1
11 pages
RISC Instruction Set Pipelining Guide
No ratings yet
RISC Instruction Set Pipelining Guide
9 pages
Computer Architecture Ebook
No ratings yet
Computer Architecture Ebook
443 pages
Contents ICD103
No ratings yet
Contents ICD103
6 pages
Cloud Computing
No ratings yet
Cloud Computing
1 page
Chapter 1 Edit PDF
No ratings yet
Chapter 1 Edit PDF
40 pages
ARM Architecture in Embedded Systems
No ratings yet
ARM Architecture in Embedded Systems
463 pages
$RS2V09G
No ratings yet
$RS2V09G
12 pages
Lec 1
No ratings yet
Lec 1
14 pages
Unit2 Aca
No ratings yet
Unit2 Aca
118 pages
ACA - All Unit
No ratings yet
ACA - All Unit
31 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
33 pages
Group 1 Section A
No ratings yet
Group 1 Section A
70 pages
Processor and Computer Achitecture
No ratings yet
Processor and Computer Achitecture
26 pages
L03 Pipelining
No ratings yet
L03 Pipelining
45 pages
RISC vs CISC: CPU Architecture Explained
No ratings yet
RISC vs CISC: CPU Architecture Explained
9 pages
RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
Lec04 Pipelining Intro&hazards
No ratings yet
Lec04 Pipelining Intro&hazards
77 pages
Risc V
100% (1)
Risc V
9 pages
Lec 7
No ratings yet
Lec 7
26 pages
Pipelining Preview: Basics & Challenges
No ratings yet
Pipelining Preview: Basics & Challenges
75 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
Chapter 4 (Processors and Memory Hierarchy)
100% (1)
Chapter 4 (Processors and Memory Hierarchy)
17 pages
MSRV32I Core Design Specification
No ratings yet
MSRV32I Core Design Specification
44 pages
RISC vs CISC Architecture Guide
No ratings yet
RISC vs CISC Architecture Guide
9 pages
RISC V Processor Architecture and 5 Stage Pipeline Implementation On FPGA Using Verilog
No ratings yet
RISC V Processor Architecture and 5 Stage Pipeline Implementation On FPGA Using Verilog
10 pages
Types of Computer Architecture Level 3
No ratings yet
Types of Computer Architecture Level 3
8 pages
L01 Introduction
No ratings yet
L01 Introduction
22 pages
Archi Reviewer
No ratings yet
Archi Reviewer
21 pages
Understanding Computer Architecture Concepts
No ratings yet
Understanding Computer Architecture Concepts
27 pages
Advanced Computer Architecture Overview
No ratings yet
Advanced Computer Architecture Overview
20 pages
04 Pipeline
No ratings yet
04 Pipeline
83 pages
Computer Architecture 1
No ratings yet
Computer Architecture 1
37 pages
Unit 5
No ratings yet
Unit 5
23 pages
2 - Cpe410l2
No ratings yet
2 - Cpe410l2
10 pages
Computer Architecture vs. Microarchitecture
No ratings yet
Computer Architecture vs. Microarchitecture
69 pages
Pipelining: 5-Stage Pipeline: Mahdi Nazm Bojnordi
No ratings yet
Pipelining: 5-Stage Pipeline: Mahdi Nazm Bojnordi
35 pages
WINSEM2022-23 BCSE205L TH VL2022230502914 2023-04-06 Reference-Material-I
No ratings yet
WINSEM2022-23 BCSE205L TH VL2022230502914 2023-04-06 Reference-Material-I
27 pages
L1.0 HPC Overview
No ratings yet
L1.0 HPC Overview
58 pages
Computer Architecture P1
No ratings yet
Computer Architecture P1
37 pages
Parallel Processing Chapter - 2: Basics of Architectural Design
No ratings yet
Parallel Processing Chapter - 2: Basics of Architectural Design
29 pages
Coa 3.2 - Risc - Cisc
No ratings yet
Coa 3.2 - Risc - Cisc
20 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
Implementation of RISC-V Processor
No ratings yet
Implementation of RISC-V Processor
7 pages
UNIDLE A Unified Framework For Deep Learning-Based Side-Channel Analysis
No ratings yet
UNIDLE A Unified Framework For Deep Learning-Based Side-Channel Analysis
13 pages
EstraNet An Efficient Shift-Invariant Transformer Network For Side-Channel Analysis
No ratings yet
EstraNet An Efficient Shift-Invariant Transformer Network For Side-Channel Analysis
39 pages
JELET-2024 Lateral Entry Exam Guide
0% (1)
JELET-2024 Lateral Entry Exam Guide
34 pages
Communication Complexity Nisan Kushilevitz
No ratings yet
Communication Complexity Nisan Kushilevitz
207 pages
Gen 1 Form GMC Registration
No ratings yet
Gen 1 Form GMC Registration
5 pages
2D TMDC-Review Paper-10.1007/s40820-017-0152-6
No ratings yet
2D TMDC-Review Paper-10.1007/s40820-017-0152-6
23 pages
Schmelzle2010 Fourier Pricing
No ratings yet
Schmelzle2010 Fourier Pricing
86 pages
Color Pals Privacy Policy
No ratings yet
Color Pals Privacy Policy
7 pages
Nutrition Bulletin March2022
No ratings yet
Nutrition Bulletin March2022
4 pages
Zhichi Co Word
No ratings yet
Zhichi Co Word
1 page
20.2 User's Guide: Document Imaging Solutions
No ratings yet
20.2 User's Guide: Document Imaging Solutions
93 pages
Economics
No ratings yet
Economics
22 pages
The Mitochondria Is The Powerhouse of The Cell
No ratings yet
The Mitochondria Is The Powerhouse of The Cell
4 pages
Tutorials Origin Pro 9
100% (1)
Tutorials Origin Pro 9
920 pages
Tabular Analysis of Transactions
No ratings yet
Tabular Analysis of Transactions
3 pages
DOWSIL 789 Silicone Weather Proofing Sealant Black-Safety Data Sheet-En
No ratings yet
DOWSIL 789 Silicone Weather Proofing Sealant Black-Safety Data Sheet-En
17 pages
Programming in C (Unit-V) New
No ratings yet
Programming in C (Unit-V) New
18 pages
S K Bagchi - Weld Failure in Oil & Gas Industries
No ratings yet
S K Bagchi - Weld Failure in Oil & Gas Industries
6 pages
The Airport Mobility Opportunity - Volume 1 USD Version
No ratings yet
The Airport Mobility Opportunity - Volume 1 USD Version
14 pages
Startek Zero Tolerance Protocol Agreement
No ratings yet
Startek Zero Tolerance Protocol Agreement
3 pages
Basic Concepts of Statistics
83% (29)
Basic Concepts of Statistics
36 pages
UCSD UPS Program Orientation Guide
No ratings yet
UCSD UPS Program Orientation Guide
4 pages
CLIA Certificate 39D0673919 2024 02 02
No ratings yet
CLIA Certificate 39D0673919 2024 02 02
1 page
Motor Data Sheet (90kw)
No ratings yet
Motor Data Sheet (90kw)
7 pages
Microsoft Word - Gosavi - PRATIK-Orgnal
No ratings yet
Microsoft Word - Gosavi - PRATIK-Orgnal
7 pages
Corporate Branding Solutions
No ratings yet
Corporate Branding Solutions
32 pages
Boarding Pass Delhi Mumbai Atul 4-8-2016
No ratings yet
Boarding Pass Delhi Mumbai Atul 4-8-2016
1 page
Thermodynamics and Heat Power 9th Edition Irving Granet Instant Download
No ratings yet
Thermodynamics and Heat Power 9th Edition Irving Granet Instant Download
127 pages
Nitrogen & Oxygen Generators
No ratings yet
Nitrogen & Oxygen Generators
16 pages
Academic Reading Sample Task
No ratings yet
Academic Reading Sample Task
2 pages
ESCR PDF
No ratings yet
ESCR PDF
3 pages
Appendix O Test Header Design
No ratings yet
Appendix O Test Header Design
8 pages
Project Description
50% (2)
Project Description
32 pages
PTV Installation Manual
No ratings yet
PTV Installation Manual
84 pages

GPU Architecture and Programming NPTEL Week 1 Assignment

Uploaded by

GPU Architecture and Programming NPTEL Week 1 Assignment

Uploaded by

The classic 5-stage RISC pipeline The Memory Hierarchy Instruction Level Parallelism (ILP)

Soumyajit Dey, Assistant Professor,

yog, kms kOflm^

yog, kms kOflm^

yog, kms kOflm^

Host CPU Bridge System Memory

ROP L2 ROP L2 ….. ROP L2 Display Interface

yog, kms kOflm^

Efficient Neural Network Training/Inferencing 11-12 6

yog, kms kOflm^

The classic 5-stage RISC pipeline

yog, kms kOflm^

Basic RISC architecture

yog, kms kOflm^

yog, kms kOflm^

Elementary CPU Datapath

yog, kms kOflm^

Single cycle implementation of datapath

yog, kms kOflm^

yog, kms kOflm^

yog, kms kOflm^

yog, kms kOflm^

Data Hazard : MIPS example

yog, kms kOflm^

yog, kms kOflm^

The Memory Hierarchy

yog, kms kOflm^

Figure: Near to CPU is faster 19 5 1

yog, kms kOflm^

yog, kms kOflm^

110001 110101 111001 111101

yog, kms kOflm^

yog, kms kOflm^

Cache write policy

yog, kms kOflm^

yog, kms kOflm^

Cache Mapping: alternate schemes

yog, kms kOflm^

Cache replacement policy

yog, kms kOflm^

Instruction Level Parallelism (ILP)

yog, kms kOflm^

Actual Pipeline CPI

yog, kms kOflm^

Compiler Techniques for ILP

yog, kms kOflm^

yog, kms kOflm^

Branch Prediction assisted ILP

taken taken taken

not taken not taken not taken

yog, kms kOflm^

yog, kms kOflm^

Dynamic Scheduling for ILP

yog, kms kOflm^

I RAW is due to data dependency, stalls in-order I S removes WAR of MUL,

yog, kms kOflm^

ILP Using Multiple Issue and Static Scheduling

yog, kms kOflm^

You might also like