0% found this document useful (0 votes)

635 views524 pages

Computer Architecture Embedded Approach

Uploaded by

scribd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

635 views524 pages

Computer Architecture Embedded Approach

Uploaded by

scribd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

COMPUTER

ARCHITECTURE
au TA

3063115
3063115

“An Embedded Apor

lan McLoughlin
School of Computer Engineering
Nanyang Technological University
a Al Me
Ps) ay
eY Sy
a

oe.
Fs
Cc7

thik +

.
Singapore * Boston * Burr Ridge, IL * Dubuque, IA * Madison, WI * New York * San Francisco Ae
St. Louis * Bangkok » Kuala Lumpur * Lisbon * London * Madrid O a:
Mexico City * Milan * Montreal * New Delhi * Seoul * Sydney * Taipei * Toronto k& oO 5
N We if
| ee? | by .
The McGraw-Hill companies

(S 13 JAN iO?”
aij
i §
Computer Architecture: An Embedded Approach

raw Higher Education

Copyright © 2011 by McGraw-Hill Education (Asia). All rights reserved. No part of this
publication may be reproduced or distributed in any form or by any means, or stored in
a database or retrieval system without the prior written permission of the publisher,
including, but not limited to, in any network or other electronic storage or transmission,
or broadcast for distance learning.

Cover image © iStockphoto.com

1ORORS 7 1O) orl Sr

CTP MPM
PAY) 13) PZ, lil

When ordering this title, use ISBN 978-007-131118-2 or MHID 0-007-131118-1

Printed in Singapore
-4O1.00101)
> 1i/f or oO1roO1
4101701

About the Author

Ian McLoughlin is an Associate Professor in the School of Computer

Engineering, Nanyang Technological University, Singapore. His back-
ground includes work for industry, government and academia across three
continents over the past 20 years. He is an engineer, and has designed or
worked on systems that can be found in space, flying in the troposphere,
empowering the global telecommunications network, underwater, in daily
use by emergency services and embedded within consumer devices. For
his work on rural telecommunications solutions, he won the inaugural IEE
Innovation in Engineering Award in 2005 with his team from Tait Electron-
ics Ltd, Christchurch, New Zealand. He is a member of IET, senior mem-
ber of IEEE, a Chartered Engineer in the UK and an Ingenieur Europeen
(Eur. Ing.)
Digitized by the Internet Archive
in 2023 with funding from
Kahle/Austin Foundation

https://archive.org/details/computerarchitecOO00mclo
bh O100101
> 1 oO1 oO1roO)1
2 4BOO1T0T01
Contents

List of Boxes
Preface
Acknowledgments
Walk Through XX

Chapter 1: Introduction
ili Book Organisation
ie Evolution
al} Computer Generations
1.3.1 First Generation
1.3.2 Second Generation
1.3.3. Third Generation
1.3.4 Fourth Generation
1.3.5 Fifth Generation NBO
NE
COR
OF
=
O1

1.4 Cloud, Pervasive, Grid and Massively Parallel Computers

eS Where To From Here?
1.6 Summary ReHer
eS
eeS'S

Chapter 2: Foundations
Dell Computer Organisation
2.1.1 Flynn’s Classification
2.1.2 Connection Arrangements
2.1.3 Layered View of Computer Organisation
DED Computer Fundamentals
ZS Number Formats
2.3.1 Unsigned Binary
2.3.2 Sign-Magnitude
2.3.3. One’s Complement
2.3.4 Two’s Complement
2.3.9 Excess-n
2.3.6 Binary-Coded Decimal
2.3.7 Fractional Notation
2.3.8 Sign Extension HS
KPOT
Ole
Ou
Oo
Com
Co
CO
ESS
OSHS
CN
SON
RFP
PRP
NN
NO
vi
Contents

24 Arithmetic 29
2.4.1 Addition 29
2.4.2 The Parallel Carry-Propagate Adder 29
2.4.3 Carry Look-Ahead 30
2.4.4 Subtraction 30
Zo Multiplication 34
256 Repeated Addition 34
Dine#3 Partial Products 3D
Dias Shift-Add Method 38
2.5.4 Booth and Robertson’s Methods 38
2.6 Division 4]
2.04 Repeated Subtraction 4]
Mei Working with Fractional Number Formats 43
Zeal Arithmetic with Fractional Numbers 44
2.7.2 Multiplication and Division of Fractional Numbers 45
2S Floating Point 46
2Oil Generalised Floating Point 46
2.8.2 TEEE754 Floating Point 46
Po) TEEE754 Modes 47
2.8.4 IEEE754 Number Ranges 51
29 Floating Point Processing 54
ZO Addition and Subtraction of IEEE754 Numbers 55
2 Multiplication and Division of IEEE754 Numbers 56
293 TEEE754 Intermediate Formats 56
2.9.4 Rounding 60
2.10 Summary 60

Chapter 3: CPU Basics 66

3.1 What Is a Computer? 66
3.2 Making the Computer Work for You 67
Oeoek Program Storage 67
Cees Memory Hierarchy 68
Ores Program Transfer 69
3.2.4 Control Unit 70
pra) Microcode 75
3.2.6 RISC vs CISC Approaches Tee
Oreut Example Processors TS,
Bue, Instruction Handling 81
Creal! The Instruction Set 81
Oig.2 Instruction Fetch and Decode 84
Bio Compressed Instruction Sets 90
3.3.4 Addressing Modes oS
eons) Stack Machines and Reverse Polish Notation 96
vil
Contents

3.4 Data Handling

3.4.1 Data Formats and Representations
3.4.2 Data Flows
3.4.3 Data Storage
3.4.4 Internal Data
3.4.5 Data Processing
3.5 A Top-Down View
opal Computer Capabilities
32082 Performance Measures, Statistics and Lies
Ghee, Assessing Performance ao
3.6 Summary

Chapter 4: Processor Internals

4.1 Internal Bus Architecture
4.1.1 A Programmer’s Perspective
4.1.2 Split Interconnection Arrangements
4.1.3 ADSP21xx Bus Arrangement
4.1.4 Simultaneous Data and Program Memory Access
4.1.5 Dual-Bus Architectures
4.1.6 Single-Bus Architectures f
4.2 Arithmetic Logic Unit me
4.2.1 ALU Functionality
4.2.2 ALU Design
4.3 Memory Management Unit
4.3.1 The Need for Virtual Memory
4.3.2 MMU Operation
4.3.3 Retirement Algorithms
4.3.4 Internal Fragmentation and Segmentation
4.3.5 External Fragmentation
4.3.6 Advanced MMUs
4.3.7 Memory Protection
4.4 Cache
4.4.1 Direct Cache
4.4.2 Set-Associative Cache
4.4.3 Full-Associative Cache
4.4.4 Locality Principles
4.4.5 Cache Replacement Algorithms
4.4.6 Cache Performance
4.4.7 Cache Coherency
4.5 Co-Processors
4.6 Floating Point Unit
4.6.1 Floating Point Emulation
viii
Contents

4.7 Streaming SIMD Extensions (SSE) and Multimedia Extensions 161

4.7.1 Multimedia Extensions (MMX) 162
4.7.2 MMxX Implementation 162
4.7.3 Use of MMX 163
4.7.4 Streaming SIMD Extensions (SSE) 164
4.7.5 Using SSE and MMX 164
4.8 Co-Processing in Embedded Systems 165
4.9 Summary 166

Chapter 5: Enhancing CPU Performance 172

5.1 Speed-Ups 173
5.2 Pipelining 173
5.2.1 Multi-Function Pipelines 175
5.2.2 Dynamic Pipelines VA
5.2.3. Changing Mode in a Pipeline WT
5.2.4 Data Dependency Hazard 179
5.2.5 Conditional Hazards 180
5.2.6 | Conditional Branches 183
5.2.7. Compile-Time Pipeline Remedies 185
5.2.8 Relative Branching 187
5.2.9 Instruction-Set Pipeline Remedies 189
5.2.10 Runtime Pipeline Remedies 190
5.3. Complex and Reduced Instruction Set Computers 193
5.4 Superscalar Architectures 194
5.4.1 Simple Superscalar 194
5.4.2 Multiple-Issue Superscalar 197
5.4.3. Superscalar Performance 198
5.5 Instructions Per Cycle 198
5.5.1 IPC of Difference Architectures 199
5.5.2 Measuring IPC 201
5.6 Hardware Acceleration 201
5.6.1 Zero-Overhead Loops 202
5.6.2 Address Handling Hardware 205
5.6.3 Shadow Registers 209
5.7 Branch Prediction 209
5.7.1 The Need for Branch Prediction 210
5.7.2 Single T-bit Predictor PAP
5.7.3 Two-Bit Predictor 214
5.7.4 The Counter and Shift Registers as Predictors 25
5.7.5 Local Branch Predictor 216
5.7.6 Global Branch Predictor 218
5.7.7 The Gselect Predictor 221
5.7.8 The Gshare Predictor 222
ix
Contents

5.7.9 Hybrid Predictors 223

5.7.10 Branch Target Buffer 226
5.7.11 Basic Blocks 228
5.7.12 Branch Prediction Summary 229
5.8 Parallel Machines 230
5.8.1 Evolution of SISD to MIMD 231
5.8.2 Parallelism for Raw Performance 235
5.8.3 More on Parallel Processing 287
5.9 Tomasulo’s Algorithm 240
5.9.1 The Rationale Behind Tomasulo’s Algorithm 240
5.9.2 An Example Tomasulo System 241
5.9.3. Tomasulo’s Algorithm in Embedded Systems 246
5.10 Summary 247

Chapter 6: Externals 252

6.1 Interfacing Using a Bus 252
6.1.1 Bus Control Signals 253
6.1.2 Direct Memory Access 254
6.2 Parallel Bus Specifications 259
6.3 Standard Interfaces Devi
6.3.1 System Control Interfaces 257
6.3.2 System Data Buses 258
6.3.3 Input/Output Buses 264
6.3.4 Peripheral Device Buses 265
6.3.5 Interface to Networking Devices 266
6.4 Real-Time Issues 266
6.4.1 External Stimuli 267
6.4.2 Interrupts 267
6.4.3 Real-Time Definitions 267
6.4.4 Temporal Scope 268
6.4.5 Hardware Architecture Support for Real-Time Operating Systems 270
6.5 Interrupts and Interrupt Handling 271
6.5.1 The Importance of Interrupts 271
6.5.2 The Interrupt Process 272
6.5.3 Advanced Interrupt Handling 278
6.5.4 Sharing Interrupts 278
6.5.5 Re-Entrant Code 279
6.5.6 Software Interrupts 279
6.6 Wireless 280
6.6.1 Wireless Technology 280
6.6.2 Wireless Interfacing 282
6.6.3 Issues Relating to Wireless 282
6.7. Summary 284
X
Contents

Chapter 7: Practical Embedded CPUs 291

Te Introduction 291
Te Microprocessors are Core Plus More 291
W8) Required Functionality 294
7.4 Clocking 300
7.4.1 Clock Generation 301
Ts) Clocks and Power 302
7.5.1 Propagation Delay 303
7.5.2 The Trouble with Current 304
7.5.3 Solutions for Clock Issues 305
7.9.4 Low-Power Design 305
7.6 Memory 307
7.6.1 Early Computer Memory 308
7.6.2 Read-Only Memory 308
7.6.3 Random Access Memory 314
WS Pages and Overlays 328
7.8 Memory in Embedded Systems 325
7.8.1 Non-Volatile Memory 326
7.8.2 Volatile Memory 328
7.8.3 Other Memory 329
VE Test and Verification 332
7.9.1 Integrated Circuit Design and Manufacture Problems dey
7.9.2 Built-in Self-Test 334
7.9.3 Joint Test Action Group Oo7
FMM) Error Detection and Correction 340
Heit Watchdog Timers and Reset Supervision 345
7.11.1 Reset Supervisors and Brownout Detectors 346
MP Reverse Engineering 348
7.12.1 The Reverse Engineering Process 349
7.12.2 Detailed Physical Layout 315)3)
TAB Preventing Reverse Engineering Boo
7.13.1 Passive Obfuscation of Stored Programs 361
7.13.2 Programmable Logic Families 362
7.13.3 Active RE Mitigation 363
7.13.4 Active RE Mitigation Classification 363
7.14 Summary 365

Chapter 8: CPU Design 369

8.1 Soft-Core Processors 369
8.1.1 Microprocessors are More Than Cores 370
8.1.2 The Advantages of Soft-Core Processors 370
8.2 Hardware-Software Co-Design 373
xi
Contents

8.3 Off-The-Shelf Cores OV

8.4 Making Our Own Soft Core 379
8.5 CPU Design Specification 380
8.5.1 CPU Architecture 381
8.5.2 Buses 381
8.5.3. Storage of Program and Data 382
8.5.4 Logical Operations 383
8.5.5 Instruction Handling 384
8.5.6 System Control 385
8.6 Instruction Set 386
8.6.1 CPU Control 388
8.7 CPU Implementation 390
8.7.1 The Importance of Testing 391
8.7.2 Defining Operations and States: defs.v CRM
8.7.3 Starting Small: counter.v oui
8.7.4 CPU Control: state.v 394
8.7.5 Program and Variable Storage: ram.v 396
8.7.6 The Stack: stack.v 599
8.7.7 Arithmetic, Logic and Multiply Unit: alu.v 401
8.7.8 Tying It All Together: tinycpu.v 403
8.8 CPU Testing and Operation 408
8.9 CPU Programming and Use 409
8.9.1 Writing TinyCPU Programs 409
8.9.2 TinyCPU Programming Tools 413
8.10 Summary 415

Chapter 9: The Future 419

el Single-Bit Architectures 419
9.1.1 Bit-Serial Addition 420
9.1.2 Bit-Serial Subtraction 421
9.1.3 Bit-Serial Logic and Processing 422
Oe Very-Long Instruction Word Architectures 422
9.2.1 The VLIW Rationale 422
9.2.2 Difficulties with VLIW 424
93 Parallel and Massively Parallel Machines 425
9.3.1 Clusters of Big Machines 426
9.3.2 Clusters of Small Machines 426
9.3.3. Parallel and Cluster Processing Considerations 431
9.3.4 Interconnection Strategies 432
9.4 Asynchronous Processors 434
9.4.1 Data Flow Control 437
9.4.2 Avoiding Pipeline Hazards 437
xii
Contents

9.5 Alternative Number Format Systems 438

9.5.1 Multiple-Valued Logic 438
9.5.2 Signed Digit Number Representation 439
9.6 Optical Computation 442
9.6.1 The Electro-Optical Full Adder 442
9.6.2 The Electro-Optical Backplane 443
9.7 Science Fiction or Future Reality? 444
9.7.1 Distributed Computing 444
9.7.2 Wetware 445
9.8 Summary 446

Appendix A: Standard Notation for Memory Size 447

Examples 448

Appendix B: Open Systems Interconnection Model 449

B.1 Introduction 449
B.2. The OSI Layers 449
B.3. Summary 451

Appendix C: Exploring Trade-Offs in Cache Size and Arrangement 452

C.1 Introduction 452
C.2. Preparation 452
C.3 Installing Cacti and Dinero 453
C4 Meet the Tools 453
C.5 Experimenting with Different Trade-Offs 454
C.6 Further Information in Cache Design 455

Appendix D: Wireless Technology for Embedded Computers 459

D.1 Introduction 459
D.2 802.11a, b and g 460
D.2.1 802.11a/b/g Solutions for Embedded Systems 460
D.3 802.11n 460
D.3.1 Draft 802.11n Solutions for Embedded Systems 460
D.4 802.20 461
D.5 802.16 461
D.5.1 802.16 Solutions 461
D.6 Bluetooth 462
D.6.1 Bluetooth Solutions 462
D.7 GSM 463
D.7.1_ GSM Solutions 463
DiS mGRES 464
D.9 ZigBee 464
D.9.1_ ZigBee Solutions 465
Xili
Contents

D.10 Wireless USB 466

D.10.1 Wireless USB Solutions 466
D.11 Near Field Communication 466
D.11.1 NEC Solutions 467
D.12 WiBro 467
D.13 Wireless Device Summary 468
D.14 Application Example 468
D.15 Summary 470

Appendix E: Tools for Compiling and Simulating TinyCPU 471

gal Preparation and Obtaining Software 471
EZ How to Compile and Simulate Your Verilog 472
B38 How to View Simulation Outputs 475
E.4 Advanced Test Benches 482
BS Summary 483

Appendix F: Tools for Compiling and Assembling Code for TinyCPU 484
Pol Introduction 484
F.2 The Assembly Process 484
R38 The Assembler 485
R4 Example Program Assembly 488
TO) The Compiler 489
F.6 Summary 490

Index 49]
> BOO10 | “he
(PADD AOO ov
> _ O13 a teva?

List of Boxes

Dal, Worked endiness example 1

Pep Worked endiness example 2
Phe) Worked endiness example 3
2.4 Worked endiness example 4
PR) What is a number format?
2.6 Negative two’s complement numbers
2.7 Worked examples of number conversion
2.8 Is binary a fractional number format?
DDS) Fractional format worked example
2.10 Sign extension worked example
2A Exercise for the reader
PMP Worked example
2.13 Exercise for the reader
2.14 Exercise for the reader
2515) Worked examples of two’s complement multiplication
ZAG Exercise for the reader
PRAVE Booth’s method worked example
2.18 Long division worked example
BANG, Worked examples of fractional representation
PPA) Worked example of fractional division
PRA TEEE754 normalised mode worked example 1
2.22 TEEE754 normalised mode worked example 2
D2} Exercise for the reader
2.24 IEEE754 denormalised mode worked example
DAS, TEEE754 infinity and other ‘numbers’
2.26 Worked example: converting decimal to floating point
2.27 Floating point arithmetic worked example
2.28 IEEE754 arithmetic worked example
Sul How the ARM was designed
oe Illustrating conditionals and the S bit in the ARM
(he) Condition codes in the ARM processor
3.4 Understanding the MOV instruction in the ARM
30 A Huffman coding illustration
3.6 Recoding RPN instructions to minimise stack space
XV

List of Boxes

3.7 Data types in embedded systems 101

3.8 Standardised performance 112
4.1 Exploring ALU propagation delays 134
4.2 MMU worked example 137
4.3 Trapping software errors in the C programming language 142
4.4 Cache example: the Intel Pentium Pro 144
4.5 Direct cache example 146
4.6 Set-associative cache example 147
4.7 Cache replacement algorithm worked example 1 151
4.8 Cache replacement algorithm worked example 2 152
4.9 Access efficiency example 154
4.10 MESI protocol worked example 157,
4.11 An alternative approach: FPU on the ARM processor 159
Bul Pipeline speed-up IAs)
be2 WAW hazard 181
53 Conditional flags 182
5.4 Branch prediction 185
D0) Speculative execution 186
5.6 Relative branching 188
OV Scoreboarding 196
5.8 ZOL worked examples 207
oe Address generation in the ARM 208
5.10 Aliasing in local prediction 219
6.1 DMA ina commercial processor 250
6.2 Bus settings for peripheral connectivity 257
6.3 The trouble with ISA 260
6.4 Scheduling priorities 270
6.5 ARM interrupt timing calculation 200)
6.6 Memory remapping during boot any,
7A Configurable I/O pins on the MSP430 296
Te Pin control on the MSP430 Pip
Vie) NAND and NOR flash memory OLE
74 Memory map in the MSP430 330
Hes Using JTAG for finding a soldering fault 338
7.0 Using JTAG for booting a CPU 339
Tl Hamming (7, 4) encoding example 343
7.8 Hamming (7, 4) encoding example using matrices 344
Mie) Bus line pin swapping 358
el Example of a VLIW hardware 424
9.2. Examples of parallel processing machines 435
9:3 Example of a CSD number 442
7

PPSSSSTS
LEASE n?
1A
” Wesco exaitee
Pinay alae
2= ia Svectiinva) tore
10 Sgn extent
1) Siew for Pe Geis
——
SO see ‘<TR ehlgenmrs bedjow MOS
ay Wl
A-sorif ti noneorsg een bA ve ae rand
eS d wiaieahindl MAPibeng level mi gnlentlA —Ob2 : =
Aor kid! whee WEASATIY PAD INKED! pRenr j
600 SOSH D SRT IMgirar sl egnifine anh = S83 oe 7
EGR WINS feel See aiduet sit fa i
Ee.
4 Ake ena NWOT HAP gnilrbedye be
oi Wires An a EN i «
4) Worked og agp Yatqysieten of. e.
" PNR ae Heee NY Sa ee Ls i
Mit OWE
Givin Wnt! £8 ”
wren (eal 20% bos (UAW EN ay
ibs WE niin eH S| |
Lue) -oritahiny NAMA PAT] niet = ON ole ‘,
PT? eigifehiery HEAT pier arte" MM vy
* aterriea gaia DUN Ain — TX er
sinutiin ynth Spee eer, ahi 88S 5B. Bd
tha AMA pve neq anil aud aN 79 Fe
- Hrveat VELOn to UaeNp 9 , se
tht qihes.nq lini Wyrnd =e AF .
‘ Bar este her merry hs il
\npee lacy sprice
Bas eT

Preface

There are a great deal of computer architecture texts in print at any one
time. Many famous authors have tried their hands at writing in this area,
however, computers constitute a rapidly advancing and fluid field, so
few books can hope to keep up without constant revisions. Above all,
the rapidity of the shift towards embedded computing systems has left
many authors, and texts, foundering in the wake. Some texts persist in
regarding computers in the same light as the room-sized machines of the
1950s and 1960s. Many more view computers in the light of the desktop and
server machines of the 1980s and 1990s. A handful acknowledge that the
vast majority of computers in modern use are embedded within everyday
objects. Few acknowledge that the future is embedded: there will come a
time when the concept of a desktop computer seems as anachronistic as
the punched card machines of 50 years ago.
This text is facing squarely towards the embedded future. Topics re-
lated to embedded processors are handled alongside the more traditional
topics of other texts and, wherever possible, examples from the embedded
world are highlighted.
The target audience for this book consists of three groups of people.
Firstly, undergraduate students of computer architecture-related courses,
typically those in their third year. Secondly, master’s level students re-
quiring a refresher in computer architecture before embarking on a more
in-depth study. Thirdly, industrial engineers. As reconfigurable logic cir-
cuits, especially FPGAs (field programmable gate arrays) are becoming
larger, faster and cheaper, there is increasing interest in soft-core comput-
ers — that is CPUs designed by engineers for specific tasks. For perhaps
the first time in history, these tools allow ordinary engineers the opportu-
nity to design and build their own custom computers. Digesting this text
will provide engineers with a solid platform of knowledge to understand
the traditional and contemporary techniques and trade-offs in computer
architecture — the art of computer design.
This text has been written from the bottom up without basing it on
an existing book. This allows it to avoid many of the historical blind al-
leys and irrelevant side shows in computer evolution, leading to a more
precisely defined focus. This is not just a computer architecture book with
XViii
Preface

an extra chapter on embedded systems. It is a fresh and integrated look at the computer
architecture of today, which is built upon the foundation and history of bigger and older
machines, but which is definitely driving towards greater levels of integration within
embedded systems.
This book aims to be an easy-access and readable text. Plenty of diagrams are
given to explain tricky concepts, and many explanatory boxes are provided throughout,
containing extra worked examples, interesting snippets of information and additional
explanations to augment the main text. Apart from covering all of the main items in
the typical computer architecture theory curriculum that are of relevance to embedded
engineers (but excluding tape storage, Winchester drives and supercomputer design),
the book contains a wealth of practical information for the target audience — even the
opportunity to build and test out a custom soft-core processor.
Sl units are used throughout the book, including the newer ‘kibibyte’ and ‘mebibyte’
measures for computer memory (explained in Appendix A). Each of the main curricu-
lum chapters includes end-of-chapter problems, with answers available in an instruc-
tor’s manual. All examples, and much more material including recommendations for
further reading, are available on the associated website at www.mheducation.asia/olc/
mcloughlin.

Ian McLoughlin
> BHOO1O .
™mO1001 ant
>41 = O101
> 1 OOOO}
Acknowledgements

Thanks are due most of all to my patient wife, Kwai Yoke, and children
Wesley and Vanessa for allowing me the time to write this book. Tom Scott,
Benjamin Premkumar, Stefan Lendnal and Adrian Busch gave me plenty
of encouragement at times when I needed it (this text took form over a long
drawn out five-year period). Doug McConnell was an inspiration as was
the late Sir Angus Tait — most of the book was written while I worked as
Principal Engineer in Group Research, Tait Electronics Ltd, Christchurch,
New Zealand. This company is the largest electronics research and devel-
opment company in Oceania, founded 30 years ago by Angus Tait at age
55 — an age at which most people wind down to retirement. Not Angus
Tait: he still went to work every day to guide the company, until he passed
away in August 2007.
Thanks are also reluctantly given to my computer architecture, ad-
vanced computer architecture and computer peripherals students at
Nanyang Technological University (NTU), for asking me difficult ques-
tions, stretching my knowledge and through that motivating me to teach
better. Associate Professor Lee Keok Kee kick-started me into gathering
materials for this book, and I would also like to acknowledge my many
other friends and colleagues in NTU, and also past colleagues in Tait Elec-
tronics Ltd, Simoco, The University of Birmingham, HMGCC and GEC
Hirst Research Centre. Thanks are also due to Gerald Bok and colleagues
at McGraw-Hill, especially to Doreen Ng and the editorial team for their
professionalism and hard work in turning the manuscript into a beautiful
book.
Most importantly, I wish to acknowledge my mother who constantly
encouraged me along the way — not just of writing this book, but through-
out my entire lifetime. Her high expectations led, eventually to my enter-
ing academia, and she has always been most enthusiastic regarding my
forays into writing; thank you Mum. However, above all I want to give
glory to the God who made me, protected me, nurtured me, gave his son
to save me, and will eventually welcome me into His presence. All that I
am, accomplish, obtain and achieve, I owe to Him.
Explanatory boxes containing extra
worked examples and interesting
Chapter 2
snippets of information to augment
main text
Booth’s method worked example

Consider —9 x 11 (signed):
2.17
Box
11110111 multiplicand —9
00001011 multiplier 11
-11110111 (i=0, subtract multiplicand since bit pair = 10)
0000000 (i=1, no action since bit pair= 11) 79
#110011 (i=2, add multiplicand « 2 since bit pair = 01) CPU Basics
-10411 , subtract multiplicand < 3 since bit pair = 10)
1 (i=4, add multiplicand « 2 since bit pair =01) How the ARM was designed
000 (i=5 and onwards, no action since all bit pairs =00)

The result is therefore obtained as the summation of the following: 3.1 In the mid-1980s, groundbreaking British computer company Acorn, with a contract
Box
from the British Broadcasting Corporation (BBC) to design and market BBC micro-
11110111 | computers was looking for a way to move beyond their hugely successful 8-bit BBC
+11011100 microcomputers. These were powered by the lean and efficient Rockwell 6502 proces-
101110¢ sors. The BBC initiatives had encouraged computer use in the UK so much that there
+01110000 were reportedly far more computers per capita in England than anywhere else in the
Or by converting the subtractions into additions (see Section 2.4.4); world. Sir Clive Sinclair’s ZX Spectrum for example, had sold 4 million units by the
| time sales of the IBM PC had reached 1 million units. Acorn is also reputed to have
00001001
+11011100
sold over 1 million BBC computers overall
In the early explosion of the ‘computer revolution’ it quickly became apparent
+01001000
to Acorn that 16-bit processors from companies such as Intel and Motorola were not
+01110000
=10011101 +Carry powerful enough to meet their projected future needs—needs which included releasing
the world’s first multi-tasking graphical desktop operating system in the late 1980s
Result: | (later some observers would conclude that this was copied by Microsoft as the basis
10011101 = —128 + 16+8+4 + 1 = —99 (correct) for Windows 95, XP and beyond).
In typical pioneering fashion, Acorn decided that, since nothing good enough
was available, they would create their own processor. They designed the ARM1 and
its support ICs (such as MEMC and VIDC) within two years despite having never
It is important to note that when i=0, the bits considered are the least significant
bit of the multiplier and a hidden zero. Thus, when the least significant bit of the
developed any silicon previously.
Acorn wanted a machine with a regular architecture —similar to the 6502, but vastly
multiplier is a ‘1’, the multiplicand must be subtracted (i.e. treated as a ‘10’ instead).
more powerful. They chose to use the RISC approach, but revisited their software needs
This can be seen in the second worked example (Box 2.17).
| by analysing operating system code to determine most used instructions which they
There are two points worth mentioning here. First, when dealing with two‘s com-
then optimised for the ARM processor. The same approach yielded an instruction set
plement signed operands, the partial products must be sign extended in the same way
(see Section 3.3) and its coding. Later, much needed additions were the multiply and
as the full partial product multiplier.
multiply-accumulate instructions.
Second, when scanning from right to left, the hidden bit at the right-hand side
This heritage leaves the globally successful ARM processor with a direct link back
means that the first pair of non-equal bits that is encountered will always be a ‘10’,
c indicating a subtraction. This regularity may be useful when designing a hardware
to the UK Government-funded BBC initiatives: the ARM software interrupt, supervi-
2 sor modes, fast interrupt, no microcode, static pipeline, load-store architecture are all
1) implementation.
¥ derived either from the hardware or the software architectures adopted by Acom.
Even for someone who has been doing binary arithmetic for many years, the
2
E preparation of this book highlighted how easy it can be to make very trivial binary
= addition mistakes. If you are required to do this as part of an examination, always inside almost every electronic product and most of these are ARM-based. Meanwhile, 5
Acorn itself no longer exists, having self-destructed in 1999. ee
S
=
3.2.7 Example Processors &Ss
Over the years, since the IBM research group published their initial results, the RISC a
£
approach has impacted almost every sphere of processor design. In particular, the ARM S
oO
RISC processor family now dominates the world of embedded systems. Therefore, in rf
this book almost all assembly language code examples are given in ARM assembler =
D
format. For example: e
z
o
=

379
|
CPU Design
determine the amount of CPU time spent within each function, the program trace and |A wealth of practical information including
the number of loops executed).
An operating system, particularly a real-time operating system (RTOS), is often the opportunity to build and test out a
required in many developments. Unfortunately, it can be difficult writing or porting
an OS to a new processor, and this is one major argument in favour of choosing a core custom soft-core processor
that is already supported by a good OS such as embedded Linux. Despite this, there __
are reasons to custom design a soft core, for example, when only small items
such as hand-written assembly language are used.
In fact, over the next few sections of this book, we will create a custom si 385
and later develop an assembler for this (we will also introduceabasic C-like co: CPU Design

W2EEWM, Making Our Own Soft Core

stack qnext
Dr lthtglsecton) andl ttioee fallewingnwa welll cernert: topathenmuchior the ino} cl stackO
gained up to this point, by following the design of a simple CPU. Actually, we w
the design of this, and then create a real Verilog executable which can be used cl
FPGA. The CPU which we will describe is in factnamed TinyCPU, and is the in
of Professor Koji Nakano of the Department of Information Engineering, Sc!
Engineering, Hiroshima University, Japan. TinyCPU consists of only about 420| “ output
Verilog hardware description language source code. | obufd
Although this design is included here specifically for the purpose of teachi in Vout
illustrating basic computer architecture features, TinyCPU isa fully workingCP
it is written in Verilog it can be included inside most common EPGAS, such a} A block diagram of TinyCPU now showing an instruction register (ir 0)and a program counter
(pc).
from Altera, Xilinx and Actel and programmed to perform real-world tasks, Pr,
Nakano and his team have also released both a simple assembler for TinyCl
a compiler for a subset of the C programming language (i.e. basic C comma tion register requires the ability to output to either the data bus or address bus at the
supported but not some of the esoteric and advanced features). appropriate times. This structure can be seen in Figure 8.8.
For readers who are seeking a processing core for their FPGA designs, Ti
may well work, However, far better would be for readers to first understand, ai) System Control
experiment with TinyCPU: rather than adopt this design as-is for a project, w TinyCPU is almost complete —at least as far as the data and address path are concemed.
extend it or use this knowledge to create or choose a custom processing core? Ti However, there are several items that are still necessary for CPU operation. These are
may not be the most efficient or suitable design for a particular application, bi the control buses to turn buffers and latches on and off (which we omit for clarity) and
the practical CPU design knowledge that this chapter presents plus the found 4 controller to use these to regulate the sequence of operations within the CPU
material presented in earlier chapters, readers will have the skills needed to The diagram in Figure 8.9 thus contains one more block, the state machine controller
custom solution or to choose from existing available solutions. (stated). Itis shown unconnected to the other units within the CPU, and itis true that
A word of warning though —sometimes it will be better to use acommon pro state0 does not connect to the data or address buses, However, it does connect widely
core for several designs, even when the core is clearly sub-optimal, becauseofthe to almost every unit and bus driver within TinyCPU.
benefits that this allows: the possibility of code/library reuse, shared devel
Figure 8.9

‘ The source code and design of TinyCPU are used with the kind permission of Professor Nal Program counter instruction register stack qnext
stackO
More information relating to TinyCPU can be found on his HDL wiki pages at http gtop
www.ca hiroshima-u.ac,jp/=naka wild

controller output
stateO obufO
in Vout
A complete block diagram of the internal structure and interconnection arrangements of | Specification
Design
TinvCPU, shawine OT ey ra, Sage eee neee
Each chapter ends with
a set of 20 problems
.ss
= roblems
Attention is given to industrially- Determine whether, in the time interval shown, all tasks meet their respective
deadlines.

relevant embedded systems and issues | 611 Repeat Problem 6.10. The only difference is that the tasks are now ordered using
{ rate monotonic scheduling. Does this change make any difference in terms of
relating to modern microprocessors tasks meeting their deadlines over the first ! = 40 ms of operation?

and system-on-chip devices 6.12 A consumer electronics device requires a small, low-power and medium-speed
CPU controller. Discuss whether aparallel-connected data memory storage sys-
tem or a series-connected data memory storage system would be more appro-
priate.

6.13 If the system of Problem 6.12 was ‘souped up’ so that performance and speed
296 became more important than size and power consumption, would that affect
Chapter 7 the choice of bus you would choose?

Figure 6.9 shows the timing diagram for the Atmel AT29LV512 flash mem-
Configurable 1/O pins on the MSP430 ory device. The timing parameters shown have the following values from the
7.1
Box The Texas Instruments MSP430 series of devices has, like many processors designed Atmel datasheet:
for embedded systems, great configurability in its I/O pins. As evidence, consider the
pin definitions for one particular device, the MSP430F1611 Bars Figure 6.9 a =

nce

P5.7/TBOUTH/ISVSOUT
PS.6/ACLK toe
L)P5.S/SMCLK

i
|
L) P5.4/MCLK
P5.3/UCLK1
n ae eae ee tDE
L) P5.2/SOMI1
tL) P5.1/SIMO1
[) PS.0/STE1
——E——SEE |
fries ~_
L) P4.7/TBCLK ACG ton
| MSP430F 1611

P3,7/URXD1 | | The read cycle of the Atmel AT29LV512 flash memory device (this waveform was drawn from
P3.6/UTXD1 | | inspection of the Atmel AT29LV512 datasheet),
P14/SMCLK UL L| P3.S/URXDO
|
|

P3.4/UTXDOL)

P2.6/ADC12CLK/DMAEO[]
Plenty of diagrams to explain
2
=
o€
On this 64-pin package device, note that apart from the power and ground con-
nections, voltage reference inputs, crystal oscillator connections and two of the JTAG
|
tricky concepts
pt pins, every pin has multiple possible functions: 51 of the 64 pins are configurable. As
7]4 an example, refer to pin 5 — this can serve as GPIO port 6-bit 6 (P6.6), as 12-bit ADC
©> input channel 6 or as 12-bit DAC output channel 0, depending upon the particular |
3
2 configuration performed in software by the device programmer. |
> In Box 7.2, we will explore exactly how these pins can be configured
co
eeo = £OOrooi1we
> HOO1S > 11,
FE MOALOONONVNG
>- qVOVEIG
PeNOBGa
f ye) §
|
Tools for Compiling and
Appendices E and F on TinyCPU | Assembling Code for TinyCPU

SF OO TOOTE™
Co TOO1O 11,
= ~OQ1Q0101 | ES introduction
on 3010701 We have seen in Section 8.9 how to write code for, and program, TinyCPU
jw BOOTOTVO? APPEND! We developed a very small example which performed a simple integer
subtraction. This was then assembled by hand into a machine code pro-

Tools for Compiling and gram which was inserted into the Verilog code of TinyCPU (specifically,
within ram. y), The main message from that exercise was how tedious and
Simulating TinyCPU longwinded such a process is when performed by hand.
In Section 8.9.2, we discussed in passing the assembler and compiler
released by Professor Nakano for TinyCPU,' but did not provide any
details.
In this appendix, we will present the entire assembler, explain its
Many advanced tools exist currently for FPGA development. The main workings and demonstrate its use on the same subtract example from
FPGA vendors provide their own software, often with aweb version freely Section 8.9. We will also discuss the C compiler briefly.
available for download, while the professional chip development compa-
nies supply their own tools, which are often used in industry, running on
UNIX and Linux workstations, to develop the most advanced projects. The Assembly Process
Mentor Graphics ModelSim is perhaps the most common of these tools.
It is the author’s recommendation that ModelSim be chosen for larger The assembler is presented with a program consisting of assembly lan-
or more critical design projects. However, for rapid evaluation and guage mnemonics, labels, constants and other information. A simple Tiny-
lightweight testing we will present here a simple open source solution: CPU program, illustrating the syntax and format, copied from Section
Icarus Verilog,! combined with GPKwave? waveform viewer. Alternative 8.9.1, is shown in Listing F.1
options, especially for the waveform viewer, are also available.
Listing F.1 subt!
1 IN
| Preparation and Obtaining Software
2 PUSH cnet
The software runs best, and of course fastest, on. a Linux computer, prefer- 3 SUB
ably running Kubuntu or UE Linux. Since some readers may not 4 our
have upgraded their PCs from Windows to Linux, they can first install
Juli? — this will createalarge file on their ‘C’ drive and add an option to 6 3
the Windows bootup menu, so that next time they reboot they can choose
| to run Kubuntu. To uninstall is equally easy. The large file can simply be
| deleted to remove the software. Mac operating system users can obtain
and run both versions on their computers, or more competent users tould
Both areavailable from hetr
simply build the software from source, vik /.In addition, the assembler source code will be given in full in this appendix
| At this point, it is assumed that readers have a working Linux dis-
| tribution or similar. Kubuntu/Ubuntu users can now proceed to install
both items of software. At a shell window, type the following:

P
3bee k
Gimoly download ent run the wubi Installer from and!
aration
ww! GVO
BABAR) 2 aes

~
>
ARES <1 One Wout Vo ot ett ‘aa
=—te i cli Gear) |
a

be eriano yal soak

ViOynil Ga ef OnmnectA
> EOOT]O 77¥35
(3b O100101
> 1 O10701
>» 1 OO1 O10} CHAPTER

Introduction

Book Organisation
Computers have evolved a long way: from Charles Babbage’s analytical
machine of 1834 (Figure 1.1 shows a drawing of his difference engine, an
earlier, fully working mathematical processing machine of similar design)
to the supercomputers of today, the story has been one of ever-increasing
processing power, complexity and miniaturisation.
Surprisingly, many techniques of Babbage’s day (as well as the early
electrical computers of the 1940s) can still be found in today’s systems,
demonstrating the amazing foresight of those early pioneers. Unfortu-
nately, these links with the past are not always positive — today’s Intel
desktop processors contain performance-limiting evolutionary throw-
backs to the 8086 processor and beyond. With the benefit of hindsight,
we have the opportunity to look back through computing history, and
identify many short-lived evolutionary branches that seemed, at the time,
to be promising paths to future progress, but which quickly disappeared.
Sometimes these may reappear years later in specialised machines, but
more often they are little more than historical curiosities.
What seems likely then is that the computers of tomorrow will be built
on the techniques used in those of today. A snapshot of current techniques
(as any computing text has to be) needs to recognise this fact, rather than
presenting the technology as being set in stone.
. This book will loosely follow the evolutionary trend. Early chapters
will focus on computer fundamentals. Mastery of these fundamentals will
allow a student to construct a working computer on paper, however slow
and inefficient their design might be if constructed. These early chapters
will be followed by a consideration of the architectural speed-ups and ad-
vanced techniques in use today. These are separated from the fundamen-
Cc
tals because some of them may turn out to be the current ‘evolutionary =
5
=

blind alleys’, but nevertheless they are some of the techniques currently 2
Cc
driving Moore’s Law so quickly forward. ce}
fe.)
Every now and then something completely revolutionary happens =

\e)
in computer architecture — these break the evolutionary trend and con- x
°
sign many past techniques that gave incremental performance increases, °
fea)
Z
Chapter 1

A portion of Babbage’s analytical Figure 1.1

difference engine, as drawn in
Harper’s new monthly magazine
Vol. 30, Issue 175, p. 34, 1864. The
original engine documents, and
a working reconstruction, can be
seen today in the London Science
Museum.

to oblivion. Without a crystal ball this book will not attempt to identify these technolo-
gies, but that will not prevent us from making an informed guess, in the final chapter,
about advanced technologies which may spark a revolution in the field of computing
over the next few decades.

Evolution
The concept of evolution of animals is controversial: to date there has been no scien-
tific proof of the theory, yet many choose to believe in it. Some prefer a ‘wait and see’
approach, hoping that science will eventually catch up, while others choose to believe
in an all-powerful yet unseen creator. Moving away from animals, to man-made
devices, the fact that computers have followed an evolutionary path of improvement is
quite obvious and unquestioned. While there have been rare disruptive breakthroughs,
computing history is full of many small incremental improvements over the years.
Of course, something as complex as a computer requires an intelligent engineer to
have designed it. We can often identify the engineers by name, especially those who
have made significant improvements (a few of them are still alive today to tell us about
it). Furthermore, the design and history of the pioneering machines, often constructed
at great expense, should have been very well documented.
So in computing, one would expect the history of development to be very definite;
there should be little confusion and controversy regarding the pioneering machines
from half a century ago. Unfortunately, that is not the case: there exists a very wide
Evolution range of opinions, with little agreements upon exact dates, contributions and ‘firsts’.
3
Introduction

Figure 1.2

sake
ae
ie
|

|
sea’
ip
OS
AGECBL
an

. =
ob) Se

One of ten Colossus computers in use during the Second World War (courtesy |
of the Bletchley Park Trust: www.bletchleypark.org.uk).

Just pick up any two books on computer architecture or computer history and compare
them. For our present purposes, we will begin the modern era of computing with the
invisible giant, Colossus.
Colossus (shown in Figure 1.2), built by engineer Tommy Flowers in 1943 and pro-
grammed by Alan Turing and colleagues in Bletchley Park, is now generally credited
with being the world’s first programmable electronic computer. This was built in Eng-
land during the Second World War as part of the (ultimately successful) code-breaking
effort against the German Enigma code. Unfortunately, Colossus fell under the British
Official Secrets Act and remained hidden for 50 years. All papers relating to it were
ordered destroyed after the war, when Prime Minister Winston Churchill (with a typi-
cally descriptive — although secret — pronouncement) ordered the machines to be ‘bro-
ken into pieces no larger than a man’s hand’. Plans and schematics were burned by the
designers and its codebreaker operators sworn to secrecy under peril of imprisonment,
or worse, for treason.
The action to hide this machine was successful. Despite the occasional unverified
rumour over the years, the existence of Colossus was only revealed publicly when
the few remaining documents were de-classified in the year 2000 and a government
report containing the information was released. For this reason, Colossus is not even
mentioned in many descriptions of computer history: an entire generation of computer
architects had never even heard about it.
<
However, there were other very well-known and reported machines of similar 24
—

vintage to Colossus that began operation in the years that followed. One of the most 2
fe)
famous, ENIAC (Electronic Numerical Integrator And Computer), was commissioned
>
[re
4
Chapter 1

and built in the USA. While Colossus remained totally hidden, ENIAC, operational by
1944, apparently snapped up worldwide patents to digital computing devices. Many
textbook authors, not knowing anything about Colossus, have hailed ENIAC as the first
modern computer. In fact, apart from being operational earlier, Colossus, being binary,
was more like today’s computers than ENIAC, which was decimal. However, neither
were easily reprogrammable, requiring adjustments to switch settings and change wire
plug positions, respectively.
Amazingly, Charles Babbage’s analytical engine of over a century earlier, being
digital rather than analogue and fully programmable, was in some ways more advanced
than these first electronic computers. Babbage even designed a printer peripheral that
could literally ‘write out’ the results of numerical computations. Babbage’s machine
also had a full programming language that could handle loops and conditional branch-
ing. This led Babbage’s friend, Ada Byron, Countess of Lovelace (the child of famous
poet Lord Byron), who worked on the machine, to write the world’s first computer
program. Possibly the first and last time in history that poetry and programming came
together.
Between the difference engine and Colossus, the computing field was not totally
deserted: German Konrad Zuse had an electrical computer working around 1940/1941,
based on relays (therefore classified as electrical rather than electronic). Another cred-
itable early attempt at building an electronic computer was the construction of the
Atanasoff-Berry machine at Iowa State College, USA in 1941. Although not program-
mable and plagued by unreliability, this demonstrated several early concepts and
undoubtedly advanced the state of the art in computing.
The advent of the transistorised computer is a similar area of confusion. The
transistor, invented at Bell Labs, USA in 1948, was low power and small-—ideal character-
istics for building a computer (although the early transistors were somewhat less reliable
than valves'). The first transistor-based machine was actually Manchester University’s
Transistor Computer running in 1953, although several texts again mis-attribute this
honour to the TX-0 at Massachusetts Institute of Technology, USA in 1956.
Finally, confusion reigns over the first stored-program computer (as opposed to the
ones programmed by plugging wires in different holes or flipping switches). This was
probably Manchester University’s Small-Scale Experimental Machine or SSEM (known
affectionately as the ‘Baby’), which successfully ran a stored program in 1948.
Another early stored-program computer, Maurice Wilkes’ EDSAC (Electronic De-
lay Storage Automatic Calculator), began operation at Cambridge University in May
1949. The equally famous US Army EDVAC (Electronic Discrete Variable Automatic
Computer) machine it was also a stored-program binary device of the same era, al-
though it was not operational until 1951-1952 (despite construction starting in 1944).

c ' Glass thermionic valves containing tiny filament electrodes in a partial vacuum were the basic logic
=) switches used in most early computers. Valves are known as ‘vacuum tubes’ or simply ‘tubes’ in
a North America. Interestingly, although they are now defunct in computing, today they are
2)
>
pa) sought-after items for very high-end audio amplification equipment.
5
Introduction

Table 1.1

Prominent machines in the evolution of computer technology.

Year —_ Location Name First

1834 eisaite " Difference aa Bee arene hae

1943 : Bleeitey Colossus E Bléctionie compote,

1948 Msnieneeter 7 SSEM (Baby) ) ci ae computer

1951 MIT Whirlwind 1 Real-time I/ O nani

1953 Manchester The transistor computer Transistorised computer 7

1971 i California Intel 4004 Mace Ret CPU

1979 ‘ CaaS ane ZX-79 ean pane computer

1981 ay York ' IBM PC Pee Lewes: in

1987 Gambridea even A400 skies ; High-otreet RISC Pole ,

1990 New Non: | IBM RS6000 Sipckcaie RISC err

1998 California Computer based on a language

Clearly then, given the three areas of confusion, the history of computers is not
as straightforward as it seems. Manchester University played a prominent but very
low-key role and has been overlooked by many computer historians. Manchester also
produced the world’s first commercial computer, the Ferranti Mark 1 in 1951? but
ultimately, the computer business became centred elsewhere.
Table 1.1 identifies a handful of world firsts in computing, along with the year
they were reported to have become operational.
The table shows the progression in computer technology and goes a long way
towards explaining how today’s computer is very much evolutionary rather than rev-
olutionary, although one wonders what happened to the 1960s.

Computer Generations.
Sometimes computers, just like humans, are described in terms of their generation.
This is a classification built up over the years, based mostly around the construction —

ce}
i
method, computing logic devices and usage of computers. oO
c
Anyone who saw computer magazine advertisements in the 1980s may remember o
©
how manufacturers cashed in on these generations and repeatedly advertised new he
oO
Cl
2

Efe)
Qa
2 The Ferranti Mark 1 was followed closely by the LEO computer (which was derived from EDSAC),
running accounting programs for the ubiquitous Lyons Tea Houses from Spring 1951 onwards. O
6
Chapter 1

products as fifth generation. Thankfully this practice has abated, and it seems that, in
terms of generations at least, the computing world is going through a plateau at the
moment. In the following sections, we will examine the five generations of computers.

Sak First Generation

e Based on vacuum tubes, usually occupying an entire room.
e Short MTBF (Mean Time Between Failures); only a few minutes between failures.
e Used base-10 arithmetic.
e Programming maybe by switch or cable, or hard wired.
e No programming languages above basic machine code.
e¢ Many were stored program. Introduction of von Neumann architecture.

The best known example, the ENIAC, consumed over 100 kW of power yet could
only deliver around 500 additions per second. This monster used 1800 valves, weighed
30 tonnes and occupied 1300 square metres. The user interface (typical for machines
of this generation) is shown in Figure 1.3. ENIAC was designed by the US Army for
solving ballistic equations as a means of calculating artillery firing tables.
The Colossus computer was equally vast and was dedicated — at least in its early
years — to code breaking: number crunching that broke the powerful and secret Enigma
code, contributing to the Allied victory in the Second World War. However, it is sad
that one of the first German messages decoded was something like ‘we’re going to
bomb Coventry’. Not wanting to alert the enemy that the code had been cracked, the
government decided not to warn the inhabitants, many of whom were later killed or
injured as the bombs rained down over that city.

Figure 1.3
333
2 2
gs
4 ind

a]
ax
aS
a5
zi :

Rae

se
:-
= Sus=

= a
oe Sas
Psa es

| Two women operating the ENIAC’s main control panel (US Army photo).
Computer
Generati
7
Introduction

13:2 Second Generation

Transistor-based, but still heavy and large.
Much better reliability.
Generally used binary logic.
Punched card or tape used for program entry.
Support for early high-level languages.
Often bus-based systems.

The CDC6000 of the time was renowned for its intelligent peripherals. But it
is another example, the PDP-1 with 4Ki words of RAM running at up to 0.2 MHz,
that is perhaps the best known. This remarkable machine led the now sadly defunct
Digital Equipment Corporation (DEC) to prominence. The PDP-1 was available at a
price tag of around US$100k, but had available an impressive array of peripherals:
light pen, EYEBALL digital camera, quadrophonic sound output, telephone interface,
several disc storage devices, a printer, keyboard interface and a console display. The
PDP-1 with several of its peripherals are shown occupying almost an entire room in
Figure 1.4.

SRE) Third Generation

Utilised integrated circuits.
Good reliability.
Emulation possible (microprograms).
Multi-programming, multi-tasking and time sharing.

Figure 1.4

2
Se
te}
os
®
fos
A)
©
(a,
=
i)
)
PDP-1 (photograph courtesy of Lawrence Livermore National Laboratory and a
found on www. computer-history. info).
£
fe)
O
8
Chapter 1

¢ High-level languages common, some attempts at user interface design.

¢ Use of virtual memory and operating systems.

The very popular and versatile IBM System/360 boasted up to 512 kibibytes of 8-bit
memory and ran at 4 MHz. It was a register-based computer with a pipelined central
processing unit (CPU) architecture and memory access scheme that would probably
appear familiar to programmers today. IBM constructed many variants of the basic
machine for different users, and most importantly opted for a microcode design that
could easily emulate other instruction sets: this guaranteed backwards compatibility
for users of second generation computers (users who had invested very significant
sums of money in their machines). Modified and miniaturised, five of these computers
perform number crunching in the NASA space shuttles.
Although not quite room-sized, the basic S/360 was still a physically large device
as Figure 1.5 illustrates.

1.3.4 Fourth Generation

e Used VLSI (very large-scale integration) integrated circuits.
¢ Highly reliable and fast.
¢ Possible to integrate the entire CPU on a single chip.
¢ DOS and CP/M operating systems and beyond.
e These are today’s computers.

Examples are profuse, including all desktop and notebook computers. The Phoebe,
a culmination of Acorn’s innovative RISC-based architecture and advanced windowing
operating system, is shown in Figure 1.6. Sadly, the company did not survive long

Figure 1.5

IBM System/360 (photograph by Ben Franske, from the Wikipedia IBM

| System/360 page).
9)
Introduction

Figure 1.6 = AcornPhoebe (picture from

publicity material © Acorn
Computers, 1998—http: //
acorn.chriswhy.co.uk/
AcornPics /phoeb2.html).

enough to market this machine — perhaps a consequence of making the machine bright
yellow. Apple, by contrast, displayed more marketing genius by originally releasing
their 333 MHz iMac with a choice of five flavours (colours), although more recently
they have reverted to an all-white, black or aluminium product line-up (some of the
newer range of iMacs are shown in Figure 1.7).

1.3.5 Fifth Generation

e Natural interaction between humans and computers.
e Very high-level programming languages — maybe even programming in English.
e May appear intelligent to the user.

Figure 1.7

Eg
=
2)
fect
o
c
®
we jes ; - : a i A O
ke
iD
—
The Apple iMac range: stylish and user-friendly machines running a reliable UNIX-based =]
Qa
operating system (photograph courtesy of Apple). E
te)
UO
10
Chapter 1

There are no confirmed examples at the time of writing. When such examples
arrive, it is quite possible that there will be nothing worth photographing: hundreds
of tiny embedded computers distributed around us and not a beige (or yellow) box in
sight.
Not really fifth generation, but the selection of the desirable and well-engineered
Apple iMac computers (see Figure 1.7) may indicate the future: stylish and user-centric
machines. Or, perhaps it is Apple’s smaller but equally desirable iPhone (shown in
Figure 1.9), reputed to contain eight separate ARM processor cores, or their equally
impressive iPad, that will herald the coming of the fifth generation?

Cloud, Pervasive, Grid and Massively Parallel Computers

Consider the history of computers. In the beginning these were room-sized machines,
whether mechanical or electrical, serviced by a dedicated staff of technicians. Relentless
technological progress allowed electrical valve-based hardware to be replaced with
smaller transistors. The room-sized computer started to shrink. Integrated circuits were
then invented to carry multiple transistors, starting with hundreds, then thousands
and beyond. The 8-bit MOS Technology Inc./ Rockwell 6502 processor released in 1975
contained around 4000 transistors in a 40-pin dual in line package (DIP). By 2008, Intel
had reached 2 billion transistors on a single chip.
The story thus far, has been room-sized computers shrinking, first to several
refrigerator-sized units, then to a single unit. Further shrinkage into a desktop box her-
alded the era of the personal computer (PC). PCs in turn became smaller. ‘Luggables’
appeared in the early 1980s, then portables, laptops, notebooks and palm computers.
Today, it is possible to purchase a fully embedded computer with sensors and CMOS
camera within a capsule that can be swallowed to aid in medical diagnosis.
So is this a story of one-way miniaturisation? Well, the answer has to be ‘no’ because
computers have also become larger in some respects. The benefits of networking, such
as Internet access, allow computers to easily link up, and potentially to share compu-
tation resource between themselves. What were once single computing jobs can now
be parallelised across multiple computing elements or computer clusters, even in geo-
graphically diverse configurations (this type of massive parallelism will be discussed in
Section 9.3).
So, given that the tasks that we need to have performed can either be executed on
a single small box or spread around and shared among several machines surrounding
5
ps us (including embedded ones), the question becomes, how do we define ‘a computer’ —
o)
al

is it the box itself, or is it the ‘thing’ that executes the task?

o Fifty years ago, it was easy to define because ‘the computer’ was in the computer
co
5
2)
room. Today, a single beige box resting on my desk may well contain two or more
& CPUs, each of which may contain several computing cores and yet I refer to that box
®
ou
‘in the singular as my ‘computer’. When I perform a web search query, it will prob-
5)2
a, ably be sent to Google where it is processed by a ‘server farm’ containing upwards
4(©)
rof 10,000 computer elements (each one just like my desktop PC). When such a server
1]
Introduction

Figure 1.8

The beautiful MareNostrum installation developed by the Barcelona Super-

computing Center in the Torre Girona chapel (picture courtesy of Barcelona
Supercomputing Center, www.bsc.es).

farm co-operates to perform processing, it is classed as a supercomputer, again in the

singular.
So the computer has become large again, and yet consists of many smaller indi-
vidual computing elements. One leading example of a large collection of computers
working together is the Barcelona Supercomputer, the MareNostrum, installed in the
Torre Girona chapel in Barcelona, and shown in Figure 1.8.

Where To From Here?

The process of miniaturisation is set to continue. More and more products, devices and
systems contain embedded computers and there is no sign that this trend will die out.
Computer speeds also will continue to increase. After all, there is a pretty amazing
track record to this: consider the numbers in Table 1.2, showing how computers have
Bpada
progressed in speed since the earliest days — remembering of course that the various @ 5;

definitions of the word ‘computer’ have changed several times throughout. x

Pause for a moment and consider the sheer magnitude of this progress. In almost £
o
ir
no other sphere of life can we see such an incredible, and sustained, performance
2
improvement. Given this track record, we can probably safely leave the miniaturisation o
i=)

and performance improvement process to major industry players such as ARM, Intel co)
fl=
and AMD. =
12
Chapter 1

Table 1.2
SS ee a
The amazing progression of computer calculating speeds from the earliest days (data provided
courtesy of Professor Jack Dongarra, University of Tennessee, USA).

Year Floating point operations per second, FLOPS

1941 1

1945 100
1949 1000 (1 KiloFLOPS, kFLOPS)

1951 10,000 |

1961 100,000
1964 1,000,000 (1 MegaFLOPS, MFLOPS)

1968 10,000,000

1975 100,000,000
1987 1,000,000,000 (1 GigaFLOPS, GFLOPS)

1992 10,000,000,000

193 100,000,000,000 x

1997 1,000,000,000,000 (1 TeraFLOPS, TFLOPS)

2000 10,000,000,000,000 ;

2007 478,000,000,000,000 (478 TFLOPS)

2009 1,100,000,000,000,000 (1.1 PetaFLOPS)

Or can we? Despite the miniaturisation, we have seen that (super) computers are
getting bigger — and more power hungry. Parallel computing has emerged as the main
technique of choice in building the world’s fastest computers. The days of a central
computer facility, the mainframe, could well be returning. The difference being that the
mainframe may now be located in a different country to its users, with mixed wireless
and Internet accessibility to those users. Perhaps the mainframes should be located in
cold countries where excess heat can go towards warming nearby homes?
Since the technology to separate bulk computing from the point at which that
computer power is needed mostly exists today, and with the possible exception of
wireless connectivity, is now mature, the controlling factors in the continued advance
of this model are services and software.
However, this does not mean that it is time to abandon the advance and improve-
ment of computers and their architecture (which would mean you can stop reading
Here?
Where
From
To here), but it does mean that the focus may change. From big and powerful to small
13
Introduction

and low power. From large-scale number crunching to embedded and application
specific.
Returning to the educational aims of this book for a moment, engineers work-
ing on computer systems have traditionally asked questions such as ‘what processor
shall I use in my system?’ and ‘how do I get this processor to work in my system?’
This book provides the background necessary to enable answers to be found to both
of these questions. In addition, it allows new questions to be asked, and answered,
such as: ‘Should I create a new processor specifically for my system, and if so, how?’
or ‘Should I use a simple CPU and connect to a remote server, or do all processing
internally?’
That computing is now primarily an embedded engineering discipline, despite the
existence of many huge supercomputers like the MareNostrum, is due to the pervasive-
ness of computer technology within embedded and consumer devices. Consider the
case of the iPhone, shown in Figure 1.9, which reportedly contains something like nine
separate microprocessors, with eight of them ARM-based. So in answer to the question
of where to from here, we can predict two ongoing trends: towards fewer but bigger
clusters of large computers, and towards more and smaller personalised computing
devices.
Also, it would probably help your career prospects to learn a little about the ubiq-
uitous ARM along the way.

i Figure 1.9 The Apple iPhone, reputed

to contain eight separate
ARM processors in a sleek
body with integral touch-
screen (photograph cour-
tesy of Apple).

a
=z

£
2
2
A)
i=
cD)
A=
=
14
Chapter 1

Summary
You, the reader, may not build the world’s fastest supercomputer (or maybe you will,
who knows?), but hopefully you will be designing or programming some amazing
embedded systems in future.
This chapter has presented a historical perspective of computing: relentless
forward progress, many huge leaps in technology and understanding, but millions
of small incremental improvements. Isaac Newton famously remarked in a letter to
his rival Robert Hooke that, ‘if Ihave seen further it is by standing on ye shoulders of
Giants’.
This could not be more true of most computer designers. You cannot really get
closer to standing on the shoulders of giants than when you use an existing computer
to design the next one!
With this perspective behind you, and confident of ongoing future progress in this
field, it is now time to learn the techniques (and some secrets) from the designers of
the computing systems of the past few decades. The following chapters will begin this
process by covering basic and foundational techniques, before considering speed-ups
and performance enhancing techniques of computers — whether desktop machines or
embedded systems. Later, we will spend more time investigating embedded systems
themselves, even taking the opportunity to build our own embedded CPU. Finally,
we will look further into the future to try and identify some promising, but unusual,
techniques on the horizon of the computing world.
YHOO1)O. —?

a woet fete) 101:

-1 O10
Oo ite
COo1 CHAPTER

Foundations

This chapter introduces the background information needed to appreciate

the design of a modern central processing unit (CPU). We will consider for-
malised methods of computer organisation and classification, define many
of the terms used to describe computer systems, discuss computer arith-
metic and data representation, and look at a few of the structural building
blocks we will encounter later when analysing computer systems.

Computer Organisation
What does a computer consist of? How are the elements connected? In
order to answer these questions, we need to first recognise that there
exists a vast range of possibilities inherent in the structure of a computer.
Looking at some of today’s desktop computers many of the peripheral
elements traditionally connected around a CPU are subsumed within the
same Integrated Circuit (IC) package; this would not be recognisable as
a computer to the early pioneers. However the main, historic, computer
elements are usually still present — even if they are not at first immediately
identifiable. In embedded systems the trend is more apparent — system-
on-chip (SoC) processors that integrate almost all required functions on a
single chip are now predominant.
Secondly, not all computers are organised in the same way, or have the
same requirements. After all, they could range in size from a room-sized
supercomputer, to a wristwatch-based personal digital assistant (PDA) or
smaller.
Despite the range of possibilities, most systems comprise functional
blocks with a degree of similarity. The placement of these blocks inside or
Ss
outside the CPU chip is a design or cost consideration, and the intercon-
che
nections between them (both internal and external) are generally parallel 5
2
<
buses, the width and speed of which are also design or cost considerations. 5
2]
There may be multiple copies of each functional block present or mul- —
oO
tiple interconnections between some blocks. =
©
_
With such variety, there is a need to classify the range of architec- 2
Qa
tural possibilities in some way. It was Michael Flynn who first devised a E
te)
comprehensive classification scheme for describing such systems in 1966. O
16
Chapter 2

Figure 2.1

HEM vee instruction)

|
data word

data word

data word NOSE

"data word |

afte:
. . ))

. . SSN

instruction)
pe ae

instruction)
Raa ea Fs

ago “instruction gf ANS A” instruction

a Ce
agar
aw

An illustration of Flynn’s taxonomy of SISD, SIMD, MISD and MIMD processing.

These four classifications show the relationship between instructions and data
| being acted upon at a snapshot in time (because the taxonomy actually refers to
streams of data and instructions rather than individual items).

Zeke k Flynn’s Classification

The widely used Flynn’s Classification scheme categorises computers based on the num-
fe)
tm
(2)
ber of instruction streams and the number of data streams that are present.
fs An instruction stream can be thought of as a command to a data processing unit to
Cc
5
5) modify data (in a data stream) passing through the unit. This is represented diagram-
in
1@) matically in Figure 2.1 which shows four examples of different connection arrange-
ad
aD ments. These are namely:
2
2.
£ * Single instruction, single data stream (SISD) — A traditional computer containing
t°)
O a single CPU receiving its instructions from a stored program in memory and acting
WH
Foundations

on a single data stream (shown in this case as one instruction acting upon one item
of data).
° Single instruction, multiple data streams (SIMD) - A single instruction stream
acting on more than one item of data. For example, given the numbers 4, 5 and
3, 2, a single instruction to perform two separate additions of 4+ 5 and 3 + 2
would be SIMD. An example of this arrangement is an array or vector process-
ing system which can perform identical operations on different data items in
parallel.
¢ Multiple instruction, single data stream (MISD) -— A rare combination of overspec-
ified multiple instructions acting on a single data stream. This redundancy could
possibly be useful in fault-tolerant systems.
e Multiple instruction, multiple data streams (MIMD) — These systems are arranged
similarly to multiple SISD systems. In fact, a common example of an MIMD system
is a multi-processor computer such as the Sun Enterprise servers.

Although Flynn originally designed his taxonomy to describe processor-level

arrangements, the same considerations can equally be applied to units within a proces-
sor. For example, Intel’s multimedia extensions (MMX) found on Pentium processors
and later as streaming SIMD extensions (SSE), is an example of a SIMD arrangement. It
allows a single instruction to be issued which can cause an operation on multiple data
items (such as eight simultaneous additions on different pairs of data). We will cover
MMxX along with SSE later in Section 4.7.

21:2 Connection Arrangements

Another common description of processor architectures is based on whether the pro-
gram instructions and data are handled together or separately.

¢ Von Neumann systems are those that share resources for storage and transfer of
data and instructions. Many modern computers fall into this category by virtue of
storing programs and data in shared memory, and using a single bus to transfer
them from memory to the CPU. Shared bus bandwidth tends to mean that such a
system has limited performance, but its advantages are simpler design and lower
cost.
e Harvard architecture systems have separate data and instruction storage and trans-
fer. Since instruction and data transfer can be simultaneous, such systems can offer
high performance.
¢ Other architectures include systems with multiple dedicated buses (such as the 2
ADSP2181 internal buses), shared data/instruction address bus but separate data 2i=
cS)
buses or similar. Chapter 4 will introduce and explain internal bus arrangements oD
further. 12)
G
ed

Some CPUs such as the DEC/Intel StrongARM are advertised as being Harvard ar- =)
o
chitecture, although they interface to shared memory via a single bus. In this case, the =
fC)
StrongARM is a Harvard architecture internally because it contains separate blocks of O
18
Chapter 2

Tr Atel gD € : Figure 2.2

Layer Operation

5
Translation through
compilation

Translation through
assembly

BIOS calls,
OS APIs, SWIs

2 CPU instruction set

Hardware decode and
interpretation

1 CPU microarchitecture

Hardware execution

0 Binary valued logic

A layered view of computer organisation and structure.

internal data and instruction cache memory, although it has an external von Neumann
connection arrangement.

2.1.3 Layered View of Computer Organisation

It is sometimes useful to consider a computer system as a number of interlinked layers.
This is illustrated in Figure 2.2 in which the operation of connecting between layers is
described, also as a hierarchy of operations.
From the bottom up, any CPU can be viewed as a collection of gates performing
logical operations. These logical operations must be controlled to perform the required
function by microprograms or a state machine, where the sequence of micro-operations
is specified by one or more instructions from the instruction set. Instructions are issued
either directly from a user program or from predefined basic input output system (BIOS)
or operating system functions.
Interestingly, this layer-like model is a reflection of the Open Systems Intercon-
nection (OSI) model, applied to computer hardware and software (the OSI model is
© covered in Appendix B).
®
E
5
a)
fo
3
Ls Computer Fundamentals
g> The computer systems described in this book, such as the SISD machine discussed in
a
E Section 2.1.1, generally comprise a number of discrete functional units interconnected
)
8) by buses. Some of these units will now be briefly introduced, before being covered in
19
Foundations

detail in subsequent chapters:

° Central processing unit (CPU) - The part of a computer that controls operation
through interpretation of instructions and through built-in behaviour. It handles
input/output functions and performs arithmetical and logical operations on data
(in other words, contains an ALU). In recent times, CPU has begun to refer to a
physical KC which, in some cases actually constrains all parts necessary to function
asa standalone computer.
e Arithmetic logic unit (ALU) — This component of the CPU performs simple arith-
metic and logical operations such as add, subtract, AND, OR. It is an asynchronous
unit which takes two data inputs from parallel connected registers or bus(es) and
outputs either direct to a register or is connected through a tristate buffer to a bus.
In addition, ithas a control input to select which function to perform, and interfaces
to a status register: It handles fixed point binary (and occasionally BCD) numbers
only and.is located on- chip in modern processors.
¢ Floating point unit (FPU)— Either an on-chip or an external co-processor, it per-
forms arithmetic on floating point numbers. The particular floating point format
supported in most modern FPUs is called IEEE754. It is usually comparatively
slow (can take tens or hundreds of instruction cycles to perform a calculation) and
its interface is to the main CPU through special floating point registers.
¢ Memory management unit (MMU) — This component provides a layer of abstrac-
tion between how the processor addresses memory and how that memory is phys-
ically arranged. This abstraction is termed virtual memory. The MMU translates
a virtual address that the processor needs to access into a real physical address in
memory. The processor typically sees a large linear continuous address space in
memory, with the MMU hiding a physical memory organisation which may be of
different sizes (larger or smaller), non-continuous or consisting partly of RAM and
partly of hard disc storage.

In addition, there are a number of items that we will include in our discussion that are
useful to define now, prior to being covered in detail later:

¢ Register - On-chip’ storage locations that are directly wired to internal CPU buses
to allow extremely fast access (often in one instruction cycle). The distinction blurs
between this and on-chip memory for some CPUs and the stack in the picoJavall
processor.
e Tristate buffer — A device to enable or disable driving a bus. It is usually placed
=
between a register and a bus to control when the bus will be driven by that register. oO
The first two states are when the tristate drives the bus voltage to be either logic E
5
high or logic low; the third (tri-) state is high impedance, meaning that the device ao)
Cc
f=)
does not drive the bus at all. ie

o
—

cone
2
ror
1 Originally, these were separate hardware devices, but are now exclusively incorporated on-chip for
E
fe)
convenience and access speed reasons. O
20
Chapter 2

¢ Complex Instruction Set Computer (CISC) — Think of any useful operation and
directly insert this into the CPU hardware. Do not worry how big, power hungry
or slow this will make the CPU; you will end up with a CISC machine. Early VAX
machines reputedly included instructions that could take over 2000 clock cycles to
execute.
e Reduced Instruction Set Computer (RISC) - CPUs are limited by their slowest
internal components and by silicon size. Based on the premise that 80% of instruc-
tions use only 20% execution time and the remaining 20% use up 80% of the chip
area, CPUs are reduced to contain the 80% most useful instructions. Sometimes a
working definition of RISC means ‘supporting a set of less than 100 instructions’.
It is also significant to note an emerging trend where a RISC CPU core emulates a
CISC machine.
¢ Instruction cycle — This refers to the time taken to fetch an instruction, decode
it, process it and return the result. This may be one or more periods of the main
clock cycle (derived from an external oscillator). For RISC processors, instructions
typically execute in a single clock cycle. For CISC processors, some instructions
take a lot longer.
¢ Big or little endian — Big endian means that the most significant byte is presented
first. It is used in processors such as 68000 and SPARC. Little endian means that

Worked endiness example 1

2.1 Q. Given a 32-bit word stored in a 16-bit architecture memory system as shown below,
Box
and given that the stored word is made up of least significant byte (LSB), second byte
(B1), third byte (B2) and most significant byte (MSB), is the following a little or big
endian representation?

1 MSE cae BD |
Olea: tm Bl ae | wih LSB a |
——' ountin’ Sogaihalee

In the diagram, the memory line (in 16-bit words) is given on the left, and the bit
positions are shown below.
A. Checking for little endian first, we identify the lowest byte-wise memory address
and count upwards. In this case, the lowest address line is 0 and the lowest byte starts
at bit 0. The next byte up in memory starts at bit 8 and is still at line 0. This is followed
by line 1 bit 0 and finally line 1 bit 8. Counting the contents from lowest byte address
upwards, we get {LSB, B1, B2, MSB}. Since this order DOES follow the least-to-most
byte format it must be little endian.
Computer
Fundamenta
2]
Foundations

Worked endiness example 2

2.2 Q. A 32-bit word is stored as shown below. Is this a little or big endian representation?
Box

31 YA D3 ame 16 15 8 a

A. First identify the lowest byte-wise memory address. This is clearly address line 0,
starting at bit 0. Next is address line 0, bit 8 and so on. Counting from least to most
and writing out the contents we get {MSB, B2, B1, LSB}. This order does NOT fol-
low the least-to-most byte format, so it is not little endian. Therefore it must be big
endian.

the least significant byte is presented first, as used by the Intel x86 family. Some
processors (such as the ARM7) allow for switchable ‘endiness’.
Unfortunately, endiness is complicated by the variable memory-width of
modern computers. It was easier when everything was byte-wide, but now there is
an added dimension of difficulty. Given an unknown system, it is probably easier
to check first whether it is little endian, and if not, classify it as big endian, rather
than working the other way around. Boxes 2.1, 2.2, 2.3 and 2.4 explore this issue in
detail.

Worked endiness example 3

Box
2.3
Q. Given the memory map shown below, write in the boxes the 32-bit number repre-
sented by MSB, B1, B2 and LSB bytes using a little endian representation.

28 |

24 |
|
|
ne
|
|
|

(Continued)
Fund
Comp
ZZ
Chapter 2

re Worked endiness example 3 (Continued)

x
A. Little endian is always easier: its LSB is at the lowest byte-address and then we
count upwards in memory to the MSB. First, we need to identify the location of
the lowest byte-address in memory. In this case, note the bit positions written along the
bottom — they start from left and increment towards the right. Lowest address of those
shown is therefore address 20 and bit 0. Next byte will be address 20, bit 8 onwards
and so on. The end result should then be:
ee ee
28

20 ES Bl B2 MSB

Pies nian eralCoie Sa iGisekos Eyes

Note: Look also at the addresses. Instead of being consecutive locations, incrementing
by one each line (as the other examples showed), these addresses jump by 4 bytes each
line. This indicates that memory is byte-addressed instead of word-addressed. This is
typical of ARM processors which, despite having a 32-bit wide memory, address each
byte in memory separately.

Worked endiness example 4

2.4 Q. Given the memory map shown below, write in the boxes the 16-bit number repre-
Box
sented by MSB and LSB bytes using a big endian representation.

pil

52 |

2
c
© A. Again, we need to identify which is the lowest byte address in the memory
E
6 pictured, and then place the MSB there since we are big endian. In this case, the memory
a2)
e map is written from top down-a common format from some processor manufacturers.
5
few
hee The top position is the lowest address, and we count downwards. Since memory is
®
5 byte-wide, this is relatively easy. The answer is thus:
—

Qa.
E
ie) (Continued)
O
23
Foundations

y Worked endiness example 4 (Continued)

x
)
= 50 MSB

Bull LSB

52 Re |

iti 0

” What is a number format?

x
a We are all aware of decimal format, either integer as in the number 123 or fractional as
in 1.23, which are both examples of base 10.
In fact, there are an infinite number of ways to represent any number (an infinite
number of different bases), but only a few of these are common. Apart from decimal,
the hexadecimal format (base 16) is used frequently in software, as is binary (base 2)
in hardware, and which we employ in all examples here.

Number Formats

Modern computers are remarkably homogeneous in their approach to arithmetic

and logical data processing: most utilise the same number format and can be
classified by how many bits of data are operated on simultaneously (known as the
data width). They even tend to employ similar techniques for number handling.
This was not the case in early computers where a profusion of non-standard data
widths and formats abounded, most of which are purely of historical significance
today.
It could be argued that seven (or fewer) binary number formats remain in use
today. Box 2.5 discusses exactly what constitutes a number format, but in order for us
to consider processing hardware later in this chapter, it is useful to now review the main
formats that we will be encountering as we progress through the book.

2.351 Unsigned Binary

In unsigned binary, each bit in a data word is weighted by the appropriate power of 2
fe)
two corresponding to its position. For example, the 8-bit binary word 00110101b is E|
(eo)
equivalent to 53 decimal. The trailing b is sometimes present to indicate it is a binary ih
re
number; reading from the right to the left, the number evaluates to oO
2
5
2
1(2°) + 0(2!) + 1(27) + 0(23) + 1(2*) + 1(2°) + 0(2°) + 0(2’) za
24
Chapter 2

In general, the value, v of a n-bit binary number x, where x[i] is the i® bit reading
from the right to the left, starting from bit 0, is

v= oat ps4
i=0
The unsigned binary format is easy for humans to read after a little practice, and
is handled efficiently by computer.

Dedee Sign-Magnitude
This format reserves the most significant bit (MSB) to convey polarity (called the ‘sign
bit’), and then uses unsigned binary notation for its remaining least significant bits to
convey magnitude. By convention, an MSB of 0 indicates a positive number while an
MSB of 1 indicates a negative number.
For example, the 4-bit sign-magnitude number 1001 is —1 and the 8-bit number
10001111b is equivalent to 8 +4+2+1 = —15 decimal.

25353 One’s Complement

This format has largely been replaced by two’s complement but can still occasionally
be found. Again, the MSB conveys polarity while the remaining bits indicate magnitude.
However, if the number is negative (i.e. the sign bit is 1), the polarity of the magnitude
bits is reversed.
For example, the 8-bit one’s complement number 1110111 is equal to —8
decimal.

2.3.4 Two’s Complement

This is undoubtedly the most common signed number format in modern computers.
It has achieved predominance for efficiency reasons: identical digital hardware for
arithmetic handling of unsigned numbers can be used for two’s complement num-
bers. Again, the MSB conveys polarity, and positive numbers are similar in form to
unsigned binary. However a negative two’s complement number has magnitude bits
that are formed by taking the one’s complement and adding 1 (Box 2.6 provides a binary
example of this method which is literally ‘taking the two’s complement of a number’).
For example, the 4-bit two’s complement number represented by binary digits
1011 is equal to —-8 + 2+ 1 = —5 decimal, and the 8-bit two’s complement number
10001010 is equal to —128 + 8 + 2 = —118 decimal.
It is undoubtedly harder for humans to read negative two’s complement numbers
than some of the other formats mentioned above, but this is a small price to pay for
pds reduced hardware complexity. Box 2.7 provides some examples of two’s complement
5
z number formation, for both positive and negative values.
2)
a

o Dae Fe) Excess-n

a
E This representation will crop up later when we discuss floating point. In this for-
=)
z mat, a number v is stored as the unsigned binary value v + n. An example is the
25
Foundations

Negative two’s complement numbers

Negative two’s complement numbers can be easily formed in practice by taking the
Box
2.6
one’s complement of the binary magnitude then adding 1. As an example, suppose
we wish to write —44 in 8-bit two’s complement:

Start by writing +44 in 7-bit binary: 010 1100

Next, flip all bits (take the one’s complement): Tor-O0lT
Add 1 to the least significant bit position: LOL -OLO
Finally, insert the sign bit (1 for negative): 11.01 6200

If you are not used to writing binary numbers, try to write them in groups of 4. That
way it is easier to line up the columns, and it aids in the conversion to hexadecimal
(since a group of 4 bits corresponds to a single hex digit).

Worked examples of number conversion

2.7 Q1. Write the decimal value 23 as a two’s complement 8-bit binary number.
Box
A1. We can start by drawing the bit weightings of an 8-bit two’s complement number.
Starting from the left, we begin with the sign bit.

The sign bit is only set if the number we want to write is negative. In this case, it is
positive so write a zero there. Next we look at 64. If our number is greater than 64 we
would write a ‘1’ here, but it is not so we write a zero. The same goes for 32, so now
we have:

Moving on to 16, we find that our number (23) is bigger than 16, and so we subtract 16
from the number to leave 23 — 16 = 7. A ‘1’ goes in the 16 box.
Next, we compare our remainder with 8. The remainder is smaller so a ‘0’ goes in
the 8 box. Moving on to 4, our remainder is bigger than this so we subtract 4 to make
anew remainder 7 — 4 = 3 and write a ‘1’ in the 4 box. Continuing with 2 and 1, both
get ‘1’s in their boxes. The final answer is thus:

ae skin Said Oomesdllnel SG) 105 saiMlb aais]igid 1

Q2. Write the decimal value —100 as a two’s complement 8-bit binary number.
A2. Again looking at the number line above, we realise that, as a negative number,
we need a ‘1’ in the —128 box. Doing the sum —100 — (—128) or —100 + 128 leaves a
(Continued)
Forma
Numbe
26
Chapter 2

A Worked examples of number conversion (Continued)

Ro remainder of 28. The rest of the numbers act as normal — a ‘0’ in the 64 box, a ‘0’ in
/ ries
is

32 box, then a ‘1’ in the 16 box. The remainder will then be 28 — 16 = 12. Continuing,
there will be ‘1’ in the 8 box, remainder 4, then a ‘1’ in the 4 box and ‘0’s beyond
that:

Pret te eh a
Note: The only really easy things to see, at a glance, about two’s complement numbers
are whether they are negative or not (a ‘1’ in the most significant position) and whether
they are odd or not (a ‘1’ in the least significant position).

the excess-127 representation in 8 bits, which can represent any number between —127
and +128 (stored in binary bit-patterns that look like the unsigned values 0 and 255
respectively).
This format can be a little confusing to students. As examples, the 8-bit excess-127
binary number 00000000 equals —127 (which is found by working out the unsigned
binary value, in this case zero, and then subtracting 127 from it). Another example is
11000010 which in binary would be 128 + 64 + 2 = 194, but since it is excess-127 we
subtract 127 from the result to give 194 — 127 = 67 decimal.

2.3.6 Binary-Coded Decimal

Binary-coded decimal (BCD) was used extensively in early computers. It is fairly easy
for humans to read in practice, because each decimal digit (0 to 9) of a number to
be stored in BCD is encoded using a group of four binary digits. Thus, 73 in dec-
imal is stored as 0111 0011 in BCD. Four binary digits can store a value from 0
to 15 so there are also some binary patterns that are not used in BCD. Ultimately,
BCD has been superseded because it is neither efficient in storage nor easy to design
hardware for.

pESWS Fractional Notation

This can actually apply to any binary notation (in fact to decimal too — see Box 2.8)
but usually applies to unsigned or two’s complement numbers within the computer
architecture field. It is strictly a conceptual interpretation of the numbers where the usual
bit weighting of 2° for the LSB, 2! for the next bit, 2? for the 3" bit and so onis replaced by
jes a scaled weighting pattern. In some digital signal processing (DSP) circles, fractional
3
& notation is described as Q-format. Otherwise, fractional notation binary is typically
}
J

ML described as (m.n) format where m is the number of digits before the imaginary radix
©
hee
(in decimal, the radix is known as the decimal point, but when dealing with another
me!
E number base we cannot refer to it as a ‘decimal’ point, so we call it the radix) and n is
2
z the number of digits after it.
27
Foundations

Is binary a fractional number format?

Remember that there is nothing special about binary — it is simply a way of writing a
‘Box
2.8
number in base 2 instead of base 10 (decimal) that we are familiar with.
Just as we can write fractional numbers in decimal (such as 9.54) as well as integers
(such as 19), we can also write any other base number in fractional as well as integer
format. So far, we have only considered integer binary format, however, it is also
important to realise that fractional binary format is used extensively in areas such as
digital signal processing.

Fractional format worked example

2.9 Q: Write the decimal value 12.625 as a (7.9) fractional format two’s complement binary
Box
number.
A: First, start by looking at the bit weightings of the (7.9) format:

AWA
Aina
where the weightings below 1/8 have been removed for space reasons. Next, we realise
that the number is positive, so there is a ‘0’ in the —64 box. We then scan from left to
right in exactly the same way as for a standard two’s complement representation (or
unsigned binary for that matter), using the weights shown above.
It turns out that 12.625 = 8 + 4+ 0.5 + 0.125 and so the result will be:

TPP EPPA P LPP PP)

An example of two 8-bit binary number arrangements in unsigned and (6.2) format
are shown below:

Ze
97
unsigned ee
(6.2) format

Refer to Box 2.9 for more examples of fractional format numbers in binary.
The beauty of fractional notation applied to unsigned or two’s complement num-
bers is that the values are handled in hardware exactly the same way as the non-
”
fractional equivalents: it is simply a programming abstraction. =
12]
E
i
fo)
2.5.0 Sign Extension rg
This is the name given to the process by which a signed two’s complement number
See
vo
Q
of a particular width is extended in width to a larger number of bits. For example, E
=)
converting an 8-bit number to a 16-bit number. While this is done occasionally as an <4
28
Chapter 2

explicit operation specified by a programmer, it is more commonly performed as part

of operations such as addition and multiplication.
Sign extension can be illustrated in the case of moving from a 4-bit to an 8-bit
two’s complement binary number. First, write the 4-bit two’s complement number
1010 in 8-bit two’s complement.
If we are considering signed numbers, we know that the 4-bit number involves bit
weightings of [—8, 4, 2, 1] while the 8-bit weightings are [—128, 64, 32, 16, 8,4, 2, 1]. For
the 4-bit number, the value 1010 is clearly

—8 #2226
If we were to simply write the 8-bit value as a 4-bit number padded with zeros
as in 00001010, then, referring to the 8-bit weightings, the value that this represents
would be

8+2=10
This is clearly incorrect. If we were then to note that a negative number requires
the sign bit set and responded by simply toggling the sign bit to give 10001010 then
the value would become
126-7 8--2=—118

This is again incorrect. In fact, in order to achieve the extension from 4 to 8 bits
correctly, it is necessary that not only the original MSB must be set correctly, but every
additional bit that we have added (every bit to the left of the original MSB) must also
be set to the same value as the original MSB. The sign bit has thus been extended to
give 11111010 with a value of
m=128-164 3216128 26
Finally, a correct result is achieved. Another example of sign extension is given in
Box 2.10.
There is evidently no difficulty with positive two’s complement numbers, but the
sign extension rule can still be applied (it has no effect, but makes a hardware design
easier if it applies to all numbers rather than just some).

Sign extension worked example

Q: Write the value —4 in 4-bit two’s complement notation. Copy the most significant bit
Box
2.10 (MSB) four
times to the left. Read off the result as an 8-bit two’s complement number.
A: 1100(-8 +440 +0)
MSB is 1, so copying this to the left four times gives 11111100.
Reading off in 8-bit signed binary, (—128 + 64 + 32+16+8+4) = —4.
For further thought: Repeat the exercise with a positive number such as 3. Does
the method still apply equally for positive numbers?
Number
Formats CC er rr
D8)
Foundations

Arithmetic
This section considers the hardware capable of performing the addition or subtraction
of two binary numbers. This functionality is used within the arithmetic logic unit (ALU)
in almost all processors, which also handles basic logic functions such as AND, OR,
NOT and so on. The ALU is described as a CPU functional unit later in Section 4.2.

2.4.1 Addition
Binary arithmetic is accomplished bitwise with a possible carry from the adjacent less
significant bit calculation. In hardware, a full adder calculates the addition of two bits
and a carry in and generates a result with an additional carry output.
A full adder is shown symbolically in Figure 2.3, where each arrow represents a
single logic bit. A half adder is similar, but does not have any provision for the carry in.

2.4.2 The Parallel Carry-Propagate Adder

To create an 8-bit parallel adder, the full adder hardware would typically be repeated
eight times for each of the input bits although the least significant bit position could
use the slightly simpler half adder, as shown in Figure 2.4.
In Figure 2.4, x[7:0] and y[7:0] are the two input bytes and z[7:0] is the output
byte. Cout is the final carry output. For the case of adding unsigned numbers, when
Cout is set it indicates that the calculation has resulted in a number that is too large to
be represented in 8 bits. For example, we know that the largest magnitude unsigned
number that can be represented in 8 bits is 2° —1 = 255. If two large numbers such as 200
and 100 are added together, the result (300) cannot fit into 8 bits. In this case, the carry
would be set on the adder and the result (z) would hold the remainder 300 — 256 = 44.
The topmost Cout therefore doubles as an overflow indicator when adding un-
signed numbers: if it is set following a calculation, this indicates that the result cannot
be represented using the number of bits present in the adder. Some further thoughts on
this are explored in Box 2.11.

Pigare2:3,. A full adder, showing two bits being added, together with b~ Xx y |

Conf Cr
a carry in, and the output of a single bit with carry.

igure 2.4 = ne ‘ zee wae

car x7y7 x6y6 x5y5 x4y4 x3y3 x2y2 x1y1 x0y0

Cn
|
|

Za z6 z5 z4 Zo Ze Zi z0
The carry-propagate or ripple-carry adder constructed from a sequence of full
adders plus one half adder.
Arith
30
Chapter 2

= Exercise for the reader

ai
% How does the topmost Cout signal from an adder behave when adding signed two’s
aS complement numbers?

1. Try working by hand using a 4-bit adder. With 4-bit two’s complement numbers
the representable range is —8 to +7.
2. Try adding some values such as 2+8 =?,2+(—8) =?,7+7 =? and (—8) + (—8) =?
3. What do you conclude about the Cout signal: does it mean the same for signed
two’s complement numbers as it does when adding unsigned numbers?

This behaviour and the add mechanism is common to almost any binary adder.
Although the parallel adder appears to be a relatively efficient structure and even works
in a similar way to a human calculating binary addition by hand (or perhaps using an
abacus), it suffers from a major speed limitation that bars its use in most microprocessor
ALUs: carry propagation.
Given that the input numbers are presented to the adder simultaneously, one mea-
sure of the adder speed is the length of time required to calculate the output. Each full
or half adder in the chain is relatively quick: both the carry out and the result will be
available a few nanoseconds after the carry in and input bits are presented (for modern
hardware). The problem is that the least significant half adder (adder 0) must finish
calculating before the next bit calculation (adder 1) can start. This is because adder 1
needs to get the carry from adder 0 before it can complete its calculation, and that carry
is not valid until adder 0 finishes. Adder 1 then supplies its carry to adder 2 and so on.
Further up the chain, adder 6 will only supply its carry to adder 7 a significant length
of time after the input words were first presented to the adder.
A worked example of calculating an entire ripple-carry adder propagation delay
is presented in Box 2.12. It is important because, if such an adder were present in a
synchronous machine, this propagation delay may well be the part of the system that
limits the maximum system clock speed.

2.4.3 Carry Look-Ahead

In order to speed up the parallel adder described above, a method is required to supply
the carry inputs to adders as early as possible.
This is achieved with a carry predictor, which is a piece of combinational logic that
calculates the carry values directly. In fact, it can supply carry values to each adder in the
chain at the same time, with approximately the same propagation delay as a single half
adder. A carry predictor is shown in Figure 2.5 for a 3-bit adder. It is interesting to note
the formation of the logic equations describing the carry look-ahead units (see Box 2.13).
w
i
© 2.4.4 Subtraction
E
<=
=pas
Similar to addition, subtraction is performed bitwise. But when performing subtraction,
@ do we need to consider the result from neighbouring bits? The answer is yes, but these
31
Foundations

Worked example

Q:The adders and half adders used in a 4-bit parallel carry-propagate adder are spec-
Box
2.12
ified as follows:
Time from last input bit (x or y) or carry in to result z: 15ns
Time from last input bit (x or y) or carry in to carry out: 12 ns
If input words x[3:0] and y[3:0] are presented and stable at time 0, how long will it be
before the 4-bit output of the adder is guaranteed stable and correct?
A: Starting from the least significant end of the chain, adder 0 receives stable inputs at
time 0. Its result z is then ready at 15 ns and its carry is ready at 12ns. Adder 1 requires
this carry in order to begin its own calculation, so this only starts at 12 ns. It takes until
24ns before it can provide a correct carry result to adder 2 and this will not provide
a carry to adder 3 until 36ns. Adder 3 then begins its calculation. Its output z is then
ready at 51 ns and its carry out is ready at 48 ns. So even though the adders themselves
are fairly quick, when chained, they require 51 ns to calculate the result.
Note: The phrase ‘begins its calculation’ when applied to the full or half adders may
be misleading. They are actually combinational logic blocks. A change of state at the
input will take some time (up to 15 ns in this case) to propagate through to the output.
Since they are combinational logic, they are always ‘processing’ input data and their
outputs are always active. However, from the specification, we know that the outputs
are only guaranteed correct 15ns or 12 ns after the inputs are correctly presented (for
result z and carry out respectively).

are now linked through ‘borrows’ from higher bits, rather than ‘carries’ from lower bits.
This is problematic in the same way as addition.
In terms of computational hardware, a specialised subtracter would be required
if it were not for the fact that addition and subtraction can be interchanged in many

‘Figure Dad
Fos(thy) eu(aly ~ Jex((@)}) cell((0)}

z(2) z(1) z(0)

The carry look-ahead adder constructed from several full adders and carry predict
logic.
Arith
32
Chapter 2

Exercise for the reader

1. Write the logic equation of a single-bit full adder.

2.13 2.
Box Extend this to a 3-bit adder as shown above.
3. Re-arrange the equations to give Co and C; in terms of the input (rather than any
carry ins). Note that the number of basic calculations required to give C; is small,
and thus the propagation delay through gates required to do this calculation is
also small.
4. Now extend the equations to derive C). How many calculation steps are needed
for this? Is it more than for C,? Can you deduce anything about the scaling of this
method to longer adder chains (thinking in terms of propagation delay and also
logic complexity)?

number formats. As an example, consider the decimal calculation 99 — 23 = 76 which

can be written in an alternative arrangement as 99 + (—23) giving an identical result.
Although the result is identical, it is achieved by performing an addition rather
than a subtraction, and changing the sign of the second operand. Many commercial
ALUs work ina similar fashion: they contain only adding circuitry and a mechanism to
change the sign of one operand. As we have seen in Section 2.3.4, changing the sign of a
two’s complement number is relatively easy: first, change the sign of every bit and then
add 1 to the least significant bit position. Adding 1 to the LSB is the same as setting the
carry input for that adder to 1.
Needless to say, this is easily achieved in hardware with a circuit such as the sub-
traction logic shown in Figure 2.6. In this circuit, the exclusive-OR gate acting on input
operand y is used to change the sign of each bit (an exclusive-OR acts as a switched
inverter in that if one input is held high, every bit present on the other input will be
inverted, otherwise it will be unchanged). If the circuit is performing a subtraction, the
add/subtract line is held high, one operand is negated and C;,, is also set high — this
has the effect of adding 1 to the least significant bit.

z[O..n-1] Figure 2.6

C single wire
out

An-bit bus

add/subtract
y({0O..n-1]
Subtraction logic consisting basically of an adder with external exclusive-OR gates.
Arithmetic
33
Foundations

There is one further area of subtraction that needs to be explored, and that is
overflow: when performing an addition, you will recall that the topmost Cout can
be used to indicate an overflow condition. This is no longer true when performing
subtractions as some examples on 4-bit two’s complement numbers will reveal:
COLO 2 LIAO = 2 2+ (-2)
=?
COLO nO OOo ae outt

Clearly, the result should be an easily-represented zero, and yet the Cout signal is
set. Consider another example where we would normally expect an overflow:
Ss} et FOLIO. =e 2 7+6=?
[e) Me te ONO ce aaa Answer = —3 ?

Again, the result should not be —3, it should be 13. Evidently, the circuitry shown is
not sufficient alone, and some account needs to be taken of the values being processed.
The answer is that the sign bits must be examined prior to adding, and the result checked
based on this. This is not computationally hard - a simple look-up table will suffice:
positive + positive = positive
positive + negative = unknown
negative + positive = unknown
negative + negative = negative

For the mixed calculation (one positive and one negative number), the sign of the
answer is unknown, but is not problematic since by definition it can never result in an
overflow (think of it this way: the negative number will reduce the size of the positive
number, but the most it can do would be if the positive number is zero, in which case
the answer is the same as the negative input, and the inputs themselves do not include
carry flags).
For the case of two positive numbers being added, the result sign bit should be 0. If
it is not, then an overflow has occurred. For the case of two negative numbers, the result
sign bit should be 1, and if it is not an overflow has occurred. It can be seen therefore
that the value of Cout alone is not enough to indicate that an overflow has occurred. In
most processors, a separate overflow flag is provided, set through consideration of the
sign bits as we have seen. Consider the worked example in Box 2.14.

Exercise for the reader

Try extending the argument in the text to a subtraction. Using 4-bit two’s complement
Box
2.14 signed number format, perform a few additions, then a few subtractions. Verify that
all of the subtractions a —b can be performed in binary as a + (—b). Verify that the Cout
signal does not indicate an overflow condition.
Perform the additions —5 + —5 and —5 + —1 and look at the sign bit and carry AS
=
bits of the result. Can you conclude that the Cout signal is useless, or can it be used to o
£
increase the bit range of the result? . P=
=
rs
<
34
Chapter 2

Multiplication
In the early days of microprocessors, multiplication was too complex to be performed
in logic within the CPU and hence required an external unit. Even when it was finally
squeezed onto the same piece of silicon, it was a tight fit: the multiply hardware in
early ARM processors occupied more silicon area than the entire ARM CPU core.
In more recent times, however, manufacturers have tuned multipliers to the target
application. For fast real-time embedded processors (perhaps an ARM7 in a GSM cell-
phone handling speech coding), there is a need to perform multiplications as quickly
as possible and hence a fast multiplier will be used. This will evidently occupy a large
silicon area compared to a slower multi-cycle multiplier used on a non real-time pro-
cessor (such as the ARM610 which was designed to power desktop computers in the
early 1990s, and to be the brains of the Apple Newton — the world’s first PDA).
There are many methods of performing the multiplication m x n at various rates
(and with various complexities). Some of the more typical methods are listed here:

Repeated addition (add m to itself n times).

Add shifted partial products.
Split n into a sequence of adds and left shifts applied to m.
Lee Booth and Robertson’s methods.
Me?
Le

Each of these will be considered in the following subsections in turn. There are, of
course, other more esoteric methods as this is an active research area. Interestingly,
some methods may perform estimation rather than calculation, or involve loss of preci-
sion in the result. These would include converting operands to the logarithmic domain
and then adding them, or using an alternative or redundant number format.
Alternative number formats are briefly described in Section 9.5, but when it comes
to hardware for performing binary calculations, there are so many alternatives that it
will be impossible to describe them all.

Ze0ek Repeated Addition

The simplest method of performing a multiplication is one of the smallest in imple-
mentation complexity and silicon area but at the cost of being slow. When multiplying
integers m x n the pseudo-code looks like:

set register A <— m

set register B <— 0
loop while (A — A —1)>0
B<B+n

c Since this involves a loop that repeats n times then the execution time is dependent on
&
—_
the value of n. However, if 1 is small, the result, B, is formed early.
12]
& If we consider that a 32-bit number can represent an integer with value in excess
2= of two billion, we realise that many iterations of the loop might be necessary: it could
2
= imply a rather long execution time.
SS
Foundations

2.52 Partial Products

Instead of iterating based on the magnitude of n (as in the repeated addition method
above), the partial products method iterates based on the number of bits in number n.
Each bit in the number n is examined in turn, from least to most significant. If a
bit is set, then a partial product derived from number m shifted left to line up with the
bit being examined, is accumulated. In multiplier terminology, the two numbers are
termed multiplier and multiplicand although we also know for decimal numbers that it
does not matter which way the multiplication is performed since (m x n) = (n x m).
Here is a partial products example:
1001 multiplicand 9
ila) multiplier 11
1001 (since multiplier bit 0 = 1, write 9 shifted left by 0 bit)
1001 (since multiplier bit 1 = 1, write 9 shifted left by 1 bit)
0000 (since multiplier bit 2 = 0, write 0 shifted left by 2 bits)
1001 (since multiplier bit 3 = 1, write 9 shifted left by 3 bits)
01100011 —_ result=99 (sum of the partial products)

The situation is complicated slightly when it comes to working with two’s comple-
ment signed numbers, firstly in that the most significant bit of the multiplier represents
sign, and secondly in that sign extension must be used (see Section 2.3.4).
For the signed case, all partial products have to be sign extended to the length of the
result (which by default would be the sum of the lengths of the input representations
minus 1 to account for the sign bit, such that a 6-bit signed number plus a 7-bit signed
number would require 12 bits to represent the result).
Since each partial product corresponds to one bit of the multiplier and is shifted to
account for the multiplier bit weighting, the partial product corresponding to the MSB
is a special case: the bit weighting is negative and this partial product must therefore be
subtracted from the accumulator rather than added. This is shown in the flowchart of
Figure 2.7, where it is assumed that the grey-coloured two’s complement accumulate
blocks are able to take account of sign extension.
To understand the process better, it is useful to attempt some simple binary multipli-
cation by hand using those methods; the reader can follow some examples in Box 2.15.
In reality, the accumulation of partial products may be more efficiently performed
in the reverse direction (i.e. looping down rather than looping up). In the best case
this would also remove the need to treat the partial product of the multiplier sign bit
differently (since this is not accumulated, it is merely the value in the accumulator
before additions begin, thus allowing its sign to be negated during the load-in process).
Figure 2.8 illustrates a block diagram of an alternative partial product multipli-
cation method for unsigned numbers only (although extending this method to two’s
complement is a relatively simple task). The figure shows the sequence of operations c
2
to be taken once the set-up (operand loading) is complete. onl
5
The set-up phase resets the accumulator Q to zero and loads both multiplier and YZ
2
=
multiplicand into the correct locations. In step 1 the least significant bit of the multiplier >
is tested. If this is a 1 (step 2) then the multiplicand is added to the accumulator (step 3). =
36
Chapter 2

es — ae = Figure 2.7

On entry, M is the multiplicand

Q is the multiplier (both are n-bits)
The result will be in register A (2 n-bits)

A =A+(M<<count) x Q[count] count = count + 1

Note the subtraction — the

MSB of Q is its sign bit
— (M<<n) x Q[n]

A= A+ (M<<count) x Q{count]

The grey single-bit ~~

multiplication boxes are
simply implemented as A =A+(M<<count)
switched accumulation functions:

A flowchart showing the steps for performing partial product multiplication.

Step 4 occurs regardless of the two previous conditional steps, and shifts the entire
accumulator one bit to the right. The system loops n times (using control logic which
is not shown) before terminating with the answer in the long register.
Consider the differences between this and the original flowchart of Figure 2.7 in
terms of the number of registers needed, bus wires, connections, switches, adder size
and control logic involved.

Figure 2.8

2: trigger if BO=1 | ]

n-bit adder shift and

add selector |

3:Q=Q+A 4: shift entire register,

(if triggered) one bit to the right fees bit

accumulator Q

4 ‘(2n+ )-bit register

final result

A bit-level block diagram of signed partial product multiplication using an accumulator.

Multiplication L a —_—__——$ — —— a
6y/
Foundations

Worked examples of two’s complement multiplication

Look at —5 x 4 (signed):
Box
2.15
1011 multiplicand —5
___0100 multiplier 4
090000000 (since multiplier bit 0= 0, write 0 shifted left by 0 bit & sign extend)
+0000000 (since multiplier bit 1= 1, write 0 shifted left by 1 bit & sign extend)
4111011 (since multiplier bit 2= 0, write —5 shifted left by 1 bit & sign extend)
+00000 (since multiplier bit 3= 0, write 0 shifted left by 1 bit & sign extend)
=11101100 = result = -128 + 64+32+8+4+4=
—20

Similarly, let us look at 4 x —5 (signed):

0100 multiplicand 4
OTM multiplier=5
00000100 — (since multiplier bit 0 = 1, write 4 shifted left by 0 bit & sign extend)
+0000100 (since multiplier bit 1 = 1, write 4 shifted left by 1 bit & sign extend)
+000000 (since multiplier bit 2 = 0, write 0 shifted left by 2 bits & sign extend)
-00100 (since multiplier bit 3 = 1, write 0 shifted left by 3 bits & sign extend)
=11101100 result = —128 + 64 +32 4+844 = —20
But the last term needs to be subtracted. What we will do is change the sign by flipping
all the bits and adding 1(00100000 — flip — 11011111 — +1 — 11100000). We then
simply add the sum to the other partial products. This gives:

00000100
+0000100
+000000
+11100
=11101100 result = —20

As we can see the result is the same. We have illustrated the cases of needing sign
extension and of handling a negative multiplier causing the final partial product to be
subtracted instead of added.

Interestingly, this method of multiplication, including the right shift method (which
divides a number by two), was reportedly used by Russian peasants for hundreds
of years, allowing them to perform quite complex decimal multiplies with ease. The
algorithm starts with the two numbers to be multiplied, A and B, written at the head
of two columns respectively. We will give as an example, 31 multiplied by 17:
Cc
Bae7 Aa =a
5
Working downwards, divide the B column by two each line, discarding the frac-
a
2
tional part until 1 is reached. Fill the A column similarly, but double the number on =
2
each successive line: =
38
Chapter 2

jsp ea
8 62
4 124
2 248
1 496

Next, simply add up all of the numbers in the A column that correspond to odd
numbers in the B column. In this example, only 17 and 1 are odd in the B column,
therefore the final answer will be 31 + 496 = 527, which is of course correct.
Note that the alternatives given in this section are by no means the only partial
product hardware designs available, and far from being the only multiplication methods
available (even among Russian peasants).

2.969 Shift-Add Method

The shift-add method relies on the fact that, for binary numbers, a shift left by one bit
is equivalent to multiplying by two. A shift left by two bits is equivalent to multiplying
by four and so on.
Using this property to perform a multiply operation will not avert the issue encoun-
tered when applying the repeated addition method in that the number of operations
depends on the value of the multiplier rather than the number of bits in the multiplier
word. For this reason, this method is not normally found as a general multiplier in
commercial processors. However, it can be very efficient where the multiplier is fixed
and close to a power of two. For this reason, it is often used in digital filters (devices
that perform a sequence of multiplications) with predetermined multiplier values.
This method is also easy to implement as a fixed filter in FPGA?-based designs
since in this case moving from one adder to the next is simply wiring up two logic
elements (logic cells), and a right shift can be accomplished simply by wiring output
bits 0, 1,2... of one cell to input bits 1, 2,3 ...on the next.

2.5.4 Booth and Robertson’s Methods

Booth’s method, is similar to partial products in that the multiplier bits are scanned from
right to left and a shifted version of the multiplicand added or subtracted depending on
the value of the multiplier bits. The difference is that the multiplier bits are examined
in pairs rather than singly. An extension of this method examines 4 bits in parallel, and
in Robertson’s method, an entire byte in parallel.
The advantage of these methods is that they are extremely fast. However, the logic
required becomes complex as the number of bits considered in parallel increases.
The trick in Booth’s method is to define a rule by which the multiplicand is sub-
tracted or added depending on the values of each pair of bits in the multiplier. If two
c consecutive bits from the multiplier are designated as X; and X;_,, when the multiplier
2
a
omt is scanned from i = 0, then the action taken upon detecting each possible combination
12]
Y of two bits is as shown in Table 2.1.
2
=
>
= * FPGA: field programmable gate array: a flexible, programmable logic device.
39
Foundations

Table 2.1

Predefined rules for bit-pair scanning in Booth’s method.

X; meas rule
——— 0 =aigits action

0 1 add shifted multiplicand

1 » 0 subtract shifted multiplicand

1 1 no action

When a multiplicand is added or subtracted to/from an accumulator, it is first

shifted left by i bit positions, just as it is done in partial products. This process can be
examined in detail by following the examples in Boxes 2.16 and 2.17.

& Exercise for the reader

Con)
8 Consider 9 x 10 (unsigned):

e 1001 multiplicand 9
1010 multiplier 10
HT goon (i=0, no action since bit pair = 0 and a hidden zero)
=1001 (i= 1, subtract multiplicand since bit pair = 10)
OO (i=2, add multiplicand « 2 since bit pair = 01)
-1001 (i=3, subtract multiplicand < 3 since bit pair = 10)
+1001 (i=4, add multiplicand « 2 since bit pair = 01)
(i=5 and onwards, no action since all bit pairs = 00)

The result is therefore obtained as the summation of the following:

10010000
OOO O10

Or by converting the subtractions into additions (see Section 2.4.4):

10010000
+10111000
+100100

+11101120 E
5
= OE Oso)
2
a
Result:
1011010 =644+ 16+8+42 = 90 (correct) 3
40
Chapter 2

NeneES SSS

Booth’s method worked example

Consider —9 x 11 (signed):
Box
2.17
UI Oa a multiplicand —9
00001011 multiplier 11
= he dno tale (i=0, subtract multiplicand since bit pair = 10)
0000000 (i=1,
no action since bit pair
= 11)
spit (Ovab algth (i=2, add multiplicand « 2 since bit pair = 01)
-10111 (i=3, subtract multiplicand « 3 since bit pair = 10)
+0111 (i=4, add multiplicand « 2 since bit pair = 01)
000 (i=5 and onwards, no action since all bit pairs = 00)
The result is therefore obtained as the summation of the following:
=e Oui
e LLOLVL LOO
10111000
+01110000

Or by converting the subtractions into additions (see Section 2.4.4):

00001001
+11011100
+01001000
+01110000
= 10000104 + Carry

Result:
10011101 = —128+ 16+8+4+41 = —99 (correct)

It is important to note that when i=0, the bits considered are the least significant
bit of the multiplier and a hidden zero. Thus, when the least significant bit of the
multiplier is a ‘1’, the multiplicand must be subtracted (i.e. treated as a ‘10’ instead).
This can be seen in the second worked example (Box 2.17).
There are two points worth mentioning here. First, when dealing with two’s com-
plement signed operands, the partial products must be sign extended in the same way
as the full partial product multiplier.
Second, when scanning from right to left, the hidden bit at the right-hand side
means that the first pair of non-equal bits that is encountered will always be a ‘10’,
c indicating a subtraction. This regularity may be useful when designing a hardware
2
=
\e) implementation.
Y Even for someone who has been doing binary arithmetic for many years, the
2
= preparation of this book highlighted how easy it can be to make very trivial binary
>
= addition mistakes. If you are required to do this as part of an examination, always
4]
Foundations

double-check your binary arithmetic. Getting it right the first time is not as simple as it
may seem.
As mentioned previously, Booth extended his method into examination of 4 bits at
a time, using a look-up-table type approach, and Robertson took this one step further by
building an 8-bit look-up table. These methods are in fact common in various modern
processors, although they require considerable resources in silicon.

Division
For many years, commodity CPUs and even DSPs did not implement hardware divi-
sion due to the complexity of silicon required to implement it. Analog Devices DSPs
and several others did include a DIV instruction, but this was generally only a hardware
assistance for the very basic primary-school method of repeated subtraction.

2.6.1 Repeated Subtraction

Since division is the process of deciding how many times a divisor M ‘goes’ into a
dividend Q (where the answer is the quotient Q/M), then it is possible to simply count
how many times M can be subtracted from Q until the remainder is less than M.
For example, in performing 13/4, we could illustrate this loop:
iteration i = 1, remainder r = 13 —4=9;
iteration i = 2, remainder r=9—4=5;
iteration i = 3, remainder r=5—4=1;
Remainder 1 is less than divisor 4 so the answer is 3 with remainder 1.

When working in binary the process is identical and perhaps best performed as
long division as in the worked example in Box 2.18.
So now the question is, how to handle signed integer division? Answer: The most
efficient method is probably to note the signs of both operands, convert both to unsigned
integers, perform the division and then apply the correct sign afterwards. Division uses
the same sign rules as multiplication in that the answer is only negative if the signs of
the operands differ.
The division process for one popular microprocessor can be seen in the flowchart of
Figure 2.9. A close examination of this may prompt some questions such as: ‘Why shift
both A and Q left at each iteration?’ and ‘Why perform an addition of Q = Q + M inside
the loop?’ These questions may be answered by considering how the operations are
performed using registers within a CPU. This will be left as a pencil-and-paper exercise
for the reader to follow the operation of the algorithm for one example division, perhaps
of two 6-bit numbers: this exercise will help to clarify how this system works.
Just note that at the completion of the algorithm, register A holds the answer, with
any remainder being in register Q. The algorithm will have iterated for n cycles where
n is the number of bits in the input words. As always, it is entirely possible to derive
other flowcharts that work differently, for example, some will even iterate and scan
through the bits in the opposite direction. Divisio
42
Chapter 2

= Long division worked example

ro
g Consider 23 = 5 (unsigned).
First, write the values in the long division format:

Lou 010111

divisor dividend

Then, starting from the most significant end (left) and working towards the least
significant end (right), scan each bit position in the dividend to see if the divisor can be
‘found’ in the dividend. In each case if it is not found, write a ‘0’ in the corresponding
position above the dividend, and look at the next bit. After three iterations, we would
have:

000 (quotient)
LOU (AGIs

But now, at the current bit position in the dividend, 101 can be found. We thus write
101 below the dividend and a ‘1’ above the dividend at the correct bit position.
Then subtract the divisor (at that bit position) from the dividend to form a new
dividend:

0001
Om OL
OD ha

= daQial

DOOOTL

Next, we continue working from left to right but this time looking at the new dividend
for the divisor. In this case it is not found; after scanning all bit positions we are left
with:

000100
Ode Jomo moana:

= Om

000011 _

The answer is seen above: the quotient is 000100 with a remainder of 000011. Since
j
<7 we were dividing 23 by 5, we expect an answer of 4 (correct) and a remainder of 3 (also
9°
ia correct).
2
a
43
Foundations

Figure 2.9 A flowchart of a division

START On entry, oe
M is the divisor, and ||
algorithm. [START Q is the dividend for n-bit division Q/M |
A=0, count=0,M=M<<n

A<<1, Q<<1

Nees |
On exit, A is the quotient, and
Q is the remainder

Working with Fractional Number Formats

Section 2.3.7 introduced the representation of fractional numbers using Q-format
notation. Although there are many reasons for requiring fractional notation, one
major reason is in digital signal processing, where a long digital filter may require
hundreds or thousands of multiply-accumulate operations before a result is
determined.
Imagine if some of the filter ‘weights’ (the fixed values in the filter that the input
numbers are multiplied by) are very small. In this case, after multiplying by these small
values many times, the result could be tiny, rounded down to zero by the number
format used. On the other hand, if some filter weights are large, the result of many
multiplications could be huge, resulting in an overflow. Understandably, this makes
designing such filters a very sensitive balancing act.
Fortunately, there is a reasonable and efficient solution: ensure that the operands
used are in fractional format, and are less than, but as close to, 1.0 as possible. The
rationale being that anything multiplied by a number that is less than or equal to 1.0
cannot be greater than itself. We are thus assured that the result of multiplying two
of these numbers will never result in an overflow. Similarly, anything multiplied by a
value slightly less than 1.0 does not become significantly smaller, hence results are less fe)
oO
likely to quickly round down to zero. Z
0
=

This is possible numerically because we are only multiplying and adding in a filter 5
tees
iL
and these are linear processes: (a x b + c) has the same result as (10a x b + 10c)/10. =
ee
Remember again that the actual fractional format used is not relevant to the hard- >
ware used to perform the calculations. It is only an abstraction that the software oD
=
engineer must keep in mind. This will be illustrated with various examples as we %
(e}
ke

progress through this chapter. =

44
Chapter 2

pay
is| Arithmetic with Fractional Numbers
Addition can always be performed on two fractional format numbers, but the correct
answer will only be achieved when the formats of each operand are identical. The
format of the answer will be that of the operands:
(m.n) + (m.n) = (m.n)
(m.n) — (m.n) = (m.n)

Worked examples of fractional representation

Question 1: Represent 1.75 and 1.25 in (2.2) format fractional notation, perform an
Box
2.19 addition between the two and determine the result.

Answer: First calculate the bit weightings for (2.2) format notation: we need two
digits to the right and two digits to the left of the radix point. Digits to the left are
integer weighting, are powers of 2 and start with 1. Digits to the right are fractional,
are 1 over powers of 2 and start with 1/2:

eel ea Be |
We can decompose 1.75 into 1+ 0.5 + 0.25 and 1.25 into 1 + 0.25 and write them in (2.2)
binary format as 0111 and 0101.
The binary addition of these operands results in 1100. Is this correct?
1100 in (2.2) format equals 2 + 1 = 3. Of course 1.75 + 1.25 = 3 so yes, the answer is
correct.
Next, we will illustrate what happens when something goes wrong.
Question 2: Represent 1.75 in (2.2) format fractional notation, represent 0.625 in (1.3)
format fractional notation, perform an addition between the two and determine the
result.
Answer: 1.75 was represented in question 1 and is 0111.
(1.3) format fractional notation has weightings 1, 0.5, 0.25, 0.125 and thus if we decom-
pose 0.625 into 0.5 + 0.125 we get a binary pattern 0101.
Next, we perform the addition 0111 + 0101 which gives the answer 1100.
However, we do not know the fractional format of the result. Let us speculate
whether this is (2.2) or (1.3) format by working out the decimal value in each case.
In (2.2) format the result is 2+1 = 3 and in (1.3) the result is 1+0.5 = 1.5. However,
the answer should be 1.75 + 0.625 = 2.375. Clearly, this does not match either of the
potential answers.
What we should have done was change one of them so they were both in the same
format before we performed the addition.
Note: Did you see that the binary patterns of both examples are identical? It is only our
interpretation of those bit-patterns that changed between examples. Using different
interpretations in this way can cause the same bit-pattern to have multiple meanings —
but the hardware used to perform the calculation does not need to change.
Fractional
with
Working
Nu weSe eee aa
45
Foundations

The arithmetic of such fractional format numbers is illustrated with two examples
in Box 2.19.

PAIEGS Multiplication and Division of Fractional Numbers

In the case of multiplication, there is more flexibility in that the operands can have
different fractional formats, and the fractional format of the answer is derived from
those of the operands:
(m.n) x (p.q) = (m+ p) x (n+q)
It is evident that the number of bits in the answer of the multiplication will be the
sum of the number of bits in the two operands, and this is expected from what we
already know of multiplier hardware from Section 2.5.
Division is rather more complex. In fact, the best way to perform division is first
to remove the radix points of both numbers by shifting both radix positions, in step,
digit by digit to the right until they are below the least significant digit of the largest
operand, extending the smaller operand where appropriate. The division then occurs
in standard binary fashion.
The worked example in Box 2.20 illustrates how fractional division is done.

Worked example of fractional division

Consider 11.000 + 01.00 (unsigned).

Box
2.20 This has the trivial meaning of 3 + 1. The first step in performing this operation is to
shift the radix point in step to the right by one position:
LMO ROO => AON 0
This is insufficient since the numbers still contain the radix so we repeat one step:
LOO ROMA OROOR
This is still insufficient so we perform the step again, extending the 0100 as a
side-effect of removing radix from 1100. 0 as follows:
MLMOOOS sO.
The division then occurs as a standard binary division:
01000 | 11000

Continuing the long-hand binary division:

OOO
01000 | 11000 Ss
Cc
pe
ees
3)
oe}
reee
<£
=
>
o
=
=—
The answer is 11, which as decimal value 3, is correct. ce)
=
46
Chapter 2

Looking at the worked example, it is clear that the actual division is no more com-
plex than standard binary arithmetic; however, consideration of the radix position may
be problematic. In fact, it requires some careful coding on the part of the programmer.

Floating Point
Floating point numbers are similar to fractional format binary but they have additional
flexibility in that the position of the radix point is variable (and is stored as part of
the number itself). It is this flexibility that allows floating point numbers to encode an
enormous range of values with relatively few bits.

2.0.1 Generalised Floating Point

A floating point number is one which has a mantissa S (or fractional part) and an
exponent E (or power). There is probably also a sign (a) such that the value of the
number represented in base B is given as:
n=axS~x
BF

Or more correctly considering the sign to be binary, with 1 indicating negative and
0 indicating positive, such that:
n=l)’ xSxB-
An example in base 10 would be 2.3 x 10° which we know is just a shorthand
method of writing 2,300,000. In fact, this illustrates one of the main benefits of floating
point: floating point numbers generally require less writing (and in binary require fewer
digits) than the decimal (or binary) values they represent.
In binary the difference is that B = 2 rather than 10 and thus the example will typi-
cally be something like 01001111 x2° which, if the mantissa (01001111) is unsigned,
becomes:

01001111 x 2° = 799 x 6449 = 505619 = 1001111000000

where the subscript 10 indicates a decimal value.

Of course, the final value is the same as the mantissa, but shifted by the exponent
(in the same way that we added five-zeros to the base 10 example above).
Normally, all the bits that constitute a floating point number (c,, S and E) are stored
in the same location such that they total a convenient number of bits such as 16, 32,
64 or 128. Some bit-level manipulation is therefore required when processing them to
separate out the three different parts from the full stored number.

Pia eRe IEEE754 Floating Point

—
a
Although there are many possible floating point formats and various implemented
° examples scattered through computing history, IEEE standard 754 which emerged
a.
0 in the 1970s, has become by far the most popular, adopted by all major CPU
=
=
2] manufacturers. It is generally considered in the trade that IEEE754 is a well thought-
2
nan out and efficient floating point format, and as a consequence is highly regarded.
47
Foundations

It is not the intention of this text to describe the entire IEEE754 standard, but we
will cover some of its more common features. We will consider single and double
precision formats which fit into 32-bit and 64-bit storage locations respectively. In the
C programming language these would normally correspond to float and double data
types:

Name _ Bits | Sign, o | Exponent, E Mantissa, S

Single precision. 2 | 1 | 8 23
Double precision 64 lies 1 | if By

In addition, further bits can be added to the representation during intermediate

calculation stages (in a hardware floating point unit) to ensure overall accuracy is main-
tained, as will be described later in Section 2.9.3.
Despite all 32 bits or 64 bits being used for sign, mantissa and exponent, the
TEEE754 format manages to cleverly signify four alternative states using unique bit-
patterns that do not occur for normal numbers. These are shown in the following table,
beginning with the default state called ‘normalised’:

Name o E S
Normalised 1 or 0 not all zero or all one | any
Zero 1or0 all zero all zero”
Infinity — nial 51-050 Ae4 all one ‘ | all zero
Not a Number (NaN) 1 or 0 = all one | non-zero
Denormalised | A 1or0 all zero ie non-zero

When an IEEE754 number is written, we typically write the bits from left to right
in the order (a, E, S) as shown below:

[Ro en toEna peiemetiaicn| eySncalcachet tho,valert NewP UERRE o|

where the box represents a 32-bit or 64-bit binary number that we know, or are told,
contains an IEEE754 format value. All examples given in this book will only use the
32-bit single-precision format to save paper.

2.8.3 IEEE754 Modes

* Tn the discussion that follows, we will use S to represent the mantissa bit-pattern in
unsigned fractional (0.23) or (0.52) format, E to represent the exponent bit-pattern in
unsigned two’s complement format and o to represent the sign. Note that they have
different meanings in the five IEEE754 modes.
In this way, an IEEE754 number written in a box such as this:

0 10110010 11100000000000000000000000 —

fe
}
would be said to have o = 0 and therefore positive sign, oo.
2))
E = 128+4+32+
16+2 =178 and =i
5
2ke
= 0.5 + 0.25 + 0.125 = 0.875
48
Chapter 2

We will maintain this naming convention for E and S throughout. So henceforth

the words ‘mantissa’ and ‘exponent’ are used to indicate the meaning of the written
bit-pattern, whereas S and E are the actual values written down in binary.
For example, an S of 10110010b = 0.875d might mean the mantissa is 0.875 or
is 1.875 or is irrelevant (not a number, NaN). The actual meaning of the written bit-
patterns changes with mode as we shall see below.

2.8.3.1 Normalised Mode

This is the number format that most non-zero numbers will be represented in. It is the
one mode whereby the number format can truly be called ‘floating point’. In this mode,
the number represented by the bit-patterns (a, E, S) is given by:
Fee (aat)a x (1 ah S) x ge-127

where it can be seen firstly that the exponent is in an excess-127 notation (introduced
in Section 2.3.5) and secondly that the mantissa needs to have a ‘1’ added to it. In other
words, the mantissa is equal to S + 1 and we know that S was written in (0.23) format.
All this may be very confusing, so we will return to the example IEEE754 number
and use it in the worked example in Box 2.21, and give a second example in Box 2.22.
Many of our example numbers have long tails of zeros. We can obtain an idea
about the basic precision of IEEE754 by considering what difference would result if the
least significant bit at the end of one of those tails is flipped from a ‘0’ to a ‘1’. Box 2.23
provides a guide as to how we can investigate the effect.

IEEE754 normalised mode worked example 1

Given the.following binary value representing an IEEE754 number, determine its

Box
2.21
decimal value.

pescgy 13d] 10110010 11100000000000000000000000

First of all, we note that here c = 0 and therefore the value has positive sign. We also
note that the number is in normalised mode. Therefore:
E = 128+32+16+2=178
and
5S = 0.5 + 0.25 + 0.125 = 0.875
Using the formula for normalised mode numbers, we can calculate the value that this
conveys:
nia) Sh ++:0:'875) wate as

=
= 1.875 x 2°!
ae
2]
[.
= 4.222 x 10'5
2)) As we can see, the result of the worked example is a fairly large number, illustrating
pos
—
2] the ability of floating point formats to represent some quite big values.
2Hele
49
Foundations

Ql TEEE754 normalised mode worked example 2

ro)
3 Given the following binary value representing an IEEE754 number, determine its
PA decimal value.

01010000000000000000000000
In this case, 0 = 1 and therefore has negative sign, and remaining bit-patterns give:

E =8+4=12and
S =1/4+1/16 =0.3125

Using the formula for normalised mode numbers, we can calculate the value that this
conveys:
hu Gel) beGlck0.3125)pe200~
USD
=t-.1097 x 10
This time the result is a very small number. This illustrates the enormous range of
numbers possible with floating point, and also the fact that all through the represented
number range (explored further in Section 2.8.4), precision is maintained.

Exercise for the reader

ro
8 Notice in the worked examples (Boxes 2.21 and 2.22) that our 23-bit long mantissa
inal
values began with a few 1’s but tailed off to a long string of 0’s at the end. This was
done to reduce the difficulty in calculating the value of the mantissa because, as a (0.23)
fractional format number, the weightings at the left-hand end are easier to deal with,
having value 0.5, 0.25, 0.125 and so on. In fact, as we move to the right the bit weights
quickly become quite difficult to write down.
The exercise in this case is to repeat one of the worked examples, but with the
least significant bit of the mantissa set to 1. If the weighting for the most significant
mantissa bit, bit 23, is 2~!(0.5) and for the next bit, bit 22, is 2~*(0.25), what will be the
weighting for bit 0?
When this is added into the answer, what difference if any does it make to the
written result?
The real question now is, does this indicate anything about the precision of
TEEE754 numbers?
—

=
()
a.

DiSsoue Denormalised Mode D

r=
Some numbers have such small magnitude that IEEE754 cannot represent them. Gen- ce]
_—

to zero, but IEEE754 has a

A}
eralised floating point would round these values down u.
50
Chapter 2

special denormalised mode that is able to extend the represented numbers downwards
in magnitude towards zero — gracefully decreasing precision until zero is reached.
Denormalised mode is not actually floating point because the exponent (which is
the part of the number that specifies the radix point) is set to all zeros and thus no longer
‘floats’. However, this mode, in allowing range extension is an important advantage of
TEEE754 numbers.
In this mode, the number represented by the bit-patterns (o, E, S) is given by:

Me a Gee
It can be seen firstly that the exponent is fixed as mentioned above, and secondly
that we no longer need to add a‘1’ to the mantissa. The reason for this will be apparent
when we explore number ranges in Section 2.8.4.
Since the exponent is fixed, the bit-pattern is always all-zero and the mantissa non-
zero. A worked example will help to clear up any confusion, and this is provided in
Box 2.24.
Since denormalised numbers extend the range of IEEE754 downwards, they will
always have very small magnitude.

2.8.3.3 Other Mode Numbers

Zero, infinity and NaN are identified by their special bit-patterns. These can all be
positive as well as negative, and require special handling in hardware (see Box 2.25).

IEEE754 denormalised mode worked example

Given the following binary value representing an [EEE754 number, determine its
Box
2.24
decimal value.

‘gene! 00000000 11010000000000000000000000

Firstly, we note that since o = 0, the number represented by these bit-patterns

therefore has positive sign.
E = 0 so we look at the mode table in Section 2.8.2 to see whether we are dealing
with a zero or a denormalised number. We actually need to examine the mantissa to
decide which it is (a zero must have a zero mantissa, otherwise it is a denormalised
number).
Looking at the mantissa we see it is non-zero and therefore a denormalised mode
number:
S = 0.5 + 0.25 + 0.0625 = 0.8125

oe Using the formula for denormalised mode numbers, we can calculate the value that
ho this conveys:
[e]
ou.
o) n = (—1)° x 0.8125 x 27128
AS
—
is] = 9.5509 x 107%?
=
i,
ll
Foundations

TEEE754 infinity and other ‘numbers’

Infinity is most commonly generated by a divide-by-zero or by a normalised mode

Box
2.25
overflow. Infinity can be positive or negative to indicate the direction from which the
overflow occurs.
NaN, indicating Not-a-Number, is generated by an undefined mathematical op-
eration such as infinity multiplied by zero or zero divided by zero.
Zero itself may indicate an operation that really did result in zero, for example
(2—2), or it could result from an underflow, when the result is too small to be represented
even by denormalised mode, in which case the meaning of + /— zero indicates whether
the un-representable number was slightly above or slightly below zero.

2.8.4 TEEE754 Number Ranges

One excellent way of understanding IEEE754 is through the construction of a number
line that represents the ranges possible in the format. The following number line, rep-
resenting an unsigned 8-bit number, will illustrate what this involves:

| Minimum magnitude = Maximum magnitude = 2° —1=

0 255

| Accuracy (distance between number steps) = 1000 ie -_ —S |

Three parameters are indicated which describe the format. The first is the smallest
magnitude number (0000 0000), the second is the largest magnitude number (1111
1111) and the final is the accuracy. Accuracy is defined as the distance between steps
in the format. In this case, the numbers count upwards as integers: 1, 2, 3, 4,5, ...255
and so the step size is simply 1.
Now, we will undertake to define a number line for [EEE754 format in the
same way. To simplify matters we will consider positive numbers, but we will look
at both normalised and denormalised modes although only for the single-precision
case.
Normalised mode requires that E is not all-zero or all-one, but S can take any value
and the actual value represented is:

fai 1) (1S) Kx 2

If we look for the smallest magnitude normalised mode number, we need to find
the smallest S and smallest E possible. The smallest S is simply 0, but the smallest E
—

=
(o)
cannot be 0 (because that would denote denormalised or zero mode), so it has to be o..

00000001 instead:
fe)
BSi=
5
Lae 00000001 | 00000000000000000000000000 | 2L
52
Chapter 2

Inserting these values into the formula and assuming a positive sign gives us:

minnorm =(1+0) x 2-17 =1 x 2-6 =1.175 x 10°*

Next, looking for the largest magnitude number, we remember that S can be any-
thing, but E cannot be 11111111 (because that would put it into infinity or NaN
modes). So we choose the largest E as 11111110 and the largest S as being all-one.
Considering E first, the value equates to 254. However, S is slightly harder to
evaluate:

GsSS AGA AE LeORO STAALSO OAL AL LISI

But realising that this is (0.23) format and is slightly less than 1.0 in value, we can
see that if we add a binary 1 to the least significant digit then all the binary 1’s in the
word would ripple-carry to zero as the carry is passed up the chain and we would get
a value like this:
EAL eapa apaWt Wg Pa

+000 0000 0000 0000 0000 0001

=1000 0000 0000 0000 0000 0000

We can use this fact; knowing that there are 23 bits, the bit weight of the first most
significant bit is 2~!, the weight of the second most significant bit is 2~* and so on. Then
the twenty-third most significant bit (which is actually the least significant bit) must
have a weight of 2-.
Therefore, the value of S has to be (1.0 — 2~**) since adding 2~ to it would make
it exactly equal 1.0:

re RE SERE RH (ob Cece

heh he he ae

Putting all that into the formula we have:

max norm =(1+1—2°-%) x 2>*” — (2 —.2-%) 97 — 3 403 x 10°

What about number accuracy? If we look at the numbers we have found we will
realise that accuracy is not constant. The smallest bit is always 2~** times the exponent
across the entire range.
Finally, starting a number line for normalised mode, we get:

CRS he erry eerie ie max 3.403 x 10%

om: >
Since the sign bit changes only the sign and does not affect the magnitude, the
range line must be a mirror image for negative numbers.
Denormalised mode can be handled in a similar way, although by definition the
exponent is always zero and the value of the number represented is:
Point
Floating ee Se es
538
Foundations

Remembering that a mantissa of zero is disallowed, the smallest denormalised

number has just the least significant mantissa bit set:

Ole 00000000 00000000000000000000001 |

And therefore a value of 2~* following the argument for normalised mode maxi-
mum number. The formula becomes:

min denorm = 27? ~ 27-176 — 2-149 — 4.401 x 107%

As for the largest denormalised number, this is simply the number where S is a
maximum. Looking at the mode table in Section 2.8.2 we see it can be all-ones:

[ 9 | 00000000 Wed 1114111101

Ree aaa =
Again using the same argument as the normalised maximum value case, this has
a meaning of (1 — 2~*’), giving a value of:

max denorm = (1 — 273) x 27176 = 2-™ = 1.175 x 10°

Now to work out the number accuracy: in this case since the exponent is fixed, the
accuracy is simply given by the value of the mantissa least significant bit multiplied by
the exponent:
9-23 x 9— 126

min 1.401 x 10-45 | | max 1.175x 10-38

< >
= (distance between number steps) = 2-7 x 2-16 -

Putting the number lines together, we see the huge range spanned by IEEE754
single-precision numbers. Remember that this is actually only half of the real number
line that has positive as well as negative sides:

Zero Denormalised Normalised oi

0 | 1401 x 10-# r 1175 x 10-8 | 1175x 10-8 | 3.403x10%

Sidley Bs
Miesole the,
tolgh(JEDslags ao By
0 Hae ee Das x 2-126 a ‘Meo epee 2-23 x 2E-W né

The number line becomes useful when we want to convert decimal numbers to
—_

a
()
IEEE754 floating point. It tells us which mode we should use, whether zero, denor- on

malised, normalised or infinity. To illustrate this, follow a worked example of conver- 03)
=
5
—

sion from decimal to floating point in Box 2.26.

Ad
There will be more examples of such conversions in Sections 2.9.1 and 2.9.2. iL.
54
Chapter 2

oo ee ee eee
Worked example: converting decimal to floating point

Q. Write decimal value 11 in IEEE754 single-precision format

Box
2.26
A. Looking at our number line in Section 2.8.4 we can see that this value lies squarely
in the normalised number range, so we are looking for a normalised number of the
form:
n= (=i x (1 ae S) x gE-127

To obtain this, it is first necessary to write N = 11 decimal as A x2° where A is

equivalent to (1 + S). Knowing that 0 > S < 1 it follows that 1 > A < 2. Probably the
easiest way is to take the number N and repeatedly halve it until we get a value A
between 1 and 2:
This gives 11 followed by 5.5 followed by 2.75 and finally 1.375.
So A = 1.375 and therefore N = 1.375 x 2° and it does not take too much work to
see that B is determined by the number of times we had to halve the original number,
N in this case, 3. Therefore our number is: 1 = (—1)° x (1.375) x 2°
Examining the formula for normalised numbers, we see that this requires:

G=
Ev) 130\(so that E — 127 = 3)
S = 0.375 (so that 1 + S = 1.375)

Finding a binary bit-pattern for E gives 128 + 2 or 10000010 and since 0.375 is
easily represented as 0.25 + 0.125 then the full number is:

0) 10000010 01100000000000000000000

Floating Point Processing

Up to now we have considered only the representation of floating point numbers, in
particular the IEEE754 standard. Such a representation is only useful if it is possible to
process the numbers to perform tasks, and this is considered further here.
In many computer systems, floating point processing is accomplished through
the use of special-purpose hardware called a floating point co-processor or floating
ey point unit (FPU). In fact, even though this is often included on-chip in commercial
Y & j
wy CPUs, it is normally still accessed as a co-processor rather than as part of the main
4)
OF)
2)
processor.
os
A.
For computers that do not have hardware floating point support, software emula-
—
tion is widely available, and apart from longer execution times (refer to Section 4.6.1),
oe
[e) the user may be unaware of where the float calculations are being done, whether in
an
o hardware or software. Most floating point support (whether hardware or software)
=c
5 is based on the IEEE754 standard although there are occasional software options to
2Whee
increase calculation speed at the expense of the full IEEE754 accuracy.
5S
Foundations

IEEE754 number processing involves the following steps:

1. Receive operands.
2. Check for number format modes. If the value is fixed, immediately generate the
answer from a look-up table.
3. Convert exponents and mantissas if necessary.
4. Perform operation.
5. Convert back to valid IEEE754 number format. Keep the most significant 1 of
the mantissa as close to the left as possible, for reasons of maintaining maximum
precision.

Para
be| Addition and Subtraction of IEEE754 Numbers
In generalised floating point, the exponents of the numbers must all be the same before
addition or subtraction can occur. This is similar to ensuring fractional format (11.1) +
(r.s) has n =r and m = s before adding as we saw in Section 2.7.1.
For example, consider the decimal numbers 0.824 x 107 + 0.992 x 10*. In order to
do this addition easily, we must have both exponents equal — then we simply add the
mantissas. But do we convert both exponents to be 10? or do we convert both to be 10,
or even choose something in between such as 10°?
In answering this question, first, let us consider how to convert an exponent down-
wards. We know that 10° is the same as 10 x 10” and 10? is the same as 100 x 10°. Since we
are talking about decimal, we multiply the mantissa by the base value of 10 every time
we decrement the exponent. Performing this in our calculation would give us the sum:

0.824 x 10* + 99.2 x 10?

Converting up is the converse: 10? is the same as 0.01 x 10* and would result in the
sum:

0.00824 x 10* + 0.992 x 10

On paper, in decimal, the value of both expressions is identical, but in binary, in
hardware, this may not be true. So the question remains: which action do we take? Do
we convert the smaller exponent to match the bigger one or the bigger exponent to
match the smaller one, or move to something in the middle?
The answer is, firstly, we do not want to convert both numbers because that is
introducing extra work, and secondly when we consider the bit-fields of binary num-
bers and knowing that by making an exponent smaller the mantissa has to get bigger
it becomes evident that there is a danger of the mantissa overflowing if it becomes too
big. We therefore opt to never increase a mantissa. This means we have to increase the
smaller exponent and scale its mantissa correspondingly:
0.00824 x 10* + 0.992 x 10*

This is termed equalising the exponents or normalising the operands. Later, we

will see that methods exist to help prevent the mantissa from disappearing by being
rounded down to zero during this process. Proces
Point
Floati
56
Chapter 2

Once the exponents are equal, we can perform an addition on the mantissas:

0.00824 x 104 + 0.992 x 10* = (0.00824 + 0.992) x 10*

IEEE754 addition and subtraction are similar to the decimal case except that since
the base is 2, the action of increasing one exponent to be the same value as the other
causes the mantissa of that number to be reduced by a factor of 2 for each integer
increase in exponent. The reduction by a factor of 2, in binary, is accomplished by a
right shift.
There is also one other factor we must consider and that is the format of the resulting
number. Remember that in normalised mode the mantissa bit-pattern cannot be greater
than 1. Well, if the result of a calculation on the mantissa becomes too big then we must
right shift the mantissa and consequently increment the exponent.
Similarly, if the mantissa becomes small it must be shifted left and the exponent
decremented. These factors will be explored through a worked example in Box 2.27.
We can now take the process further. Having determined how to equate the expo-
nents prior to performing arithmetic, we can tie that in with our knowledge of IEEE754
format and perform these operations directly on IEEE754 format numbers themselves.
Referring to the worked example in Box 2.27, we can now write the IEEE754 bit-
patterns of the numbers and perform the conversion in Box 2.28.
Subtraction is similar to addition — all steps remain the same except the mantissas
are subtracted as appropriate. Of course, we still have to consider overflow on the result
mantissa because we could be subtracting two negative numbers, such that the result
is larger than either original operand.

Die Wp Multiplication and Division of IEEE754 Numbers

For multiplication and division we do not need to normalise the operands first, but
we do need to perform two calculations on the numbers, one for the mantissas
and one for the exponents. The following relationships hold for these operations on
base B numbers:
(A x BS) x (D x B®) = (A x D) x BC*®)
(A x BS) /(D x BF) = (A/D) x B®)
Another decimal example will illustrate the point:

(0.824 x 107) x (0.992 x 10*)

oa
= (0.824 x 0.992) x 10°+4) = 0.817408 x 10°
wa
i)
Y Once again, in the case of IEEE754 format numbers the result must be converted
2
a. to a correct representation and special results (zero, infinity, NaN) checked for.
oie
Bo
[eo]
ou 2.953 IEEE754 Intermediate Formats
5) Although a particular IEEE754 calculation may have IEEE754 operands as input and
=
aes
Le] as output, there are cases where the output will be numerically incorrect unless there is
fe,
i greater precision within the calculation. A short example subtraction on 9-bit numbers
yA
Foundations

Floating point arithmetic worked example

Q. Convert decimal values 20 and 120 to IEEE754 format, add them and convert the
Box
2.27
result back to decimal.

A. Looking at our number line from Section 2.8.4 we realise that both values lie in the
normalised number range of IEEE754, but initially we will simply consider a generic
A x 2° format. Furthermore, we will not look at the exact IEEE754 bit-patterns here.
Simply remember that A = (1+ S) and B = (E — 127).
Starting with 20 we divide repeatedly by 2 until we get a remainder between 1
and 2: 10,5, 2.5, 1.25 and so A = 1.25. We divided four times so B = 4.
120 similarly divides down to 60, 30, 15, 7.5, 3.75, 1.875 so A = 1.875. Since we
divided six times, B = 6.
The information is inserted into the following table. We do not need to derive the
E and S bit-patterns at this stage; we are more concerned with their interpretation:

0 B - |A Binary value | Decimal value

0 | 4 | sis 1.25 x 2! | 20
0 |6 | 1.875 ELD | 120
Next step is to equalise the exponents. As discussed in the text, we have to make
both equal the largest exponent value, reducing the mantissa of the smaller number as
appropriate.
1.25 x 2* thus becomes 0.625 x 2° and then 0.3125 x 2° to reform the operands
into the following:

|o B |A Binary value Decimal sae

ONG | 0.3125 0.3125 x 2° * aS
Guhl | 1.875 vob eeBTS 2 papeedl204 in tiv |
Since both exponents are identical, it is now possible to proceed by adding the
mantissas to form a result:

P Oo B a |A L Binary value | Decimal value |

? ||
|
|

Or | 2.1875 PAE ||
However, this is not a valid representation for IEEE754 because the mantissa value is
too large. Remember the (1 + S) in the formula? Well, A = (1 + S) < 2 is our constraint.
If both operands were IEEE754-compliant then we should be able to guarantee that
no more than one shift is needed to put it right, so we shift the A value right by one
w
wy

binary digit and then increment B: v

2
on
| | B | LX Binary value | Decimal value | ee
g => _
po:
fo [7 {| 1.09375 — SoNo}(ee)N o1
Xx
Ni]
| IN
nN
3
a.

calculator will reveal that 1.09375 x 2’ is indeed the correct answer giving
D
A check ona £i=
us a decimal value of 140. 3]
“4
i
58
Chapter 2

%@ IEEE754 arithmetic worked example

fon}
5 First, we begin with the normalised mode formula:
jaa
= (1) (le Sx 2
Begin with the value of 20 decimal. In the previous worked example, it was determined
to be 1.25 x 10+. Slotting this into the formula reveals that (1 + S) = 1.25 and so
S = 0.25, (E — 127) = 4and thus E = 131. This is represented below:

[26 10000011 | 01000000000000000000000 |

120 decimal was 1.875 x 2° which gives us S = 0.875 and E = 133:

| 0 10000101 | 11100000000000000000000 |
The result of the addition was 1.09375 x 2’ such that S = 0.09375 and E = 134.
Since 0.09375 is not an obvious fraction of 2, we can use a longhand method to
determine the bit-patterns. In this, we repeatedly multiply the value by 2, subtracting
1 whenever the result is equal to or bigger than 1, and ending when the remainder is
Zero:

: 0.09375
: 0.0187
OLSYAS
2 O75)
at 055
Ot
>
=
NO
HG)-)—1=,0

We subtracted 1 on iterations 4 and 5. We make use of this by setting the fourth and
fifth bits from the left to 1. In fact, we could have used this method for the first two
numbers, but they were too easy:

| 9 | 10000110 _| _00011000000000000000000 a)

will illustrate this:

1.0000 0000 x 2) A
— 1.11111111 x 20 B
a
c
wi Before we can proceed with the subtraction it will of course be necessary to nor-
®
13) malise the numbers to the same exponent. We do this by increasing the smaller one as
e
a.
ao
we have done in Section 2.9.1:
i
to) 1.0000 0000 it A
a.
s2)) = (OAL Nat var B
am
—
2] Now we can proceed with the calculation. The result:
i°)
ra 0.0000 0001 x12! ©
59
Foundations

Then shift the mantissa left as far as possible:

1.0000 0000 ies

Let us look at the actual numbers that we have used. Operand A has value 2.0 and
operand B has value (2.0 — 2~*) which in decimal is 1.99609375. So the result should be:

2.0 — 1.99609375 = 0.00390625

However, the result from our calculation is 1 x 2~” or 0.0078125. There is obviously
a problem somewhere.
Now let us repeat the calculation but this time adding something called a guard
bit during the intermediate stages. This effectively extends the length of the mantissa
by adding another digit at the least significant end. We start at the point where the
numbers have been normalised. Note the extra digit:
1.0000 0000 0 ee A
SSH Aa) x20 B

Next shifting to normalise the exponents, the LSB of B shifts into the guard bit
when we shift the number right by 1 bit:

1.0000 0000 0 i A
= @,i att a x2) B

and subtract to get the following result:

0.0000 0000 1 x E

Then shift the mantissa left as far as possible:

1.0000 0000 0 eae

Notice that in line C this time the most significant (only) 1 occurred in the guard bit
whereas previously it was located at the bit above that. The normalised value is now
1 x 2-5 or 0.00039065, a correct answer this time.
Although this example showed generalised 8-bit floating point numbers, the prin-
ciple is the same for IEEE754 numbers.
The example above showed a loss of precision error causing an incorrect result
during a subtraction. Of course, the same error could occur during an addition since A —
B is the same as A + (—B). But can it also occur during multiplication and division? It is
left as an exercise for the reader to try and find a simple example that demonstrates this.
cy
In IEEE754 terminology, more than one guard bit is used and the method is called ht
wn
©
extended intermediate format. It is standardised with the following bit widths: 8)
a
2
| —

Name | Bits | o | Exponent E Mantissa S £

~ sama ats - —_t—__________§ ae ()
a.

Extended single precision | 43 I 1 | 11 oil D

_ —e oo — — +--+ — oo oo —{—__— $$ __— — etl —_____—_—— —— £
fe]
—

Extended double precision | Foie Maer | 15 | 63 °

iL
60
Chapter 2

Obviously it becomes awkward to handle 43-bit and 79-bit numbers in computers

that are based around 8-bit binary number sizes, but this should not normally be an
issue because extended intermediate format is designed for use within a hardware
floating point unit during a calculation. The input numbers and output numbers will
still be 32 bits or 64 bits only.

2.9.4 Rounding
Sometimes an extended intermediate value needs to be rounded in order to represent
it in a desired output format. At other times a format conversion from double to single
precision may require rounding. Rounding can be necessary for both fixed and floating
point number calculations at times.
There is more than one method of performing numeric rounding and many com-
puter systems will support one or more of these methods under operating system
control:

¢ Round to nearest (most common) — Round to the nearest representable value and
if two values are equally near, default to the one with LSB = 0, for example 1.1 to
lL ASO 2 eval 5 ko 2.
¢ Round towards +ve — Round towards the most positive number, for example —1.2
to —1 and 2.2 to 3.
e¢ Round towards —ve — Round towards the most negative number, for example
=1.2 te—2 and 2.2 to 2.
¢ Round towards 0 — Equivalent to always truncating the number, for example —1.2
to lkand 2.26 2.

For very high-precision computation, it is possible to perform each calculation

twice, rounding towards negative and rounding towards positive respectively during
each iteration. The average of the two results could be the answer (at least in a linear
system). Even if a high-precision answer is not obtained using this method, the differ-
ence between the two answers obtained will give a good indication of the numerical
accuracy involved in the calculations.

Summary
This chapter, entitled ‘Foundations’, has really begun our journey inside the computer
— whether that is a room-sized mainframe, a grey desktop box or a tiny embedded
system. It is foundational too, since almost all computers, whatever their size, are based
upon similar principles. They use the same number formats, perform the same type
of calculations such as addition, subtraction, multiplication and division. The main
differences that we have seen are that there exist some faster methods to carry out these
operations, but at the cost of increased complexity, size and usually power consumption.
We began the chapter by considering the definition of a computer and what it
contains. We introduced the useful classification of computer types (or CPUs) by Flynn,
6]
Foundations

viewed them in terms of their connectivity and the layers of functionality that they
contain. We then refreshed our knowledge of number formats and the basic operations,
before going into a little more detail about how these calculations are achieved.
Having covered the foundations here, the next chapter will focus on how to achieve
the connectivity and calculations that we know are required — how to fit these functional
units together, write and store a program and control the internal operation required
in a working CPU.
62
Chapter 2

A programmer wrote a C language program to store 4 bytes (b0, b1, b2, b3) to
consecutive memory locations and ran this on a little endian computer with
32-bit wide memory. If he examined the memory after running his program,
would he see something like A or B in the diagrams below?

bit 31 — bit0
(Scie Wis: | b2 | b1 | bo |
B: | bo | b1 | b2 |_b3

papi Complete the following table (for 8-bit binary numbers), indicating any in-
stances where conversion is impossible for the given value:

Value Unsigned |
Two's complement Sign-magnitude Excess 127
+

123 +

| —15 | ni
193
—127

ra8) With a two’s complement (2.30) format number, how do we represent the value
0.783203125? Can this be represented exactly with (a) 32 bits, (b) 16 bits and
(c) 8 bits?

2.4 One BCD digit consists of 4 bits. Starting with a 4-bit ripple-carry adder, modify
this with extra single-bit adders and logic gates to create an adder that can add
two BCD digits and produce a BCD sum. Extend the design so that it can add
two 4-digit BCD numbers.

pass) Using partial products (long multiplication), manually multiply the two 4-bit
binary numbers X = 1011 and Y = 1101 assuming they are unsigned numbers.

2.6 Repeat the previous multiplication using Booth’s algorithm.

ah If ADD, SHIFT and compare operations each require a single CPU cycle to com-
plete, how many CPU cycles are needed to perform the calculation in Problem
2.5? Compare this with the steps of Booth’s method in Problem 2.6. Also would
Booth’s algorithm become more efficient for a larger word width?

2.8 Consider a RISC CPU that has an instruction named ‘MUL’ that can multiply the
contents of two registers and store the result into a third register. The registers
are 32-bits wide, and the stored result is the top 32 bits of the 64-bit logical result
63
Foundations

(remember that 32 bits x 32 bits should give 64 bits). However, the programmer
wants to determine the full 64-bit result. How can he obtain this? (Hint: You will
need to do more than one multiply, and also a few ANDs and adds to get the result).
Verify your method, and determine how many instructions are needed.

If we multiply the two (2.6) format unsigned numbers X = 11010000 and

Y = 01110000 then we should get a (4.12) format result. We can shift the result
two digits left, giving (2.14) [i-e. effectively removing the top 2 bits] and then
truncate it to (2.6) [by discarding the lower 8 bits]. Will this cause an overflow,
and will the truncation lose any bits?

2.10 Consider the IEEE754 single-precision floating point standard.

(a) Express the value of the stored number N in terms of its storage bits (o, E,
S) for the following cases:
ll Soir ae
Hie r= 20) suiccesstully,
tll, =O =< IF = Ve5
Vee Uo)
% ES@S=C

(b) Express the following values in IEEE754 single-precision normalised

format:
ee=—lpy48)
Lily

2.11 Cana standard exponent/mantissa floating point number format represent zero
in more than one way? Can IEEE754 represent zero in more than one way? If
so, explain any differences between the representations.

2.12 Use the division flowchart of Figure 2.9 to obtain the quotient and remainder
values for the unsigned 5-bit binary division Q/M where Q = 10101b and
Mi=00\0'L1b.

2.13 Use the multiplication flowchart from Figure 2.7 to perform partial product
multiplication of two 5-bit unsigned binary numbers 00110 and 00101. De-
termine the number of registers used, their sizes and their content during each
iteration.

2.14 Repeat the previous problem using the multiplication block diagram of Fig-
ure 2.8, to compare and contrast the two approaches in terms of efficiency,
number of steps, number of registers and so on.
64
Chapter 2

Yroblems Consider the following calculation in the C programming language:

215

0.25 + (float)(9 x 43)

Assuming that integers are represented by 16-bit binary numbers and the floats
are in 32-bit IEEE754 single-precision representation, follow the numerical
steps involved in performing this calculation to yield a result in IEEE754 format.

2.16 How would Michael Flynn classify a processor that has an instruction able to
simultaneously right shift by one bit position every byte stored in a group of
five internal registers?

2.17 Justify whether self-modifying code (that is, software that can modify its own
instructions by rewriting part of its code) would fit better in a von Neumann or
Harvard architecture system.

2.18 Using a 16-bit processor and only a single result register, follow the process
to add the (2.14) format unsigned number X = 01.11000000000000 and the
(1.15) format unsigned number Y = 0. 110000000000000. What format would
the result need to be in to avoid overflow? Is there any loss of precision caused
by the calculation in this case?

2.19 Identify the IEEE754 modes of the following numbers:

10100010 10100000000000000000000000
0 00000000 10100000000000000000000000
al etertes tive 104 00000000000000000000000000

2.20 What would be the mantissa and the exponent of the result of the following
base 7 calculation, expressed in base 7?

(3 x 78)/(6 x 7*)
Hint: You do not need to use a calculator to obtain the answer.

2.21 Using partial products (long multiplication), manually multiply the two 6-bit
binary numbers X = 100100 and Y = 101010 assuming they are signed.

Depa Repeat the previous multiplication by swapping the multiplier and multipli-
cand (i.e. multiply the two 6-bit signed binary numbers X = 101010 and
Y = 100100). Compare the number of additions that are required to perform
65
Foundations

the partial product summation. Is it possible to simplify the process by swapping

multiplier and multiplicand, and if so why?

2.23 Repeat the previous two multiplications using Booth’s method. Is there any
difference in the number of partial product additions when the multiplier and
multiplicand are swapped?

2.24 Referring to Section 2.9, determine the number of basic integer addition, shift
and multiplication operations required to perform a single-precision [EE754
floating point normalised mode multiply, and compare this with the basic op-
erations required to perform a (2.30) x (2.30) multiply. Ignore extended inter-
mediate mode, overflow and saturation effects and assume the floating point
numbers have different exponent values.

2.25 How many computational operations are required to perform an 8-bit division
using repeated subtraction?
MO rays
CAO}, fotom| font
‘ 21o1o1
CHAPTER

CPU Basics

In this chapter, we begin looking at a cohesive unit which we can call a

computer — specifically its brains, the central processing unit (CPU). This
chapter will very much presenta traditional view of computer architecture.
It will neither consider state-of-the-art extensions and speed-ups which
Chapter 5 will cover nor look too deeply at individual functional units
within a computer which Chapter 4 will cover.
Rather, this chapter will concentrate on what a computer is comprised
of, how it is organised and controlled and how it is programmed.

What Is a Computer?
When the general public refer to a computer, they generally envisage a
beige-coloured box with monitor, keyboard and mouse. While the box
they imagine does contain a computer, we know there is a whole lot more
in there.
The ‘computer’ part of the system is the CPU, memory subsystem
and any required buses — in fact those items that allow it to function as
a stored-program digital computer. It does not require a graphics card,
wireless interface card, hard disc or sound system in order to compute
and execute stored programs.
The stored-program digital computer is basically just a very flexible,
but generally quite basic, calculating and data transfer machine that is
programmable to perform the required functions.
These days, most people in the developed world will be surrounded
by tens, if not hundreds, of computers. These may be inside microwaves,
toasters, cellphones, MP3 players, even electronic door locks. It has been
estimated that a luxury car contains well over 100 processors, and even
hd

oY
os
an entry model may contain over 40 separate devices. In one surprising
2
ox example encountered recently, a new double-sized electric blanket was
= promoted as containing four dedicated microprocessors — one active and
fe)
O one backup device for dual independent controls on each side. With usage
12]
fas on this scale it becomes easy to imagine that the ‘future is embedded’. The
aa
5 contents of this chapter apply whether the computer is room-size or the
£
= size of an ant.
67
CPU Basics

Making the Computer Work for You

As we have seen, at its most basic level a computer is simply a unit able to transfer data
and perform logical operations. All higher-level computational functions are a sequence
or combination of these basic data moves and logic operations. Various units inside the
computer are dedicated to performing different tasks, and these are fairly standard
building blocks used by most computers. For example, an arithmetic logic unit (ALU)
performs arithmetic operations, while a bus transfers data from one point to another.
Obviously, some method is needed for directing the computer — deciding when and
where to move data and which logic operations to perform using these building blocks.
The computer (comprising its internal units and buses) must be programmed to perform
the work that we wish it to undertake.
As a first step, the work required needs to be divided into a sequence of available
operations. Such a sequence is called a program and each operation is commanded
through an instruction plus operands. The list of supported operations in a computer
defines its instruction set.

3.2.1 Program Storage

Instructions clustered into a program need to be stored in a way that is accessible to the
computer. The very first electronic computers were programmed by plugging wires into
different holes. Later, manual switches were used and then automated with punched
card readers. Punched and then magnetic tape were invented, but whatever the storage
format a new program was entered by hand each time after power-up.
Modern computers store programs on magnetic disc, ROM, EEPROM, flash
memory or similar media. Programs are often read from their storage device into RAM
before execution for performance reasons: RAM is faster than most mass-storage
devices.
Items stored in memory need to have a location that is accessible. Their storage
place also needs to be identified in order to be accessed. Early computer designers
termed the storage location an address, since this allows the CPU to select and access
any particular item of information or program code which reside at unique addresses.
The most efficient way to do this has been for the CPU to notify the memory storage
device of the address it requires, wait for the content of that address to be accessed and
then read in the value from the device interface some time later.
As you may know, CPUs are programmed at the lowest level in machine code
instructions which are fixed (in most RISC devices such as ARM, PIC or MIPS), or
variable length sequences of bytes (as in several CISC devices such as Motorola
a
68000). It is a bunch of these instructions, in some particular program sequence, that e
co}
instructs a computer to perform required tasks. O
For these sequences of instructions to do something useful, they probably require w
a=
access to some data which requires processing. This historically encouraged a separa- oD
i
tion between program and data storage spaces, particularly since the two types of infor- XZ
o
mation have different characteristics: programs are typically sequential and read-only =
68
Chapter 3

whereas data may require read/write access and may be accessed either sequentially
or ina random fashion.

Bi2.2 Memory Hierarchy

Storage locations within a computer can all be defined as ‘memory’, because once
written to they remember the value written. Often, however, we reserve this term for
referring to solid-state RAM and ROM rather than registers, CDs and so on. Whatever
the naming convention, storage is defined by various trade-offs and technology choices
that include the following characteristics:

eo ) .Cose
* Density (bytes per cm’).
¢ Power efficiency (nanojules per write, read or second of storage time).
e Access speed (including seek time and average access time).
e Access size (byte, word, page, etc.).
e Volatility (ie. data lost when the device is unpowered).
¢ Reliability (does it have moving parts? does it age?).
¢ CPU overhead to manage it.

These factors lead to a hierarchy of memory as shown in the pyramid in Figure 3.1,
for both a large desktop/server and a typical embedded system. Two items shown will
be explored subsequently in Chapter 4: the memory management unit (MMU) and
cache. However, for the present discussion notice that registers — temporary storage
locations very close to the CPU functional units — are the fastest, but most expensive

Figure 3.1
| Higher speed,
| closer to CPU,
more costly
registers

ig, Highest
2g capacity,
2
Q lowest cost
£
(e}
2) Typical embedded system Typical desktop/server system
co)
hee
a
_ A pyramidal diagram illustrating the hierarchy of memory in terms of speed,
5) | size, cost and so on for embedded systems (on the left) and traditional desktop
£
x | computers (on the right).
[e)
=
69
CPU Basics

resource (and are therefore generally few in number, ranging from 1, 2 or 3 in simple
microcontrollers up to 128 or more in some large UNIX servers).
Moving down the pyramid, cost per byte decreases (and thus the amount provided
tends to increase), but the penalty is that access speed also decreases. A computer,
whether embedded, desktop or supercomputer, almost always comprises several of
the levels in the hierarchy:

¢ Registers—Store temporary program variables, counters, status information, return

addresses, stack pointers and so on.
e RAM - Hold stack, variables, data to be processed and often a temporary store of
program code itself.
e Non-volatile memory such as flash, EPROM or hard disc — Store programs to be
executed — particularly important after initial power-up boot time when volatile
RAM memory would be empty.

Other levels are there for convenience or speed reasons, and since there are so many
levels in the hierarchy, there are several places capable of storing required items of
information. Thus, a convenient means is required to transfer information between
locations as and when required.

SpPR Program Transfer

For reading a program from external storage into RAM, an I/O interface such as IDE
(integrated drive electronics — a very popular interface for hard discs), SCSI (small com-
puter systems interface — which can address discs, scanners and many other devices),
other parallel buses or serial buses (such as USB) are used (these interfaces will be
explained later in Sections 6.3.2 and 6.3.4).
The connection between the RAM and CPU, and also between CPU and I/O devices
is via a bus, and this transfers a program, a byte or word at a time. RAM may be external
or internal to the physical integrated circuit (IC) on which the CPU resides.
When an instruction from a program is read from RAM into a CPU it needs to be
decoded and then executed. Since different units inside the CPU perform different tasks,
data to be processed needs to be directed to a unit able to perform the required function.
To convey information around the inner parts of a CPU, there needs to be an internal
bus between an instruction fetch/decode unit and the various processing units, and
perhaps a bus to collect the result from each processing unit, and place it somewhere.
Often, data to be processed is already available in internal registers (and in par- ae
ticular, many modern CPUs, called load-store, constrain their architecture so data being Oo
—

processed must come from registers). This data is transported from registers to process- 2
2
ing units via buses. Results will then be sent back to registers, again by bus. It is often 3
fe)
convenient to group all internal registers together into a bank. In addition, in a regular O
w
architecture machine every processing unit will be connected to this bank of registers, <£
_

again using a bus. oD

BS
In Chapter 4, we will look at computer buses in a different way as we examine ~~
0
many of the functional blocks found in modern CPUs and consider the effect of different =
70
Chapter 3

bus arrangements on performance. Here, we can be content with the assumption that
such things as internal buses do exist.
Given a (possibly quite complex) bus interconnection network inside a CPU, plus
multiple internal functional units and registers that connect to this, the question arises
as to what arbitrates and controls data transfers across and between the buses.

3.2.4 Control Unit

Multiple buses, registers, various functional units, memories, I/O ports and so on, need
to be controlled. This is the job of the imaginatively named control unit. Most operations
require there to be a well-defined process flow within a CPU, such as:

e Fetch instruction.
e Decode instruction.
e Execute instruction.
e Save result (if any) of instruction.
Furthermore, there needs to be a method of ensuring that these steps occur and do
so in the correct order. This presupposes the need to have a set of control wires and
signals within a device from some control unit to each of the on-chip units that must
be controlled.
In early processors, the control unit was a simple finite state machine (FSM) end-
lessly stepping through one of several predefined states. Control wires ran from this to
each of the endpoints requiring control in a spider-web of wires and interconnects. We
will see this method in more detail when we design our own processor in Chapter 8.
Control is not only needed to fetch and distribute instructions, it is also needed
for carrying out the actions of single instructions. Consider the case of performing a
simple data transfer from register A to register B (LDR B, A) across a single 32-bit bus
as shown in Figure 3.2.
The two triangles within the figure are tristate buffers — devices similar to a switch
in that when the control signal is enabled, signals can pass through the buffer but when
the control signal is disabled, signals do not pass through the buffer. This is used in a
bus (for example) to decide which register is allowed to drive the bus wires. Only a
single register can drive a bus at any one time, so all other tristates connected to that
bus must remain turned off.
Bearing this in mind, the actions that need to be taken for a data transfer are
summarised here:
1. Turn off any tristate buffers driving the bus (in this case de-assert ena] to 4).
2
a.
E A block diagram of a very simple computer control Figure 3.2
0
O unit showing two registers, each with selectable tristate
Y
= buffers and a single 32-bit bus connecting all ports.
oD
<Cc
o
=
7
CPU Basics

2. Assert ena2 to turn on the 32-bit tristate, driving the content of register A onto the
shared bus.
3. Assert ena3 to feed the bus data into register B.
4. De-assert ena3 to lock the bus data into register B.
5. De-assert ena2 to free up the bus for other operations.

Perhaps the details of the process will differ from device to device (in particular
the enable signals are usually edge-triggered on different clock edges), but something
like this process is needed — in the order given — and more importantly sufficient time
is required between stages for:

¢ 1 to2-Wait for the ‘off’ signal to propagate along the control wires, hit the tristate
buffers and for them to act on it.
¢ 2 to 3 — Wait for the bus voltage to stabilise (ie. the content of register A to be
reflected by the bus voltage levels).
¢ 3to4-Give the register sufficient time to capture the bus value.
¢ 4to5—Wait for the control signal to hit the register and the register to stop ‘looking
at the bus’ before the bus can be freed for another purpose.

Sometimes the waiting time is most important. In modern processors it is counted

in system clock cycles, with each stage of the process being allocated a single or poten-
tially more cycles.
Figure 3.3 illustrates cycle-by-cycle timing for the case of one clock between ac-
tions, showing the sequence of events at each stage in the process. It is evident that a
synchronous control system is needed to carry out a sequence of actions involved in
even the most simple of CPU instructions.
Not all instructions need to step through the same states. Some instructions, such
as those that return no result, can be terminated early. Those instructions could either
be supported by allowing a state machine to continue running through all states (but
with dummy actions for the unused states), or be supported by early termination or
custom state transitions. In the example given, such an instruction would not need to
complete five states before finishing.

Figure 3.3 7 ; i a" ~ |

A Ra FHA S| Be |
B Be {EF fs. B B |
, 2
3 4 5 |

An illustration of the cycle-by-cycle timing of the simple control unit that was shown in Figure
3.2 as it transfers data from register A to register B. Darker lines indicate that the particular bus |
or signal is active at that time. |
Compu
the
Makin
TZ.
Chapter 3

Some instructions are likely to need specialised handling that extends the state
machine further. CPU designers generally cater for this by increasing the complexity
of the state machine to handle such exceptions to the rule, all in the quest to increase
runtime efficiency.
Over the years, more and more weird and wonderful instructions have been intro-
duced. It does not take a genius to figure out where they all have ended up — more and
more complex state machines! In some cases, the CPU control unit became the most
complex part of the design and required up to half of the on-chip area. In other cases,
the state machine was so complex that it was itself implemented as another CPU — in
effect a simpler processor handling the control needs of a larger and more complex one.
In IC design terms (as in many other fields), complexity is known to lead to errors and
for these reasons alternatives were researched.
So far, we have only considered the case of handling different instructions within a
processor. Now, let us consider the actual task of distributing the control signals across
larger and ever-growing IC sizes with increasing numbers of internal bus interconnects,
larger register banks, more functional units and a larger degree of clocking complexity
and flexibility. It is to be expected that a larger degree of the internal processor routing
logic (i.e. wires that traverse the device from one side to another) is going to be needed.
This presents difficulties beyond the complexity of instruction control. It turns out that
ina silicon IC, the interconnects that can reach across an entire chip are a scarce resource:
these are normally reserved for fast data buses. The need to utilise more and more of
these for dedicated control purposes has provided another impetus to the research of
alternative control strategies.
Three general methodologies resulted, namely distributed control, self-timed control
and simplification (increased regularity). The main example of distributed control is in
the use of microcode, explored in Section 3.2.5. An example of simplification is in the
move to RISC processors, explored in Section 3.2.6. Let us briefly examine each control
method.
Figure 3.4 shows part of the internals of a very simple CPU. There are four
registers in a bank and two arithmetic logic units (ALUs) all connected through two

Figure 3.4

A/= tristate buffer

A block diagram of the centralised control wiring required for a very simple CPU.
Making
Wo
Computer
the
73
CPU Basics

Figure 3.5 A small control unit is shown in this diagram wired

to the input-select logic for a bank of four registers.

|control unit

shared data buses. At each point of bus entry/exit there is a tristate buffer. Each bus,
tristate, register and ALU port is several bits wide.
Evidently, the thin control wires emanating from the control unit are many, even for
such a simple system. These are used to control each of the tristate buffers and the mode
of the ALUs (which can perform several selectable functions). Some, such as register-
select logic, are not shown. In Chapter 4 and beyond, different bus arrangements will be
discussed, but control signals such as these will not be shown in subsequent chapters:
diagrams simply become too complicated.
One simplification that can be introduced is the use of a control bus or several
control buses. Instead of two control signals needed for each register as in Figure 3.4,
the fact that each data bus can only carry a single value at a time can be exploited to
need only a 2-bit selection bus to drive each data bus (i.e. 4-bit total control for the
system shown). This is termed a register-select bus. Such an approach may not seem
particularly beneficial in a four-register system, but with 32 registers it would reduce
the number of register-select control wires from 64 to 6. A small example is shown in
Figure 3.5.
The number of wires emanating from the control unit to the register bank in Fig-
ure 3.5 is four. These are decoded in the register bank itself to select the appropriate
register. This is not necessarily minimising logic, but is minimising the number of con-
nections around the CPU.
To summarise, control is needed for arbitration of internal buses, for initiating the
fetch, decoding and handling of instructions, for interactions with the outside world
(such as I/O interfacing) and pretty much everything sequential in a CPU, which is a
great deal. Control may even extend to handling external memory, and the next chapter oO
carries an important example of this in the memory management unit.
Conall

a.
ae

Self-timed control is an alternative strategy that distributes control throughout a £

o
CPU, following from the observation that most instructions need to follow a com- O
®
mon ‘control path’ through a processor — fetch, decode, execute and store. And during he:
=

execution, the process is also fairly common — drive some registers onto buses, drive o
£
values from buses into one or more functional units, then some time later allow the BO
(e]
result to be collected (again using one or more buses) and latched back into registers. =
74
Chapter 3

Self-timed control in this instance does not imply an asynchronous system since
each block is synchronous, albeit to a faster clock (note that self-timing is used within
some esoteric asynchronous systems which we will explore in Chapter 9, but in this
case we are only dealing with synchronous logic).
A centralised control unit could specify in turn ‘fetch now’ then ‘decode now’ then
‘execute now’ and finally ‘store now’. This would require control connections from the
IC areas responsible for each of these tasks, back to the central control unit. However,
the self-timed strategy requires the control unit to simply start the process of instruction
fetch. The signal ‘decode now’ would be triggered from the fetch unit and not from a
central location. Similarly, ‘execute now’ would be a signal generated by the decode
unit and passed to the execute unit. In this way, a control interconnect is needed from
each unit to the next unit, but not all going to a single central location. In effect, the
control signals are actually following the data paths, something that becomes even more
effective in a pipelined machine (which will be covered in Chapter 5).
The two alternative approaches of centralised and self-timed control are shown
in the flowcharts of Figure 3.6. In this case, data buses are not shown which would
originate from external memory and traverse the fetch, decode, execute and store
(FDES) string. On the left is shown a control unit with four control buses, each one
linked to the enable inputs of the four separate units. At the relevant times as specified
in an internal state machine, the control unit will initiate operations in the FDES units.
Depending upon the instruction being processed, the control unit state machine
may need to operate the FDES differently (perhaps a longer execution stage or skip the
store). This knowledge must be encoded within the control unit, which must remember
every combination of operations for every unit connected to it.
The state machine must firstly contain detailed information on the timings and
requirements of each unit. It must also keep track of potentially multiple instructions
progressing simultaneously through these units.
On the right-hand side, a self-timed system is shown: the control unit still initiates
the process, but in this case each subsequent unit is initiated from the previous unit as

Figure 3.6
control unit

fetch ~>s/Fetch
ena ote
TRS
_,/ decode ( }
So || \ ong eae
cme —Jdecodeltonre
ena exerts
\ -~lexecute : <4
=
=)
joy = C "Bees done
E engexecute
° eng, Store a
O (

_ \_ store one
w
£
— 97aeS |)
= es 2S
o)
aS Control flowcharts of the alternative strategies of centralised control (left) and
x
5 self-timed control (right).
=
75
CPU Basics

and when necessary. Since the units themselves initiate the next step, the data buses
(not shown) are assumed to have the correct information at the correct times.
Depending upon the instruction being processed, units may decide to skip them-
selves and pass the request directly to the next unit. Each unit must thus encode the
knowledge of its own responsibilities and timings.
Perhaps more problematic is the need to convey different information to the various
units. For example, the execute unit needs to know what function is to be performed —
is itan AND, OR, SUB and so on. It does not need to know where to store the result
from the execution — this information is needed by the store unit which in turn does
not need to know what function was performed. In the self-timed case, either a full
record of needed information is passed between units, with units only reading the
items relevant to them, or there is a centralised store for such information. The choice of
implementation strategy depends upon complexity and performance requirements.

Shes) Microcode
As CPUs grew and became more complex, they ended up as an amalgamation of basic
required functionality, with assorted customised one-off instructions, some of which
were past their sell-by-date, an example from the 1980s being the binary-coded-decimal
handling instructions of the Intel 8086, 80386, 80486 processors required for backwards
compatibility with several decades-old legacy business software. The commercial drive
was for greater processing speed, and that was achieved partly through increasing clock
rates and partly through performing more functions with a single instruction.
Much of this drive for complex instructions came from the disparity between the
speed of external and internal memory. Internal memory cost an enormous amount of
money, but was up to 1000 times faster than external memory. A big bottleneck was
dragging an instruction from external memory into the processor. It therefore made
perfect sense to create a single complex instruction that replaced a sequence of 100
separate smaller instructions.
In fact, it was possible to think in terms of tokens. The external program was written
in tokens (instructions), fed slowly into the CPU, each causing a longer sequence of in-
ternal operations. Each token could launch a sequence of internal operations, and these
internal operations in turn were really programs, written in microcode. Microcode was
the basic instruction set of these processors, but often did not particularly resemble the
external instructions. Every external instruction would be translated into a microcode
program or microprogram, upon entering the CPU.
Microprogramming, as a technique, was actually invented by Maurice Wilkes
oy
in the early 1950s at Cambridge University, although one of the IBM System/360 ]
a
family of computers was probably the first commercial machine to implement this e
°o
technology. O
o
Some of the microcoding concepts are illustrated in Figure 3.7 where an external <£
—

program in slow memory is being executed by the CPU. The current program counter te))
A
(PC) is pointing at the instruction DEC A, presumably a command to decrement register =
5
A. This is fetched by the CPU and decoded into a sequence of microcode instructions =
76
Chapter 3

jects ee eeeeeneeeneeseeeeeeeeeeeeeeereeeeeeeeeeeserenegs | Figure 3.7

mcermalipras ram ig Instruction) | Registers . it

piaieg | fetch | 48.CD th:
PLBA ZW j Microcode
QWA C /
WRE B
TRTC
PPLA

|Data memory
*
*
interface
%., E
PPP eee rere rere rrr errr)

A block diagram of an instruction being fetched from slow external mem-

ory, decoded inside a CPU and executed as a sequence of much simpler
microcode instructions.

to load register X from A, then load register Y with 1, then subtract Y from X and finally
to store the result back in A.
The small four-instruction microprogram that the DEC instruction launches is con-
tained entirely inside the CPU, in fast, on-chip read only memory (ROM), and requires
an internal microprogram counter. None of this is visible from the ‘outside world’ of
the external program which may not even know that registers X, Y and Z exist inside
the CPU.
Extending this approach further led to a processor which used nanocode: external
programs would be converted to a microprogram of microcode instructions, each of
which would in turn translate to a nanoprogram of nanocode instructions! Despite
the elegance of this Cat-in-the-Hat technique, there were decreasing returns with the
microcode approach. It relied upon the fact that external memory was a bottleneck.
In the days when external random access memory (RAM) was expensive and slow,
ply but internal ROM was very fast, this was undoubtedly true. But then advances in
o RAM technology, including static RAM (SRAM), dynamic RAM (DRAM) and then
=
ee
®
synchronous dynamic ram (SDRAM) all chipped away at the speed advantages of
-_
p=|
o.
ROM that by the 1990s there was little difference between the technologies.
E With minimal speed advantage, the popularity of microcode began to wane.
°
O An exception was where the benefits of instruction translation were required. This
®
A
a
_
feature is inherent in the microcode approach, and allows a CPU of one type to use the
oD instruction set of another machine.
=ars In the late 1990s, processors were being developed that were internally RISC ma-
2)
= chines, but which could execute CISC instruction sets (see next section). Nowhere was
Th
CPU Basics

this advantage more clear than with the humble x86 series of processors. With a design
heritage harking back to 1971, these CPUs had to not only guarantee backwards code
compatibility by executing an ancient and poorly-optimised CISC instruction set, but
had to do this faster than competing processors. The old-fashioned CISC instructions
that entered some of these processors would be translated into sequences of much
faster optimised RISC-style assembler. The RISC instructions thus took the place of
modern-day microcode.
A further advantage of the microcode translation was the design of a processor that
could mimic other devices. Such a device could execute an ARM program as if it were
a native ARM processor, and then switch to executing Texas Instruments DSP code as
if it were a TI DSP: the ultimate approach to being all CPUs to all programmers.
Despite such niche markets, the driving factors behind microcode disappeared,
and it became less popular in the 1980s. The trend was constantly towards doing more,
and doing it faster: Moore’s Law in full swing.

3.2.6 RISC vs CISC Approaches

The ideas behind RISC (Reduced Instruction Set Computer) and CISC (Complex In-
struction Set Computer) have been mentioned briefly in Section 2.2. The CISC archi-
tecture encompasses many complicated and powerful instructions, whereas the RISC
architecture concentrates on a smaller subset of common useful instructions which it
handles extremely fast. Even when complex operations are synthesised through mul-
tiple RISC instructions they will be as fast, or faster, than if encoded directly as a CISC
instruction.
This concept is illustrated in Figure 3.8 showing two programs — one running on a
RISC machine, with its fast one-cycle per instruction operation completing a program of
12 instructions (A to L) in 12 clock cycles. Below that is a CISC computer with its longer
clock cycle (because the hardware is more complicated and thus slower) completing
the same process in roughly the same number of clock cycles, but in this case using only
five complex instructions instead of the 12 in the RISC machine. Since the clock cycles
are longer, it completes the task slower than the RISC machine. This is typically the
case, although conditions do sometimes exist, especially for smaller programs, where
the CISC processor can complete its program faster.

Figure 3.8
|

RISC | {A][B)[C|DIE|[F GH) [J \KIIL]| [j= instruction

fs
o
CISC |[A,B,C ][D,E][ FGHI!I ][J Kt | —
=]
| | | 2.
E
A diagram illustrating the difference in size, speed and functionality of CISC and | 3
O
RISC instructions. RISC instructions (top) are uniformly small, each occupying v
A=
a single CPU cycle, indicated by the vertical tick marks. By contrast, the CISC eel

D
instructions (bottom) require multiple cycles to execute and often accomplish i
¥
more per instruction than in the RISC case. 0
=
78
Chapter 3

However, this account does not quite describe the two approaches in context and
for that we require a little hindsight. Taking a historical perspective, early computers
were operated by the designers of the machines themselves. Designers knew what
basic operations were required in their programs and catered for these directly in hard-
ware. As hardware became more capable, it became possible to add instructions to the
computer that could perform functions that would otherwise require time-consuming
strings of instructions.
As time progressed, computer programmers concentrated on software develop-
ment, and computer architects specialised in the hardware aspects. Programmers would
then approach architects asking for custom instructions to make their programs faster.
Architects often complied, but sometimes took the initiative to add what they imagined
were useful instructions, but which left the programmers scratching their heads.
By the mid-1980s, various design groups, most notably at Berkeley and then Stan-
ford universities in the USA, began to question the prevailing design ethos. They were
probably prompted in this by groundbreaking work performed quietly at IBM, in which
less complex machines that could clock much faster because of simple and regular de-
sign, were investigated. These machines demonstrated that simple instructions could be
processed very quickly. Even though sometimes a few RISC instructions were needed to
perform the same operation as a single-CISC instruction, a RISC program was typically
still significantly faster overall.
The name Reduced Instruction Set Computer pays tribute to the simplicity of the
original designs, although there was no actual reason to reduce the size of the instruc-
tion set, just to reduce the complexity of the instructions. Groups that popularised RISC
technology produced, in turn the RISC I, RISC II and MIPS processors. These evolved
into commercial devices delivering powerful workstation performance where back-
wards compatibility with x86 code was not required, namely the SPARC and MIPS
devices.
In the meantime, over in Cambridge in the UK, a tiny design group at Acorn Com-
puters Ltd, the highly successful producer of the 6502-powered BBC microcomputer
range (that contributed to the UK having the highest rate of computer ownership in the
world), had designed their own processor, based on the earliest Berkeley work. This
Acorn RISC Machine, the ARM1, was designed on a 2-MHz BBC microcomputer running
BASIC. Acorn wrote their own silicon design tools for this processor which was very
soon followed by the ARM2, which became the world’s first commercial RISC process-
ing chip. This powered the novel Acorn Archimedes range of computers. By 2002, ARM,
fast now renamed Advanced RISC Machine, became the world’s top-selling 32-bit processor
2 claiming 76.8% of the market. By mid-2005, over 2.5 billion ARM processor-powered
pe|
jeu
E products had been sold, and by the start of 2009 that had increased to be more than one
(2)
O sold for every person on the planet. The popularity of the ARM processor continues to
J) increase. Box 3.1 briefly explores the background to the development of the amazing
<a
ee

o) ARM processor.
&£
x While Intel rode the wave of the desktop personal computer boom, the ARM archi-
i)
= tecture is riding the much larger wave of the embedded processor boom. CPUs are now
79
CPU Basics

How the ARM was designed

3.1 In the mid-1980s, groundbreaking British computer company Acorn, with a contract
Box
from the British Broadcasting Corporation (BBC) to design and market BBC micro-
computers was looking for a way to move beyond their hugely successful 8-bit BBC
microcomputers. These were powered by the lean and efficient Rockwell 6502 proces-
sors. The BBC initiatives had encouraged computer use in the UK so much that there
were reportedly far more computers per capita in England than anywhere else in the
world. Sir Clive Sinclair’s ZX Spectrum for example, had sold 4 million units by the
time sales of the IBM PC had reached 1 million units. Acorn is also reputed to have
sold over 1 million BBC computers overall.
In the early explosion of the ‘computer revolution’ it quickly became apparent
to Acorn that 16-bit processors from companies such as Intel and Motorola were not
powerful enough to meet their projected future needs — needs which included releasing
the world’s first multi-tasking graphical desktop operating system in the late 1980s
(later some observers would conclude that this was copied by Microsoft as the basis
for Windows 95, XP and beyond).
In typical pioneering fashion, Acorn decided that, since nothing good enough
was available, they would create their own processor. They designed the ARM1 and
its support ICs (such as MEMC and VIDC) within two years despite having never
developed any silicon previously.
Acorn wanted a machine witha regular architecture —similar to the 6502, but vastly
more powerful. They chose to use the RISC approach, but revisited their software needs
by analysing operating system code to determine most used instructions which they
then optimised for the ARM processor. The same approach yielded an instruction set
(see Section 3.3) and its coding. Later, much needed additions were the multiply and
multiply-accumulate instructions.
This heritage leaves the globally successful ARM processor with a direct link back
to the UK Government-funded BBC initiatives: the ARM software interrupt, supervi-
sor modes, fast interrupt, no microcode, static pipeline, load-store architecture are all
derived either from the hardware or the software architectures adopted by Acorn.

inside almost every electronic product and most of these are ARM-based. Meanwhile,
Acorn itself no longer exists, having self-destructed in 1999.

Go
away Example Processors Zs
Over the years, since the IBM research group published their initial results, the RISC a
£
approach has impacted almost every sphere of processor design. In particular, the ARM fe)
O
RISC processor family now dominates the world of embedded systems. Therefore, in oO
<=
this book almost all assembly language code examples are given in ARM assembler oO
format. For example: &
x
fe)
ADDO; Ri, in =
80
Chapter 3

adds together the contents of registers R1 and R2 and stores the result in
register RO.
Today, although it is easy to find examples of ‘pure’ RISC processors such as
the ARM and MIPS, even the die-hard CISC devices (such as Motorola/Freescale
68000/Coldfire and some of the Intel x86 range) are now implemented with CISC-
to-RISC hardware translators and internal RISC cores. Pure CISC processors do not
seem to be popular these days. For this reason, when referring to CISC processors we
define a pseudo-ARM assembler format, rather than use the format from any particular
CISC device:
ADD “AY Bi ee

adds together registers B and C, placing the result in register A. Usually, examples
in this text are identified as being RISC or CISC, and can otherwise be differentiated
because the RISC examples use ARM-style registers RO to R15 whereas CISC examples
use alphabetical registers A, B, C and so on. Some special-purpose registers are also
mentioned in later sections; SP is the stack pointer, LR is the link register.
The only exception to the use of pseudo-ARM instructions in this book is in
discussions relating to the Analog Devices ADSP21xx processor and a single Texas
Instruments TMS320 example. The ADSP in particular uses an assembly language
that is structurally similar to the C programming language, and therefore quite eas-
ily readable. These exceptions will be highlighted at the time the code segments are
presented.
Note that some processors, most notably the 68000, would actually specify the
destination register last instead of first as in the ARM. However, in this book the desti-
nation register is always specified ARM style, and any comment is written following a
semicolon (‘;’):
SUBMER See Re, Rae R3=R2—RI

Sometimes the destination and first source register are the same:

DD Mery pp (Oe)

or perhaps there is only a single source register:

NOT _E.,. oF; B= not_F
or maybe no source register:
yy =
B R3; jump to address contained in R3
=
a
tao
—_—
Generally the instructions themselves are self-explanatory (ADD, AND, SUB and
2
2. so on). The following section will provide more examples and detail on the ARM in-
E struction format, including a list of all instruction families.
)
O Beware, in the ARM, the destination register is specified first for all instructions
v
&
— apart from the store to memory instruction and its variants:
oD
& SiR es; [R3 ]
a4
0
= This would store the content of register R1 into the memory address held in R3.
81
CPU Basics

Instruction Handling
As mentioned in Section 3.2, computers are operated through sequences of instructions
known as programs. The generic term for such programs is software. Various schemes
exist for creating software through writing in high-level languages (HLL), where each
HLL command is made up of a sequence of perhaps several tens of CPU instructions.
In low-level languages, typically each command invokes few, or perhaps only a single
CPU operation.
If we define a CPU operation to be some data move or logical transaction by the
CPU, an instruction is a command to the CPU from a program (which results in one or
more CPU operations). A HLL command is made up of one or more instructions, and
a stored program is a list of such instructions.
In some computers, a single instruction can be used to invoke multiple CPU opera-
tions. This may be required for performance reasons, especially where the rate at which
a program can be read from external memory is far slower than the speed at which the
processor can execute the operations. In fact, this thinking led in the past to the idea of
microcode (explored in Section 3.2.5).
Machine code is the name given to (usually) binary numerical identifiers that cor-
respond to known actions in a CPU. This may mean, for example, that when exam-
ining program memory, hexadecimal byte 0x4E followed by byte 0xA8 might repre-
sent two instructions in an 8-bit processor, or a single instruction, 0x4EA8, in a 16-bit
processor. In modern processors, programmers are very rarely required to deal with
the underlying binary numerical identifiers that the processor understands, but han-
dle these through a set of abbreviated mnemonics called assembly language or assem-
bly code. It is this code that is produced and stored when compiling a HLL into an
executable.
The instruction set is a list of the possible assembly language mnemonics. It is a list
of all instructions supported by a particular CPU.

3.3.1 The Instruction Set

The instruction set describes the set of operations that the CPU is capable of perform-
ing, with each operation being encoded through an instruction which is part of the set.
Some instructions require one or more operands (for example, ADD A, B, C where A,
B and C are called the source and destination operands and may be immediate values,
registers, memory locations or others depending on the addressing modes available —
see Section 3.3.4). Often, there is a restriction placed on operand type or range, for ex- D
£
ample, a shift instruction may be limited to the maximum shift allowed by the shift os
c
hardware. fe)
x
The instruction set contains every instruction and thus describes the full capa- c
2
bility of the processor hardware. The set may be broken into groups based upon —_
8)
which processor unit they involve such as the following defined for the ADSP2181 =]
=w
processor: £
82
Chapter 3

Instruction group Example operations within the group

ALU . ’ add, subtract, AND, OR, etc.

MAC multiply, multiply-accumulate, etc.

SHIFT arithmetic/logical shift left/right, derive exponent, etc.

MOVE . register /register, memory /register, register /memory, I VOvete

PROGRAM FLOW branch/jump, call, return, do loops, etc.

MISC Fi idle mode, NOP, stack control, configuration, etc.

Many processors would add an FPU or MMX group to those defined, but the
ADSP2181 is a fixed point only processor with no multimedia extensions.
The instruction set for the ARM processor, specifically the ARM7, is shown for
reference in Figure 3.9 (note this shows the ARM mode instructions and does not include
the 16-bit Thumb mode that many ARM processors also support). Notations used in
the instruction set table include the following:

¢ §_ in bit 20 indicates instruction should update condition flags upon completion

(see Box 3.2).
¢ §S_ inbit 6/22 indicates transfer instruction should restore status register.
e U signed/unsigned for multiply and up/down for data transfer index
modifications.
e I anindicator bit used to select immediate addressing.

Figure 3.9
31/30) 29 28° 27 126. (25 24° 23 (2221 (20 19 18 7 16) 15) 14°13 12 41 10 9 8 ALS 5 4 3 2 1 Oo

conditions |0] 4|1 [x] x|x[x]x[x[x]x]x[x][x]x]x[x]x[x[x[x[x[x[4 [x] x] x] x] undefined

conditions 0; |! opcode Ss Rn io Rd second operand data processing

conditions POCO et Mak: address offset to destination | branch

conditions ONO) OO OF 0: Ss Rn Rd Rs rt | OG: 7 Rm | MUL

conditions | 0} 07 O} OF ‘U iS) RdHi RdLow Rn }1 | olo}4 Rm | long multiply

conditions | 0} 1/1|P]U)BJWL Rn ) Rd address offset LDR/STR

conditions 0/0/}0/P)U)1IWL lai Rn Rd offset Hele Sule Hi ly Alley offset halfword transfer

conditions | 0|0}]0}|Pj|U| 0 {WL Rn " Rd 0/;0/;0 | 0 | 1 | S/H |1 Rm | hatfword transfer

conditions 1 0) :0) Ua Sai Wi Rn list of registers block transfer

conditions De OPO 11s OOM a tte tet aia aL ete ate he ] ACW ot ater O On ONt Rn BX
toy) conditions | 0 0 Ot 0B 0 lo Rn z Rd GO} 0} 0} 07} tH Or} oO 4 Rm single data swap
pe! conditions fl i) 0 Pp U N WL) Rn } CRd CP no. offset LDC
ee conditions — TPA hk |) oP opcode CRn CRd CP no. CP 0 CRm CDP
= |_conditions Pap AE o| CP opcode |L :,crn Rd CP no. cp |e cRm MCR
6 The ARM instruction set in a tabulated format. Columns are headed by the instruction word
v bit they contain. All 14 classes of instruction available in this version of the ARM instruction
= set are shown.
Cc fs. —-
83
CPU Basics

accumulate /do not accumulate answer.

unsigned byte/word.
e@
write back.
load /store.
pre- and post-increment and decrement operators.
>Atrsw
indicates one of the 16 registers.
e CR indicates a co-processor register (one of eight co-processors that can be
identified).

Many of these modifiers are specific to the ARM processor and will not be con-
sidered further in the text. However, we shall look in more detail at the ‘S’ bit and
the addressing capabilities (see Section 3.3.4). The interested reader is referred to the
ARM Ltd website! where further explanations and documents are available. The in-
struction set varies slightly among ARM processors. The version shown above is the
more common ARM7TDMI version.?
Recently, ARM have completed a rebranding exercise in which their processors
are now known as Cortex devices. The original ARM7, ARM9 and ARM11 devices are
termed ‘classic’. Most likely, this move has been an effort to counter the fragmentation
of the huge ARM market in which one basic architecture (the ARM) was required to
span a very wide and diverse set of needs, ranging from tiny and slow sensor systems
to larger and faster handheld computers. At the time of writing, the new processors are
classed into three ranges which better subdivide the traditional strength areas for ARM
devices:
Cortex-A series processors are application-oriented. They have the in-built hard-
ware support suited for running rich modern operating systems such as Linux, with
graphically rich user interfaces such as Apple’s iOS and Google’s Android. The pro-
cessing power of these runs from the efficient Cortex-A5, through the A8, A9 and up to
the highest performance Cortex-A15 device. All support ARM, Thumb and Thumb-2
instructions sets (Thumb-2 reportedly improves upon Thumb in terms of performance
and compactness).
Cortex-R series devices are targeted to real-time systems that also have significant
performance requirements. These include smartphone handsets, media players and
cameras. The ARM company is also promoting Cortex-R for automotive and medical
systems; ones in which reliability and hard real-time response are often important.
These probably do not require complex and rich operating systems, just small, hard
and fast real-time arrangements. At the time of writing, only the Cortex-R4 is available,
and has already found its way into many real-time systems in use worldwide.
oO
Cortex-M family processors are at the lower end of the range for use in very cost- 2
a)
sensitive and low power systems. It could be argued that these are for traditional e
5
microcontroller-type applications that probably do not need advanced operating x=
c
=°
3)
=)
1 http: //www.arm.com _
—_
nn

2 This information was extracted from ARM Ltd Open Access document DDI 0029E. =
84
Chapter 3

system support (and possibly do not need any OS). These are for applications that
do not have rich user interface requirements, and for which the clock speed will be no
more than several tens of MHz. At the time of writing, the Cortex-M0 is the entry device,
beyond which the M3 and M4 provide increasing levels of performance.
Although most variants of the ARM7 support a 16-bit Thumb mode (see Sec-
tion 3.3.3), all ARM7 devices support the standard fixed length 32-bit instructions
shown above. It can be seen that, as in the ADSP21xx, there are various groups of
instructions, such as data processing, multiply or branch. With 15 instruction groups,
4 bits are needed to represent the instruction group and further bits are used within
this to represent the exact instruction in each group.
Notice the fixed condition bits available for every instruction. No matter which
instruction is being used, these bits are located at the same position in the instruction
word. This regularity aids in instruction decoding within the processor. It is important
to note that the consequence of this is that every instruction can operate conditionally.
This is unusual, and among common modern processors is found only in the ARM: most
other processors support conditional branch instructions only. In the ARM, the S bit
within many instruction words controls whether that instruction can change condition
codes on completion (see Box 3.2). These two features, when used in conjunction with
each other, are very flexible and efficient.
Also, note that for every instruction, the destination register (if required) is in the
same place in the instruction word. This further regularity also simplifies the decoding
process.

She py Instruction Fetch and Decode

In a modern computer system, programs being executed normally reside in RAM (they
may have been copied there from hard disc or flash memory). A memory controller,
usually part of amemory management unit that we will explore in Section 4.3, controls
external RAM and handles memory accesses on behalf of the CPU.
Within the CPU, an instruction fetch and decode unit (IFDU or simply IFU) retrieves
the next instruction to be executed at each instruction cycle. The next instruction is
identified by an address pointer, which is held ina program counter (PC) innearly every
processor in use today. This program counter is normally incremented automatically
after an instruction is retrieved, but is overridden with a new value when a jump ora
branch occurs. These items are illustrated in Figure 3.10.

Figure 3.10

A diagram showing the connectivity of the memory controller in a typical

CPU system.
Instruction
Handling
85
CPU Basics

Illustrating conditionals and the S bit in the ARM

Consider the efficiency of the ARM processor compared to a mythical standard RISC
Box
3.2
-
processor that does not allow conditional operation for every instruction.
The instruction mnemonics used are similar to those of the ARM (but not com-
pletely realistic). First, we will examine the program on the standard RISC processor
that adds the numbers in registers RO and R1 and then, depending on the answer,
either places a 0 in register R2 (if the result is less than 0) or places a 1 in register R2
otherwise.
EXIDIDYS) INO, IG) RIL
BLT pos1 (branchif less than 0)
MOV R2, #1

B pos2
posl MOV R2, #0
OOS ae oe

The program occupies five instructions and will always require a branch no
matter what registers RO and R1 contain on entry.
The following code segment reproduces the same behaviour for the ARM pro-
cessor, but uses conditional moves to replace the branch. In this case, RO and R1 are
added. The S after the ADD mnemonic indicates that the result of the addition should
update the internal condition flags. Next, a value 1 is loaded into R2 if the result of the
last condition-code-setting instruction is less than 0. A value 0 is loaded into R2 if the
result is greater than or equal to 0.

ADDS RO, RO, Rl

MOVLT R2, i
MOVGE R2, #0

The ARM version is obviously shorter — only three instructions are required, and
in this case no branches are needed. It is this mechanism that allows ARM programs
to be efficient whereas RISC processors are traditionally known for less efficient code
density. In higher level languages, the structure that leads to this code arrangement is
very common:

IF condition THEN
eee i
ELSE
D
Ans
action 2 oO
=
5
Po
c
2
Once the instruction fetch and decode unit reads an instruction, it begins to de- 3)
—

=]
code that instruction which then flows through steps as shown in the flowchart of =4)
Figure 3.11. =
86
Chapter 3

—
Figure 3.11
| fetch decode > fetch _ execute
| instruction instruction !1 operand j instruction

A flowchart of instruction processing for a typical processor.

Grose Instruction Decode

In the ARM, because all instructions can be conditional, the IFU first looks at the condi-
tion code bits encoded in the instruction and compares these bitwise with the current
condition flags in the processor status register. If the conditions required by the instruc-
tion do not match the current condition flags, then the instruction is dumped and the
next one retrieved instead.
In the ARM, the simplicity of the instruction set means that the conditional bits of
each retrieved word can simply be ANDed with status register bits 28 to 31 (that encode
the current condition flags). Box 3.3 explains the quite extensive set of conditional codes
available in the ARM.
Looking again at the ARM instruction set, it can be seen that the destination register
(for instructions that have a destination) is located in the same place in each instruction
word. On decode, the IFU simply takes these 4 bits (used to address the 16 registers)
and applies them as a register bank destination address.

Se4 2 Fetch Operand

Evidently, the value of the operand is not always encoded in the instruction word
itself. The ARM and many other RISC processors are simplified by being load-store
architectures where operands in memory cannot be used directly in an operation —
they have to be transferred into a register first. The exception is with immediate values
which are encoded as part of several data processing instructions, such as MOV (see the
example in Box 3.4).
So the ARM normally prepares operands for an operation either by decoding an
immediate value from the instruction word or by selecting one or more source, and one
destination register. The exception is the load (LDR) and store (STR) instructions that
explicitly move 32-bit values between memory and a register.
In many other processors, normally CISC rather than RISC, it is possible to execute
an instruction that performs some operation on the contents of a memory address and
stores the result back into another memory address. Evidently in such a processor, the
action of moving operands around will require one or two memory accesses. Since
o
ae RISC processors aim to complete each instruction within a single clock cycle if possible,
go) this has been disallowed.
c
5
x=
i= Bere, Branching
2
_—
2)
The branch instruction group in the ARM instruction set is, as expected, all conditional —
2
=
a as indeed are branch instructions in nearly all other processors. In a branch instruction,
“
= bits 24 to 27 are the unique identifiers that indicate an instruction in the branch group.
87
CPU Basics

Condition codes in the ARM processor

The ARM, as we have seen in Figure 3.9, reserves 4 bits (bits 31, 30, 29 and 28) for
3.3
Box
-
condition codes in every instruction. This means that every machine code instruction
can be conditional (although when written in assembly language there may be some
instructions which do not take conditionals).
Normally, the condition code is appended to the instruction. Thus, an ADDGT is an
ADD instruction that only executes when the condition flags in the processor indicate
that the result of the last instruction which set the condition flags is greater than 0.
The full set of ARM conditionals is shown in the table below (although strictly
the last two are unconditional conditionals!).

Condition nibble Condition code Meaning Conditional on

0000 E® equal The

0001 NE not equal Zi—\)

0010 CS carry set C=1

0011 CG carry clear @—()

sapere jak MI minus Ne 1

0101 PL plus N= 0

0110 VS overflow set Veaet

‘eh my : ME overflow clear Vie=0)

1000 : HI higher C=1, 20

1001 | | 1s lower or same C20; 2=1

1010 i GE greater or equal IN = V.

1011 ae UT at less than INS, Vi

1100 I E nee greater than NSW Z=0

D
Le
1101 F LE less than or equal (Nee2V ron Zia 1 oO
=
ce}
as
1110 ; AL Meee always ers tie set (1 i=
2
U
as

2
=4)
=
88
Chapter 3

The L bit distinguishes between a jump or a call (branch-and-link in ARM terminology,

where link means that the address to return to is placed in the link register LR, which
is R14, when the branch occurs). Apart from the 4 bits needed to define the instruction
type, 4 bits are needed for condition codes. So there are only 24 bits remaining. These 24
bits are called the offset. They indicate where to branch to — which instruction address
should be placed in the program counter.
Since the ARM is a 32-bit processor, instruction words are 32 bits wide. However,
memory is only byte wide, such that one instruction spans four consecutive memory
locations. The ARM designers have specified that instructions cannot start just any-
where, they can only start on 4-byte boundaries: addresses 0, 4, 8, 12 and so on. So the
offset refers to blocks of 4 bytes.
Now there are two general methods of indicating addresses to branch to in com-
puter architecture; these are absolute and relative. Absolute specifies a complete memory
address, whereas relative specifies a certain number of locations forwards or backwards
from the current position. As computer memory spaces have become larger, specifying
absolute addresses has become inefficient — the principle of locality (Section 4.4.4) indi-
cates that branch distances will usually be quite small, requiring fewer bits to specify
than an entire absolute jump address (which would in fact take 28 bits in the ARM).
Back to the ARM, the jump address is termed an offset, which means it must be a
jump relative to the current program counter location. With a 24-bit offset, a branch can
indicate a jump range of 2% words in memory which is 64 MiB. Of course, the offset
has to be signed to allow a jump backwards (as in a loop) as well as forwards, and so
this means a +/— 32 MiB jump span.
Is the limited branch range a limitation? Not normally. Despite rampant code bloat,
even at the time of writing, single programs are not usually 64 MiB long. It is thus likely
that the ARM's designers have catered for the vast majority of jump requirements with
the instruction.
However, a 32-bit memory bus allows up to 4 GiB of memory space to be addressed,
which is far larger than the capability of address jumps. So, if a 70 MiB jump is required,
how could it be accomplished?
In this case, the ARM has a branch and exchange instruction. To use this, the des-
tination address is first loaded into a register (which is 32 bits in size), and then this
instruction can be issued to jump to the address held in that register. Of course, the
question arises as to how the register itself can be loaded with a 32-bit number. Sec-
tion 3.3.4 will discuss addressing modes, one of which is the immediate constant — a
number encoded as part of the instruction. Box 3.4 will also consider how immediate
joy) values can be loaded with the MOV instruction.
ioe
K2)
=
2)
e4
3.3.2.4 Immediate Constant
c The issue is that, with a 32-bit instruction word, it is not possible to convey a 32-bit
=ie)
S) constant as well as bits specifying condition, destination register, S bit and so on. An
=
> immediate constant (a value encoded within the instruction word) has to be less than
ah
“
= 32 bits.
89
CPU Basics

Understanding the MOV instruction in the ARM

The MOV is 32-bits long like all ARM instructions. Its structure is shown below.
3.4
Box
.
abit cond |0 | 0 | il |opcode | S |Rn |Rd |4-bit rotation | 8-bit value _
or

|4-bit cond | 0 | 0 0 | opcode |S Rn |Rd |immediate/register shift & Rm |

The 4-bit condition code is common with all other ARM instructions, the opcode defines
the exact instruction in the data processing class, Rn is the first operand register, Rd is
the second operand register and, selected through bit 25 =1, Rm is the third.
We will concentrate on the top form of the command, where an 8-bit immediate
constant and 4-bit rotation are supplied (the actual rotation to be applied is twice the
value supplied here). Where the opcode specifies a MOV instruction, the immediate,
rotated by the degree specified is loaded into the destination register. Here are some
examples:

MOV R5, #OxXFF ; Rd =5, Rn=0, rotation = 0, value = OxFF

MOV R2, #0x2180 ;Rd=2, Rn=0, rotation=2, value = 0x43 (loads 0x43<«4)

Note: For these MOV instructions, Rn is always set to 0 since it is unused.

Question: How can the processor set a register to OxFOFFFFFF?

Answer: The programmer would probably write:

MOV RO, #OxXFOFFFPFF

However, the assembler would be likely to complain (‘number too big for immediate
constant’ or similar) since the 32-bit value that is specified cannot fit into an 8-bit
register no matter what degree of shift is required. Some assemblers and more
experienced programmers would know that they can simply convert the instruction
to a ‘move NOT’ instead:
MVN RO, #0x0F000000 ;Rd=0, Rn=0, rotation = 12, value = 0x0F

As you can see, despite the relatively small immediate value size that can be accommo-
dated within the instruction field, this allied with the instruction flexibility and shift
value, can actually encode quite a wide variety of constants.

fe)
=
In the case of the ARM, immediate constants are loaded into a register with the oO
c
MOV instruction (in the data processing instruction group). An immediate value can 5
<=
be located inside the section labelled ‘Operand 2’ in the ARM instruction set (Fig- Cc
a
ure 3.9). However, not all of the operand is used for holding the constant. In fact, only —
(8)
2
an 8-bit immediate value is catered for, with the remaining 4 bits used to specify a =
4)

rotation. =
90
Chapter 3

So, although the processor has 32-bit registers, only an 8-bit number can be loaded.
However, due to the rotation mechanism (with 4 bits for rotation this can specify 15
positions either left or right), a large variety of numbers can result. Box 3.4 looks in
detail at the bitfields present in the ARM processor MOV instruction, to see how these
impact the flexibility of one variant of the instruction.
Many processors work differently. They generally allow at least a 16-bit constant to
be loaded immediately and the 16 bits are encoded as part of the instruction word. CISC
processors often have variable length instructions or use two consecutive instructions.
A variable length instruction may be 16-bits long when only an 8-bit constant is to
be loaded, or 32-bits long when a 16-bit or 24-bit constant is loaded. Variable length
instructions require the instruction fetch unit to be fairly complex, and thus a more
simple method of achieving a similar result is to use two consecutive instructions. The
first instruction may mean ‘load the next instruction value to register R2’ so that the
IFU simply reads the next value directly into the register rather than trying to decode it.
This evidently means that some instructions require two instruction cycles to execute,
and imposes a timing penalty, especially in pipelined processors (Section 5.2).
For the example of the ARM processor, although the restriction in immediate values
exists, in practice many constants can be encoded with an 8-bit value and a shift so that
this does not translate into a significant performance bottleneck. The ADSP2181 handles
immediate loads in a similar fashion and has been designed for high-speed single-cycle
operation.

35333 Compressed Instruction Sets

Especially in processors with variable length instructions, Huffman encoding is used to
improve processor efficiency. In fact, as we shall see later, similar ideas can be used even
within a fixed length processor, but in this case not for efficiency reasons.
Huffman encoding is based on the principle of reducing the size of the most com-
mon instructions and increasing the size of the least common instructions to result in an
average size reduction. Obviously, this requires knowledge of the probability of instruc-
tions occurring and then allowing the size of the encoded word used to represent those
instructions to be inversely proportional to their probability. An example of Huffman
coding applied to instruction set design is provided in Box 3.5.
It should be noted that in the real world, one particular application may exhibit
very different instruction probability statistics compared to the average.
Many ARM processors contain an alternative 16-bit instruction set called the
Thumb. This was designed to improve code density. Note however that even though a
oD) given memory size can support twice as many Thumb instructions compared to 32-bit
po:
m2) ARM instructions, on average more Thumb instructions are required to perform the
c
12)
x
same function as the underlying ARM instructions which they map to once decoded
< (this is mainly because there are fewer different Thumb instructions to choose from).
=—
3) The process by which ARM engineers designed the Thumb instruction set is note-
2
=
—_ worthy since they used a similar idea to Huffman coding. ARM engineers examined
”“
= a database of example application code and calculated the number of uses of each
9]
CPU Basics

A Huffman coding illustration

An example processor has five instructions for which an analysis of the 1000 instruction
Box
3.5
- software program that it runs reveals the following occurrences:

CALL CO; ADD S00 SUB 80, AND 60, MOV 500

If an equal number of bits were used to represent each instruction in this instruction
set, 3 bits would be needed (since that would allow up to seven possibilities). Ignoring
any operands, 1000 x 3 bits = 3000 bits are required to represent that program.
The processor designers wish to use Huffman coding to reduce the program size.
First, they calculate the probability of each instruction (by dividing each occurrence
by the total number of instructions):

CAL OMG ADD NOs). SUS 008), “ANDT0).067 MOV "Or.5

Next, these are ordered in a list in terms of probability. The lowest two probabilities
are combined and the list re-ordered:

MOV 0.5 MOV 0.5

ADD 0.3 ADD 0.3
SUB 0.08 C/A 0.12
CALL0.06 | SUB0.08
~ Sai = ae
This process is then repeated until finally there are only two choices left:

MOV 0.5 MOV 0.5 MOV 0.5 MOV 0.5

ADD 0.3 ADD 0.3 ADD 0.3 C/AIS/A0.5
SUB 0.08 CG/A012. |. C/A/S 0.2 . peer
CALL 0.06 SUE 0.08, seaman mretiiory iicatour fi
AND 0.06 _| i _ ane
Next, traverse the tree from right to left. The bottom two entires in each column are
numbered: the upper value is designated binary ‘1’ and the lower is binary ‘0’, and
these numbers must be written down when tracing through. Any other column entry
can simply be followed left without writing anything more until the original instruction
on the left-hand side is reached. t2))
For example, in the right-hand column, a ‘1’ indicates MOV, a ‘0’ indicates any ge
go
one of CALL/ AND/SUB/ADD. Moving left, a ‘01’ now indicates an ADD whereas a Cc
1°}
‘00’ is the prefix for any of CALL/AND/SUB. In the next column, ‘001’ indicates either x=
Cc
CALL or AND and ‘000’ indicates SUB. Writing all of these out gives the following: 2
a
18)
2
(Continued) —
then
4)
ae
92
Chapter 3
a 8 ee eee
A Huffman coding illustration (Continued)

3.5 MOV is ‘1’, ADD is ‘01’, SUB

Box
is ‘000’, CALL is ‘0011’, and AND is ‘0010’. If we look
at the number of bits used to represent each instruction, we can see that the most
common instruction (MOV) is represented by a single bit whereas the least common
(AND) needs 4 bits, so the encoding method seems to have worked in representing the
most common instructions with fewer bits. Using the original number of occurrences
of each instruction and the number of Huffman bits, we can calculate the new program
size:

(500 x 1) + (300 x 2) + (80 x 3) + (60 x 4) + (60 x 4) = 1820

Which is significantly fewer than the 3000 bits we calculated for a fixed 3-bit represen-
tation.

instruction. Only the most common instructions were made available in Thumb mode.
The binary encoding within the fixed 16-bit word used to represent an instruction is
length coded based on the number of bits required for the other operands.
Some features of the Thumb instruction set are as follows:

e There is only one conditional instruction (an offset branch).

e There is no ’S’ flag. Most Thumb instructions will update condition flags
automatically.
¢ The destination register is usually the same as one of the source registers (in ARM
mode the destination and source are almost always specified separately).
e All instructions are 16 bits (but register and internal bus width is still 32 bits).
¢ The addressing mode for immediate and offset addresses is significantly
limited.
¢ Most instructions can only access the lower 8 registers (of 16).

The Thumb instruction set is significantly more complicated than the ARM instruction
set, although the decoding process (from Thumb instruction fetched from memory to
ARM instruction ready to be executed inside the processor) is automatic and very fast.
The following are some example instructions:

| 16-bit binary instruction Instruction name | Example

bit pattern
oO -——__—_-—-— - ~ — +—
& 1101 Condition Offset Conditional BLT loop
3g ee Wale (4 its) oe (8 bits)| branch
po 11100 Offset (11 bits) Branch B main
j=) = - roy r TJ 4 — afr
2 01001 Destination Offset | Load memory LDR R3, [PC, #10]
5 ee __| register (4bits) | (8 bits) |_to register
ra 101100001 | Immediate (7 bits) Add to stack ADD SP, SP, #23
98
CPU Basics

From the limited examples shown here, it can be seen that the few most significant
bits identify the instruction. These actually range from 3 bits to 9 bits in length across
the entire instruction set. In the case of the ADD instruction shown, the register it
operates on is fixed: it is an add to stack only — the flexibility and regularity of the ARM
instruction set, where almost all instructions operate on any registers, is lost — but the
most common operations found in software are catered for.
It should be noted at this point that the Thumb instruction set, being 16 bits wide,
really operates at its best when the interface to external memory is 16 bits, in which
case each ARM instruction would require two memory cycles to be retrieved (and thus
the processor would run half as fast as it should), whereas the Thumb code could be
executed at full speed.

3.3.4 Addressing Modes

Addressing modes describe the various methods of identifying an operand within an
instruction. Instructions specify many operations, which may have no operands, one,
two or three operands. There may, very exceptionally, be instructions with greater than
three operands. In most modern processors, common examples of non-zero operands
are as follows:

Type Examples Operand

Single B address Address, may be given directly, may be an offset

operand from current position or may be an address in a
register or memory location.

Two NOT destination, Destination or source may be registers, memory

operands source addresses or memory locations specified by reg-
isters. The source may also be a numeric value.

Three ADD. destination, Destination or source may be registers, memory

operands source, source addresses or memory locations specified by reg-
isters. The source may also be a numeric. value.

Of course, not all possible operand types are suitable for all instructions, and even
so may not be available on some processors (for example RISC processors, being load-
store, typically limit the operands of arithmetic instructions to registers, whereas in
CISC processors they may be located in memory or elsewhere). A final point to note
is the assumption in the two bottom examples above that the first operand written is 2)]
a
the destination — which is true for ARM assembly language, but is reversed for some m2)
c
other processors (see Section 3.2.7). This can be a real cause for confusion when writing 12)
=
assembler code for different processors (and is an occupational hazard for computer =
2
architecture lecturers /authors). as
8)
2
The term addressing mode refers to the method of specifying a load or store ad- —=
w

dress, using one of several different techniques. The following table lists the common =
94
Chapter 3

addressing modes, with ARM-style assembly language examples (although it should

be noted that PUSH does not exist in the ARM instruction set, only in the Thumb).

Name Example Explanation

Immediate addressing MOV RO, #0x1000 Move hexadecimal value 0x1000 to register RO

Absolute addressing LDR RO, #0x20 Load whatever is in memory at address 0x20 into RO

Register direct NOT RO, R1 Take content of R1, NOT it and store inside RO
addressing
Register indirect LDR RO, [R1] If R1 contains value 0x123, then retrieve contents of
addressing memory location 0x123, and place it in RO

Stack addressing PUSH RO In this case, the contents of RO are pushed onto the stack
(and the assumption is of only one stack)

The following extensions and combinations of the basic idea are also common:

Name Example Explanation if RI=1&

R2=2

Register indirect with LDR RO, [R1, R2, #3] The second operand
register indirect index and is the memory
immediate offset address 1+2+3=6

Register indirect with STR RO, [R1, R2, LSL #2] The second operand
immediate scaled register is the memory
indirect index address 14+ (2 <« 2) =9

Various processors, including the ARM and the ADSP2181, also offer an automatic
way to update registers after they have been used to perform offset addressing. For ex-
ample, a register indirect access with immediate offset could leave the register used in
D
pe the access updated after addition of the offset. This is shown in the following examples
se)
= where R1 = 22:
5
<=
Cc EDR PROF etRaN is ras Load RO with content of memory address 22 and then
Be
_
©) set Ri 221+: 52127
2
ie“ AEDSY SASH)Pe |II e215) | Set RI = 22 + 5 = 27 and then load RO with content
= of memory address 27
75
CPU Basics

Note that it is not our intention here to teach the details of the ARM instruction set,
but merely to use it as a teaching aid for the underlying addressing techniques.*
It is instructive to analyse the limitations that caused CPU designers to provide
certain levels of functionality within a processor — and this is rarely more revealing
than in the consideration of the instruction set. In this regard, CISC processors are
more interesting. Some examples are given below from an imaginary CISC processor,
where main memory locations mA, mB and mC are used for absolute operand stor-
age, and a RISC processor, where registers RO, R1 and R2 are used for register direct
addressing:

¢ CISC processor; ADD mA, mB, mc ;mA=mB+mC

In this case, once the CPU has read and decoded the instruction, it must read the
content of two further memory locations to retrieve the operand values mB and
mC, and this probably requires two memory bus cycles. These values must then
be transferred by internal bus to the ALU as they are retrieved (and since this is
sequential, only one bus is needed). Once the ALU has calculated the result, this is
transferred by bus toa memory interface for writing back to main memory location
mA.
The instruction overhead is three external memory cycles in addition to the |
ALU operation time. External memory cycles are usually far slower than internal |
ALU operations, and so this is clearly a bottleneck. There is only a need for one
internal bus in this processor.
The instruction word must hold three absolute addresses. With 32-bit
memory, this equates to 96 bits, making a very long instruction word. This could
be reduced through offset/relative addressing, but would probably still be too big
for a 32-bit instruction word.
e RISC processor: ADD RO, R1, R2 ;RO=R1+R2
The same operation is now performed with registers. All of the operand values
are already inside the CPU, which means they can be accessed quickly. Once the
instruction has been read and decoded, register R1 is allowed to drive one internal
operand bus and register R2 is allowed to drive the other internal operand bus
simultaneously. Both operands are thus conveyed to the ALU ina single very fast
internal bus cycle. Once the ALU has calculated the result, an internal results bus
will collect the result. RO will be listening to this bus and, at the appropriate time,
latch the result value from the bus.
The instruction overhead is two fast internal bus cycles in addition to the ALU
oO)
operation time. In our example description, the CPU must contain three internal £&

buses: two to simultaneously transfer both operands and one to collect the result. ao)
&
5
Other alternative arrangements are equally possible. <=
<
=
—
8)
2
3 Those who do wish to learn the ARM instruction set are recommended to refer to the book ARM
=
oa
wn

System Architecture, by Steve Furber (one of the original inventors of the ARM processor). Le
96
Chapter 3

The instruction word needs to contain three register values. However, with a
bank of 32 registers, only 5 bits are needed to specify each register, and so 15 bits
are used in total. This would easily allow the operation to be encoded in a 32-bit
instruction.
CISC processor: ADD mA, mB ;mA=mA+mB
Similar to the first example, the CPU must read two external memory locations
to retrieve the operand values, requiring two memory bus cycles. It also needs to
transfer the result back to memory and thus execution time is unchanged.
However, the instruction word this time only needs to contain two absolute
addresses instead of three. This would be achievable in a real system, especially if
an absolute value is used for the first operand address and an offset used for the
second one.
CISC processor: ADD mB ;ACC=mB+ACC
The CISC processors of the 1980s and earlier commonly utilised accumulators.
These were general-purpose registers (the forerunners of the register bank) that
were used as an operand for all arithmetic and data mode operations and to hold the
result of those operations. The other operand was almost always an absolute value
from memory. In this case, the instruction requires a single value to be loaded from
memory prior to the addition and thus involves a single external memory bus cycle.
The instruction word needs to only contain a single absolute memory value,
which could be achieved by loading a second instruction word containing the ad-
dress (thus requiring two instruction fetches to be performed prior to instruction
execution).
Stack processor: ADD
This is a special case (that will be explored further in the next section and specifi-
cally in Chapter 8) where a CPU pops the top two stack entries, adds them together
and pushes the result back onto the stack. This needs to access a stack which would
be quick if it were an internal memory storage block, however, a stack would more
normally be located in off-chip memory. The main benefit with the stack approach
is that the instruction does not need to encode any absolute memory addresses.
Theoretically, this can make for an extremely small instruction width.

31350 Stack Machines and Reverse Polish Notation

People generally employ infix notation to represent an operation written on paper (such
as a + b ~ c), where an agreed fixed precedence’ of operators (that can be overridden
oa) using parentheses) determines the order in which the various operations occur. Polish
A=
oO notation (note: not reverse Polish notation) was invented by Polish mathematician Jan
i=
2]
<=
Cc
=7 * Many readers may remember being taught the BODMAS acronym as an aid to remembering
)
2
precedence during primary school mathematics. BODMAS stands for Brackets, Orders (e.g. powers
=
—
“
and square roots), Division, Multiplication, Addition and Subtraction: see
= http: //www.malton.n-yorks.sch.uk/MathsWeb/reference/bodmas.html
97
CPU Basics

Lukasiewicz in the 1920s to place the operator before the operands, thus it is a prefix
notation. By specifying the operand in this way, operator precedence is unimportant
and parentheses are not required.
Reverse Polish notation (RPN) by contrast is a postfix notation where the order of
the equation completely defines the precedence. This was created during the 1950s and
1960s as an aid to working with a stack-based architecture. It was subsequently intro-
duced and loved (or hated) by two generations of Hewett-Packard electronic calculator
users.
An example of RPN is bc + a+, where the operands b and c are given first fol-
lowed by the command to divide them and hold the result. Then operand a is loaded
followed by the command. to add the previous result to a and store the new result
somewhere. Some further e amples are shown below and in Figure 3.12.
i ¥
¥

if¥ ~ Infix Postfix

Li axb ab x

LY ay a+b-—c ab +c —

(a+b)+c | ab+c+

Considering the operations taking place, it becomes evident that using a stack is a
very efficient method of performing RPN operations. A stack in this case is a storage
device with a single entry /exit point. Numbers can be pushed onto the ‘top’ of the stack
and then popped back off the ‘top’ of it. It is a last-in first-out (LIFO) construct.
An example of a stack operation performing ab + is shown in Figure 3.12, reading
from left to right. Some things to note are that only a single push occurs in each step
(likely to each take a single cycle in a stack-based processor) although the number of
pops required to feed an operation is determined by the number of operands required.
For example, an ADD requires two operands, so two POPs are used to load those to
the ALU. The result of each operation is PUSHed back onto the top of the stack.

Figure 3.12

Pop off top

two values 23)
and add them =
Ze)
Empty Pusha Push b Push result &
ce}
stack onstack onstack on stack -
c
An illustration of the concept of stack processing. Two operands are pushed in 2
8}
ot

turn onto the stack and ALU then executes, popping the operands, calculating the ro
j=
sum and then pushing the result back onto the stack. _—
nn
£
98
Chapter 3

ne es ee ee
‘© Recoding RPN instructions to minimise stack space
ro)
g Consider the infix expression a + (b x c) which can also be written as (b x c)
+ asince the order of addition is unimportant to the final result.
For each expression, write the equation in postfix notation and write out the se-
quence of stack operations that would be required to execute it. Consider the stack
usage for each expression.
It should be clear that writing the equation one way involves the use of amaximum
stack depth of three locations, whereas the alternative way results in a stack depth of
only two locations.
It appears that the order of the postfix expression can have a significant impact on
the stack resources (and hence hardware resources) needed, although it will not alter
the number of steps needed to find a solution.
Not all infix expressions are insensitive to order. Addition and multiplication are,
whereas division and subtraction are most definitely not.

It is also interesting to consider the use of such a stack machine performing com-
plex programming tasks. It seems efficient for simple operations, but sometimes it is
possible that the final state of the stack after a sequence of operations may not have the
correct results located on the top of the stack. This may be exacerbated by multi-tasking
or interrupt service routines. There must be a way of re-ordering the stack, such as
popping items out and storing into main memory, and then pushing them back in a
different order. This could be a very time-consuming process and impacts heavily on
the overall performance of a stack machine. This process is also explored in Box 3.6
where re-ordering is performed to minimise stack usage.

Data Handling
This chapter, up to now, has concentrated on CPU basics — what a computer is and what
it fundamentally consists of. We have mentioned instructions, programs and so on. As
part of this, Section 3.3 considered instruction handling, including some variations on
a theme, as well as the important sub-topic of addressing modes.
Later, Section 3.5 will present a top-down view of computers. However, in between
these two extremes of high-level overview and low-level detail, there is a more philo-
sophical question regarding the purpose of computers. We can consider a ‘black box’
perspective as an example.’ Having a black box perspective, we view a computer as a
2) unit that modifies some input data to produce some output data.
ie
i]
i=
5
= ° For those who have not encountered this term, a ‘black box’ is the name given to something that,
e] when considered as a unit, is defined solely in terms of its inputs and outputs. It does not matter
5
el

ra what is inside the box as long as it produces the correct output given the correct input.
99
CPU Basics

Both input and output data could take many forms: commands, knowledge, sensor
data, multimedia and so on. For some systems, input data could consist of a single
trigger event. Output data could likewise consist of an actuator switch signal. This is
the case in control systems, which often operate with a need for real-time processing
of data (real-time issues are considered in depth in Section 6.4). Some systems are data
rich — either input or output may consist of dense streams of data, such as digital audio
or video. These systems may also need to operate in real time. However, the majority of
computer systems are probably general-purpose machines capable of performing both
control and data processing tasks with little regard to real-time issues.
The common theme here is clearly data: computers process data, whether that is a
single bit trigger for which timing is critical, or a 1 Tbyte block of multimedia data that
completes processing in several minutes. This section is dedicated to this important
aspect of computers: what data is, how it is presented, stored and processed.

3.4.1 Data Formats and Representations

We have discussed number formats in general in Section 2.3, including those of most
relevance to computers (unsigned binary, two’s complement and so on). Whatever
format is in use, the width of the number — the number of bits occupied by one number —
can be adjusted by computer architects to either increase the largest magnitude number
that can be stored or to increase the precision. Typically, since computers are byte-based,
number sizes are in multiples of 8 bits.
Most CPUs have a natural size data format which is determined by the width
of the internal buses, for example byte-wide in the old 6502 processor and 32-bits
wide in the ARM. Although the ARM can also handle bytes and 16-bit half-words, it
accesses main memory in 32-bit units, and thus handles 32-bit values no slower than the
handing of bytes. Registers, memory locations, most operands and so on are 32 bits in
the ARM.
Programmers typically handle data in memory or in registers through a high-level
language such as C. Although some programming languages tightly define the num-
ber format used by data types within the language, that is not really the case for the
C programming language, apart from the definition of a byte, which is always 8-bits
in size.
Usually, although it is actually at the discretion of the particular C compiler in use,
the int data type generally matches the natural size of the processor for machines of
16-bit word size and above. Thus, an int ina 16-bit machine will normally be a 16-bit
number, whereas it will tend to be 64 bits in a 64-bit machine.
Programmers beware: if you wish to write portable code, ensure that there are no
assumptions made about the exact size of aint, short and so on. Table 3.1 illustrates
the width of several data types for the common gcc compiler targeting different pro- D
Ge
cessors.° Concerns over the changing nature of some of the original C language data ae)
c
5
=
ie]
6 Note that some compiler implementations will differ, or may not comply to ISO or ANSI C language fe)
Soe

specifications. ra
100
Chapter 3

Table 3.1

Comparison of C programming language data type sizes for CPUs ranging from 8 bits to 64 bits.
Note how some of the data types change size between processors, while others remain the same.
For a particular implementation, these sizes are usually defined by maximum and minimum
representable number specifications in the configuration header file types.h. Remember also that
the byte order may change between big and little endian processors (see Section 2.2).

C name 8-bit CPU 16-bit CPU 32-bit CPU 64-bit CPU

char 8 8 8
byte 8 8 8
short 16 16 16 16
int 16 16 B2 64
long int 32 32 SP 64
long long int 64 64 64 64

float Be Je 32. 32
double 64 64 64 64
long double compiler specific — may be 128, 96, 80 or 64 bits

types has led to many developers adopting specific-sized data types, described further
in Box 3.7.
Of course, experienced programmers will know that any integer data type in the
C programming language (i.e. the top six rows in the table) can be specified as either
signed or unsigned. The default (if neither is specified) data types are signed two’s
complement.
The long int and long long int canalso be specified as just long and long
long respectively. On all but the largest machines these will require multiple memory
locations for storage.
The char type normally contains a 7-bit useful value, complying with the ASCII
standard (American Standard Code for Information Interchange), shown in
Table 3.2. Any top-bit-set character (i.e. a char where bit 8 is non-zero) would
be interpreted as an extended ASCII character (ASCII characters that are not
shown in the figure). Interestingly, characters lower than decimal 32 (space) and
including decimal 127 (delete), are non-printable characters having special values
related to their original definitions for teletype terminals. For example, ASCII
2) character 8, \b is the bell character,
pe which would cause a ‘beep’ sound when
a) printed. A brief web search can easily reveal the meanings of other special ASCII
c
5 characters.
x=
BS5 ASCII was excellent when computers were effectively confined to English (or Amer-
5
a ican) speakers, but not particularly useful for other languages. Hence, significant effort
101
CPU Basics

Data types in embedded systems

3.7 Although general programs written in languages such as C and C++ will make use
Box
of the standard data types shown in Table 3.1, this can cause confusion when porting
code. If a programmer makes an implicit assumption regarding the size of a particular
data type, this assumption may no longer be correct when the code is compiled on a
different processor.
The situation was actually far worse in the days before the widespread adop-
tion of the gcc compiler — many compilers had different compilation modes such as
‘large memory model’ and ‘small memory model’ which could result in the num-
ber of bits used to represent variables changing (even gcc has command switches
which can change this, but are not often used). Cross compiling for embedded sys-
tems, where the target machine may differ from the host compilation machine, makes
it doubly important to ensure that any code tested on the host performs similarly on
the target.
Perhaps the simplest way to achieve this, and to remain mindful of the limitations
of different data types, is to directly specify the size of each type when declaring
variables. In the C99 programming language (the version of C formalised in 1999) the
definitions have been made for us in the <st dint .h> header file:

Size Unsigned Signed

8 Messe eatin Ge

16 alias Gets Eine Efe

on imiwesQets (laine
2)2)4

64. int64-t

The 64-bit definitions (and other odd sizes such as 24 bits) may exist for a par-
ticular processor implementation but not for others. Of course, if it exists, it will
occupy the sizes given, but otherwise these are optional, so for some machines the
compiler will not support anything but the main 8-, 16- and 32-bit definitions. Writ-
ers of code for embedded systems will likely encounter these safer type declara-
tions more often than those writing desktop machine software. The author would
encourage embedded systems developers to use the specific-sized types wherever
possible.

D
ae
has been paid over many years to define different character encodings for other lan- Oo
Cc
13,000 ce}
guages. Perhaps the ultimate challenge has been Chinese which has around x
pictograms (individual ‘letters’): clearly an 8-bit data type is not able to encode written 5
e)
—

Chinese. Many solutions have appeared over the past two decades, most of which use Q
102
Chapter 3

Table 3.2
co es heb hein 18 EE
The American Standard Code for Information Interchange, 7-bit ASCII table, showing the charac-
ter (or name/identifier for non-printable characters) and the representative code in decimal and
hexadecimal.
wht ze thadhen mage des tal morn tn ond eee aera © aba ee
Char Dec Hex Char Dec Hex Char Dec Hex | Char Dec Hex

\0 0 0x00 (spc) 32 0x20 | @ 64 0x40 i 96 0x60

(soh),,.. 4. ,0x01 | 33 0x21 iF A 65 0x41 F a 97 Ox6l

ey’ 2 0x02 it 34 0x22 | B 66 0x42 | b 98 0x62

(CBO 9. 0x03 # 35 20x23 * ‘S 67 0x43 c 99 0x63

cee 4. 0x04 $ 36 0x24 | D 68 0x44 d 100 0x64

+ | —
(eng) 9 Ox05 % SUX E 69 - 0x45 e 101 0x65

(ack) 6 0x06 | & | 38 0x26 F 70 0x46 f 102 0x66

\a Ta SOX07, i" O99 0x27 G 71 0x47 g 103 0x67

\b 8 0x08 ( 40 0x28 H 72 0x48 h 104 0x68

\t OF 0x09 ) et? i 73 0x49 | i 105 0x69

\n 10 =Ox0a | % 42. Ox2a | J 74 Ox4a | j 106 Ox6a

(vt) ' 11 Ox0b - 43 Ox2b K 75 Ox4b ik 107 Ox6b

\f | 12 ; Ox0c ; 44. 0x2c iF 76 Ox4c ] 108 Ox6c

a 13 Ox0d — 45 Ox2d M 77 Ox4d | m 109 Ox6d

(so) 14 Ox0e | ; 46 Ox2e N 78 Ox4e n 110 Ox6e

(si) «dS Ox0F / 47 Ox2f O 79 Oxf oO Lit “Ox6f

(dle) 16 0x10 0 48 0x30 P 80 0x50 P 12 Ox70

(del) T7o Oxi | Fi AOE ORS Q 81 0x51 q Lis User

: (de2) : 1S, Belo? | ) , COXA 2 R 82. Oxd2 | r 114 (0x72

(de3) 19 “0x13 es Silvey Oxaa . S 83.rr0x53 [ S 1S 80x

ae a 20 0x14 na ‘a 52 0x34 ya gd 0x54 t 116 0x74

(nak) ; PAN h 0x15 i is 55 “0x35 | U 65 PS 0N55 | u 1 O75

oD
a (syn) , 20 a. 0x16 é: ne 0x36
oO V 86 0x56 Vv 118 0x76
Cc
ce) (etb): "23 ni oir} -7 ’ 55 0x37 '
= WwW 87 Ox57 Ww OR OXTEA
<5
5 (can) 24 0x18 = ’ 8 56 0x38 | X 88 0x58 x 120 0x78
Q
103
CPU Basics

Table 3.2

(Continued)

Char Dec Hex Char Dec Hex | Char Dec Hex Char Dec Hex

(Ga 25 0x19 9 7a OXS? Na 89 0x59 y 121 0x79

(sub) 26 Oxia ; 5SamOxsa UA. D0 Oxda | Z 122. Ox7a

(escyy 27 Ux : Do OxaD [ CW bent Jaa)9) | { 123 Ox7b

(fs) oy 28.4 .0xie c rt 60 Ox3c \ 02 em Ox5e | 124 Ox7c

Ge aboeikyYael hieria bile iatigy ay laleatly.vloigg gel |my 1B Ox7d

Ee 30 Oxle > 62 Ox3e zi 94 Ox5e 126 Ox7e

| (us) 31 oxif | 2 (8) Oycit = 95 Oxf |‘(Gel : 127 Ox7f |

two or more sequential bytes to hold a single character. The current de-facto standard
encoding is called unicode, which has various ‘flavours’ but which can use up to four
sequential bytes to encode the vast majority of characters, including Chinese, Japanese,
Korean and so on.
Although the detail of this encoding system is beyond the scope of this book,
the implications are not: early computers were byte-sized and were naturally able to
handle byte-sized ASCII characters. These days, it requires a 32-bit machine to handle
a 4-byte unicode character in a single operation. Similarly, early interfacing methods
such as the PC parallel and serial ports (see Chapter 6) were byte-based. Memory
accesses have often been byte-based. The argument has been that a byte is a convenient
size for simple counting and for text processing. However, this argument no longer
applies in many cases. Where non-English alphabet systems are concerned, a byte-
sized processing system is nothing more than a historical curiosity.
One final point to note concerning data sizes is the uniformity of the float and
double types. This uniformity is related to the ubiquity of the IEEE754 standard, and
the fact that the majority of hardware floating point units comply with the standard
(this will be explained a little more in Section 4.6).

3.4.2 Data Flows

Again adopting a black-box view, a computer takes input, processes it and generates
output. Evidently the requirements of that data are important, in terms of timeliness,
Da
quantity, quality and so on. eo
Today’s computers, and especially many consumer electronic embedded systems, 5)
<7
5
are heavily user-centric. This means that input, output or both need to interact with a =
human being. Some data also tends to be quite voluminous (video, audio and so on). io]
5
te

Buses, which we will consider more fully in Section 6.1, need to be sized to cope with Qa
104
Chapter 3

the required data flows, and systems should also consider human needs. For example,
the human sensory organs are often far more sensitive to sudden discontinuities than
they are to continuous errors (noise). It is usually more annoying for listeners to hear
music from a CD player which skips than it is to listen to music in the presence of
background noise. Similarly with video: skipped frames can be more annoying than a
slightly noisy picture.
Most of the important real-time issues will be explored in Section 6.4. However, at
this point, we need to stress that computer architects should bear in mind the use to
which their systems will be put. Embedded computer architects may have an advantage
in that their systems are less flexible and more generic, and thus better able to satisfy
users. Unfortunately, they also suffer the considerable disadvantage that size, cost and
power limitations are more severe, and thus require finer balancing of trade-offs in
design.
Technically speaking, data flows through computers on pathways called buses.
This data may originate from external devices or some form of data store, be processed
in some way by a CPU or co-processor, and then output similarly either to another
external device or data store.

3.4.3 Data Storage

The memory hierarchy of Figure 3.1 highlights the difference in memory provision
between embedded systems and typical desktop or server systems: with the exception
of some iPod-like devices, data storage in embedded systems is usually flash-memory-
based. In desktop systems it is often stored on hard disc (for short-term storage), or
tape/CDROM or DVD (for backup storage).
Data ‘inside’ a computer is located within RAM, cache memory, registers and so
on. From a programmer’s perspective it is either in registers or in main memory (since
cache memory is usually deliberately invisible to a programmer). Data enters memory
from external devices or hard discs over buses (Section 6.1) either individually a byte
or word at a time, or in bursts, perhaps using a scheme such as direct memory access
(DMA - see Section 6.1.2). Large amounts of data occupy pages of memory, handled
by a memory management unit (MMU - covered in Section 4.3), and small amounts
may exist in fixed variable locations or in a system stack. Since embedded systems
often use parallel bus-connected flash memory devices, data in such systems is already
directly accessible by the main processor and thus is considered already ‘inside’ the
computer.
Data is brought into a CPU from memory for processing, and again may
be conveyed as individual items or as a block. For load-store machines (Sec-
tion 3.2.3), data to be processed must first be loaded into individual registers
D since all processing operations take input only from registers and output only to
=
se] registers. Some specialised machines (such as vector processors) can handle
c
ce} blocks of data directly and some machines have dedicated co-processing units that
=
ie]
cael can access memory directly, without requiring the CPU to handle loading and
5.
Q storing.
105
CPU Basics

3.4.4 Internal Data

When compiling C code, the compiler decides how to handle program variables. Some
variables, usually the most often accessed ones, will occupy registers during the time
that they are being accessed. However, most processors have insufficient registers for
more than a handful of variables to be catered for in this way.
Global variables have a dedicated memory address during the execution of a pro-
gram, but other variables are stored ina memory stack. That means that whena program
contains a statement such as ‘i++’ and i is a local variable which the compiler decides
cannot remain in a register, the compiler dedicates a particular location in the stack
to the variable. The pseudo-machine code instructions to execute this statement on a
load-store machine would thus be as follows:

1. Load the data item at the particular stack offset corresponding to variable i into a
register.
2. Increment the value stored in that register.
3. Save that register content to the stack offset that it was retrieved from.

If there was a subsequent decision to be made on variable i (suchasif i > 100

then ....)the compiler knows that i is already occupying a register, so it will re-use
that register in the subsequent comparison and decision. Some variables, as we have
mentioned, can remain in registers throughout a calculation. It all depends upon how
many registers are available, how many variables are in use and how frequently these
are accessed.
Actually the programmer has little control over which variables are to be stored in
registers and which are to be kept in a stack, although the C programming language
keyword register asks the compiler to keep a variable in a register if possible. For
example, if we wanted to maintain i ina register (if possible), we would have declared
vas:
register imt 1=0;

Spill code is the name given to the few machine code instructions that a compiler
adds to a program to load-store variables between memory and registers. Since memory
accesses are far slower than register accesses, spill code not only slightly increases the
size of a program, it also adversely affects execution speed. Minimising spill code has
long been a target of compiler researchers and computer architects worldwide.

3.4.5 Data Processing

Adding two 8-bit integers in an 8-bit processor is always going to be a simple proposi-
tion, and adding two 8-bit numbers in a 32-bit processor is also relatively simple’ since
both arithmetic operations can be performed with a single instruction. oD
A
oO
c
5
7 Remember though that sign extension (Section 2.3.8) would need to be performed when placing en
5
8-bit values into 32-bit registers; otherwise negative two’s complement numbers may be incorrectly 5
-_—

interpreted in the ALU! a

106
Chapter 3

This single instruction is normally accomplished very easily in hardware: send

the two operands from registers to an ALU and then load the result back into another
register.
The situation becomes more interesting when processing larger numbers in a
smaller processor and when performing more complex processing. Let us consider
three possibilities in turn: operating on numbers that are larger than the width of the
processor, floating point in a fixed point CPU and complex numbers.

3.4.5.1 Big Numbers on Small CPUs

Since the C programming language can define 32-bit or even 64-bit data types, it fol-
lows that any C compiler for 8-bit, 16-bit or even 32-bit CPUs must be able to sup-
port arithmetic and logical operations on numbers larger than the natural size of the
processor.
First of all, note that many processors with a certain data bus width actually support
higher precision arithmetic. For example, most ARM processors are able to perform a
multiplication between two 32-bit numbers. We know that the maximum size of the
result of such an operation could be 64 bits. The original ARM multiplier would allow
only the lower 32-bit part of that result to be stored to the destination register. However,
a ‘long multiply’ instruction on newer ARM processors allows the full 64-bit result to
be stored to two 32-bit destination registers. Evidently, the operation to store the results
will take twice as long to complete (but this is a lot less time than trying to determine
the upper 32 bits from the operation using other methods).
Let us examine how we can perform a 64-bit multiply on an ARM processor that
does not have a ‘long’ multiply instruction (although please note that this may not be
the fastest way to do it):

Load operand 1 lower 16 bits to R1

Load operand 1 upper 16 bits to R2
Load operand 2 lower 16 bits to R3
Load operand 2 upper 16 bits to R4
RO = RI x R3
RO = RO + (R2 x R3) « 16
RO = RO + (R1 x R4) « 16
So
Na
Oe
ae
SO RO = RO + (R2 x R4) « 32

This is illustrated diagrammatically in Figure 3.13, where the loading is shown as a

set-up stage and the multiplication and adding are shown as an operation stage. Within
this stage, four multiplications, three shifts and three additions need to be performed
to calculate the result.
5)
po: The clear message here is that the lack of a single ‘long’ multiply instruction will
a2)
Cc
entail several additional operations, and possibly registers, to replace it. Of course, there
2] are slightly faster or lower-overhead schemes than the particular one shown, that can
ps2
—2] work in certain cases. However, for general-purpose multiplication none of these can
i°]
a better the use of a single instruction.
107
CPU Basics

Figure 3.13 ?

Set-up phase any

f eS Ly R1 A(15:0]
OperandA
—SS————— er ee
R2 A[31:16]
.

a ar ad R3 B[15:0]
Operand B
a al a a ;
R4 B[31:16] |
2 DItS a
_»| 16 bits <<

Calculation phase
R1 A[15:0]
A[15:0] x B[15:0]
R2 A[31:16] 4 a
A[15:0]xB[31:16] |
R3 B[15:0] a
xB[15:0] |!
A[31:16]
+

R4 B[31:16] A[31:16] x B[31:16]

RO | |
|
EE OAIDITS a
A block diagram illustrating the set-up and calculation stages of the multi-step procedure nec-
essary to perform a 32-bit x 32-bit = 64-bit multiplication using multiply hardware only capable
of returning a 32-bit result (i.e. 16-bit x 16-bit = 32-bit hardware).

Logical operations on longer data words are quite simple: split the operands, pro-
cess the logical operations on each part separately and then re-assemble the result. This
is because a logical operation on one bit in a binary word does not have any impact
upon the neighbouring bits.
Arithmetic operations require a little more thought than logical operations (but are
simpler than multiplication or division). The issue with arithmetic operations is that of 21)
ae
overflow: the result of adding two 16-bit numbers may be 17 bits in length. The extra ae)
c
bit (carry) must therefore be taken into consideration when performing the addition of 5
S
the split numbers. Usually, that will involve calculating the lower half of the split first —
2]
5
and then adding this (with carry) to the result of the upper half. a
108
Chapter 3

3.4.5.2 Floating Point on Fixed Point CPUs

We discussed floating point numbers extensively in Section 2.8 and floating point pro-
cessing in Section 2.9. Most of the time when we discuss “floating point’ in computer
architecture, we mean IEEE754 standard floating point. In fact, this is natural because
most hardware floating point units implement the IEEE754 standard.
Processors without floating point capabilities either rely upon compiler support to
translate each C programming language floating point operation into much longer and
slower fixed point subroutines, or compile into floating point machine code instructions
which are then trapped by the processor. This trapping is in effect an interrupt triggered
by the receipt of an instruction that the processor cannot handle. The interrupt service
code then has the responsibility of performing the particular floating point operation
using fixed point code before returning back to normal execution. This is known as
floating point emulation (FPE), and is examined further in Section 4.6.1. The first ap-
proach can only be used if the programmer knows, at the time of compilation, whether
an FPU will be present or not, and thus may not be suitable for general software such
as on a personal computer.
When a hardware FPU is not included in a system, the FPE alternative (or compiler
alternative) probably will not implement the full IEEE754 standard since this would
make it quite slow. Thus, the code will end up being potentially less accurate (as well
as a lot slower) than the programmer might expect.
Let us refer back to Sections 2.9.1 and 2.9.2 where we had considered the
addition/subtraction and multiplication of floating point numbers: the addition/
subtraction process required a normalisation procedure, whereas the multiplication
process required a straightforward calculation, albeit one containing several
subcomputations.
Taking the simple multiplication from Section 2.9.2 as an example:

(A x BS) x (Dx BF)

=(A x D) x BCT®

For a machine equipped with FPU, (A x B°) and (D x B®) would be single 32-bit
(for float) or 64-bit (for double) values. These would be loaded into two FPU registers,
a single instruction issued to perform the multiplication and the answer retrieved from
a destination FPU register. By contrast, for a machine without FPU, several fixed point
operations would be required instead:

Split off mantissa and exponent A and C and store in R1 and R2 respectively.
Split off mantissa and exponent D and E and store in R3 and R4 respectively.
Calculate the new mantissa: R1 x R3.
d) Calculate the new exponent: R2 + R4.
me Normalise exponents.
a2)
Cc
2] Se
le
Ba Recombine and store in IEEE754 format.
<=
=
5 Clearly, the single FPU instruction is preferable to the several fixed point operations
je]
a that are needed to replace it.
109
CPU Basics

3.4.5.3 Complex Numbers

Complex numbers, of the form (a+j.b) where j = J—1,are frequently used in scientific
systems and also in radio communications systems. Almost all CPUs lack support for
complex numbers and few programming languages cater for them.
Complex number calculations on a system with hardware handling only real num-
bers, just like floating point performed with fixed precision arithmetic, requires a few
steps. Consider the multiplication and addition of two complex numbers:
(a + 3.b) x (c+ j.d) = (a.c —d.b) + j(a.d + b.c)
(a + j.b)+(c+jd)=(a+c)+ j(b+d)

The complex multiplication needs four real multiplications and two additions.
The complex addition is a little simpler, requiring only two real additions. This will re-
quire the programmer (or compiler) splitting the operation into steps of several simpler
instructions.
A processor with hardware support for complex numbers would possess a single
instruction capable of performing these operations. The underlying hardware architec-
ture would actually need to perform all of the splitting, suboperations and separate
multiplies, but this would be handled very quickly within the CPU without requiring
separate loads, stores and data moves.

A Top-Down View

cma | Computer Capabilities

Looking at various processors available today, there are a profusion of features, clock
speeds, bit widths, instruction sets and so on. The question arises as to what is needed
in a particular computer. Some capabilities are examined below.

Choe a! Functionality
Given that all computable functions can be performed by some sequence of logic op-
erations, the main reason why not all functions are computed in such a way (i.e. as a
possibly long sequence of logical operations), is related to efficiency — how long does
such a function take to complete, and what hardware resources are required? There is
some trade-off in that making a computer simpler can allow faster clock speeds. This
argument led to the advent of RISC processors which, being simpler, clock faster — at
the expense of having to perform some functions longhand that would be built into a
CISC computer as single instructions.
=
2
>
8 The notable exception is FORTRAN (FORmula TRANslation), the general-purpose compiled
c
>
language introduced by IBM in the mid-1950s. FORTAN, updated several times since (the latest fe)

being 2003), has natively supported a complex number data type for over 50 years. Among modern
a
a.
languages, there has been some promotion of Java as a scientific language, with a complex number 2
extension. Unfortunately, Java is currently significantly slower to execute than FORTRAN. Sg
110
Chapter 3

However, it is generally pragmatic to consider how often a particular function is

required in software when deciding how to implement it. Put simply, if a function
is required very frequently during everyday use, then maybe it is useful to build a
dedicated hardware unit to handle it quickly. In this way, an ALU is included in all
modern processors and almost all have hardware multiplier units.
Not only functional operations, but flexibility in the instruction set is an important
feature. For example, there may be time-saving instructions available in one design
but not another, even when these do not require large amounts of hardware support.
Examples are the universality of conditional instructions in the ARM instruction set
(Section 3.3.1) and zero-overhead loop instructions in some digital signal processors
(shown later in Section 5.6.1).
The internal architecture of a CPU —namely the number of buses, registers and their
organisation, is also an important consideration for performance. Generally speak-
ing, more buses means more data items can travel around a device simultaneously,
and thus better performance. Likewise, more registers support more software vari-
ables that would otherwise need to be stored in slower memory, and again improves
performance.

3.5.1.2 Clock Speed

A higher clock speed does not always mean faster operation. For example, it is
relatively easy to design a fast ALU, but not at all trivial to design a fast multiply unit.
When comparing two processors, clock speed alone is not sufficient to decide which
is faster. There are many factors such as functionality, bus bandwidth, memory speeds
and so on, which must be considered: in effect, asking ‘what can be accomplished each
clock cycle?’ This question is considered in the next section.

SHepi les: Bit Widths

Until recently, the vast majority of CPU sales were for 4-bit processors, destined to
be used in watches, calculators and so on. These days the profusion of mostly 32-
bit processors (generally ARM-based) used in cellphones and network appliances, is
tipping the balance towards wider processors.
Although it might seem a wider processor will result in faster computation, this
is only true if the data types being computed make use of the extra width. High-end
servers with 64-bit or even 128-bit architectures are available, but if these are being used
to handle text (such as 7-bit or 8-bit ASCII or even 16-bit Unicode), the extra width may
well be wasted.

3.5.1.4 Memory Provision

= The memory connected to a processor is often critical in determining operation speed.
As
S Not only the speed of memory access, but the width (together specifying a bandwidth
c
> in bits per second) and technology are just as important. Other aspects include burst
°
z% mode access, paging or packetisation and single-edged or double-edged clocking.
fol On-chip memory also may not always be single-cycle access, but it is likely to be
2
og faster than off-chip memory. Given a particular software task that must be run, the
hed
CPU Basics

amount of memory provided on-chip, and off-chip, must be considered. A cache (Sec-
tion 4.4) in particular is used to maximise the use of faster memory, and the complexity
of hardware memory units tends to influence how well memory use is optimised. In
terms of memory, the way software is written and compiled can also result in more
efficient use of hardware resources.

O.0e2 Performance Measutres, Statistics and Lies

In order to determine exactly how fast a computer operates, the simplest general-
purpose measure is simply how many instructions it can process per second.
MIPS (millions of instructions per second) measures the speed at which instructions
or operations, can be handled. This is a useful low-level measure, but it does not really
relate to how powerful a computer is: the operations themselves may be very simple
such that multiple operations are required to perform a useful task. In other words, a
simple computer with a high MIPS rating (such as a RISC processor) may handle real-
world tasks slower than a computer with a lower MIPS rating but with instructions that
each can perform more work (such as a CISC processor). The bogomips rating, calculated
at boot-up on Linux PCs, is a famous attempt to gauge a MIPS score in software — but
is unfortunately notoriously inaccurate.
MIPS as a measure is therefore made up of two components, clock frequency f (in
Hz) and CPI (cycles per instruction) such that:
MIPS = f /CPI
More generally, for a particular program containing P instructions, the completion
time is going to be:
Teomplete aa (P. x CPD) ak

So completion time reduces when CPI is low, fis high or most obviously P is low
(i.e. a shorter program will probably execute faster than a longer one). The trade-off
between P and CPI in computer architecture is a revisit of the RISC vs CISC debate,
while ever-increasing clock frequency is the story of modern CPUs.
The task of minimising CPI is another aspect of modern computer systems. Up
until the 1980s, CPI would be greater than 2, perhaps as much as several hundreds
in CISC machines. The RISC approach began to shift CPI downwards, with the aim of
achieving a CPI of unity. The ARM family of processors typically achieve a CPI of about
1.1, and other RISC processors can do a little better than this.
Later, the advent of superscaler architectures led to CPI values of below unity,
through allowing several instructions to execute simultaneously. This, and the inverse
of CPI (called IPC) will be explored later in Section 5.5.1.
Sometimes floating point performance is an important attribute and this is mea- &
>
sured in MFLOPS (millions of floating point operations per second). In recent times, e
GFLOPS readings are more commonly quoted, meaning thousands of MFLOPS and ES
°
even petaFLOPS, PFLOPS. These values are more indicative of actual performance efol
than MIPS since we are counting useful calculation operations rather than the low- =
level instructions which comprise them. 4
12
Chapter 3

oe Standardised performance
x
5 (In the mid-1980s, the computer industry worldwide saw an unprecedented level of
competition between vendors. This was not simply a two-entry race between AMD
and Intel. It included thousands of manufacturers selling enormously differing ma-
chines — alternative architectures, different memory, tens of CPU types, custom oper-
ating systems, 8 bits, 16 bits and even some more unusual choices.
In the UK, companies such as Sinclair, Acorn, Oric, Amstrad, Research Machines,
Apricot, Dragon, ICL, Ferranti, Tandy, Triumf-Adler and more battled in the market-
place against IBM, Apple, Compaq, DEC, Atari, Commodore and others. Claims and
counterclaims regarding performance littered the advertisements and sales brochures
available at that time. However, with no standard and no baseline, claims were often
dubious to say the least.
In response, the British Standards Institute (BSI) published a performance stan-
dard for computers — testing useful tasks such as integer calculation, floating point
calculation, branching performance and graphics as well as disc reads and writes.
However, at that time the programming language of choice was BASIC (Beginners
All-purpose Symbolic Instruction Set), and hence the standards were written in this
language! From today’s point of view, the graphics and disc tests are also dated: the
‘graphics’ test was actually text being written to the screen or VDU (visual display unit)
in the parlance of the time. This was important for many users interested in nothing
more than word-processing. Also disc reads and writes were to floppy discs — a great
advance on the tape drives used for most home computers at the time — hard discs
(usually known as Winchester drives in those days) were simply too expensive and
not even supported on most machines available at the time. Far more common was
saving programs to cassette tape.
Today, computer magazines and websites test new hardware and software with a
battery of tests far removed from the BSI standard, but following the same rationale.
Thus, measures such as ‘refresh rate for playing Quake III’ and ‘time taken to sort
1 million rows of random numbers in a spreadsheet’ are to be found. Other more
standard, but often not freely available, tests exist but these are less commonly applied:
after all, most users are more interested in playing Quake than in how quickly they
can calculate z to 100 decimal places.

Benchmarks are so important that several companies exist to provide such services
(Box 3.8 explores the background and necessity of having such benchmarks). BDTi is
> one example which publishes comparative speeds for several digital signal processors
AS
S (DSPs). Their measures are skewed towards outright calculating performance, some-
c
=
fo)
thing which is the mainstay of the DSP market.
a Otherwise, SPECint and SPECfp benchmarks compute integer and floating point
2.
AS) performance directly. These are obtainable in source code format from the Standard
8 Performance Evaluation Corporation (SPEC) for a fee, and can be compiled on an
Ms
CPU Basics

architecture to assess its performance. Each measure is calculated from a set of algo-
rithms that have to be run, and results combined. Generally, a year is provided to
indicate test version. Thus, SPECint92 is the 1992 version of the integer standard.
The SPEC measures themselves incorporate two earlier measures known as Dhiry-
stone and Whetstone, both originating in the 1970s and measuring integer and floating
point performance respectively. Many other performance metrics exist and may be
used to assess performance for various tasks (such as graphics rendering, real-time
performance, byte handling and so on).
Unfortunately, it is a well-known fact that, given any single performance measure,
computer architects can tweak an architecture to yield a high score at the expense of
other, unmeasured, operations. Furthermore, none of these measures really reflect the
overall completion time of anything but the simplest tasks running in isolation. So
many issues intervene in the real world to confuse results, such as interrupted tasks,
operating system calls, varying memory speeds, disc speeds, multi-tasking and cache.
In computing, a cache (covered in detail in Section 4.4) is a small block of very
fast memory provided on a system which has far slower main memory. Any program
running directly from the cache will obviously execute quicker than one running from
slow main memory. Why this is relevant is that in the past, at least one processor vendor
has deliberately designed a cache just big enough to hold an entire performance measure
algorithm (i.e. the entire SPECint or Dhrystone program) so that it runs much faster
than it does on a competitor’s machine.
In such an example, if the main memory were set to run ten times slower, the
performance measure result would not change since the measuring program runs from
the cache, not main memory. Obviously, such a performance measure is not realistic.
In fact, such a machine would yield a faster performance score than a competitor with
a smaller cache but significantly faster main memory — one which would in reality
probably perform real-world tasks much quicker.
Given significant performance-altering factors such as those we have mentioned, it
is clear that the world of benchmarking is fraught with difficulty. A system designer is
thus urged to be careful. In practice, this may mean understanding device operation in
detail, building in large safety margins or testing final code in-situ before committing
to a device. Although it is rare in industrial projects for software to be available and
working before hardware is complete, if such an opportunity arises, the approach of
in-situ testing is very much recommended.

So Assessing Performance
Section 6.4.4 will discuss completion times and execution performance for real-time
and multi-tasking systems, but here we consider estimation of performance. In order
to underscore the need for accurate performance estimation, here is an example from
éS
c
industry:
°>
Several years ago, an embedded design group needed hardware to run an algorithm re- ge
rot
quiring 12 MIPS of processing power. A 32-bit CPU rated at providing 40 MIPS when 2
clocked at 40 MHz was chosen to execute this. In an attempt to reduce design risks, the <x
114
Chapter 3

designers obtained a development board, loaded a Dhrystone measure on to this and

checked actual performance themselves before committing to that processor as the design
choice.
During the design process, they realised that on-chip memory was insufficient for the
needs of their software and hence added external DRAM memory. Due to the small size
of the CPU package and the low number of pins, the external memory bus was limited to
being 16-bits wide. External memory accesses were therefore 16-bits wide instead of 32-bits
wide.
Having completed their hardware design and built the system, they loaded up the
code and found it would not execute in the time required. Where had they gone
wrong?
Firstly, the Dhrystone measure fitted into fast on-chip memory and so could run
at full speed, whereas their wanted algorithm was too large to fit into on-chip mem-
ory and therefore had to be stored in DRAM instead. Not only were the DRAM ac-
cesses themselves slower than internal memory accesses, but DRAM needed a ‘time out’
occasionally to refresh itself. During that time-out, all memory accesses by the CPU were
stalled.
Finally, the 16-bit interface meant that two memory reads were now required to fetch
each 32-bit instruction — two 16-bit accesses were also required to read in every 32-bit data
word. This meant that, when executing a program from DRAM, the CPU needed to spend
half of its time idle. Every even cycle it would fetch the first half of the instruction. In
the odd cycle it would fetch the second half of the instruction, and only then begin to
process it.
The 16-bit interface effectively dropped the 40 MIPS down to 20 MIPS, and the lower
speed of the DRAM accesses plus refresh time reduced the 20 MIPS performance further
to around 9 MIPS.
The solutions were unpleasant: either switch to using very fast external memory
(SRAM) which was perhaps 20 times as expensive, or upgrade to another CPU with
either faster speed or a wider external memory interface, or both. Designers chose
neither — they added a second CPU alongside the first to handle some of the processing
tasks.

This example underscores the necessity of matching performance requirements

to hardware. In general, there are two approaches to this. The first one is through a
clear understanding of the architecture, and the second is through careful evaluation of
the architecture. In both cases, the architecture referred to is not only that of a central
processor; it includes other important peripheral elements.
Gaining a clear understanding of software requirements means having fixed soft-
ware that needs to be run ona system, analysing that software to identify its contents
(particularly any bottlenecks) and then matching the results of that analysis to available
>
iS> hardware. At the simplest level this might mean avoiding an integer-only CPU when
most calculations need to be done in floating point.
>
{ow

fo)
This approach is commonly taken for DSP systems, and will include a close look at
zi memory transfers, placement of variable blocks into different memory areas that can be
Qo
2 accessed simultaneously (Section 4.1.4), input and output bottlenecks and mathematical
<q operations which are typically the major strength of such processors. Slow set-up, user
Wiss
CPU Basics

interface and control code are generally ignored in such calculations, except in the sizing
of overall program memory requirements.
At this point it is useful to note that most, if not all, software developments end
up overrunning initial program memory use estimates. Clever coding can often bring
down data memory use and can reduce processing requirements, but can seldom save
significant amounts of program memory. Unlike desktop computer designers, embed-
ded designers do not have the luxury of providing for RAM expansion: this must be
fixed at design time. In such cases, it is wise to significantly overestimate memory needs
up-front.
The second approach mentioned of matching required performance to hardware,
is through careful evaluation. This does not require detailed architectural understanding,
but does require detailed levels of testing. Ideally, the final runtime software should be
executed on candidate hardware to evaluate how much CPU time it requires. A list of
other tasks to be performed should also be made and checked to see whether those can
fit into whatever spare processing time remains. Software profiling tools (such as GNU
gprof) will identify any bottlenecks in the runtime code and make clear which software
routines require large amounts of CPU time.
It is important to run any test a number of times (but do not average the results if
timing is critical — take the maximum worst case), to increase program size sufficiently
to force it out of the cache or on-chip memory, if appropriate, and to enable whatever
interrupts and ancillary tasks might be needed in the final system.
If, as is sometimes the case, the target software is already running on another ma-
chine, it is possible to compare its execution on that machine to execution on another —
but only after considering all important architectural factors as discussed in these last
two chapters. In such instances, compiling and comparing a suite of standard bench-
marks on both machines will help, assuming that the benchmarks chosen are ones of
relevance to the target software.
The world is full of examples where designers have estimated processor perfor-
mance and/or memory requirements incorrectly (including one example designed
for an Asian industrial manufacturer in 1999 by the author: a portable MP3 player
that could only replay seven seconds of MP3 audio at a time, due to unexpectedly
low memory bus bandwidth. Luckily, a faster speed grade processor became
available).
You have been warned! Beware the pitfalls of performance estimation, evaluation
and measurement. Above all, remember to read the small print below manufacturers’
performance claims.

Summary
In this chapter, the basics of the microprocessor have been covered, starting with the
functionality of a CPU, the ability to control this with a program and the need to transfer
this program (and store it somewhere).
116
Chapter 3

A control unit needs to keep a processor on track, managing operations and ex-
ceptions, and being directed in turn by the computer program through a sequence of
instructions. Control units can be centralised, or distributed with timing from a state
machine, a microcode engine or using self-timed logic.
Each instruction in a program is part of an allowable instruction set that (depend-
ing on your point of view) describes the operations capable of being performed by that
processor, or which specifies the microprocessor behaviour. Such behaviour includes
data transfer through internal buses to various functional units. Having laid the foun-
dation for CPU design here and in the previous chapter, in Chapter 4, we will delve into
the internal arrangements and functional units of most mainstream CPUs and attempt
to relate that to the programmer’s experience.
NY,
CPU Basics

If the assembler instruction LSL means ‘logical shift left’, LSR means ‘logical
shift right’, ASL means ‘arithmetic shift left’ and ASR means ‘arithmetic shift
right’ then what are the results of performing these operations on the following
signed 16-bit numbers?
OxO00CA ASR 1
O00 ESR 2
OxFFOF LSL
OxXFFOF LSR
OxXFFOF ASR
&
moan
OxXFFOF ASL bs)
hor
1G)
CO

32 An analysis of representative code for a RISC processor with only eight instruc-
tions finds the following occurrences of those instructions:

Instruction Number of occurrences

ADD 30

AND 22

LDR 68 j

MOV 100

NOT ; ilks)

ORR 10

STR 2 60

SUB 6

a. If each instruction (excluding operands) is 6-bits long, how many bits does
the program occupy?
b. Use the information in the table to design a Huffman coding for the
processor.
Calculate the number of bits needed to store the program using the Huffman
coded instruction set.

ees) Show the sequence of stack PUSHes and POPs during the execution of the
following Reverse Polish notation (RPN) operations and translate each into
infix notation:
Ase Osa
D.ccalb ch cx
Gan ab. cdsi phar
Consider the maximum depth of stack required to perform these operations.
118
Chapter 3

A ROT (rotate) instruction is similar to a shift, except that it wraps around —

when shifting right, each bit that drops off the LSB end of the word is moved
around to become the new MSB. When shifting left, each MSB that drops off is
moved around to become the new LSB.
The ROT argument is positive for left shifts and negative for right shifts.
So, imagine a processor that has a ROT instruction but no shift. How can
we do arithmetic and logical shifting?

oD Translate the following infix operations to Reverse Polish notation (RPN):

a. (Aand B)orC
(A and B) or (C and D)
((A or B) and C) + D
C+ {pow(A, B) x D}
5oo
See if you can perform the following translation in three different ways:
{C + pow(A, B)} x D

3.6 Calculate the maximum stack usage (depth) for each of the three answers to
part (e) above.

Shi Translate the following Reverse Polish notations to infix:

aa, AB+C+-Dx
b, —ABCDE-+ * x—
ce. DC not and BA ++

3.8 Given the following segment of ARM assembler, rewrite the code to use condi-
tional ADDS to remove the need for any branch instructions.

DD SRO PRLS RS

step2
step3 TOP

Bg In ARM assembly language, determine the least number of instructions in each

case to perform the following immediate loads (hint: use the MOV instruction):
a. Load a value 0x12340001 to register RO
b. Load a value 0x00000700 to register R1
c. Load a value OXFFFFOFFO to register R2
119
CPU Basics

Identify the sequence of operations in a RISC processor that is required to add

the contents of two memory addresses m1 and m2 and store the result to a third
address m3.

3.11 Scientists discover a new type of silicon memory cell. Semiconductor engineers
design this into anew memory chip. Identify six factors that computer architects
would look at when deciding whether to adopt this new technology for mass
storage in an embedded video player.

| Bil Consider the following instructions and decide whether they are from a RISC
or CISC processor:
a. MPX: Multiply the content of two memory locations, then add the result to
an accumulator.
b. BCDD: Perform a binary-coded decimal division on two registers, format
the result in scientific notation and store as ASCII to a memory block ready
for display to the screen.
c. SUB: Subtract one operand from another and return the result as a third
operand. The operands and result are register contents only.
d. LDIV Rc, Ra, Rb: Perform a 100-cycle-long division of Ra/Rb and place the
result in register Rc.

Write an approximate microcode program sequence to perform any two of the

instructions from the previous problem. Assume an internal RISC-style archi-
tecture.

3.14 What is a load-store architecture? Why would computer designers adopt such
an idea?

In a simple computer pipeline, what process normally follows the instruction

fetch stage? |

3.16 For a fictitious 32-bit processor, the hexadecimal machine code instruction for
the assembler command to store a word 0x1234 in memory location 0x9876
looks like this:
Ox0FOO 1234 088D 9876
By examining the machine code instruction, determine whether this processor
is likely to be capable of absolute addressing. Justify your answer.

SpA Another fictitious processor, this time an 8-bit CPU, has eight registers. Is it
possible to have instructions in this processor that specify two operand registers
and a separate result register?
120
Chapter 3

3.18 Assuming ARM-style assembly language (but not necessarily an ARM proces-
sor), identify the type of addressing represented in the following instructions:
MOV R8, #0x128
AND
SHIR. Ieee os Nile
dt
AND R4, R5, R4
LDRRG (URS peRObe iol 2)
LDR RZ Ry Ole aor

sp Sle Roy
Cemeoean wlinsr ees 0)|

3.19 Which processor is likely to be faster at processing 32-bit floating point data: a
900 MHz 32-bit floating point CPU or a 2 GHz 16-bit integer-only CPU?

3.20 When writing code in the C programming language on different processors, is

a byte always represented as 8 bits? How about the short and int — what
size are these, and are they always the same?
A
CHAPTER

| Processor Internals

Chapter 2 has covered much of the low-level numerical calculations per-

formed by computer and also dealt with the definitions of computer func-
tional units and classifications of some connectivities. In Chapter 3, this
information has been formed into cohesive units with different functions
that are able to execute sequences of instructions as specified by a pro-
grammer, since we know that computers, and indeed CPUs, can be divided
logically into a number of functional units performing different tasks.
This chapter will extend beyond the basic high-level discussion of |
what goes into a CPU and focus on the largest, most prominent and most
important of the internal units that are commonly found in modern proces-
sors. We will look in more detail at what tasks those units perform and how
they are able to do so. This discussion mainly covers the ALU, FPU, MMU |
and memory cache unit. However, before embarking upon that discussion,
we will first consider the issue of how the units are wired up through buses.
It is time to assess the actual architecture — specifically the intercon-
nected bus structure — of units within a CPU.

Internal Bus Architecture

4.1.1 A Programmer’s Perspective

From a programmer’s perspective, the internal bus architecture of a pro-
cessor can be seen in two main, but related, ways. The first is in the degree
of flexibility of register use. This is evident in the set of possible registers
that can be used as operands in a particular instruction: in the ARM for
instance, where a register operand is allowed, any register from its register
bank can be named:
Ay
ADD RO, Rl, R2 .;RO=R1+R2 O
=Zz
Any register could be used — we could even use the same register:
Y
<=
FNIDID) 10, 140, RO PING = INO) SeIN
r=)=
wn

Many processors do not have this flexibility or are less regular. Sec- o
=
ondly, there is the issue of how much work can be performed in a sin- ©
—

gle instruction cycle. This is normally implicit in the instruction set itself. =
122
Chapter 4

A schematic diagram of an ALU and a bank Figure 4.1

of registers interconnected with a three-bus
arrangment.

result bus

Again looking at the ARM, there are at most two register input operands and a single
register result operand associated with any arithmetic or logic instruction:
ADD RO, R1, R2 ;RO=R1+R2
With regard to the means of transporting data from a register to the ALU and back
again: if this all happens in a single cycle, it implies that both the input and the output
have their own buses (since only one operand can travel along one bus at any time).
One bus will convey the content of R1, another will convey the content of R2 and yet
another will convey the result from the ALU back to register RO.
Taking the two observations together implies that all registers connect to all buses,
and there are at least three main internal buses.
The arrangement concerning registers and ALU that we can deduce from a brief
examination of the instruction set is shown in Figure 4.1. This is actually a simplified
schematic of the ARM processor internal interconnection arrangement. The arrows in-
dicate controllable tristate buffers, acting as gates controlling read and write access be-
tween the registers and the buses. Control logic (described in Section 3.2.4) is not shown.

4.1.2 Split Interconnection Arrangements

The ARM is justly famed for its regularity and simplicity. Some other processors are
less friendly to low-level programmers: where the ARM has a bank of 16 identical
registers with identical connectivity,’ it is more usual to assign special meanings to sets
of registers. One common arrangement is to dedicate several address registers to holding
and handling addresses, whereas the remainder are data registers. It is easy where there
g
is such a split to imagine an internal address bus that only connects to those registers
sy
—_

dedicated to handling addresses. In the ARM, where every register can hold an address
gio (since it uses indirect addressing, explained in Section 3.3.4), every register must also
4
m4 have connectivity to the internal address bus.
”n
>
r=)
re] "In fact, registers R14 and R15 are the link register and program counter respectively. These
i
c
© understandably require connections that other registers will lack which are not really evident
c
=

through examining the instruction set. Registers also vary in their shadowing arrangements.
123
Processor !nternals

Figure 4.2

2nd operand bus

lst operand bus

inte)

TR14
R15

A schematic diagram of an ALU, a MAC and a bank of registers interconnected

witha three-bus arrangement. The ability to convey two operands simultaneously
to a single functional unit is highlighted.

In some processors, such as the ADSP21xx, there is no bank of registers — there

are instead specific registers associated with the input and output of each processing
element. This means that when using a particular instruction, the low-level programmer
has to remember (or look up in the programming manual) which registers are allowed.
Sometimes an instruction has to be wasted to switch a value from one register to another
to perform a particular function — although clever instruction set design means that
these inefficiencies are quite rare. These days, such architectures are uncommon among
general-purpose processors, but are still found in some digital signal processors (DSPs)
such as the ADSP21xx? family.
So, why would designers go to such trouble and complicate the instruction set?
The answer requires us to take a snapshot of the internals of a processor as it performs
some function. In this case, we will look at the ARM as it performs the following two
instructions, using hardware which is shown diagrammatically in Figure 4.2.
MUL RO, R1, R2 ;RO=R1+R2
ADD R4, R5, R6 ;R4=R5+R6
The snapshot of time represented in Figure 4.2 shows data being output from R1
and R2 simultaneously on the two operand buses (indicated in dark colour), flowing
into the multiply-accumulate unit (MAC), and the result flowing over the results bus
back into register RO.
The thing to note during this snapshot is that, the registers from R3 onwards and
the ALU are all sitting idle. When CPU designers see resources sitting idle, they tend ey
2
to wonder if it is possible to utilise them — in this instance, to see if there is a way of =
Y
using the ALU and the MAC simultaneously. One answer is to partition the design as SG
ww
shown in Figure 4.3. 2
co

characteristics, such as the ADSP2181, ADSP2191 and so on. £

124
Chapter 4

A schematic diagram of an ALU, a MAC Figure 4.3

and a bank of registers interconnected
witha three-bus arrangement. This is sim-
ilar in resource use to the hardware illus-
trated in Figure 4.2 although in this case
bus partitioning has been used to allow
the two functional units to transfer their
operands simultaneously.

In the arrangement shown, both the MAC and the ALU have their own buses —
both input and result, and by extension, their own set of preferred registers. Thus, as
long as the programmer remembers to use RO to R3 when dealing with the MAC, and
R4 to R7 when dealing with the ALU, both of the example instructions:
MUL RO, R1, R2 ;RO=R1+R2
ADD R4, R5, R6 ;R4=R5+4+R6
can be performed simultaneously in a single cycle.
This process is probably the underlying thinking below the design of the ADSP21xx
hardware, squeezed by designers for every last drop of performance gain.

4.1.3 ADSP21xx Bus Arrangement

In the ADSP21xx hardware, every processing element is limited to receiving its input
from only a few registers and outputting a result to another small set. This means there
are many internal buses and many operations can be performed very quickly in parallel.
A simplified diagram of some of the many internal buses within the ADSP21xx
is shown in Figure 4.4. In this figure, PMA is program memory address and DMA is

TOMA a)ae SSTIRE Higuisee

STAIN
Ie Datta Addiess) |Datanddress
PMA[0:13] Generator 1 || Generator 2
CP MD10:231
cosa eee enema '
ov “DMDOAS] At yeaa peaerie
be)
ae
3)
Vy)
v |
Y
=
< |
Y ZAC
Af
< Sipe,
,, j 7
wn
>
ca
Te
S Ul Ae HOA fen ee eee fee
®
ha
A simplified diagram of the ADSP internal bus arrangement.
1
—
125
Processor Internals

data memory address. Both are address buses that index into the two blocks of memory
(program and data) which also indicate that this is basically a Harvard architecture
processor (see Section 2.1.2). However, it actually goes a step further in its partitioning of
address spaces. PMD and DMD are program and data memory data buses respectively.
Note the bus sizes: not only does this ADSP have a complex internal bus interconnection
arrangement, but the bus width and width of the interconnects differ.
The diagram shows that the ALU and the MAC, but not the shifter, can receive
input operands from the 24-bit PMD bus, but all can receive input and output from the
16-bit DMD bus.

4.1.4 Simultaneous Data and Program Memory Access

A topic that is very important in areas such as signal processing is the consideration
of how fast external data can be brought into a computer, processed and then output.
Signal processors typically operate on streams of such data, whether such data is high-
fidelity audio or wideband wireless signals.
Signal processing operations tend to be some form of digital filter. This can be
considered as a time series of samples, x[0], x[1], x[2] and so on, being the input values at
time instant 0 (which we can think of as ‘now’), one sample previously and two samples
previously respectively. y[0], y[1], y[2] are the output values at those corresponding
times. If this were audio data, then x and y would be audio samples, probably 16 bits and
if they were sampled at 48 kHz the time instants would each be 1/48000 = 211s apart.
Without delving too deeply into digital signal processing (DSP), we can say there
are two general filter equations: the finite impulse response (FIR) filter and the infi-
nite impulse response filter (IIR). FIR outputs are obtained by multiplying each of the
previous n samples by some predetermined values and then adding them up. Mathe-
matically, this is written:
nl
yl0] = S— afi] x xf]
i=0
So the current output y/0] depends on n previous input values multiplied by the
filter coefficients al] and then summed together. The number of previous values defines
the order of the filter, A tenth order filter would be defined by setting 1» = 10 and
predetermining ten a/] values. An adaptive FIR filter would be the one in which the al]
values are changed from time to time.
The IIR filter, by contrast, makes the output value dependent upon all previous
outputs as well as previous inputs: —
7°)
n—1 eI id)
yl] = Sali] x xf] + > Ol) x yl
bp
<£

i=0 i=l y
—f
”
This includes the use of a further set of filter coefficients, b[]. IIR filters can also be =|
co

adaptive and are generally able to perform the same workas FIR filters but with a lower ie]
c
order (which means a smaller value of 1). This strong filtering action comes at a price,
ke

A block diagram of Harvard Figure 4.5

architecture internal memory
access in a DSP augmented by
) external the ability to add external shared
shared memory.
7” internal @ internal
Vag program \ data
4g memory |memory

The art of designing high-performance digital signal processors is to make these

equations able to operate as quickly as possible, with the goal of being able to calculate
a value y[0] in as few clock cycles as possible. Looking back at the equation for the FIR
filter, we can see that most of the work is done by the following low-level operation:

ACC y= ACE =e (erm | x wilien ip)

The act of multiplying two values and adding to something already there is called
multiply-accumulate, which uses an accumulator, usually abbreviated to ‘ACC’.
Now we need to relate that function to the hardware of a digital signal processor.
There are many subtleties that could be discussed here, and using this operation, but
in this case one of the most important aspects is the memory access arrangements.
Consider the block diagram in Figure 4.5 showing a digital signal processor con-
taining a CPU, two memory blocks and a block of external shared memory. The device
seems to have an internal Harvard architecture (separate program and data memory
and buses), but connects externally to a block of shared memory. This type of arrange-
ment is very common, with the internal memory being static RAM (SRAM), and some-
times having SDRAM (synchronous dynamic RAM) externally for the main reason that
it is far less expensive than SRAM (refer to Section 7.6 for details on memory technolo-
gies and their features).
On-chip memory uses short internal buses and is generally extremely fast, some-
times accessing instructions in a single cycle. Occasionally, a block of two-cycle memory
is also provided. This is twice as slow as single-cycle memory since it requires two clock
cycles between requesting data and it being made available.
Ignoring the memory speed for now, and referring back to the multiply-accumulate
example, we need to feed the multiplier with two values: one being a predetermined
od coefficient, a[] and the other being an input data value x/]. Given a shared bus, these
3)
two values cannot be obtained /transferred simultaneously. However, given the internal
=®
ze spilt buses in the diagram, they can both be fetched together and begin to be multiplied
4
< in a single cycle — if obtained from the separate on-chip memory blocks. Overall, this
r=)>
n
will probably be a multi-cycle operation: one cycle to load and decode the instruction,
G the cycle following that to load the operands, and then one or more cycles to operate on
=
a
v those. However, given fast single-cycle on-chip memory it is possible for the operand
Sonal

= fetch to occur as part of an internal instruction cycle.

127
Processor Internals

Usually, anything that traverses an off-chip bus is slow compared to data following
on-chip paths, and this is one major driving factor behind the use of cache memory
(explored later in Section 4.4). Where the external memory device is SDRAM there will
almost always be an on-chip cache to alleviate the issue so that however fast SDRAM
is, there is always a two- or three-cycle latency between requesting a single memory
value and it being provided.

4.1.5 Dual-Bus Architectures

Taking a step backwards, a large hardware saving is made by minimising the number
of buses: buses are bundles of parallel wires that must be routed through an integrated
circuit, which cost in terms of buffers, registers and interconnects. They are expensive
in silicon area and consume prime ‘real estate’ on chip. It is entirely possible to reduce
area (and thus cost) by moving to a two-bus architecture and beyond that to a single-
bus architecture (Section 4.1.6).
This is one case where our investigation does not parallel computer architecture
evolution. The reason is that using a three-bus architecture is actually more sensible than
using a single bus and easier to explain. Tricks are required when buses are fewer —
tricks that have been used in silicon before the 1980s but which nevertheless complicate
the simple view of a bus as a path between the source and destination of operands and
results. All examples in this section and the next are fictitious: they present something
like the ARM architecture, but with different bus arrangements. Original reduced bus
designs, such as the venerable 6502 processor, did not have the luxury of a register
bank, let alone a multiplier. Therein lies the problem: silicon area was too limited to
allow a nice architecture or sometimes even a time-efficient architecture. In many cases,
it was simply sufficient that the design could be manufactured and could work. With
space for only three general registers, the 6502 designers were never going to be able
to shoehorn in another parallel bus — they would have added some more registers
instead.
Figure 4.6 presents a register bank connected to an ALU using a two-bus arrange-
ment. There are three registers or latches shown clustered around the ALU (actually
making this very similar to the 6502 — ignoring the larger register bank of course).
In order for this, and the following examples to make sense, it is necessary to
remember something about the ALU. That is the propagation delay timings. When we

Figure 4.6 ‘A dual-bus connection between an

apo
ALU and a register bank.
=<=
ss
b4
Ww
2
faa)
o
<j
=
a
®
Ss
128
Chapter 4

present stable electrical signals at the two input arms of the ALU, we need to wait for
a certain length of time before the answer appearing at the bottom of the ALU is valid.
Some control logic (not shown) would be present to instruct the ALU as to exactly what
arithmetic or logic operation it should be performing, and this is assumed constant
here. But the length of time we have to wait depends on the exact operation being
performed — and the maximum (worst case) time is the one that determines how fast
we can clock the circuitry based around this ALU. In a modern system, this delay may
be something like one or two nanoseconds.
That delay is accounted for, but the problem here is that there is effectively no
minimum delay: what this means is that as soon as one of the input signals is removed
or changes, the result can start to become corrupted. The consequence of this is that
the input operands must remain in place driving the ALU as the result is collected and
stored. Only then can the input operands change, or be removed.
Hence, the registers on the ALU input arms. Without at least one register there is no
way fora two-bus architecture to drive an ALU with input operands and simultaneously
collect the result. With one or two registers present there are several alternatives that
may save on hardware slightly, but the more general is the following sequence of events
performing:
ADD RO, R1, R2 ;RO=R1+R2
Each numbered step is at a monotonically increasing time instant:

1. Set up system, clear buses and set ALU functionality switch to ‘ADD’.
2. Allow register R1 to drive bus 1 (by turning on register output buffer) and register
R2 to drive bus 2 (by turning on register output buffer).
3. Latch bus 1 value into first ALU operand register and bus 2 value into second ALU
operand register.
4. Turn off R1 register output buffer (bus 1 becomes free) and R2 register output buffer
(bus 2 becomes free).
Wait for worst case propagation delay through ALU.
Latch ALU result into ALU output buffer.
Allow ALU output buffer to drive one bus.
Latch content of that bus into register RO.
SS Turn off ALU output buffer (both buses become free and the system is ready to
Gy
SI
So
perform the next operation).
iy> It can be seen that the very simple ADD command actually comprises a number of
—
8)
steps that must be performed in hardware. These steps add up to something like eight
=£ time periods ignoring ALU propagation delay. In a three-bus design (Section 4.1.1),
2
< such an add would require only three time periods.
The complexity of these steps even for a simple ADD instruction goes some way
wy
>
a
GS towards explaining the importance of a control unit inside a CPU to manage this process
=
h—_
D
(Section 3.2.4). Can you imagine the control complexity needed for a large multi-cycle
hen
ae CISC instruction?
129
Processor Internals

Figure 4.7. A single-bus connection between an ALU and

a register bank.

4.1.6 Single-Bus Architectures

The case of a single-bus architecture can be extrapolated from the section above. Again
using a fictitious ARM-style processor as an example, the architecture may look similar
to that shown in Figure 4.7.
Note the architectural simplicity of the design, which belies the operational com-
plexity of the multi-step operation of such a system. Again we consider adding RO =
Ri + R2 with each numbered step being at a monotonically increasing time instant.
1. Set up system and set ALU functionality switch to ‘ADD’.
2. Allow register R1 to drive bus (by turning on the register output buffer).
3. Latch bus value into the first ALU operand register.
4. Turn off register output buffer for R1.
Allow register R2 to drive bus (by turning on the register output buffer).
oI Latch bus value into the second ALU operand register.
6. Turn off register output buffer for R1.
Wait for worst-case propagation delay through ALU.
7. Latch ALU result into ALU output buffer.
& Allow ALU output buffer to drive the bus.
9. Latch content of the bus into register RO.
10. Turn off ALU output buffer (bus becomes free and the system is ready to perform
the next operation).
Comparing the sequence above to that for a two-bus system in Section 4.1.5, the
two extra steps and the resulting reduction in efficiency are noticeable. One common
improvement made historically to single-bus architectures was the addition of a very
short and inexpensive result feedback bus as shown Figure 4.8.

: Figure 4.8 A single-bus connection between an ALU 5

—

and a register bank as in Figure 4.7 but 1S)

iD
augmented with a single feedback link
is
ke
U
from ALU output to one of the ALU input <q
—

latches. wn
>
a
re]
<
7)
—

_—

=
130
Chapter 4

Again there are several alternative arrangements to perform this functionality, but
all allow the result of an ALU calculation to be fed back to the input of one arm of
the ALU. This would be useful when performing accumulation or when following
one arithmetic or logical operation after another. In this case, the register on the left-
hand arm of the ALU became known as the accumulator. It was the basis for almost
every operation, the most used register in the entire system, the programmer’s friend.
Older low-level programmers came to know and love the accumulator: many mourned
its death, killed by RISC and CISC advancements alike. This quote from well-known
New Zealand engineering management guru Adrian Busch sums it all up: ‘If it hasn’t
got an accumulator, it isn’t a real CPU.’

Arithmetic Logic Unit

4.2.1 ALU Functionality

Clearly, an arithmetic logic unit (ALU) is the part of a computer capable of performing
arithmetic and logical operations. But what exactly are these? An example of ALU
operations defined from the instruction sets of two common processors may give some
indication:
e ADSP2181 — Add, subtract, increment, decrement, AND, OR, EOR, pass/clear,
negate, NOT, absolute, set bit, test bit, toggle bit. There are limits on which registers
can be used as input and only two registers are available for output.
e ARM7 — Add, subtract, increment, decrement, AND, OR, EOR, pass/clear, NOT.
Any register can be used as input and any register as output.
In general, the ALU performs bitwise logical operations, tests and addition or subtrac-
tion. There may be other functions performed by the ALU that are derivatives of these,
and using multiple ALU operations a great deal of other functions could be performed.
A basic ALU, performing addition or subtraction, can be constructed from a num-
ber of single-bit slices operating in a chain, similar (in the add/subtract case) to the
carry-propagate adder of Section 2.4.2 and illustrated in Figure 4.9. In this case, where
control or function-select logic is not shown, eight separate single-bit ALUs operate
bit-wise with carry on two input bytes to generate a result byte. The operation being
performed is:

c pe Ty barat t
> hat Satgogies adie A block diagram of the Figure 4.9
pos | parallel bitwise func-
aD
°
eel carry aftBI) tional chain of parallel
a out “JALU | -1-bit units that comprise
7
a

| abyte wide ALU.

E
<=
_he
ee SOAS aiFu lercas ae i caer a!
<
13]
Processor Internals

Some 4-bit examples of ALU operations are given below:

1001 AND _|_1110 a 1000 Bitwise and

0011 AND 1010 = 0010 Bitwise and
00 | OR 0001 = 1101 Bitwise or
0001 OR 1001 1001 Bitwise or
0001 ADD 0001 0010 Addition
0100 ADD | 1000 E 1100 Addition
O11 ADD 0001 1000 Addition
Nor | 1001 | 0110 Negation
0101 ~~ SUB 0001 a 100° | “Subtraction
0110 EOR 1100 = 1010 Exclusive-OR

From the background work in Chapter 2, we know that addition and subtraction
are not parallel bit-wise operations. By that, we mean the n'" bit result of an addition
depends not only on the n'" bits of each input operand, but also on all previous bits, 1,
n—1,n—2...0. In fact, arithmetic operations between two values in general are not
accomplished in a bit-parallel manner, but logical operations between two values are.
Knowing what types of functions an ALU performs in typical devices and having
looked at some examples, it may now be instructive to perform a low-level design of
an ALU to explore how it operates.

4.2.2 ALU Design

The block symbol traditionally used for an ALU is shown in Figure 4.10 with n-bit
input operands A and B and n-bit result output indicated.
Function select is normally a bit-parallel control interface that identifies with the
ALU operation being performed. Status information includes whether the answer is
positive, negative, equal to zero, includes a carry or is an overflow. In some processors,
these values are abbreviated to N, Z, O° and C.

Before Operation Afterwards |

R1 R2 RO rtork: eanmubird fit ; ak, Flags if
5 Se tLCUB RO RIgR2 0 rae
8 10 SUB RO, R1, R2 —2 vue |
Assume that the registers are 8 bits for the next two. An 8-bit register can store numbers | c
from 0 to 255 unsigned or —128 to 127 in two’s complement signed binary. | >
ee
Bala | ADD RO, R1,R2_. [0 EAR Rr ae fe) ta

ON Be,
127 1 | ADD RO,R1,R2__| 128 (unsigned), —128 (signed) wt

—1 [1 [|ADDRO,R1,R2 [0 VAS As
aa
o
E
<=
=
—

3 “V’ is often used to represent the overflow flag instead of ‘O’, which might be confused with a zero. <x
132
Chapter 4

The block symbol normally used to repre- _ Figure 4.10

sent an ALU showing n-bit operand inputs
A and B, function-select logic and finally
| Function both n-bit result output and status flag