Computer Architecture Embedded Approach
Computer Architecture Embedded Approach
ARCHITECTURE
au TA
3063115
3063115
lan McLoughlin
School of Computer Engineering
Nanyang Technological University
a Al Me
Ps) ay
eY Sy
a
oe.
Fs
Cc7
thik +
.
Singapore * Boston * Burr Ridge, IL * Dubuque, IA * Madison, WI * New York * San Francisco Ae
St. Louis * Bangkok » Kuala Lumpur * Lisbon * London * Madrid O a:
Mexico City * Milan * Montreal * New Delhi * Seoul * Sydney * Taipei * Toronto k& oO 5
N We if
| ee? | by .
The McGraw-Hill companies
(S 13 JAN iO?”
aij
i §
Computer Architecture: An Embedded Approach
Copyright © 2011 by McGraw-Hill Education (Asia). All rights reserved. No part of this
publication may be reproduced or distributed in any form or by any means, or stored in
a database or retrieval system without the prior written permission of the publisher,
including, but not limited to, in any network or other electronic storage or transmission,
or broadcast for distance learning.
Printed in Singapore
-4O1.00101)
> 1i/f or oO1roO1
4101701
https://archive.org/details/computerarchitecOO00mclo
bh O100101
> 1 oO1 oO1roO)1
2 4BOO1T0T01
Contents
List of Boxes
Preface
Acknowledgments
Walk Through XX
Chapter 1: Introduction
ili Book Organisation
ie Evolution
al} Computer Generations
1.3.1 First Generation
1.3.2 Second Generation
1.3.3. Third Generation
1.3.4 Fourth Generation
1.3.5 Fifth Generation NBO
NE
COR
OF
=
O1
Chapter 2: Foundations
Dell Computer Organisation
2.1.1 Flynn’s Classification
2.1.2 Connection Arrangements
2.1.3 Layered View of Computer Organisation
DED Computer Fundamentals
ZS Number Formats
2.3.1 Unsigned Binary
2.3.2 Sign-Magnitude
2.3.3. One’s Complement
2.3.4 Two’s Complement
2.3.9 Excess-n
2.3.6 Binary-Coded Decimal
2.3.7 Fractional Notation
2.3.8 Sign Extension HS
KPOT
Ole
Ou
Oo
Com
Co
CO
ESS
OSHS
CN
SON
RFP
PRP
NN
NO
vi
Contents
24 Arithmetic 29
2.4.1 Addition 29
2.4.2 The Parallel Carry-Propagate Adder 29
2.4.3 Carry Look-Ahead 30
2.4.4 Subtraction 30
Zo Multiplication 34
256 Repeated Addition 34
Dine#3 Partial Products 3D
Dias Shift-Add Method 38
2.5.4 Booth and Robertson’s Methods 38
2.6 Division 4]
2.04 Repeated Subtraction 4]
Mei Working with Fractional Number Formats 43
Zeal Arithmetic with Fractional Numbers 44
2.7.2 Multiplication and Division of Fractional Numbers 45
2S Floating Point 46
2Oil Generalised Floating Point 46
2.8.2 TEEE754 Floating Point 46
Po) TEEE754 Modes 47
2.8.4 IEEE754 Number Ranges 51
29 Floating Point Processing 54
ZO Addition and Subtraction of IEEE754 Numbers 55
2 Multiplication and Division of IEEE754 Numbers 56
293 TEEE754 Intermediate Formats 56
2.9.4 Rounding 60
2.10 Summary 60
Appendix F: Tools for Compiling and Assembling Code for TinyCPU 484
Pol Introduction 484
F.2 The Assembly Process 484
R38 The Assembler 485
R4 Example Program Assembly 488
TO) The Compiler 489
F.6 Summary 490
Index 49]
> BOO10 | “he
(PADD AOO ov
> _ O13 a teva?
List of Boxes
List of Boxes
PPSSSSTS
LEASE n?
1A
” Wesco exaitee
Pinay alae
2= ia Svectiinva) tore
10 Sgn extent
1) Siew for Pe Geis
——
SO see ‘<TR ehlgenmrs bedjow MOS
ay Wl
A-sorif ti noneorsg een bA ve ae rand
eS d wiaieahindl MAPibeng level mi gnlentlA —Ob2 : =
Aor kid! whee WEASATIY PAD INKED! pRenr j
600 SOSH D SRT IMgirar sl egnifine anh = S83 oe 7
EGR WINS feel See aiduet sit fa i
Ee.
4 Ake ena NWOT HAP gnilrbedye be
oi Wires An a EN i «
4) Worked og agp Yatqysieten of. e.
" PNR ae Heee NY Sa ee Ls i
Mit OWE
Givin Wnt! £8 ”
wren (eal 20% bos (UAW EN ay
ibs WE niin eH S| |
Lue) -oritahiny NAMA PAT] niet = ON ole ‘,
PT? eigifehiery HEAT pier arte" MM vy
* aterriea gaia DUN Ain — TX er
sinutiin ynth Spee eer, ahi 88S 5B. Bd
tha AMA pve neq anil aud aN 79 Fe
- Hrveat VELOn to UaeNp 9 , se
tht qihes.nq lini Wyrnd =e AF .
‘ Bar este her merry hs il
\npee lacy sprice
Bas eT
Preface
There are a great deal of computer architecture texts in print at any one
time. Many famous authors have tried their hands at writing in this area,
however, computers constitute a rapidly advancing and fluid field, so
few books can hope to keep up without constant revisions. Above all,
the rapidity of the shift towards embedded computing systems has left
many authors, and texts, foundering in the wake. Some texts persist in
regarding computers in the same light as the room-sized machines of the
1950s and 1960s. Many more view computers in the light of the desktop and
server machines of the 1980s and 1990s. A handful acknowledge that the
vast majority of computers in modern use are embedded within everyday
objects. Few acknowledge that the future is embedded: there will come a
time when the concept of a desktop computer seems as anachronistic as
the punched card machines of 50 years ago.
This text is facing squarely towards the embedded future. Topics re-
lated to embedded processors are handled alongside the more traditional
topics of other texts and, wherever possible, examples from the embedded
world are highlighted.
The target audience for this book consists of three groups of people.
Firstly, undergraduate students of computer architecture-related courses,
typically those in their third year. Secondly, master’s level students re-
quiring a refresher in computer architecture before embarking on a more
in-depth study. Thirdly, industrial engineers. As reconfigurable logic cir-
cuits, especially FPGAs (field programmable gate arrays) are becoming
larger, faster and cheaper, there is increasing interest in soft-core comput-
ers — that is CPUs designed by engineers for specific tasks. For perhaps
the first time in history, these tools allow ordinary engineers the opportu-
nity to design and build their own custom computers. Digesting this text
will provide engineers with a solid platform of knowledge to understand
the traditional and contemporary techniques and trade-offs in computer
architecture — the art of computer design.
This text has been written from the bottom up without basing it on
an existing book. This allows it to avoid many of the historical blind al-
leys and irrelevant side shows in computer evolution, leading to a more
precisely defined focus. This is not just a computer architecture book with
XViii
Preface
an extra chapter on embedded systems. It is a fresh and integrated look at the computer
architecture of today, which is built upon the foundation and history of bigger and older
machines, but which is definitely driving towards greater levels of integration within
embedded systems.
This book aims to be an easy-access and readable text. Plenty of diagrams are
given to explain tricky concepts, and many explanatory boxes are provided throughout,
containing extra worked examples, interesting snippets of information and additional
explanations to augment the main text. Apart from covering all of the main items in
the typical computer architecture theory curriculum that are of relevance to embedded
engineers (but excluding tape storage, Winchester drives and supercomputer design),
the book contains a wealth of practical information for the target audience — even the
opportunity to build and test out a custom soft-core processor.
Sl units are used throughout the book, including the newer ‘kibibyte’ and ‘mebibyte’
measures for computer memory (explained in Appendix A). Each of the main curricu-
lum chapters includes end-of-chapter problems, with answers available in an instruc-
tor’s manual. All examples, and much more material including recommendations for
further reading, are available on the associated website at www.mheducation.asia/olc/
mcloughlin.
Ian McLoughlin
> BHOO1O .
™mO1001 ant
>41 = O101
> 1 OOOO}
Acknowledgements
Thanks are due most of all to my patient wife, Kwai Yoke, and children
Wesley and Vanessa for allowing me the time to write this book. Tom Scott,
Benjamin Premkumar, Stefan Lendnal and Adrian Busch gave me plenty
of encouragement at times when I needed it (this text took form over a long
drawn out five-year period). Doug McConnell was an inspiration as was
the late Sir Angus Tait — most of the book was written while I worked as
Principal Engineer in Group Research, Tait Electronics Ltd, Christchurch,
New Zealand. This company is the largest electronics research and devel-
opment company in Oceania, founded 30 years ago by Angus Tait at age
55 — an age at which most people wind down to retirement. Not Angus
Tait: he still went to work every day to guide the company, until he passed
away in August 2007.
Thanks are also reluctantly given to my computer architecture, ad-
vanced computer architecture and computer peripherals students at
Nanyang Technological University (NTU), for asking me difficult ques-
tions, stretching my knowledge and through that motivating me to teach
better. Associate Professor Lee Keok Kee kick-started me into gathering
materials for this book, and I would also like to acknowledge my many
other friends and colleagues in NTU, and also past colleagues in Tait Elec-
tronics Ltd, Simoco, The University of Birmingham, HMGCC and GEC
Hirst Research Centre. Thanks are also due to Gerald Bok and colleagues
at McGraw-Hill, especially to Doreen Ng and the editorial team for their
professionalism and hard work in turning the manuscript into a beautiful
book.
Most importantly, I wish to acknowledge my mother who constantly
encouraged me along the way — not just of writing this book, but through-
out my entire lifetime. Her high expectations led, eventually to my enter-
ing academia, and she has always been most enthusiastic regarding my
forays into writing; thank you Mum. However, above all I want to give
glory to the God who made me, protected me, nurtured me, gave his son
to save me, and will eventually welcome me into His presence. All that I
am, accomplish, obtain and achieve, I owe to Him.
Explanatory boxes containing extra
worked examples and interesting
Chapter 2
snippets of information to augment
main text
Booth’s method worked example
Consider —9 x 11 (signed):
2.17
Box
11110111 multiplicand —9
00001011 multiplier 11
-11110111 (i=0, subtract multiplicand since bit pair = 10)
0000000 (i=1, no action since bit pair= 11) 79
#110011 (i=2, add multiplicand « 2 since bit pair = 01) CPU Basics
-10411 , subtract multiplicand < 3 since bit pair = 10)
1 (i=4, add multiplicand « 2 since bit pair =01) How the ARM was designed
000 (i=5 and onwards, no action since all bit pairs =00)
The result is therefore obtained as the summation of the following: 3.1 In the mid-1980s, groundbreaking British computer company Acorn, with a contract
Box
from the British Broadcasting Corporation (BBC) to design and market BBC micro-
11110111 | computers was looking for a way to move beyond their hugely successful 8-bit BBC
+11011100 microcomputers. These were powered by the lean and efficient Rockwell 6502 proces-
101110¢ sors. The BBC initiatives had encouraged computer use in the UK so much that there
+01110000 were reportedly far more computers per capita in England than anywhere else in the
Or by converting the subtractions into additions (see Section 2.4.4); world. Sir Clive Sinclair’s ZX Spectrum for example, had sold 4 million units by the
| time sales of the IBM PC had reached 1 million units. Acorn is also reputed to have
00001001
+11011100
sold over 1 million BBC computers overall
In the early explosion of the ‘computer revolution’ it quickly became apparent
+01001000
to Acorn that 16-bit processors from companies such as Intel and Motorola were not
+01110000
=10011101 +Carry powerful enough to meet their projected future needs—needs which included releasing
the world’s first multi-tasking graphical desktop operating system in the late 1980s
Result: | (later some observers would conclude that this was copied by Microsoft as the basis
10011101 = —128 + 16+8+4 + 1 = —99 (correct) for Windows 95, XP and beyond).
In typical pioneering fashion, Acorn decided that, since nothing good enough
was available, they would create their own processor. They designed the ARM1 and
its support ICs (such as MEMC and VIDC) within two years despite having never
It is important to note that when i=0, the bits considered are the least significant
bit of the multiplier and a hidden zero. Thus, when the least significant bit of the
developed any silicon previously.
Acorn wanted a machine with a regular architecture —similar to the 6502, but vastly
multiplier is a ‘1’, the multiplicand must be subtracted (i.e. treated as a ‘10’ instead).
more powerful. They chose to use the RISC approach, but revisited their software needs
This can be seen in the second worked example (Box 2.17).
| by analysing operating system code to determine most used instructions which they
There are two points worth mentioning here. First, when dealing with two‘s com-
then optimised for the ARM processor. The same approach yielded an instruction set
plement signed operands, the partial products must be sign extended in the same way
(see Section 3.3) and its coding. Later, much needed additions were the multiply and
as the full partial product multiplier.
multiply-accumulate instructions.
Second, when scanning from right to left, the hidden bit at the right-hand side
This heritage leaves the globally successful ARM processor with a direct link back
means that the first pair of non-equal bits that is encountered will always be a ‘10’,
c indicating a subtraction. This regularity may be useful when designing a hardware
to the UK Government-funded BBC initiatives: the ARM software interrupt, supervi-
2 sor modes, fast interrupt, no microcode, static pipeline, load-store architecture are all
1) implementation.
¥ derived either from the hardware or the software architectures adopted by Acom.
Even for someone who has been doing binary arithmetic for many years, the
2
E preparation of this book highlighted how easy it can be to make very trivial binary
= addition mistakes. If you are required to do this as part of an examination, always inside almost every electronic product and most of these are ARM-based. Meanwhile, 5
Acorn itself no longer exists, having self-destructed in 1999. ee
S
=
3.2.7 Example Processors &Ss
Over the years, since the IBM research group published their initial results, the RISC a
£
approach has impacted almost every sphere of processor design. In particular, the ARM S
oO
RISC processor family now dominates the world of embedded systems. Therefore, in rf
this book almost all assembly language code examples are given in ARM assembler =
D
format. For example: e
z
o
=
379
|
CPU Design
determine the amount of CPU time spent within each function, the program trace and |A wealth of practical information including
the number of loops executed).
An operating system, particularly a real-time operating system (RTOS), is often the opportunity to build and test out a
required in many developments. Unfortunately, it can be difficult writing or porting
an OS to a new processor, and this is one major argument in favour of choosing a core custom soft-core processor
that is already supported by a good OS such as embedded Linux. Despite this, there __
are reasons to custom design a soft core, for example, when only small items
such as hand-written assembly language are used.
In fact, over the next few sections of this book, we will create a custom si 385
and later develop an assembler for this (we will also introduceabasic C-like co: CPU Design
‘ The source code and design of TinyCPU are used with the kind permission of Professor Nal Program counter instruction register stack qnext
stackO
More information relating to TinyCPU can be found on his HDL wiki pages at http gtop
www.ca hiroshima-u.ac,jp/=naka wild
controller output
stateO obufO
in Vout
A complete block diagram of the internal structure and interconnection arrangements of | Specification
Design
TinvCPU, shawine OT ey ra, Sage eee neee
Each chapter ends with
a set of 20 problems
.ss
= roblems
Attention is given to industrially- Determine whether, in the time interval shown, all tasks meet their respective
deadlines.
relevant embedded systems and issues | 611 Repeat Problem 6.10. The only difference is that the tasks are now ordered using
{ rate monotonic scheduling. Does this change make any difference in terms of
relating to modern microprocessors tasks meeting their deadlines over the first ! = 40 ms of operation?
and system-on-chip devices 6.12 A consumer electronics device requires a small, low-power and medium-speed
CPU controller. Discuss whether aparallel-connected data memory storage sys-
tem or a series-connected data memory storage system would be more appro-
priate.
6.13 If the system of Problem 6.12 was ‘souped up’ so that performance and speed
296 became more important than size and power consumption, would that affect
Chapter 7 the choice of bus you would choose?
Figure 6.9 shows the timing diagram for the Atmel AT29LV512 flash mem-
Configurable 1/O pins on the MSP430 ory device. The timing parameters shown have the following values from the
7.1
Box The Texas Instruments MSP430 series of devices has, like many processors designed Atmel datasheet:
for embedded systems, great configurability in its I/O pins. As evidence, consider the
pin definitions for one particular device, the MSP430F1611 Bars Figure 6.9 a =
nce
P5.7/TBOUTH/ISVSOUT
PS.6/ACLK toe
L)P5.S/SMCLK
i
|
L) P5.4/MCLK
P5.3/UCLK1
n ae eae ee tDE
L) P5.2/SOMI1
tL) P5.1/SIMO1
[) PS.0/STE1
——E——SEE |
fries ~_
L) P4.7/TBCLK ACG ton
| MSP430F 1611
P3,7/URXD1 | | The read cycle of the Atmel AT29LV512 flash memory device (this waveform was drawn from
P3.6/UTXD1 | | inspection of the Atmel AT29LV512 datasheet),
P14/SMCLK UL L| P3.S/URXDO
|
|
P3.4/UTXDOL)
P2.6/ADC12CLK/DMAEO[]
Plenty of diagrams to explain
2
=
o€
On this 64-pin package device, note that apart from the power and ground con-
nections, voltage reference inputs, crystal oscillator connections and two of the JTAG
|
tricky concepts
pt pins, every pin has multiple possible functions: 51 of the 64 pins are configurable. As
7]4 an example, refer to pin 5 — this can serve as GPIO port 6-bit 6 (P6.6), as 12-bit ADC
©> input channel 6 or as 12-bit DAC output channel 0, depending upon the particular |
3
2 configuration performed in software by the device programmer. |
> In Box 7.2, we will explore exactly how these pins can be configured
co
eeo = £OOrooi1we
> HOO1S > 11,
FE MOALOONONVNG
>- qVOVEIG
PeNOBGa
f ye) §
|
Tools for Compiling and
Appendices E and F on TinyCPU | Assembling Code for TinyCPU
SF OO TOOTE™
Co TOO1O 11,
= ~OQ1Q0101 | ES introduction
on 3010701 We have seen in Section 8.9 how to write code for, and program, TinyCPU
jw BOOTOTVO? APPEND! We developed a very small example which performed a simple integer
subtraction. This was then assembled by hand into a machine code pro-
Tools for Compiling and gram which was inserted into the Verilog code of TinyCPU (specifically,
within ram. y), The main message from that exercise was how tedious and
Simulating TinyCPU longwinded such a process is when performed by hand.
In Section 8.9.2, we discussed in passing the assembler and compiler
released by Professor Nakano for TinyCPU,' but did not provide any
details.
In this appendix, we will present the entire assembler, explain its
Many advanced tools exist currently for FPGA development. The main workings and demonstrate its use on the same subtract example from
FPGA vendors provide their own software, often with aweb version freely Section 8.9. We will also discuss the C compiler briefly.
available for download, while the professional chip development compa-
nies supply their own tools, which are often used in industry, running on
UNIX and Linux workstations, to develop the most advanced projects. The Assembly Process
Mentor Graphics ModelSim is perhaps the most common of these tools.
It is the author’s recommendation that ModelSim be chosen for larger The assembler is presented with a program consisting of assembly lan-
or more critical design projects. However, for rapid evaluation and guage mnemonics, labels, constants and other information. A simple Tiny-
lightweight testing we will present here a simple open source solution: CPU program, illustrating the syntax and format, copied from Section
Icarus Verilog,! combined with GPKwave? waveform viewer. Alternative 8.9.1, is shown in Listing F.1
options, especially for the waveform viewer, are also available.
Listing F.1 subt!
1 IN
| Preparation and Obtaining Software
2 PUSH cnet
The software runs best, and of course fastest, on. a Linux computer, prefer- 3 SUB
ably running Kubuntu or UE Linux. Since some readers may not 4 our
have upgraded their PCs from Windows to Linux, they can first install
Juli? — this will createalarge file on their ‘C’ drive and add an option to 6 3
the Windows bootup menu, so that next time they reboot they can choose
| to run Kubuntu. To uninstall is equally easy. The large file can simply be
| deleted to remove the software. Mac operating system users can obtain
and run both versions on their computers, or more competent users tould
Both areavailable from hetr
simply build the software from source, vik /.In addition, the assembler source code will be given in full in this appendix
| At this point, it is assumed that readers have a working Linux dis-
| tribution or similar. Kubuntu/Ubuntu users can now proceed to install
both items of software. At a shell window, type the following:
P
3bee k
Gimoly download ent run the wubi Installer from and!
aration
ww! GVO
BABAR) 2 aes
~
>
ARES <1 One Wout Vo ot ett ‘aa
=—te i cli Gear) |
a
Introduction
Book Organisation
Computers have evolved a long way: from Charles Babbage’s analytical
machine of 1834 (Figure 1.1 shows a drawing of his difference engine, an
earlier, fully working mathematical processing machine of similar design)
to the supercomputers of today, the story has been one of ever-increasing
processing power, complexity and miniaturisation.
Surprisingly, many techniques of Babbage’s day (as well as the early
electrical computers of the 1940s) can still be found in today’s systems,
demonstrating the amazing foresight of those early pioneers. Unfortu-
nately, these links with the past are not always positive — today’s Intel
desktop processors contain performance-limiting evolutionary throw-
backs to the 8086 processor and beyond. With the benefit of hindsight,
we have the opportunity to look back through computing history, and
identify many short-lived evolutionary branches that seemed, at the time,
to be promising paths to future progress, but which quickly disappeared.
Sometimes these may reappear years later in specialised machines, but
more often they are little more than historical curiosities.
What seems likely then is that the computers of tomorrow will be built
on the techniques used in those of today. A snapshot of current techniques
(as any computing text has to be) needs to recognise this fact, rather than
presenting the technology as being set in stone.
. This book will loosely follow the evolutionary trend. Early chapters
will focus on computer fundamentals. Mastery of these fundamentals will
allow a student to construct a working computer on paper, however slow
and inefficient their design might be if constructed. These early chapters
will be followed by a consideration of the architectural speed-ups and ad-
vanced techniques in use today. These are separated from the fundamen-
Cc
tals because some of them may turn out to be the current ‘evolutionary =
5
=
blind alleys’, but nevertheless they are some of the techniques currently 2
Cc
driving Moore’s Law so quickly forward. ce}
fe.)
Every now and then something completely revolutionary happens =
\e)
in computer architecture — these break the evolutionary trend and con- x
°
sign many past techniques that gave incremental performance increases, °
fea)
Z
Chapter 1
to oblivion. Without a crystal ball this book will not attempt to identify these technolo-
gies, but that will not prevent us from making an informed guess, in the final chapter,
about advanced technologies which may spark a revolution in the field of computing
over the next few decades.
Evolution
The concept of evolution of animals is controversial: to date there has been no scien-
tific proof of the theory, yet many choose to believe in it. Some prefer a ‘wait and see’
approach, hoping that science will eventually catch up, while others choose to believe
in an all-powerful yet unseen creator. Moving away from animals, to man-made
devices, the fact that computers have followed an evolutionary path of improvement is
quite obvious and unquestioned. While there have been rare disruptive breakthroughs,
computing history is full of many small incremental improvements over the years.
Of course, something as complex as a computer requires an intelligent engineer to
have designed it. We can often identify the engineers by name, especially those who
have made significant improvements (a few of them are still alive today to tell us about
it). Furthermore, the design and history of the pioneering machines, often constructed
at great expense, should have been very well documented.
So in computing, one would expect the history of development to be very definite;
there should be little confusion and controversy regarding the pioneering machines
from half a century ago. Unfortunately, that is not the case: there exists a very wide
Evolution range of opinions, with little agreements upon exact dates, contributions and ‘firsts’.
3
Introduction
Figure 1.2
sake
ae
ie
|
|
sea’
ip
OS
AGECBL
an
. =
ob) Se
One of ten Colossus computers in use during the Second World War (courtesy |
of the Bletchley Park Trust: www.bletchleypark.org.uk).
Just pick up any two books on computer architecture or computer history and compare
them. For our present purposes, we will begin the modern era of computing with the
invisible giant, Colossus.
Colossus (shown in Figure 1.2), built by engineer Tommy Flowers in 1943 and pro-
grammed by Alan Turing and colleagues in Bletchley Park, is now generally credited
with being the world’s first programmable electronic computer. This was built in Eng-
land during the Second World War as part of the (ultimately successful) code-breaking
effort against the German Enigma code. Unfortunately, Colossus fell under the British
Official Secrets Act and remained hidden for 50 years. All papers relating to it were
ordered destroyed after the war, when Prime Minister Winston Churchill (with a typi-
cally descriptive — although secret — pronouncement) ordered the machines to be ‘bro-
ken into pieces no larger than a man’s hand’. Plans and schematics were burned by the
designers and its codebreaker operators sworn to secrecy under peril of imprisonment,
or worse, for treason.
The action to hide this machine was successful. Despite the occasional unverified
rumour over the years, the existence of Colossus was only revealed publicly when
the few remaining documents were de-classified in the year 2000 and a government
report containing the information was released. For this reason, Colossus is not even
mentioned in many descriptions of computer history: an entire generation of computer
architects had never even heard about it.
<
However, there were other very well-known and reported machines of similar 24
—
vintage to Colossus that began operation in the years that followed. One of the most 2
fe)
famous, ENIAC (Electronic Numerical Integrator And Computer), was commissioned
>
[re
4
Chapter 1
and built in the USA. While Colossus remained totally hidden, ENIAC, operational by
1944, apparently snapped up worldwide patents to digital computing devices. Many
textbook authors, not knowing anything about Colossus, have hailed ENIAC as the first
modern computer. In fact, apart from being operational earlier, Colossus, being binary,
was more like today’s computers than ENIAC, which was decimal. However, neither
were easily reprogrammable, requiring adjustments to switch settings and change wire
plug positions, respectively.
Amazingly, Charles Babbage’s analytical engine of over a century earlier, being
digital rather than analogue and fully programmable, was in some ways more advanced
than these first electronic computers. Babbage even designed a printer peripheral that
could literally ‘write out’ the results of numerical computations. Babbage’s machine
also had a full programming language that could handle loops and conditional branch-
ing. This led Babbage’s friend, Ada Byron, Countess of Lovelace (the child of famous
poet Lord Byron), who worked on the machine, to write the world’s first computer
program. Possibly the first and last time in history that poetry and programming came
together.
Between the difference engine and Colossus, the computing field was not totally
deserted: German Konrad Zuse had an electrical computer working around 1940/1941,
based on relays (therefore classified as electrical rather than electronic). Another cred-
itable early attempt at building an electronic computer was the construction of the
Atanasoff-Berry machine at Iowa State College, USA in 1941. Although not program-
mable and plagued by unreliability, this demonstrated several early concepts and
undoubtedly advanced the state of the art in computing.
The advent of the transistorised computer is a similar area of confusion. The
transistor, invented at Bell Labs, USA in 1948, was low power and small-—ideal character-
istics for building a computer (although the early transistors were somewhat less reliable
than valves'). The first transistor-based machine was actually Manchester University’s
Transistor Computer running in 1953, although several texts again mis-attribute this
honour to the TX-0 at Massachusetts Institute of Technology, USA in 1956.
Finally, confusion reigns over the first stored-program computer (as opposed to the
ones programmed by plugging wires in different holes or flipping switches). This was
probably Manchester University’s Small-Scale Experimental Machine or SSEM (known
affectionately as the ‘Baby’), which successfully ran a stored program in 1948.
Another early stored-program computer, Maurice Wilkes’ EDSAC (Electronic De-
lay Storage Automatic Calculator), began operation at Cambridge University in May
1949. The equally famous US Army EDVAC (Electronic Discrete Variable Automatic
Computer) machine it was also a stored-program binary device of the same era, al-
though it was not operational until 1951-1952 (despite construction starting in 1944).
c ' Glass thermionic valves containing tiny filament electrodes in a partial vacuum were the basic logic
=) switches used in most early computers. Valves are known as ‘vacuum tubes’ or simply ‘tubes’ in
a North America. Interestingly, although they are now defunct in computing, today they are
2)
>
pa) sought-after items for very high-end audio amplification equipment.
5
Introduction
Table 1.1
Clearly then, given the three areas of confusion, the history of computers is not
as straightforward as it seems. Manchester University played a prominent but very
low-key role and has been overlooked by many computer historians. Manchester also
produced the world’s first commercial computer, the Ferranti Mark 1 in 1951? but
ultimately, the computer business became centred elsewhere.
Table 1.1 identifies a handful of world firsts in computing, along with the year
they were reported to have become operational.
The table shows the progression in computer technology and goes a long way
towards explaining how today’s computer is very much evolutionary rather than rev-
olutionary, although one wonders what happened to the 1960s.
Computer Generations.
Sometimes computers, just like humans, are described in terms of their generation.
This is a classification built up over the years, based mostly around the construction —
ce}
i
method, computing logic devices and usage of computers. oO
c
Anyone who saw computer magazine advertisements in the 1980s may remember o
©
how manufacturers cashed in on these generations and repeatedly advertised new he
oO
Cl
2
Efe)
Qa
2 The Ferranti Mark 1 was followed closely by the LEO computer (which was derived from EDSAC),
running accounting programs for the ubiquitous Lyons Tea Houses from Spring 1951 onwards. O
6
Chapter 1
products as fifth generation. Thankfully this practice has abated, and it seems that, in
terms of generations at least, the computing world is going through a plateau at the
moment. In the following sections, we will examine the five generations of computers.
The best known example, the ENIAC, consumed over 100 kW of power yet could
only deliver around 500 additions per second. This monster used 1800 valves, weighed
30 tonnes and occupied 1300 square metres. The user interface (typical for machines
of this generation) is shown in Figure 1.3. ENIAC was designed by the US Army for
solving ballistic equations as a means of calculating artillery firing tables.
The Colossus computer was equally vast and was dedicated — at least in its early
years — to code breaking: number crunching that broke the powerful and secret Enigma
code, contributing to the Allied victory in the Second World War. However, it is sad
that one of the first German messages decoded was something like ‘we’re going to
bomb Coventry’. Not wanting to alert the enemy that the code had been cracked, the
government decided not to warn the inhabitants, many of whom were later killed or
injured as the bombs rained down over that city.
Figure 1.3
333
2 2
gs
4 ind
a]
ax
aS
a5
zi :
Rae
se
:-
= Sus=
= a
oe Sas
Psa es
| Two women operating the ENIAC’s main control panel (US Army photo).
Computer
Generati
7
Introduction
The CDC6000 of the time was renowned for its intelligent peripherals. But it
is another example, the PDP-1 with 4Ki words of RAM running at up to 0.2 MHz,
that is perhaps the best known. This remarkable machine led the now sadly defunct
Digital Equipment Corporation (DEC) to prominence. The PDP-1 was available at a
price tag of around US$100k, but had available an impressive array of peripherals:
light pen, EYEBALL digital camera, quadrophonic sound output, telephone interface,
several disc storage devices, a printer, keyboard interface and a console display. The
PDP-1 with several of its peripherals are shown occupying almost an entire room in
Figure 1.4.
Figure 1.4
2
Se
te}
os
®
fos
A)
©
(a,
=
i)
)
PDP-1 (photograph courtesy of Lawrence Livermore National Laboratory and a
found on www. computer-history. info).
£
fe)
O
8
Chapter 1
The very popular and versatile IBM System/360 boasted up to 512 kibibytes of 8-bit
memory and ran at 4 MHz. It was a register-based computer with a pipelined central
processing unit (CPU) architecture and memory access scheme that would probably
appear familiar to programmers today. IBM constructed many variants of the basic
machine for different users, and most importantly opted for a microcode design that
could easily emulate other instruction sets: this guaranteed backwards compatibility
for users of second generation computers (users who had invested very significant
sums of money in their machines). Modified and miniaturised, five of these computers
perform number crunching in the NASA space shuttles.
Although not quite room-sized, the basic S/360 was still a physically large device
as Figure 1.5 illustrates.
Examples are profuse, including all desktop and notebook computers. The Phoebe,
a culmination of Acorn’s innovative RISC-based architecture and advanced windowing
operating system, is shown in Figure 1.6. Sadly, the company did not survive long
Figure 1.5
enough to market this machine — perhaps a consequence of making the machine bright
yellow. Apple, by contrast, displayed more marketing genius by originally releasing
their 333 MHz iMac with a choice of five flavours (colours), although more recently
they have reverted to an all-white, black or aluminium product line-up (some of the
newer range of iMacs are shown in Figure 1.7).
Figure 1.7
Eg
=
2)
fect
o
c
®
we jes ; - : a i A O
ke
iD
—
The Apple iMac range: stylish and user-friendly machines running a reliable UNIX-based =]
Qa
operating system (photograph courtesy of Apple). E
te)
UO
10
Chapter 1
There are no confirmed examples at the time of writing. When such examples
arrive, it is quite possible that there will be nothing worth photographing: hundreds
of tiny embedded computers distributed around us and not a beige (or yellow) box in
sight.
Not really fifth generation, but the selection of the desirable and well-engineered
Apple iMac computers (see Figure 1.7) may indicate the future: stylish and user-centric
machines. Or, perhaps it is Apple’s smaller but equally desirable iPhone (shown in
Figure 1.9), reputed to contain eight separate ARM processor cores, or their equally
impressive iPad, that will herald the coming of the fifth generation?
Figure 1.8
The process of miniaturisation is set to continue. More and more products, devices and
systems contain embedded computers and there is no sign that this trend will die out.
Computer speeds also will continue to increase. After all, there is a pretty amazing
track record to this: consider the numbers in Table 1.2, showing how computers have
Bpada
progressed in speed since the earliest days — remembering of course that the various @ 5;
Pause for a moment and consider the sheer magnitude of this progress. In almost £
o
ir
no other sphere of life can we see such an incredible, and sustained, performance
2
improvement. Given this track record, we can probably safely leave the miniaturisation o
i=)
and performance improvement process to major industry players such as ARM, Intel co)
fl=
and AMD. =
12
Chapter 1
Table 1.2
SS ee a
The amazing progression of computer calculating speeds from the earliest days (data provided
courtesy of Professor Jack Dongarra, University of Tennessee, USA).
1941 1
1945 100
1949 1000 (1 KiloFLOPS, kFLOPS)
1951 10,000 |
1961 100,000
1964 1,000,000 (1 MegaFLOPS, MFLOPS)
1968 10,000,000
1975 100,000,000
1987 1,000,000,000 (1 GigaFLOPS, GFLOPS)
1992 10,000,000,000
193 100,000,000,000 x
2000 10,000,000,000,000 ;
Or can we? Despite the miniaturisation, we have seen that (super) computers are
getting bigger — and more power hungry. Parallel computing has emerged as the main
technique of choice in building the world’s fastest computers. The days of a central
computer facility, the mainframe, could well be returning. The difference being that the
mainframe may now be located in a different country to its users, with mixed wireless
and Internet accessibility to those users. Perhaps the mainframes should be located in
cold countries where excess heat can go towards warming nearby homes?
Since the technology to separate bulk computing from the point at which that
computer power is needed mostly exists today, and with the possible exception of
wireless connectivity, is now mature, the controlling factors in the continued advance
of this model are services and software.
However, this does not mean that it is time to abandon the advance and improve-
ment of computers and their architecture (which would mean you can stop reading
Here?
Where
From
To here), but it does mean that the focus may change. From big and powerful to small
13
Introduction
and low power. From large-scale number crunching to embedded and application
specific.
Returning to the educational aims of this book for a moment, engineers work-
ing on computer systems have traditionally asked questions such as ‘what processor
shall I use in my system?’ and ‘how do I get this processor to work in my system?’
This book provides the background necessary to enable answers to be found to both
of these questions. In addition, it allows new questions to be asked, and answered,
such as: ‘Should I create a new processor specifically for my system, and if so, how?’
or ‘Should I use a simple CPU and connect to a remote server, or do all processing
internally?’
That computing is now primarily an embedded engineering discipline, despite the
existence of many huge supercomputers like the MareNostrum, is due to the pervasive-
ness of computer technology within embedded and consumer devices. Consider the
case of the iPhone, shown in Figure 1.9, which reportedly contains something like nine
separate microprocessors, with eight of them ARM-based. So in answer to the question
of where to from here, we can predict two ongoing trends: towards fewer but bigger
clusters of large computers, and towards more and smaller personalised computing
devices.
Also, it would probably help your career prospects to learn a little about the ubiq-
uitous ARM along the way.
a
=z
£
2
2
A)
i=
cD)
A=
=
14
Chapter 1
Summary
You, the reader, may not build the world’s fastest supercomputer (or maybe you will,
who knows?), but hopefully you will be designing or programming some amazing
embedded systems in future.
This chapter has presented a historical perspective of computing: relentless
forward progress, many huge leaps in technology and understanding, but millions
of small incremental improvements. Isaac Newton famously remarked in a letter to
his rival Robert Hooke that, ‘if Ihave seen further it is by standing on ye shoulders of
Giants’.
This could not be more true of most computer designers. You cannot really get
closer to standing on the shoulders of giants than when you use an existing computer
to design the next one!
With this perspective behind you, and confident of ongoing future progress in this
field, it is now time to learn the techniques (and some secrets) from the designers of
the computing systems of the past few decades. The following chapters will begin this
process by covering basic and foundational techniques, before considering speed-ups
and performance enhancing techniques of computers — whether desktop machines or
embedded systems. Later, we will spend more time investigating embedded systems
themselves, even taking the opportunity to build our own embedded CPU. Finally,
we will look further into the future to try and identify some promising, but unusual,
techniques on the horizon of the computing world.
YHOO1)O. —?
Foundations
Computer Organisation
What does a computer consist of? How are the elements connected? In
order to answer these questions, we need to first recognise that there
exists a vast range of possibilities inherent in the structure of a computer.
Looking at some of today’s desktop computers many of the peripheral
elements traditionally connected around a CPU are subsumed within the
same Integrated Circuit (IC) package; this would not be recognisable as
a computer to the early pioneers. However the main, historic, computer
elements are usually still present — even if they are not at first immediately
identifiable. In embedded systems the trend is more apparent — system-
on-chip (SoC) processors that integrate almost all required functions on a
single chip are now predominant.
Secondly, not all computers are organised in the same way, or have the
same requirements. After all, they could range in size from a room-sized
supercomputer, to a wristwatch-based personal digital assistant (PDA) or
smaller.
Despite the range of possibilities, most systems comprise functional
blocks with a degree of similarity. The placement of these blocks inside or
Ss
outside the CPU chip is a design or cost consideration, and the intercon-
che
nections between them (both internal and external) are generally parallel 5
2
<
buses, the width and speed of which are also design or cost considerations. 5
2]
There may be multiple copies of each functional block present or mul- —
oO
tiple interconnections between some blocks. =
©
_
With such variety, there is a need to classify the range of architec- 2
Qa
tural possibilities in some way. It was Michael Flynn who first devised a E
te)
comprehensive classification scheme for describing such systems in 1966. O
16
Chapter 2
Figure 2.1
|
data word
data word
afte:
. . ))
. . SSN
instruction)
pe ae
instruction)
Raa ea Fs
on a single data stream (shown in this case as one instruction acting upon one item
of data).
° Single instruction, multiple data streams (SIMD) - A single instruction stream
acting on more than one item of data. For example, given the numbers 4, 5 and
3, 2, a single instruction to perform two separate additions of 4+ 5 and 3 + 2
would be SIMD. An example of this arrangement is an array or vector process-
ing system which can perform identical operations on different data items in
parallel.
¢ Multiple instruction, single data stream (MISD) -— A rare combination of overspec-
ified multiple instructions acting on a single data stream. This redundancy could
possibly be useful in fault-tolerant systems.
e Multiple instruction, multiple data streams (MIMD) — These systems are arranged
similarly to multiple SISD systems. In fact, a common example of an MIMD system
is a multi-processor computer such as the Sun Enterprise servers.
¢ Von Neumann systems are those that share resources for storage and transfer of
data and instructions. Many modern computers fall into this category by virtue of
storing programs and data in shared memory, and using a single bus to transfer
them from memory to the CPU. Shared bus bandwidth tends to mean that such a
system has limited performance, but its advantages are simpler design and lower
cost.
e Harvard architecture systems have separate data and instruction storage and trans-
fer. Since instruction and data transfer can be simultaneous, such systems can offer
high performance.
¢ Other architectures include systems with multiple dedicated buses (such as the 2
ADSP2181 internal buses), shared data/instruction address bus but separate data 2i=
cS)
buses or similar. Chapter 4 will introduce and explain internal bus arrangements oD
further. 12)
G
ed
Some CPUs such as the DEC/Intel StrongARM are advertised as being Harvard ar- =)
o
chitecture, although they interface to shared memory via a single bus. In this case, the =
fC)
StrongARM is a Harvard architecture internally because it contains separate blocks of O
18
Chapter 2
5
Translation through
compilation
Translation through
assembly
BIOS calls,
OS APIs, SWIs
1 CPU microarchitecture
Hardware execution
internal data and instruction cache memory, although it has an external von Neumann
connection arrangement.
° Central processing unit (CPU) - The part of a computer that controls operation
through interpretation of instructions and through built-in behaviour. It handles
input/output functions and performs arithmetical and logical operations on data
(in other words, contains an ALU). In recent times, CPU has begun to refer to a
physical KC which, in some cases actually constrains all parts necessary to function
asa standalone computer.
e Arithmetic logic unit (ALU) — This component of the CPU performs simple arith-
metic and logical operations such as add, subtract, AND, OR. It is an asynchronous
unit which takes two data inputs from parallel connected registers or bus(es) and
outputs either direct to a register or is connected through a tristate buffer to a bus.
In addition, ithas a control input to select which function to perform, and interfaces
to a status register: It handles fixed point binary (and occasionally BCD) numbers
only and.is located on- chip in modern processors.
¢ Floating point unit (FPU)— Either an on-chip or an external co-processor, it per-
forms arithmetic on floating point numbers. The particular floating point format
supported in most modern FPUs is called IEEE754. It is usually comparatively
slow (can take tens or hundreds of instruction cycles to perform a calculation) and
its interface is to the main CPU through special floating point registers.
¢ Memory management unit (MMU) — This component provides a layer of abstrac-
tion between how the processor addresses memory and how that memory is phys-
ically arranged. This abstraction is termed virtual memory. The MMU translates
a virtual address that the processor needs to access into a real physical address in
memory. The processor typically sees a large linear continuous address space in
memory, with the MMU hiding a physical memory organisation which may be of
different sizes (larger or smaller), non-continuous or consisting partly of RAM and
partly of hard disc storage.
In addition, there are a number of items that we will include in our discussion that are
useful to define now, prior to being covered in detail later:
¢ Register - On-chip’ storage locations that are directly wired to internal CPU buses
to allow extremely fast access (often in one instruction cycle). The distinction blurs
between this and on-chip memory for some CPUs and the stack in the picoJavall
processor.
e Tristate buffer — A device to enable or disable driving a bus. It is usually placed
=
between a register and a bus to control when the bus will be driven by that register. oO
The first two states are when the tristate drives the bus voltage to be either logic E
5
high or logic low; the third (tri-) state is high impedance, meaning that the device ao)
Cc
f=)
does not drive the bus at all. ie
o
—
cone
2
ror
1 Originally, these were separate hardware devices, but are now exclusively incorporated on-chip for
E
fe)
convenience and access speed reasons. O
20
Chapter 2
¢ Complex Instruction Set Computer (CISC) — Think of any useful operation and
directly insert this into the CPU hardware. Do not worry how big, power hungry
or slow this will make the CPU; you will end up with a CISC machine. Early VAX
machines reputedly included instructions that could take over 2000 clock cycles to
execute.
e Reduced Instruction Set Computer (RISC) - CPUs are limited by their slowest
internal components and by silicon size. Based on the premise that 80% of instruc-
tions use only 20% execution time and the remaining 20% use up 80% of the chip
area, CPUs are reduced to contain the 80% most useful instructions. Sometimes a
working definition of RISC means ‘supporting a set of less than 100 instructions’.
It is also significant to note an emerging trend where a RISC CPU core emulates a
CISC machine.
¢ Instruction cycle — This refers to the time taken to fetch an instruction, decode
it, process it and return the result. This may be one or more periods of the main
clock cycle (derived from an external oscillator). For RISC processors, instructions
typically execute in a single clock cycle. For CISC processors, some instructions
take a lot longer.
¢ Big or little endian — Big endian means that the most significant byte is presented
first. It is used in processors such as 68000 and SPARC. Little endian means that
2.1 Q. Given a 32-bit word stored in a 16-bit architecture memory system as shown below,
Box
and given that the stored word is made up of least significant byte (LSB), second byte
(B1), third byte (B2) and most significant byte (MSB), is the following a little or big
endian representation?
1 MSE cae BD |
Olea: tm Bl ae | wih LSB a |
——' ountin’ Sogaihalee
In the diagram, the memory line (in 16-bit words) is given on the left, and the bit
positions are shown below.
A. Checking for little endian first, we identify the lowest byte-wise memory address
and count upwards. In this case, the lowest address line is 0 and the lowest byte starts
at bit 0. The next byte up in memory starts at bit 8 and is still at line 0. This is followed
by line 1 bit 0 and finally line 1 bit 8. Counting the contents from lowest byte address
upwards, we get {LSB, B1, B2, MSB}. Since this order DOES follow the least-to-most
byte format it must be little endian.
Computer
Fundamenta
2]
Foundations
2.2 Q. A 32-bit word is stored as shown below. Is this a little or big endian representation?
Box
31 YA D3 ame 16 15 8 a
A. First identify the lowest byte-wise memory address. This is clearly address line 0,
starting at bit 0. Next is address line 0, bit 8 and so on. Counting from least to most
and writing out the contents we get {MSB, B2, B1, LSB}. This order does NOT fol-
low the least-to-most byte format, so it is not little endian. Therefore it must be big
endian.
the least significant byte is presented first, as used by the Intel x86 family. Some
processors (such as the ARM7) allow for switchable ‘endiness’.
Unfortunately, endiness is complicated by the variable memory-width of
modern computers. It was easier when everything was byte-wide, but now there is
an added dimension of difficulty. Given an unknown system, it is probably easier
to check first whether it is little endian, and if not, classify it as big endian, rather
than working the other way around. Boxes 2.1, 2.2, 2.3 and 2.4 explore this issue in
detail.
Box
2.3
Q. Given the memory map shown below, write in the boxes the 32-bit number repre-
sented by MSB, B1, B2 and LSB bytes using a little endian representation.
28 |
24 |
|
|
ne
|
|
|
(Continued)
Fund
Comp
ZZ
Chapter 2
24
20 ES Bl B2 MSB
2.4 Q. Given the memory map shown below, write in the boxes the 16-bit number repre-
Box
sented by MSB and LSB bytes using a big endian representation.
50
pil
52 |
2
c
© A. Again, we need to identify which is the lowest byte address in the memory
E
6 pictured, and then place the MSB there since we are big endian. In this case, the memory
a2)
e map is written from top down-a common format from some processor manufacturers.
5
few
hee The top position is the lowest address, and we count downwards. Since memory is
®
5 byte-wide, this is relatively easy. The answer is thus:
—
Qa.
E
ie) (Continued)
O
23
Foundations
Bull LSB
52 Re |
iti 0
Number Formats
In general, the value, v of a n-bit binary number x, where x[i] is the i® bit reading
from the right to the left, starting from bit 0, is
v= oat ps4
i=0
The unsigned binary format is easy for humans to read after a little practice, and
is handled efficiently by computer.
Dedee Sign-Magnitude
This format reserves the most significant bit (MSB) to convey polarity (called the ‘sign
bit’), and then uses unsigned binary notation for its remaining least significant bits to
convey magnitude. By convention, an MSB of 0 indicates a positive number while an
MSB of 1 indicates a negative number.
For example, the 4-bit sign-magnitude number 1001 is —1 and the 8-bit number
10001111b is equivalent to 8 +4+2+1 = —15 decimal.
te
a
E This representation will crop up later when we discuss floating point. In this for-
=)
z mat, a number v is stored as the unsigned binary value v + n. An example is the
25
Foundations
Negative two’s complement numbers can be easily formed in practice by taking the
Box
2.6
one’s complement of the binary magnitude then adding 1. As an example, suppose
we wish to write —44 in 8-bit two’s complement:
If you are not used to writing binary numbers, try to write them in groups of 4. That
way it is easier to line up the columns, and it aids in the conversion to hexadecimal
(since a group of 4 bits corresponds to a single hex digit).
2.7 Q1. Write the decimal value 23 as a two’s complement 8-bit binary number.
Box
A1. We can start by drawing the bit weightings of an 8-bit two’s complement number.
Starting from the left, we begin with the sign bit.
The sign bit is only set if the number we want to write is negative. In this case, it is
positive so write a zero there. Next we look at 64. If our number is greater than 64 we
would write a ‘1’ here, but it is not so we write a zero. The same goes for 32, so now
we have:
Moving on to 16, we find that our number (23) is bigger than 16, and so we subtract 16
from the number to leave 23 — 16 = 7. A ‘1’ goes in the 16 box.
Next, we compare our remainder with 8. The remainder is smaller so a ‘0’ goes in
the 8 box. Moving on to 4, our remainder is bigger than this so we subtract 4 to make
anew remainder 7 — 4 = 3 and write a ‘1’ in the 4 box. Continuing with 2 and 1, both
get ‘1’s in their boxes. The final answer is thus:
Ro remainder of 28. The rest of the numbers act as normal — a ‘0’ in the 64 box, a ‘0’ in
/ ries
is
32 box, then a ‘1’ in the 16 box. The remainder will then be 28 — 16 = 12. Continuing,
there will be ‘1’ in the 8 box, remainder 4, then a ‘1’ in the 4 box and ‘0’s beyond
that:
Pret te eh a
Note: The only really easy things to see, at a glance, about two’s complement numbers
are whether they are negative or not (a ‘1’ in the most significant position) and whether
they are odd or not (a ‘1’ in the least significant position).
the excess-127 representation in 8 bits, which can represent any number between —127
and +128 (stored in binary bit-patterns that look like the unsigned values 0 and 255
respectively).
This format can be a little confusing to students. As examples, the 8-bit excess-127
binary number 00000000 equals —127 (which is found by working out the unsigned
binary value, in this case zero, and then subtracting 127 from it). Another example is
11000010 which in binary would be 128 + 64 + 2 = 194, but since it is excess-127 we
subtract 127 from the result to give 194 — 127 = 67 decimal.
ML described as (m.n) format where m is the number of digits before the imaginary radix
©
hee
(in decimal, the radix is known as the decimal point, but when dealing with another
me!
E number base we cannot refer to it as a ‘decimal’ point, so we call it the radix) and n is
2
z the number of digits after it.
27
Foundations
Remember that there is nothing special about binary — it is simply a way of writing a
‘Box
2.8
number in base 2 instead of base 10 (decimal) that we are familiar with.
Just as we can write fractional numbers in decimal (such as 9.54) as well as integers
(such as 19), we can also write any other base number in fractional as well as integer
format. So far, we have only considered integer binary format, however, it is also
important to realise that fractional binary format is used extensively in areas such as
digital signal processing.
2.9 Q: Write the decimal value 12.625 as a (7.9) fractional format two’s complement binary
Box
number.
A: First, start by looking at the bit weightings of the (7.9) format:
AWA
Aina
where the weightings below 1/8 have been removed for space reasons. Next, we realise
that the number is positive, so there is a ‘0’ in the —64 box. We then scan from left to
right in exactly the same way as for a standard two’s complement representation (or
unsigned binary for that matter), using the weights shown above.
It turns out that 12.625 = 8 + 4+ 0.5 + 0.125 and so the result will be:
Ze
97
unsigned ee
(6.2) format
Refer to Box 2.9 for more examples of fractional format numbers in binary.
The beauty of fractional notation applied to unsigned or two’s complement num-
bers is that the values are handled in hardware exactly the same way as the non-
”
fractional equivalents: it is simply a programming abstraction. =
12]
E
i
fo)
2.5.0 Sign Extension rg
This is the name given to the process by which a signed two’s complement number
See
vo
Q
of a particular width is extended in width to a larger number of bits. For example, E
=)
converting an 8-bit number to a 16-bit number. While this is done occasionally as an <4
28
Chapter 2
—8 #2226
If we were to simply write the 8-bit value as a 4-bit number padded with zeros
as in 00001010, then, referring to the 8-bit weightings, the value that this represents
would be
8+2=10
This is clearly incorrect. If we were then to note that a negative number requires
the sign bit set and responded by simply toggling the sign bit to give 10001010 then
the value would become
126-7 8--2=—118
This is again incorrect. In fact, in order to achieve the extension from 4 to 8 bits
correctly, it is necessary that not only the original MSB must be set correctly, but every
additional bit that we have added (every bit to the left of the original MSB) must also
be set to the same value as the original MSB. The sign bit has thus been extended to
give 11111010 with a value of
m=128-164 3216128 26
Finally, a correct result is achieved. Another example of sign extension is given in
Box 2.10.
There is evidently no difficulty with positive two’s complement numbers, but the
sign extension rule can still be applied (it has no effect, but makes a hardware design
easier if it applies to all numbers rather than just some).
Q: Write the value —4 in 4-bit two’s complement notation. Copy the most significant bit
Box
2.10 (MSB) four
times to the left. Read off the result as an 8-bit two’s complement number.
A: 1100(-8 +440 +0)
MSB is 1, so copying this to the left four times gives 11111100.
Reading off in 8-bit signed binary, (—128 + 64 + 32+16+8+4) = —4.
For further thought: Repeat the exercise with a positive number such as 3. Does
the method still apply equally for positive numbers?
Number
Formats CC er rr
D8)
Foundations
Arithmetic
This section considers the hardware capable of performing the addition or subtraction
of two binary numbers. This functionality is used within the arithmetic logic unit (ALU)
in almost all processors, which also handles basic logic functions such as AND, OR,
NOT and so on. The ALU is described as a CPU functional unit later in Section 4.2.
2.4.1 Addition
Binary arithmetic is accomplished bitwise with a possible carry from the adjacent less
significant bit calculation. In hardware, a full adder calculates the addition of two bits
and a carry in and generates a result with an additional carry output.
A full adder is shown symbolically in Figure 2.3, where each arrow represents a
single logic bit. A half adder is similar, but does not have any provision for the carry in.
Pigare2:3,. A full adder, showing two bits being added, together with b~ Xx y |
Conf Cr
a carry in, and the output of a single bit with carry.
TL
Cn
|
|
Za z6 z5 z4 Zo Ze Zi z0
The carry-propagate or ripple-carry adder constructed from a sequence of full
adders plus one half adder.
Arith
30
Chapter 2
1. Try working by hand using a 4-bit adder. With 4-bit two’s complement numbers
the representable range is —8 to +7.
2. Try adding some values such as 2+8 =?,2+(—8) =?,7+7 =? and (—8) + (—8) =?
3. What do you conclude about the Cout signal: does it mean the same for signed
two’s complement numbers as it does when adding unsigned numbers?
This behaviour and the add mechanism is common to almost any binary adder.
Although the parallel adder appears to be a relatively efficient structure and even works
in a similar way to a human calculating binary addition by hand (or perhaps using an
abacus), it suffers from a major speed limitation that bars its use in most microprocessor
ALUs: carry propagation.
Given that the input numbers are presented to the adder simultaneously, one mea-
sure of the adder speed is the length of time required to calculate the output. Each full
or half adder in the chain is relatively quick: both the carry out and the result will be
available a few nanoseconds after the carry in and input bits are presented (for modern
hardware). The problem is that the least significant half adder (adder 0) must finish
calculating before the next bit calculation (adder 1) can start. This is because adder 1
needs to get the carry from adder 0 before it can complete its calculation, and that carry
is not valid until adder 0 finishes. Adder 1 then supplies its carry to adder 2 and so on.
Further up the chain, adder 6 will only supply its carry to adder 7 a significant length
of time after the input words were first presented to the adder.
A worked example of calculating an entire ripple-carry adder propagation delay
is presented in Box 2.12. It is important because, if such an adder were present in a
synchronous machine, this propagation delay may well be the part of the system that
limits the maximum system clock speed.
Worked example
Q:The adders and half adders used in a 4-bit parallel carry-propagate adder are spec-
Box
2.12
ified as follows:
Time from last input bit (x or y) or carry in to result z: 15ns
Time from last input bit (x or y) or carry in to carry out: 12 ns
If input words x[3:0] and y[3:0] are presented and stable at time 0, how long will it be
before the 4-bit output of the adder is guaranteed stable and correct?
A: Starting from the least significant end of the chain, adder 0 receives stable inputs at
time 0. Its result z is then ready at 15 ns and its carry is ready at 12ns. Adder 1 requires
this carry in order to begin its own calculation, so this only starts at 12 ns. It takes until
24ns before it can provide a correct carry result to adder 2 and this will not provide
a carry to adder 3 until 36ns. Adder 3 then begins its calculation. Its output z is then
ready at 51 ns and its carry out is ready at 48 ns. So even though the adders themselves
are fairly quick, when chained, they require 51 ns to calculate the result.
Note: The phrase ‘begins its calculation’ when applied to the full or half adders may
be misleading. They are actually combinational logic blocks. A change of state at the
input will take some time (up to 15 ns in this case) to propagate through to the output.
Since they are combinational logic, they are always ‘processing’ input data and their
outputs are always active. However, from the specification, we know that the outputs
are only guaranteed correct 15ns or 12 ns after the inputs are correctly presented (for
result z and carry out respectively).
are now linked through ‘borrows’ from higher bits, rather than ‘carries’ from lower bits.
This is problematic in the same way as addition.
In terms of computational hardware, a specialised subtracter would be required
if it were not for the fact that addition and subtraction can be interchanged in many
‘Figure Dad
Fos(thy) eu(aly ~ Jex((@)}) cell((0)}
C single wire
out
An-bit bus
add/subtract
y({0O..n-1]
Subtraction logic consisting basically of an adder with external exclusive-OR gates.
Arithmetic
33
Foundations
There is one further area of subtraction that needs to be explored, and that is
overflow: when performing an addition, you will recall that the topmost Cout can
be used to indicate an overflow condition. This is no longer true when performing
subtractions as some examples on 4-bit two’s complement numbers will reveal:
COLO 2 LIAO = 2 2+ (-2)
=?
COLO nO OOo ae outt
Clearly, the result should be an easily-represented zero, and yet the Cout signal is
set. Consider another example where we would normally expect an overflow:
Ss} et FOLIO. =e 2 7+6=?
[e) Me te ONO ce aaa Answer = —3 ?
Again, the result should not be —3, it should be 13. Evidently, the circuitry shown is
not sufficient alone, and some account needs to be taken of the values being processed.
The answer is that the sign bits must be examined prior to adding, and the result checked
based on this. This is not computationally hard - a simple look-up table will suffice:
positive + positive = positive
positive + negative = unknown
negative + positive = unknown
negative + negative = negative
For the mixed calculation (one positive and one negative number), the sign of the
answer is unknown, but is not problematic since by definition it can never result in an
overflow (think of it this way: the negative number will reduce the size of the positive
number, but the most it can do would be if the positive number is zero, in which case
the answer is the same as the negative input, and the inputs themselves do not include
carry flags).
For the case of two positive numbers being added, the result sign bit should be 0. If
it is not, then an overflow has occurred. For the case of two negative numbers, the result
sign bit should be 1, and if it is not an overflow has occurred. It can be seen therefore
that the value of Cout alone is not enough to indicate that an overflow has occurred. In
most processors, a separate overflow flag is provided, set through consideration of the
sign bits as we have seen. Consider the worked example in Box 2.14.
Try extending the argument in the text to a subtraction. Using 4-bit two’s complement
Box
2.14 signed number format, perform a few additions, then a few subtractions. Verify that
all of the subtractions a —b can be performed in binary as a + (—b). Verify that the Cout
signal does not indicate an overflow condition.
Perform the additions —5 + —5 and —5 + —1 and look at the sign bit and carry AS
=
bits of the result. Can you conclude that the Cout signal is useless, or can it be used to o
£
increase the bit range of the result? . P=
=
rs
<
34
Chapter 2
Multiplication
In the early days of microprocessors, multiplication was too complex to be performed
in logic within the CPU and hence required an external unit. Even when it was finally
squeezed onto the same piece of silicon, it was a tight fit: the multiply hardware in
early ARM processors occupied more silicon area than the entire ARM CPU core.
In more recent times, however, manufacturers have tuned multipliers to the target
application. For fast real-time embedded processors (perhaps an ARM7 in a GSM cell-
phone handling speech coding), there is a need to perform multiplications as quickly
as possible and hence a fast multiplier will be used. This will evidently occupy a large
silicon area compared to a slower multi-cycle multiplier used on a non real-time pro-
cessor (such as the ARM610 which was designed to power desktop computers in the
early 1990s, and to be the brains of the Apple Newton — the world’s first PDA).
There are many methods of performing the multiplication m x n at various rates
(and with various complexities). Some of the more typical methods are listed here:
Each of these will be considered in the following subsections in turn. There are, of
course, other more esoteric methods as this is an active research area. Interestingly,
some methods may perform estimation rather than calculation, or involve loss of preci-
sion in the result. These would include converting operands to the logarithmic domain
and then adding them, or using an alternative or redundant number format.
Alternative number formats are briefly described in Section 9.5, but when it comes
to hardware for performing binary calculations, there are so many alternatives that it
will be impossible to describe them all.
c Since this involves a loop that repeats n times then the execution time is dependent on
&
—_
the value of n. However, if 1 is small, the result, B, is formed early.
12]
& If we consider that a 32-bit number can represent an integer with value in excess
2= of two billion, we realise that many iterations of the loop might be necessary: it could
2
= imply a rather long execution time.
SS
Foundations
The situation is complicated slightly when it comes to working with two’s comple-
ment signed numbers, firstly in that the most significant bit of the multiplier represents
sign, and secondly in that sign extension must be used (see Section 2.3.4).
For the signed case, all partial products have to be sign extended to the length of the
result (which by default would be the sum of the lengths of the input representations
minus 1 to account for the sign bit, such that a 6-bit signed number plus a 7-bit signed
number would require 12 bits to represent the result).
Since each partial product corresponds to one bit of the multiplier and is shifted to
account for the multiplier bit weighting, the partial product corresponding to the MSB
is a special case: the bit weighting is negative and this partial product must therefore be
subtracted from the accumulator rather than added. This is shown in the flowchart of
Figure 2.7, where it is assumed that the grey-coloured two’s complement accumulate
blocks are able to take account of sign extension.
To understand the process better, it is useful to attempt some simple binary multipli-
cation by hand using those methods; the reader can follow some examples in Box 2.15.
In reality, the accumulation of partial products may be more efficiently performed
in the reverse direction (i.e. looping down rather than looping up). In the best case
this would also remove the need to treat the partial product of the multiplier sign bit
differently (since this is not accumulated, it is merely the value in the accumulator
before additions begin, thus allowing its sign to be negated during the load-in process).
Figure 2.8 illustrates a block diagram of an alternative partial product multipli-
cation method for unsigned numbers only (although extending this method to two’s
complement is a relatively simple task). The figure shows the sequence of operations c
2
to be taken once the set-up (operand loading) is complete. onl
5
The set-up phase resets the accumulator Q to zero and loads both multiplier and YZ
2
=
multiplicand into the correct locations. In step 1 the least significant bit of the multiplier >
is tested. If this is a 1 (step 2) then the multiplicand is added to the accumulator (step 3). =
36
Chapter 2
es — ae = Figure 2.7
A= A+ (M<<count) x Q{count]
Step 4 occurs regardless of the two previous conditional steps, and shifts the entire
accumulator one bit to the right. The system loops n times (using control logic which
is not shown) before terminating with the answer in the long register.
Consider the differences between this and the original flowchart of Figure 2.7 in
terms of the number of registers needed, bus wires, connections, switches, adder size
and control logic involved.
Figure 2.8
2: trigger if BO=1 | ]
accumulator Q
final result
0100 multiplicand 4
OTM multiplier=5
00000100 — (since multiplier bit 0 = 1, write 4 shifted left by 0 bit & sign extend)
+0000100 (since multiplier bit 1 = 1, write 4 shifted left by 1 bit & sign extend)
+000000 (since multiplier bit 2 = 0, write 0 shifted left by 2 bits & sign extend)
-00100 (since multiplier bit 3 = 1, write 0 shifted left by 3 bits & sign extend)
=11101100 result = —128 + 64 +32 4+844 = —20
But the last term needs to be subtracted. What we will do is change the sign by flipping
all the bits and adding 1(00100000 — flip — 11011111 — +1 — 11100000). We then
simply add the sum to the other partial products. This gives:
00000100
+0000100
+000000
+11100
=11101100 result = —20
As we can see the result is the same. We have illustrated the cases of needing sign
extension and of handling a negative multiplier causing the final partial product to be
subtracted instead of added.
Interestingly, this method of multiplication, including the right shift method (which
divides a number by two), was reportedly used by Russian peasants for hundreds
of years, allowing them to perform quite complex decimal multiplies with ease. The
algorithm starts with the two numbers to be multiplied, A and B, written at the head
of two columns respectively. We will give as an example, 31 multiplied by 17:
Cc
Bae7 Aa =a
5
Working downwards, divide the B column by two each line, discarding the frac-
a
2
tional part until 1 is reached. Fill the A column similarly, but double the number on =
2
each successive line: =
38
Chapter 2
jsp ea
8 62
4 124
2 248
1 496
Next, simply add up all of the numbers in the A column that correspond to odd
numbers in the B column. In this example, only 17 and 1 are odd in the B column,
therefore the final answer will be 31 + 496 = 527, which is of course correct.
Note that the alternatives given in this section are by no means the only partial
product hardware designs available, and far from being the only multiplication methods
available (even among Russian peasants).
Table 2.1
X; meas rule
——— 0 =aigits action
1 1 no action
e 1001 multiplicand 9
1010 multiplier 10
HT goon (i=0, no action since bit pair = 0 and a hidden zero)
=1001 (i= 1, subtract multiplicand since bit pair = 10)
OO (i=2, add multiplicand « 2 since bit pair = 01)
-1001 (i=3, subtract multiplicand < 3 since bit pair = 10)
+1001 (i=4, add multiplicand « 2 since bit pair = 01)
(i=5 and onwards, no action since all bit pairs = 00)
10010000
OOO O10
10010000
+10111000
+100100
+11101120 E
5
= OE Oso)
2
a
Result:
1011010 =644+ 16+8+42 = 90 (correct) 3
40
Chapter 2
NeneES SSS
Consider —9 x 11 (signed):
Box
2.17
UI Oa a multiplicand —9
00001011 multiplier 11
= he dno tale (i=0, subtract multiplicand since bit pair = 10)
0000000 (i=1,
no action since bit pair
= 11)
spit (Ovab algth (i=2, add multiplicand « 2 since bit pair = 01)
-10111 (i=3, subtract multiplicand « 3 since bit pair = 10)
+0111 (i=4, add multiplicand « 2 since bit pair = 01)
000 (i=5 and onwards, no action since all bit pairs = 00)
The result is therefore obtained as the summation of the following:
=e Oui
e LLOLVL LOO
10111000
+01110000
Result:
10011101 = —128+ 16+8+4+41 = —99 (correct)
It is important to note that when i=0, the bits considered are the least significant
bit of the multiplier and a hidden zero. Thus, when the least significant bit of the
multiplier is a ‘1’, the multiplicand must be subtracted (i.e. treated as a ‘10’ instead).
This can be seen in the second worked example (Box 2.17).
There are two points worth mentioning here. First, when dealing with two’s com-
plement signed operands, the partial products must be sign extended in the same way
as the full partial product multiplier.
Second, when scanning from right to left, the hidden bit at the right-hand side
means that the first pair of non-equal bits that is encountered will always be a ‘10’,
c indicating a subtraction. This regularity may be useful when designing a hardware
2
=
\e) implementation.
Y Even for someone who has been doing binary arithmetic for many years, the
2
= preparation of this book highlighted how easy it can be to make very trivial binary
>
= addition mistakes. If you are required to do this as part of an examination, always
4]
Foundations
double-check your binary arithmetic. Getting it right the first time is not as simple as it
may seem.
As mentioned previously, Booth extended his method into examination of 4 bits at
a time, using a look-up-table type approach, and Robertson took this one step further by
building an 8-bit look-up table. These methods are in fact common in various modern
processors, although they require considerable resources in silicon.
Division
For many years, commodity CPUs and even DSPs did not implement hardware divi-
sion due to the complexity of silicon required to implement it. Analog Devices DSPs
and several others did include a DIV instruction, but this was generally only a hardware
assistance for the very basic primary-school method of repeated subtraction.
When working in binary the process is identical and perhaps best performed as
long division as in the worked example in Box 2.18.
So now the question is, how to handle signed integer division? Answer: The most
efficient method is probably to note the signs of both operands, convert both to unsigned
integers, perform the division and then apply the correct sign afterwards. Division uses
the same sign rules as multiplication in that the answer is only negative if the signs of
the operands differ.
The division process for one popular microprocessor can be seen in the flowchart of
Figure 2.9. A close examination of this may prompt some questions such as: ‘Why shift
both A and Q left at each iteration?’ and ‘Why perform an addition of Q = Q + M inside
the loop?’ These questions may be answered by considering how the operations are
performed using registers within a CPU. This will be left as a pencil-and-paper exercise
for the reader to follow the operation of the algorithm for one example division, perhaps
of two 6-bit numbers: this exercise will help to clarify how this system works.
Just note that at the completion of the algorithm, register A holds the answer, with
any remainder being in register Q. The algorithm will have iterated for n cycles where
n is the number of bits in the input words. As always, it is entirely possible to derive
other flowcharts that work differently, for example, some will even iterate and scan
through the bits in the opposite direction. Divisio
42
Chapter 2
Lou 010111
divisor dividend
Then, starting from the most significant end (left) and working towards the least
significant end (right), scan each bit position in the dividend to see if the divisor can be
‘found’ in the dividend. In each case if it is not found, write a ‘0’ in the corresponding
position above the dividend, and look at the next bit. After three iterations, we would
have:
000 (quotient)
LOU (AGIs
But now, at the current bit position in the dividend, 101 can be found. We thus write
101 below the dividend and a ‘1’ above the dividend at the correct bit position.
Then subtract the divisor (at that bit position) from the dividend to form a new
dividend:
0001
Om OL
OD ha
= daQial
DOOOTL
Next, we continue working from left to right but this time looking at the new dividend
for the divisor. In this case it is not found; after scanning all bit positions we are left
with:
000100
Ode Jomo moana:
= Om
000011 _
The answer is seen above: the quotient is 000100 with a remainder of 000011. Since
j
<7 we were dividing 23 by 5, we expect an answer of 4 (correct) and a remainder of 3 (also
9°
ia correct).
2
a
43
Foundations
A<<1, Q<<1
Nees |
On exit, A is the quotient, and
Q is the remainder
This is possible numerically because we are only multiplying and adding in a filter 5
tees
iL
and these are linear processes: (a x b + c) has the same result as (10a x b + 10c)/10. =
ee
Remember again that the actual fractional format used is not relevant to the hard- >
ware used to perform the calculations. It is only an abstraction that the software oD
=
engineer must keep in mind. This will be illustrated with various examples as we %
(e}
ke
pay
is| Arithmetic with Fractional Numbers
Addition can always be performed on two fractional format numbers, but the correct
answer will only be achieved when the formats of each operand are identical. The
format of the answer will be that of the operands:
(m.n) + (m.n) = (m.n)
(m.n) — (m.n) = (m.n)
Question 1: Represent 1.75 and 1.25 in (2.2) format fractional notation, perform an
Box
2.19 addition between the two and determine the result.
Answer: First calculate the bit weightings for (2.2) format notation: we need two
digits to the right and two digits to the left of the radix point. Digits to the left are
integer weighting, are powers of 2 and start with 1. Digits to the right are fractional,
are 1 over powers of 2 and start with 1/2:
eel ea Be |
We can decompose 1.75 into 1+ 0.5 + 0.25 and 1.25 into 1 + 0.25 and write them in (2.2)
binary format as 0111 and 0101.
The binary addition of these operands results in 1100. Is this correct?
1100 in (2.2) format equals 2 + 1 = 3. Of course 1.75 + 1.25 = 3 so yes, the answer is
correct.
Next, we will illustrate what happens when something goes wrong.
Question 2: Represent 1.75 in (2.2) format fractional notation, represent 0.625 in (1.3)
format fractional notation, perform an addition between the two and determine the
result.
Answer: 1.75 was represented in question 1 and is 0111.
(1.3) format fractional notation has weightings 1, 0.5, 0.25, 0.125 and thus if we decom-
pose 0.625 into 0.5 + 0.125 we get a binary pattern 0101.
Next, we perform the addition 0111 + 0101 which gives the answer 1100.
However, we do not know the fractional format of the result. Let us speculate
whether this is (2.2) or (1.3) format by working out the decimal value in each case.
In (2.2) format the result is 2+1 = 3 and in (1.3) the result is 1+0.5 = 1.5. However,
the answer should be 1.75 + 0.625 = 2.375. Clearly, this does not match either of the
potential answers.
What we should have done was change one of them so they were both in the same
format before we performed the addition.
Note: Did you see that the binary patterns of both examples are identical? It is only our
interpretation of those bit-patterns that changed between examples. Using different
interpretations in this way can cause the same bit-pattern to have multiple meanings —
but the hardware used to perform the calculation does not need to change.
Fractional
with
Working
Nu weSe eee aa
45
Foundations
The arithmetic of such fractional format numbers is illustrated with two examples
in Box 2.19.
Looking at the worked example, it is clear that the actual division is no more com-
plex than standard binary arithmetic; however, consideration of the radix position may
be problematic. In fact, it requires some careful coding on the part of the programmer.
Floating Point
Floating point numbers are similar to fractional format binary but they have additional
flexibility in that the position of the radix point is variable (and is stored as part of
the number itself). It is this flexibility that allows floating point numbers to encode an
enormous range of values with relatively few bits.
Or more correctly considering the sign to be binary, with 1 indicating negative and
0 indicating positive, such that:
n=l)’ xSxB-
An example in base 10 would be 2.3 x 10° which we know is just a shorthand
method of writing 2,300,000. In fact, this illustrates one of the main benefits of floating
point: floating point numbers generally require less writing (and in binary require fewer
digits) than the decimal (or binary) values they represent.
In binary the difference is that B = 2 rather than 10 and thus the example will typi-
cally be something like 01001111 x2° which, if the mantissa (01001111) is unsigned,
becomes:
It is not the intention of this text to describe the entire IEEE754 standard, but we
will cover some of its more common features. We will consider single and double
precision formats which fit into 32-bit and 64-bit storage locations respectively. In the
C programming language these would normally correspond to float and double data
types:
Name o E S
Normalised 1 or 0 not all zero or all one | any
Zero 1or0 all zero all zero”
Infinity — nial 51-050 Ae4 all one ‘ | all zero
Not a Number (NaN) 1 or 0 = all one | non-zero
Denormalised | A 1or0 all zero ie non-zero
When an IEEE754 number is written, we typically write the bits from left to right
in the order (a, E, S) as shown below:
0 10110010 11100000000000000000000000 —
fe
}
would be said to have o = 0 and therefore positive sign, oo.
2))
E = 128+4+32+
16+2 =178 and =i
5
2ke
= 0.5 + 0.25 + 0.125 = 0.875
48
Chapter 2
where it can be seen firstly that the exponent is in an excess-127 notation (introduced
in Section 2.3.5) and secondly that the mantissa needs to have a ‘1’ added to it. In other
words, the mantissa is equal to S + 1 and we know that S was written in (0.23) format.
All this may be very confusing, so we will return to the example IEEE754 number
and use it in the worked example in Box 2.21, and give a second example in Box 2.22.
Many of our example numbers have long tails of zeros. We can obtain an idea
about the basic precision of IEEE754 by considering what difference would result if the
least significant bit at the end of one of those tails is flipped from a ‘0’ to a ‘1’. Box 2.23
provides a guide as to how we can investigate the effect.
First of all, we note that here c = 0 and therefore the value has positive sign. We also
note that the number is in normalised mode. Therefore:
E = 128+32+16+2=178
and
5S = 0.5 + 0.25 + 0.125 = 0.875
Using the formula for normalised mode numbers, we can calculate the value that this
conveys:
nia) Sh ++:0:'875) wate as
=
= 1.875 x 2°!
ae
2]
[.
= 4.222 x 10'5
2)) As we can see, the result of the worked example is a fairly large number, illustrating
pos
—
2] the ability of floating point formats to represent some quite big values.
2Hele
49
Foundations
01010000000000000000000000
In this case, 0 = 1 and therefore has negative sign, and remaining bit-patterns give:
E =8+4=12and
S =1/4+1/16 =0.3125
Using the formula for normalised mode numbers, we can calculate the value that this
conveys:
hu Gel) beGlck0.3125)pe200~
USD
=t-.1097 x 10
This time the result is a very small number. This illustrates the enormous range of
numbers possible with floating point, and also the fact that all through the represented
number range (explored further in Section 2.8.4), precision is maintained.
=
()
a.
special denormalised mode that is able to extend the represented numbers downwards
in magnitude towards zero — gracefully decreasing precision until zero is reached.
Denormalised mode is not actually floating point because the exponent (which is
the part of the number that specifies the radix point) is set to all zeros and thus no longer
‘floats’. However, this mode, in allowing range extension is an important advantage of
TEEE754 numbers.
In this mode, the number represented by the bit-patterns (o, E, S) is given by:
Me a Gee
It can be seen firstly that the exponent is fixed as mentioned above, and secondly
that we no longer need to add a‘1’ to the mantissa. The reason for this will be apparent
when we explore number ranges in Section 2.8.4.
Since the exponent is fixed, the bit-pattern is always all-zero and the mantissa non-
zero. A worked example will help to clear up any confusion, and this is provided in
Box 2.24.
Since denormalised numbers extend the range of IEEE754 downwards, they will
always have very small magnitude.
Given the following binary value representing an [EEE754 number, determine its
Box
2.24
decimal value.
oe Using the formula for denormalised mode numbers, we can calculate the value that
ho this conveys:
[e]
ou.
o) n = (—1)° x 0.8125 x 27128
AS
—
is] = 9.5509 x 107%?
=
i,
ll
Foundations
Three parameters are indicated which describe the format. The first is the smallest
magnitude number (0000 0000), the second is the largest magnitude number (1111
1111) and the final is the accuracy. Accuracy is defined as the distance between steps
in the format. In this case, the numbers count upwards as integers: 1, 2, 3, 4,5, ...255
and so the step size is simply 1.
Now, we will undertake to define a number line for [EEE754 format in the
same way. To simplify matters we will consider positive numbers, but we will look
at both normalised and denormalised modes although only for the single-precision
case.
Normalised mode requires that E is not all-zero or all-one, but S can take any value
and the actual value represented is:
fai 1) (1S) Kx 2
If we look for the smallest magnitude normalised mode number, we need to find
the smallest S and smallest E possible. The smallest S is simply 0, but the smallest E
—
=
(o)
cannot be 0 (because that would denote denormalised or zero mode), so it has to be o..
00000001 instead:
fe)
BSi=
5
Lae 00000001 | 00000000000000000000000000 | 2L
52
Chapter 2
Inserting these values into the formula and assuming a positive sign gives us:
Next, looking for the largest magnitude number, we remember that S can be any-
thing, but E cannot be 11111111 (because that would put it into infinity or NaN
modes). So we choose the largest E as 11111110 and the largest S as being all-one.
Considering E first, the value equates to 254. However, S is slightly harder to
evaluate:
But realising that this is (0.23) format and is slightly less than 1.0 in value, we can
see that if we add a binary 1 to the least significant digit then all the binary 1’s in the
word would ripple-carry to zero as the carry is passed up the chain and we would get
a value like this:
EAL eapa apaWt Wg Pa
We can use this fact; knowing that there are 23 bits, the bit weight of the first most
significant bit is 2~!, the weight of the second most significant bit is 2~* and so on. Then
the twenty-third most significant bit (which is actually the least significant bit) must
have a weight of 2-.
Therefore, the value of S has to be (1.0 — 2~**) since adding 2~ to it would make
it exactly equal 1.0:
And therefore a value of 2~* following the argument for normalised mode maxi-
mum number. The formula becomes:
As for the largest denormalised number, this is simply the number where S is a
maximum. Looking at the mode table in Section 2.8.2 we see it can be all-ones:
Now to work out the number accuracy: in this case since the exponent is fixed, the
accuracy is simply given by the value of the mantissa least significant bit multiplied by
the exponent:
9-23 x 9— 126
Putting the number lines together, we see the huge range spanned by IEEE754
single-precision numbers. Remember that this is actually only half of the real number
line that has positive as well as negative sides:
The number line becomes useful when we want to convert decimal numbers to
—_
a
()
IEEE754 floating point. It tells us which mode we should use, whether zero, denor- on
malised, normalised or infinity. To illustrate this, follow a worked example of conver- 03)
=
5
—
oo ee ee eee
Worked example: converting decimal to floating point
G=
Ev) 130\(so that E — 127 = 3)
S = 0.375 (so that 1 + S = 1.375)
Finding a binary bit-pattern for E gives 128 + 2 or 10000010 and since 0.375 is
easily represented as 0.25 + 0.125 then the full number is:
0) 10000010 01100000000000000000000
1. Receive operands.
2. Check for number format modes. If the value is fixed, immediately generate the
answer from a look-up table.
3. Convert exponents and mantissas if necessary.
4. Perform operation.
5. Convert back to valid IEEE754 number format. Keep the most significant 1 of
the mantissa as close to the left as possible, for reasons of maintaining maximum
precision.
Para
be| Addition and Subtraction of IEEE754 Numbers
In generalised floating point, the exponents of the numbers must all be the same before
addition or subtraction can occur. This is similar to ensuring fractional format (11.1) +
(r.s) has n =r and m = s before adding as we saw in Section 2.7.1.
For example, consider the decimal numbers 0.824 x 107 + 0.992 x 10*. In order to
do this addition easily, we must have both exponents equal — then we simply add the
mantissas. But do we convert both exponents to be 10? or do we convert both to be 10,
or even choose something in between such as 10°?
In answering this question, first, let us consider how to convert an exponent down-
wards. We know that 10° is the same as 10 x 10” and 10? is the same as 100 x 10°. Since we
are talking about decimal, we multiply the mantissa by the base value of 10 every time
we decrement the exponent. Performing this in our calculation would give us the sum:
Once the exponents are equal, we can perform an addition on the mantissas:
oa
= (0.824 x 0.992) x 10°+4) = 0.817408 x 10°
wa
i)
Y Once again, in the case of IEEE754 format numbers the result must be converted
2
a. to a correct representation and special results (zero, infinity, NaN) checked for.
oie
Bo
[eo]
ou 2.953 IEEE754 Intermediate Formats
5) Although a particular IEEE754 calculation may have IEEE754 operands as input and
=
aes
Le] as output, there are cases where the output will be numerically incorrect unless there is
fe,
i greater precision within the calculation. A short example subtraction on 9-bit numbers
yA
Foundations
Q. Convert decimal values 20 and 120 to IEEE754 format, add them and convert the
Box
2.27
result back to decimal.
A. Looking at our number line from Section 2.8.4 we realise that both values lie in the
normalised number range of IEEE754, but initially we will simply consider a generic
A x 2° format. Furthermore, we will not look at the exact IEEE754 bit-patterns here.
Simply remember that A = (1+ S) and B = (E — 127).
Starting with 20 we divide repeatedly by 2 until we get a remainder between 1
and 2: 10,5, 2.5, 1.25 and so A = 1.25. We divided four times so B = 4.
120 similarly divides down to 60, 30, 15, 7.5, 3.75, 1.875 so A = 1.875. Since we
divided six times, B = 6.
The information is inserted into the following table. We do not need to derive the
E and S bit-patterns at this stage; we are more concerned with their interpretation:
? ||
|
|
Or | 2.1875 PAE ||
However, this is not a valid representation for IEEE754 because the mantissa value is
too large. Remember the (1 + S) in the formula? Well, A = (1 + S) < 2 is our constraint.
If both operands were IEEE754-compliant then we should be able to guarantee that
no more than one shift is needed to put it right, so we shift the A value right by one
w
wy
calculator will reveal that 1.09375 x 2’ is indeed the correct answer giving
D
A check ona £i=
us a decimal value of 140. 3]
“4
i
58
Chapter 2
| 0 10000101 | 11100000000000000000000 |
The result of the addition was 1.09375 x 2’ such that S = 0.09375 and E = 134.
Since 0.09375 is not an obvious fraction of 2, we can use a longhand method to
determine the bit-patterns. In this, we repeatedly multiply the value by 2, subtracting
1 whenever the result is equal to or bigger than 1, and ending when the remainder is
Zero:
: 0.09375
: 0.0187
OLSYAS
2 O75)
at 055
Ot
>
=
NO
HG)-)—1=,0
We subtracted 1 on iterations 4 and 5. We make use of this by setting the fourth and
fifth bits from the left to 1. In fact, we could have used this method for the first two
numbers, but they were too easy:
| 9 | 10000110 _| _00011000000000000000000 a)
1.0000 0000 x 2) A
— 1.11111111 x 20 B
a
c
wi Before we can proceed with the subtraction it will of course be necessary to nor-
®
13) malise the numbers to the same exponent. We do this by increasing the smaller one as
e
a.
ao
we have done in Section 2.9.1:
i
to) 1.0000 0000 it A
a.
s2)) = (OAL Nat var B
am
—
2] Now we can proceed with the calculation. The result:
i°)
ra 0.0000 0001 x12! ©
59
Foundations
Let us look at the actual numbers that we have used. Operand A has value 2.0 and
operand B has value (2.0 — 2~*) which in decimal is 1.99609375. So the result should be:
However, the result from our calculation is 1 x 2~” or 0.0078125. There is obviously
a problem somewhere.
Now let us repeat the calculation but this time adding something called a guard
bit during the intermediate stages. This effectively extends the length of the mantissa
by adding another digit at the least significant end. We start at the point where the
numbers have been normalised. Note the extra digit:
1.0000 0000 0 ee A
SSH Aa) x20 B
Next shifting to normalise the exponents, the LSB of B shifts into the guard bit
when we shift the number right by 1 bit:
1.0000 0000 0 i A
= @,i att a x2) B
0.0000 0000 1 x E
Notice that in line C this time the most significant (only) 1 occurred in the guard bit
whereas previously it was located at the bit above that. The normalised value is now
1 x 2-5 or 0.00039065, a correct answer this time.
Although this example showed generalised 8-bit floating point numbers, the prin-
ciple is the same for IEEE754 numbers.
The example above showed a loss of precision error causing an incorrect result
during a subtraction. Of course, the same error could occur during an addition since A —
B is the same as A + (—B). But can it also occur during multiplication and division? It is
left as an exercise for the reader to try and find a simple example that demonstrates this.
cy
In IEEE754 terminology, more than one guard bit is used and the method is called ht
wn
©
extended intermediate format. It is standardised with the following bit widths: 8)
a
2
| —
2.9.4 Rounding
Sometimes an extended intermediate value needs to be rounded in order to represent
it in a desired output format. At other times a format conversion from double to single
precision may require rounding. Rounding can be necessary for both fixed and floating
point number calculations at times.
There is more than one method of performing numeric rounding and many com-
puter systems will support one or more of these methods under operating system
control:
¢ Round to nearest (most common) — Round to the nearest representable value and
if two values are equally near, default to the one with LSB = 0, for example 1.1 to
lL ASO 2 eval 5 ko 2.
¢ Round towards +ve — Round towards the most positive number, for example —1.2
to —1 and 2.2 to 3.
e¢ Round towards —ve — Round towards the most negative number, for example
=1.2 te—2 and 2.2 to 2.
¢ Round towards 0 — Equivalent to always truncating the number, for example —1.2
to lkand 2.26 2.
Summary
This chapter, entitled ‘Foundations’, has really begun our journey inside the computer
— whether that is a room-sized mainframe, a grey desktop box or a tiny embedded
system. It is foundational too, since almost all computers, whatever their size, are based
upon similar principles. They use the same number formats, perform the same type
of calculations such as addition, subtraction, multiplication and division. The main
differences that we have seen are that there exist some faster methods to carry out these
operations, but at the cost of increased complexity, size and usually power consumption.
We began the chapter by considering the definition of a computer and what it
contains. We introduced the useful classification of computer types (or CPUs) by Flynn,
6]
Foundations
viewed them in terms of their connectivity and the layers of functionality that they
contain. We then refreshed our knowledge of number formats and the basic operations,
before going into a little more detail about how these calculations are achieved.
Having covered the foundations here, the next chapter will focus on how to achieve
the connectivity and calculations that we know are required — how to fit these functional
units together, write and store a program and control the internal operation required
in a working CPU.
62
Chapter 2
A programmer wrote a C language program to store 4 bytes (b0, b1, b2, b3) to
consecutive memory locations and ran this on a little endian computer with
32-bit wide memory. If he examined the memory after running his program,
would he see something like A or B in the diagrams below?
bit 31 — bit0
(Scie Wis: | b2 | b1 | bo |
B: | bo | b1 | b2 |_b3
papi Complete the following table (for 8-bit binary numbers), indicating any in-
stances where conversion is impossible for the given value:
Value Unsigned |
Two's complement Sign-magnitude Excess 127
+
123 +
| —15 | ni
193
—127
ra8) With a two’s complement (2.30) format number, how do we represent the value
0.783203125? Can this be represented exactly with (a) 32 bits, (b) 16 bits and
(c) 8 bits?
2.4 One BCD digit consists of 4 bits. Starting with a 4-bit ripple-carry adder, modify
this with extra single-bit adders and logic gates to create an adder that can add
two BCD digits and produce a BCD sum. Extend the design so that it can add
two 4-digit BCD numbers.
pass) Using partial products (long multiplication), manually multiply the two 4-bit
binary numbers X = 1011 and Y = 1101 assuming they are unsigned numbers.
ah If ADD, SHIFT and compare operations each require a single CPU cycle to com-
plete, how many CPU cycles are needed to perform the calculation in Problem
2.5? Compare this with the steps of Booth’s method in Problem 2.6. Also would
Booth’s algorithm become more efficient for a larger word width?
2.8 Consider a RISC CPU that has an instruction named ‘MUL’ that can multiply the
contents of two registers and store the result into a third register. The registers
are 32-bits wide, and the stored result is the top 32 bits of the 64-bit logical result
63
Foundations
(remember that 32 bits x 32 bits should give 64 bits). However, the programmer
wants to determine the full 64-bit result. How can he obtain this? (Hint: You will
need to do more than one multiply, and also a few ANDs and adds to get the result).
Verify your method, and determine how many instructions are needed.
2.11 Cana standard exponent/mantissa floating point number format represent zero
in more than one way? Can IEEE754 represent zero in more than one way? If
so, explain any differences between the representations.
2.12 Use the division flowchart of Figure 2.9 to obtain the quotient and remainder
values for the unsigned 5-bit binary division Q/M where Q = 10101b and
Mi=00\0'L1b.
2.13 Use the multiplication flowchart from Figure 2.7 to perform partial product
multiplication of two 5-bit unsigned binary numbers 00110 and 00101. De-
termine the number of registers used, their sizes and their content during each
iteration.
2.14 Repeat the previous problem using the multiplication block diagram of Fig-
ure 2.8, to compare and contrast the two approaches in terms of efficiency,
number of steps, number of registers and so on.
64
Chapter 2
2.16 How would Michael Flynn classify a processor that has an instruction able to
simultaneously right shift by one bit position every byte stored in a group of
five internal registers?
2.17 Justify whether self-modifying code (that is, software that can modify its own
instructions by rewriting part of its code) would fit better in a von Neumann or
Harvard architecture system.
2.18 Using a 16-bit processor and only a single result register, follow the process
to add the (2.14) format unsigned number X = 01.11000000000000 and the
(1.15) format unsigned number Y = 0. 110000000000000. What format would
the result need to be in to avoid overflow? Is there any loss of precision caused
by the calculation in this case?
10100010 10100000000000000000000000
0 00000000 10100000000000000000000000
al etertes tive 104 00000000000000000000000000
2.20 What would be the mantissa and the exponent of the result of the following
base 7 calculation, expressed in base 7?
(3 x 78)/(6 x 7*)
Hint: You do not need to use a calculator to obtain the answer.
2.21 Using partial products (long multiplication), manually multiply the two 6-bit
binary numbers X = 100100 and Y = 101010 assuming they are signed.
Depa Repeat the previous multiplication by swapping the multiplier and multipli-
cand (i.e. multiply the two 6-bit signed binary numbers X = 101010 and
Y = 100100). Compare the number of additions that are required to perform
65
Foundations
2.23 Repeat the previous two multiplications using Booth’s method. Is there any
difference in the number of partial product additions when the multiplier and
multiplicand are swapped?
2.24 Referring to Section 2.9, determine the number of basic integer addition, shift
and multiplication operations required to perform a single-precision [EE754
floating point normalised mode multiply, and compare this with the basic op-
erations required to perform a (2.30) x (2.30) multiply. Ignore extended inter-
mediate mode, overflow and saturation effects and assume the floating point
numbers have different exponent values.
2.25 How many computational operations are required to perform an 8-bit division
using repeated subtraction?
MO rays
CAO}, fotom| font
‘ 21o1o1
CHAPTER
CPU Basics
What Is a Computer?
When the general public refer to a computer, they generally envisage a
beige-coloured box with monitor, keyboard and mouse. While the box
they imagine does contain a computer, we know there is a whole lot more
in there.
The ‘computer’ part of the system is the CPU, memory subsystem
and any required buses — in fact those items that allow it to function as
a stored-program digital computer. It does not require a graphics card,
wireless interface card, hard disc or sound system in order to compute
and execute stored programs.
The stored-program digital computer is basically just a very flexible,
but generally quite basic, calculating and data transfer machine that is
programmable to perform the required functions.
These days, most people in the developed world will be surrounded
by tens, if not hundreds, of computers. These may be inside microwaves,
toasters, cellphones, MP3 players, even electronic door locks. It has been
estimated that a luxury car contains well over 100 processors, and even
hd
oY
os
an entry model may contain over 40 separate devices. In one surprising
2
ox example encountered recently, a new double-sized electric blanket was
= promoted as containing four dedicated microprocessors — one active and
fe)
O one backup device for dual independent controls on each side. With usage
12]
fas on this scale it becomes easy to imagine that the ‘future is embedded’. The
aa
5 contents of this chapter apply whether the computer is room-size or the
£
= size of an ant.
67
CPU Basics
whereas data may require read/write access and may be accessed either sequentially
or ina random fashion.
eo ) .Cose
* Density (bytes per cm’).
¢ Power efficiency (nanojules per write, read or second of storage time).
e Access speed (including seek time and average access time).
e Access size (byte, word, page, etc.).
e Volatility (ie. data lost when the device is unpowered).
¢ Reliability (does it have moving parts? does it age?).
¢ CPU overhead to manage it.
These factors lead to a hierarchy of memory as shown in the pyramid in Figure 3.1,
for both a large desktop/server and a typical embedded system. Two items shown will
be explored subsequently in Chapter 4: the memory management unit (MMU) and
cache. However, for the present discussion notice that registers — temporary storage
locations very close to the CPU functional units — are the fastest, but most expensive
Figure 3.1
| Higher speed,
| closer to CPU,
more costly
registers
ig, Highest
2g capacity,
2
Q lowest cost
£
(e}
2) Typical embedded system Typical desktop/server system
co)
hee
a
_ A pyramidal diagram illustrating the hierarchy of memory in terms of speed,
5) | size, cost and so on for embedded systems (on the left) and traditional desktop
£
x | computers (on the right).
[e)
=
69
CPU Basics
resource (and are therefore generally few in number, ranging from 1, 2 or 3 in simple
microcontrollers up to 128 or more in some large UNIX servers).
Moving down the pyramid, cost per byte decreases (and thus the amount provided
tends to increase), but the penalty is that access speed also decreases. A computer,
whether embedded, desktop or supercomputer, almost always comprises several of
the levels in the hierarchy:
Other levels are there for convenience or speed reasons, and since there are so many
levels in the hierarchy, there are several places capable of storing required items of
information. Thus, a convenient means is required to transfer information between
locations as and when required.
processed must come from registers). This data is transported from registers to process- 2
2
ing units via buses. Results will then be sent back to registers, again by bus. It is often 3
fe)
convenient to group all internal registers together into a bank. In addition, in a regular O
w
architecture machine every processing unit will be connected to this bank of registers, <£
_
bus arrangements on performance. Here, we can be content with the assumption that
such things as internal buses do exist.
Given a (possibly quite complex) bus interconnection network inside a CPU, plus
multiple internal functional units and registers that connect to this, the question arises
as to what arbitrates and controls data transfers across and between the buses.
e Fetch instruction.
e Decode instruction.
e Execute instruction.
e Save result (if any) of instruction.
Furthermore, there needs to be a method of ensuring that these steps occur and do
so in the correct order. This presupposes the need to have a set of control wires and
signals within a device from some control unit to each of the on-chip units that must
be controlled.
In early processors, the control unit was a simple finite state machine (FSM) end-
lessly stepping through one of several predefined states. Control wires ran from this to
each of the endpoints requiring control in a spider-web of wires and interconnects. We
will see this method in more detail when we design our own processor in Chapter 8.
Control is not only needed to fetch and distribute instructions, it is also needed
for carrying out the actions of single instructions. Consider the case of performing a
simple data transfer from register A to register B (LDR B, A) across a single 32-bit bus
as shown in Figure 3.2.
The two triangles within the figure are tristate buffers — devices similar to a switch
in that when the control signal is enabled, signals can pass through the buffer but when
the control signal is disabled, signals do not pass through the buffer. This is used in a
bus (for example) to decide which register is allowed to drive the bus wires. Only a
single register can drive a bus at any one time, so all other tristates connected to that
bus must remain turned off.
Bearing this in mind, the actions that need to be taken for a data transfer are
summarised here:
1. Turn off any tristate buffers driving the bus (in this case de-assert ena] to 4).
2
a.
E A block diagram of a very simple computer control Figure 3.2
0
O unit showing two registers, each with selectable tristate
Y
= buffers and a single 32-bit bus connecting all ports.
oD
<Cc
o
=
7
CPU Basics
2. Assert ena2 to turn on the 32-bit tristate, driving the content of register A onto the
shared bus.
3. Assert ena3 to feed the bus data into register B.
4. De-assert ena3 to lock the bus data into register B.
5. De-assert ena2 to free up the bus for other operations.
Perhaps the details of the process will differ from device to device (in particular
the enable signals are usually edge-triggered on different clock edges), but something
like this process is needed — in the order given — and more importantly sufficient time
is required between stages for:
¢ 1 to2-Wait for the ‘off’ signal to propagate along the control wires, hit the tristate
buffers and for them to act on it.
¢ 2 to 3 — Wait for the bus voltage to stabilise (ie. the content of register A to be
reflected by the bus voltage levels).
¢ 3to4-Give the register sufficient time to capture the bus value.
¢ 4to5—Wait for the control signal to hit the register and the register to stop ‘looking
at the bus’ before the bus can be freed for another purpose.
A Ra FHA S| Be |
B Be {EF fs. B B |
, 2
3 4 5 |
An illustration of the cycle-by-cycle timing of the simple control unit that was shown in Figure
3.2 as it transfers data from register A to register B. Darker lines indicate that the particular bus |
or signal is active at that time. |
Compu
the
Makin
TZ.
Chapter 3
Some instructions are likely to need specialised handling that extends the state
machine further. CPU designers generally cater for this by increasing the complexity
of the state machine to handle such exceptions to the rule, all in the quest to increase
runtime efficiency.
Over the years, more and more weird and wonderful instructions have been intro-
duced. It does not take a genius to figure out where they all have ended up — more and
more complex state machines! In some cases, the CPU control unit became the most
complex part of the design and required up to half of the on-chip area. In other cases,
the state machine was so complex that it was itself implemented as another CPU — in
effect a simpler processor handling the control needs of a larger and more complex one.
In IC design terms (as in many other fields), complexity is known to lead to errors and
for these reasons alternatives were researched.
So far, we have only considered the case of handling different instructions within a
processor. Now, let us consider the actual task of distributing the control signals across
larger and ever-growing IC sizes with increasing numbers of internal bus interconnects,
larger register banks, more functional units and a larger degree of clocking complexity
and flexibility. It is to be expected that a larger degree of the internal processor routing
logic (i.e. wires that traverse the device from one side to another) is going to be needed.
This presents difficulties beyond the complexity of instruction control. It turns out that
ina silicon IC, the interconnects that can reach across an entire chip are a scarce resource:
these are normally reserved for fast data buses. The need to utilise more and more of
these for dedicated control purposes has provided another impetus to the research of
alternative control strategies.
Three general methodologies resulted, namely distributed control, self-timed control
and simplification (increased regularity). The main example of distributed control is in
the use of microcode, explored in Section 3.2.5. An example of simplification is in the
move to RISC processors, explored in Section 3.2.6. Let us briefly examine each control
method.
Figure 3.4 shows part of the internals of a very simple CPU. There are four
registers in a bank and two arithmetic logic units (ALUs) all connected through two
Figure 3.4
A block diagram of the centralised control wiring required for a very simple CPU.
Making
Wo
Computer
the
73
CPU Basics
|control unit
shared data buses. At each point of bus entry/exit there is a tristate buffer. Each bus,
tristate, register and ALU port is several bits wide.
Evidently, the thin control wires emanating from the control unit are many, even for
such a simple system. These are used to control each of the tristate buffers and the mode
of the ALUs (which can perform several selectable functions). Some, such as register-
select logic, are not shown. In Chapter 4 and beyond, different bus arrangements will be
discussed, but control signals such as these will not be shown in subsequent chapters:
diagrams simply become too complicated.
One simplification that can be introduced is the use of a control bus or several
control buses. Instead of two control signals needed for each register as in Figure 3.4,
the fact that each data bus can only carry a single value at a time can be exploited to
need only a 2-bit selection bus to drive each data bus (i.e. 4-bit total control for the
system shown). This is termed a register-select bus. Such an approach may not seem
particularly beneficial in a four-register system, but with 32 registers it would reduce
the number of register-select control wires from 64 to 6. A small example is shown in
Figure 3.5.
The number of wires emanating from the control unit to the register bank in Fig-
ure 3.5 is four. These are decoded in the register bank itself to select the appropriate
register. This is not necessarily minimising logic, but is minimising the number of con-
nections around the CPU.
To summarise, control is needed for arbitration of internal buses, for initiating the
fetch, decoding and handling of instructions, for interactions with the outside world
(such as I/O interfacing) and pretty much everything sequential in a CPU, which is a
great deal. Control may even extend to handling external memory, and the next chapter oO
carries an important example of this in the memory management unit.
Conall
a.
ae
execution, the process is also fairly common — drive some registers onto buses, drive o
£
values from buses into one or more functional units, then some time later allow the BO
(e]
result to be collected (again using one or more buses) and latched back into registers. =
74
Chapter 3
Self-timed control in this instance does not imply an asynchronous system since
each block is synchronous, albeit to a faster clock (note that self-timing is used within
some esoteric asynchronous systems which we will explore in Chapter 9, but in this
case we are only dealing with synchronous logic).
A centralised control unit could specify in turn ‘fetch now’ then ‘decode now’ then
‘execute now’ and finally ‘store now’. This would require control connections from the
IC areas responsible for each of these tasks, back to the central control unit. However,
the self-timed strategy requires the control unit to simply start the process of instruction
fetch. The signal ‘decode now’ would be triggered from the fetch unit and not from a
central location. Similarly, ‘execute now’ would be a signal generated by the decode
unit and passed to the execute unit. In this way, a control interconnect is needed from
each unit to the next unit, but not all going to a single central location. In effect, the
control signals are actually following the data paths, something that becomes even more
effective in a pipelined machine (which will be covered in Chapter 5).
The two alternative approaches of centralised and self-timed control are shown
in the flowcharts of Figure 3.6. In this case, data buses are not shown which would
originate from external memory and traverse the fetch, decode, execute and store
(FDES) string. On the left is shown a control unit with four control buses, each one
linked to the enable inputs of the four separate units. At the relevant times as specified
in an internal state machine, the control unit will initiate operations in the FDES units.
Depending upon the instruction being processed, the control unit state machine
may need to operate the FDES differently (perhaps a longer execution stage or skip the
store). This knowledge must be encoded within the control unit, which must remember
every combination of operations for every unit connected to it.
The state machine must firstly contain detailed information on the timings and
requirements of each unit. It must also keep track of potentially multiple instructions
progressing simultaneously through these units.
On the right-hand side, a self-timed system is shown: the control unit still initiates
the process, but in this case each subsequent unit is initiated from the previous unit as
Figure 3.6
control unit
fetch ~>s/Fetch
ena ote
TRS
_,/ decode ( }
So || \ ong eae
cme —Jdecodeltonre
ena exerts
\ -~lexecute : <4
=
=)
joy = C "Bees done
E engexecute
° eng, Store a
O (
_ \_ store one
w
£
— 97aeS |)
= es 2S
o)
aS Control flowcharts of the alternative strategies of centralised control (left) and
x
5 self-timed control (right).
=
75
CPU Basics
and when necessary. Since the units themselves initiate the next step, the data buses
(not shown) are assumed to have the correct information at the correct times.
Depending upon the instruction being processed, units may decide to skip them-
selves and pass the request directly to the next unit. Each unit must thus encode the
knowledge of its own responsibilities and timings.
Perhaps more problematic is the need to convey different information to the various
units. For example, the execute unit needs to know what function is to be performed —
is itan AND, OR, SUB and so on. It does not need to know where to store the result
from the execution — this information is needed by the store unit which in turn does
not need to know what function was performed. In the self-timed case, either a full
record of needed information is passed between units, with units only reading the
items relevant to them, or there is a centralised store for such information. The choice of
implementation strategy depends upon complexity and performance requirements.
Shes) Microcode
As CPUs grew and became more complex, they ended up as an amalgamation of basic
required functionality, with assorted customised one-off instructions, some of which
were past their sell-by-date, an example from the 1980s being the binary-coded-decimal
handling instructions of the Intel 8086, 80386, 80486 processors required for backwards
compatibility with several decades-old legacy business software. The commercial drive
was for greater processing speed, and that was achieved partly through increasing clock
rates and partly through performing more functions with a single instruction.
Much of this drive for complex instructions came from the disparity between the
speed of external and internal memory. Internal memory cost an enormous amount of
money, but was up to 1000 times faster than external memory. A big bottleneck was
dragging an instruction from external memory into the processor. It therefore made
perfect sense to create a single complex instruction that replaced a sequence of 100
separate smaller instructions.
In fact, it was possible to think in terms of tokens. The external program was written
in tokens (instructions), fed slowly into the CPU, each causing a longer sequence of in-
ternal operations. Each token could launch a sequence of internal operations, and these
internal operations in turn were really programs, written in microcode. Microcode was
the basic instruction set of these processors, but often did not particularly resemble the
external instructions. Every external instruction would be translated into a microcode
program or microprogram, upon entering the CPU.
Microprogramming, as a technique, was actually invented by Maurice Wilkes
oy
in the early 1950s at Cambridge University, although one of the IBM System/360 ]
a
family of computers was probably the first commercial machine to implement this e
°o
technology. O
o
Some of the microcoding concepts are illustrated in Figure 3.7 where an external <£
—
program in slow memory is being executed by the CPU. The current program counter te))
A
(PC) is pointing at the instruction DEC A, presumably a command to decrement register =
5
A. This is fetched by the CPU and decoded into a sequence of microcode instructions =
76
Chapter 3
|Data memory
*
*
interface
%., E
PPP eee rere rere rrr errr)
to load register X from A, then load register Y with 1, then subtract Y from X and finally
to store the result back in A.
The small four-instruction microprogram that the DEC instruction launches is con-
tained entirely inside the CPU, in fast, on-chip read only memory (ROM), and requires
an internal microprogram counter. None of this is visible from the ‘outside world’ of
the external program which may not even know that registers X, Y and Z exist inside
the CPU.
Extending this approach further led to a processor which used nanocode: external
programs would be converted to a microprogram of microcode instructions, each of
which would in turn translate to a nanoprogram of nanocode instructions! Despite
the elegance of this Cat-in-the-Hat technique, there were decreasing returns with the
microcode approach. It relied upon the fact that external memory was a bottleneck.
In the days when external random access memory (RAM) was expensive and slow,
ply but internal ROM was very fast, this was undoubtedly true. But then advances in
o RAM technology, including static RAM (SRAM), dynamic RAM (DRAM) and then
=
ee
®
synchronous dynamic ram (SDRAM) all chipped away at the speed advantages of
-_
p=|
o.
ROM that by the 1990s there was little difference between the technologies.
E With minimal speed advantage, the popularity of microcode began to wane.
°
O An exception was where the benefits of instruction translation were required. This
®
A
a
_
feature is inherent in the microcode approach, and allows a CPU of one type to use the
oD instruction set of another machine.
=ars In the late 1990s, processors were being developed that were internally RISC ma-
2)
= chines, but which could execute CISC instruction sets (see next section). Nowhere was
Th
CPU Basics
this advantage more clear than with the humble x86 series of processors. With a design
heritage harking back to 1971, these CPUs had to not only guarantee backwards code
compatibility by executing an ancient and poorly-optimised CISC instruction set, but
had to do this faster than competing processors. The old-fashioned CISC instructions
that entered some of these processors would be translated into sequences of much
faster optimised RISC-style assembler. The RISC instructions thus took the place of
modern-day microcode.
A further advantage of the microcode translation was the design of a processor that
could mimic other devices. Such a device could execute an ARM program as if it were
a native ARM processor, and then switch to executing Texas Instruments DSP code as
if it were a TI DSP: the ultimate approach to being all CPUs to all programmers.
Despite such niche markets, the driving factors behind microcode disappeared,
and it became less popular in the 1980s. The trend was constantly towards doing more,
and doing it faster: Moore’s Law in full swing.
Figure 3.8
|
D
instructions (bottom) require multiple cycles to execute and often accomplish i
¥
more per instruction than in the RISC case. 0
=
78
Chapter 3
However, this account does not quite describe the two approaches in context and
for that we require a little hindsight. Taking a historical perspective, early computers
were operated by the designers of the machines themselves. Designers knew what
basic operations were required in their programs and catered for these directly in hard-
ware. As hardware became more capable, it became possible to add instructions to the
computer that could perform functions that would otherwise require time-consuming
strings of instructions.
As time progressed, computer programmers concentrated on software develop-
ment, and computer architects specialised in the hardware aspects. Programmers would
then approach architects asking for custom instructions to make their programs faster.
Architects often complied, but sometimes took the initiative to add what they imagined
were useful instructions, but which left the programmers scratching their heads.
By the mid-1980s, various design groups, most notably at Berkeley and then Stan-
ford universities in the USA, began to question the prevailing design ethos. They were
probably prompted in this by groundbreaking work performed quietly at IBM, in which
less complex machines that could clock much faster because of simple and regular de-
sign, were investigated. These machines demonstrated that simple instructions could be
processed very quickly. Even though sometimes a few RISC instructions were needed to
perform the same operation as a single-CISC instruction, a RISC program was typically
still significantly faster overall.
The name Reduced Instruction Set Computer pays tribute to the simplicity of the
original designs, although there was no actual reason to reduce the size of the instruc-
tion set, just to reduce the complexity of the instructions. Groups that popularised RISC
technology produced, in turn the RISC I, RISC II and MIPS processors. These evolved
into commercial devices delivering powerful workstation performance where back-
wards compatibility with x86 code was not required, namely the SPARC and MIPS
devices.
In the meantime, over in Cambridge in the UK, a tiny design group at Acorn Com-
puters Ltd, the highly successful producer of the 6502-powered BBC microcomputer
range (that contributed to the UK having the highest rate of computer ownership in the
world), had designed their own processor, based on the earliest Berkeley work. This
Acorn RISC Machine, the ARM1, was designed on a 2-MHz BBC microcomputer running
BASIC. Acorn wrote their own silicon design tools for this processor which was very
soon followed by the ARM2, which became the world’s first commercial RISC process-
ing chip. This powered the novel Acorn Archimedes range of computers. By 2002, ARM,
fast now renamed Advanced RISC Machine, became the world’s top-selling 32-bit processor
2 claiming 76.8% of the market. By mid-2005, over 2.5 billion ARM processor-powered
pe|
jeu
E products had been sold, and by the start of 2009 that had increased to be more than one
(2)
O sold for every person on the planet. The popularity of the ARM processor continues to
J) increase. Box 3.1 briefly explores the background to the development of the amazing
<a
ee
o) ARM processor.
&£
x While Intel rode the wave of the desktop personal computer boom, the ARM archi-
i)
= tecture is riding the much larger wave of the embedded processor boom. CPUs are now
79
CPU Basics
3.1 In the mid-1980s, groundbreaking British computer company Acorn, with a contract
Box
from the British Broadcasting Corporation (BBC) to design and market BBC micro-
computers was looking for a way to move beyond their hugely successful 8-bit BBC
microcomputers. These were powered by the lean and efficient Rockwell 6502 proces-
sors. The BBC initiatives had encouraged computer use in the UK so much that there
were reportedly far more computers per capita in England than anywhere else in the
world. Sir Clive Sinclair’s ZX Spectrum for example, had sold 4 million units by the
time sales of the IBM PC had reached 1 million units. Acorn is also reputed to have
sold over 1 million BBC computers overall.
In the early explosion of the ‘computer revolution’ it quickly became apparent
to Acorn that 16-bit processors from companies such as Intel and Motorola were not
powerful enough to meet their projected future needs — needs which included releasing
the world’s first multi-tasking graphical desktop operating system in the late 1980s
(later some observers would conclude that this was copied by Microsoft as the basis
for Windows 95, XP and beyond).
In typical pioneering fashion, Acorn decided that, since nothing good enough
was available, they would create their own processor. They designed the ARM1 and
its support ICs (such as MEMC and VIDC) within two years despite having never
developed any silicon previously.
Acorn wanted a machine witha regular architecture —similar to the 6502, but vastly
more powerful. They chose to use the RISC approach, but revisited their software needs
by analysing operating system code to determine most used instructions which they
then optimised for the ARM processor. The same approach yielded an instruction set
(see Section 3.3) and its coding. Later, much needed additions were the multiply and
multiply-accumulate instructions.
This heritage leaves the globally successful ARM processor with a direct link back
to the UK Government-funded BBC initiatives: the ARM software interrupt, supervi-
sor modes, fast interrupt, no microcode, static pipeline, load-store architecture are all
derived either from the hardware or the software architectures adopted by Acorn.
inside almost every electronic product and most of these are ARM-based. Meanwhile,
Acorn itself no longer exists, having self-destructed in 1999.
Go
away Example Processors Zs
Over the years, since the IBM research group published their initial results, the RISC a
£
approach has impacted almost every sphere of processor design. In particular, the ARM fe)
O
RISC processor family now dominates the world of embedded systems. Therefore, in oO
<=
this book almost all assembly language code examples are given in ARM assembler oO
format. For example: &
x
fe)
ADDO; Ri, in =
80
Chapter 3
adds together the contents of registers R1 and R2 and stores the result in
register RO.
Today, although it is easy to find examples of ‘pure’ RISC processors such as
the ARM and MIPS, even the die-hard CISC devices (such as Motorola/Freescale
68000/Coldfire and some of the Intel x86 range) are now implemented with CISC-
to-RISC hardware translators and internal RISC cores. Pure CISC processors do not
seem to be popular these days. For this reason, when referring to CISC processors we
define a pseudo-ARM assembler format, rather than use the format from any particular
CISC device:
ADD “AY Bi ee
adds together registers B and C, placing the result in register A. Usually, examples
in this text are identified as being RISC or CISC, and can otherwise be differentiated
because the RISC examples use ARM-style registers RO to R15 whereas CISC examples
use alphabetical registers A, B, C and so on. Some special-purpose registers are also
mentioned in later sections; SP is the stack pointer, LR is the link register.
The only exception to the use of pseudo-ARM instructions in this book is in
discussions relating to the Analog Devices ADSP21xx processor and a single Texas
Instruments TMS320 example. The ADSP in particular uses an assembly language
that is structurally similar to the C programming language, and therefore quite eas-
ily readable. These exceptions will be highlighted at the time the code segments are
presented.
Note that some processors, most notably the 68000, would actually specify the
destination register last instead of first as in the ARM. However, in this book the desti-
nation register is always specified ARM style, and any comment is written following a
semicolon (‘;’):
SUBMER See Re, Rae R3=R2—RI
Sometimes the destination and first source register are the same:
DD Mery pp (Oe)
Instruction Handling
As mentioned in Section 3.2, computers are operated through sequences of instructions
known as programs. The generic term for such programs is software. Various schemes
exist for creating software through writing in high-level languages (HLL), where each
HLL command is made up of a sequence of perhaps several tens of CPU instructions.
In low-level languages, typically each command invokes few, or perhaps only a single
CPU operation.
If we define a CPU operation to be some data move or logical transaction by the
CPU, an instruction is a command to the CPU from a program (which results in one or
more CPU operations). A HLL command is made up of one or more instructions, and
a stored program is a list of such instructions.
In some computers, a single instruction can be used to invoke multiple CPU opera-
tions. This may be required for performance reasons, especially where the rate at which
a program can be read from external memory is far slower than the speed at which the
processor can execute the operations. In fact, this thinking led in the past to the idea of
microcode (explored in Section 3.2.5).
Machine code is the name given to (usually) binary numerical identifiers that cor-
respond to known actions in a CPU. This may mean, for example, that when exam-
ining program memory, hexadecimal byte 0x4E followed by byte 0xA8 might repre-
sent two instructions in an 8-bit processor, or a single instruction, 0x4EA8, in a 16-bit
processor. In modern processors, programmers are very rarely required to deal with
the underlying binary numerical identifiers that the processor understands, but han-
dle these through a set of abbreviated mnemonics called assembly language or assem-
bly code. It is this code that is produced and stored when compiling a HLL into an
executable.
The instruction set is a list of the possible assembly language mnemonics. It is a list
of all instructions supported by a particular CPU.
Many processors would add an FPU or MMX group to those defined, but the
ADSP2181 is a fixed point only processor with no multimedia extensions.
The instruction set for the ARM processor, specifically the ARM7, is shown for
reference in Figure 3.9 (note this shows the ARM mode instructions and does not include
the 16-bit Thumb mode that many ARM processors also support). Notations used in
the instruction set table include the following:
Figure 3.9
31/30) 29 28° 27 126. (25 24° 23 (2221 (20 19 18 7 16) 15) 14°13 12 41 10 9 8 ALS 5 4 3 2 1 Oo
conditions De OPO 11s OOM a tte tet aia aL ete ate he ] ACW ot ater O On ONt Rn BX
toy) conditions | 0 0 Ot 0B 0 lo Rn z Rd GO} 0} 0} 07} tH Or} oO 4 Rm single data swap
pe! conditions fl i) 0 Pp U N WL) Rn } CRd CP no. offset LDC
ee conditions — TPA hk |) oP opcode CRn CRd CP no. CP 0 CRm CDP
= |_conditions Pap AE o| CP opcode |L :,crn Rd CP no. cp |e cRm MCR
6 The ARM instruction set in a tabulated format. Columns are headed by the instruction word
v bit they contain. All 14 classes of instruction available in this version of the ARM instruction
= set are shown.
Cc fs. —-
83
CPU Basics
Many of these modifiers are specific to the ARM processor and will not be con-
sidered further in the text. However, we shall look in more detail at the ‘S’ bit and
the addressing capabilities (see Section 3.3.4). The interested reader is referred to the
ARM Ltd website! where further explanations and documents are available. The in-
struction set varies slightly among ARM processors. The version shown above is the
more common ARM7TDMI version.?
Recently, ARM have completed a rebranding exercise in which their processors
are now known as Cortex devices. The original ARM7, ARM9 and ARM11 devices are
termed ‘classic’. Most likely, this move has been an effort to counter the fragmentation
of the huge ARM market in which one basic architecture (the ARM) was required to
span a very wide and diverse set of needs, ranging from tiny and slow sensor systems
to larger and faster handheld computers. At the time of writing, the new processors are
classed into three ranges which better subdivide the traditional strength areas for ARM
devices:
Cortex-A series processors are application-oriented. They have the in-built hard-
ware support suited for running rich modern operating systems such as Linux, with
graphically rich user interfaces such as Apple’s iOS and Google’s Android. The pro-
cessing power of these runs from the efficient Cortex-A5, through the A8, A9 and up to
the highest performance Cortex-A15 device. All support ARM, Thumb and Thumb-2
instructions sets (Thumb-2 reportedly improves upon Thumb in terms of performance
and compactness).
Cortex-R series devices are targeted to real-time systems that also have significant
performance requirements. These include smartphone handsets, media players and
cameras. The ARM company is also promoting Cortex-R for automotive and medical
systems; ones in which reliability and hard real-time response are often important.
These probably do not require complex and rich operating systems, just small, hard
and fast real-time arrangements. At the time of writing, only the Cortex-R4 is available,
and has already found its way into many real-time systems in use worldwide.
oO
Cortex-M family processors are at the lower end of the range for use in very cost- 2
a)
sensitive and low power systems. It could be argued that these are for traditional e
5
microcontroller-type applications that probably do not need advanced operating x=
c
=°
3)
=)
1 http: //www.arm.com _
—_
nn
2 This information was extracted from ARM Ltd Open Access document DDI 0029E. =
84
Chapter 3
system support (and possibly do not need any OS). These are for applications that
do not have rich user interface requirements, and for which the clock speed will be no
more than several tens of MHz. At the time of writing, the Cortex-M0 is the entry device,
beyond which the M3 and M4 provide increasing levels of performance.
Although most variants of the ARM7 support a 16-bit Thumb mode (see Sec-
tion 3.3.3), all ARM7 devices support the standard fixed length 32-bit instructions
shown above. It can be seen that, as in the ADSP21xx, there are various groups of
instructions, such as data processing, multiply or branch. With 15 instruction groups,
4 bits are needed to represent the instruction group and further bits are used within
this to represent the exact instruction in each group.
Notice the fixed condition bits available for every instruction. No matter which
instruction is being used, these bits are located at the same position in the instruction
word. This regularity aids in instruction decoding within the processor. It is important
to note that the consequence of this is that every instruction can operate conditionally.
This is unusual, and among common modern processors is found only in the ARM: most
other processors support conditional branch instructions only. In the ARM, the S bit
within many instruction words controls whether that instruction can change condition
codes on completion (see Box 3.2). These two features, when used in conjunction with
each other, are very flexible and efficient.
Also, note that for every instruction, the destination register (if required) is in the
same place in the instruction word. This further regularity also simplifies the decoding
process.
Figure 3.10
Consider the efficiency of the ARM processor compared to a mythical standard RISC
Box
3.2
-
processor that does not allow conditional operation for every instruction.
The instruction mnemonics used are similar to those of the ARM (but not com-
pletely realistic). First, we will examine the program on the standard RISC processor
that adds the numbers in registers RO and R1 and then, depending on the answer,
either places a 0 in register R2 (if the result is less than 0) or places a 1 in register R2
otherwise.
EXIDIDYS) INO, IG) RIL
BLT pos1 (branchif less than 0)
MOV R2, #1
B pos2
posl MOV R2, #0
OOS ae oe
The program occupies five instructions and will always require a branch no
matter what registers RO and R1 contain on entry.
The following code segment reproduces the same behaviour for the ARM pro-
cessor, but uses conditional moves to replace the branch. In this case, RO and R1 are
added. The S after the ADD mnemonic indicates that the result of the addition should
update the internal condition flags. Next, a value 1 is loaded into R2 if the result of the
last condition-code-setting instruction is less than 0. A value 0 is loaded into R2 if the
result is greater than or equal to 0.
The ARM version is obviously shorter — only three instructions are required, and
in this case no branches are needed. It is this mechanism that allows ARM programs
to be efficient whereas RISC processors are traditionally known for less efficient code
density. In higher level languages, the structure that leads to this code arrangement is
very common:
IF condition THEN
eee i
ELSE
D
Ans
action 2 oO
=
5
Po
c
2
Once the instruction fetch and decode unit reads an instruction, it begins to de- 3)
—
=]
code that instruction which then flows through steps as shown in the flowchart of =4)
Figure 3.11. =
86
Chapter 3
—
Figure 3.11
| fetch decode > fetch _ execute
| instruction instruction !1 operand j instruction
The ARM, as we have seen in Figure 3.9, reserves 4 bits (bits 31, 30, 29 and 28) for
3.3
Box
-
condition codes in every instruction. This means that every machine code instruction
can be conditional (although when written in assembly language there may be some
instructions which do not take conditionals).
Normally, the condition code is appended to the instruction. Thus, an ADDGT is an
ADD instruction that only executes when the condition flags in the processor indicate
that the result of the last instruction which set the condition flags is greater than 0.
The full set of ARM conditionals is shown in the table below (although strictly
the last two are unconditional conditionals!).
0101 PL plus N= 0
2
=4)
=
88
Chapter 3
The MOV is 32-bits long like all ARM instructions. Its structure is shown below.
3.4
Box
.
abit cond |0 | 0 | il |opcode | S |Rn |Rd |4-bit rotation | 8-bit value _
or
The 4-bit condition code is common with all other ARM instructions, the opcode defines
the exact instruction in the data processing class, Rn is the first operand register, Rd is
the second operand register and, selected through bit 25 =1, Rm is the third.
We will concentrate on the top form of the command, where an 8-bit immediate
constant and 4-bit rotation are supplied (the actual rotation to be applied is twice the
value supplied here). Where the opcode specifies a MOV instruction, the immediate,
rotated by the degree specified is loaded into the destination register. Here are some
examples:
However, the assembler would be likely to complain (‘number too big for immediate
constant’ or similar) since the 32-bit value that is specified cannot fit into an 8-bit
register no matter what degree of shift is required. Some assemblers and more
experienced programmers would know that they can simply convert the instruction
to a ‘move NOT’ instead:
MVN RO, #0x0F000000 ;Rd=0, Rn=0, rotation = 12, value = 0x0F
As you can see, despite the relatively small immediate value size that can be accommo-
dated within the instruction field, this allied with the instruction flexibility and shift
value, can actually encode quite a wide variety of constants.
fe)
=
In the case of the ARM, immediate constants are loaded into a register with the oO
c
MOV instruction (in the data processing instruction group). An immediate value can 5
<=
be located inside the section labelled ‘Operand 2’ in the ARM instruction set (Fig- Cc
a
ure 3.9). However, not all of the operand is used for holding the constant. In fact, only —
(8)
2
an 8-bit immediate value is catered for, with the remaining 4 bits used to specify a =
4)
rotation. =
90
Chapter 3
So, although the processor has 32-bit registers, only an 8-bit number can be loaded.
However, due to the rotation mechanism (with 4 bits for rotation this can specify 15
positions either left or right), a large variety of numbers can result. Box 3.4 looks in
detail at the bitfields present in the ARM processor MOV instruction, to see how these
impact the flexibility of one variant of the instruction.
Many processors work differently. They generally allow at least a 16-bit constant to
be loaded immediately and the 16 bits are encoded as part of the instruction word. CISC
processors often have variable length instructions or use two consecutive instructions.
A variable length instruction may be 16-bits long when only an 8-bit constant is to
be loaded, or 32-bits long when a 16-bit or 24-bit constant is loaded. Variable length
instructions require the instruction fetch unit to be fairly complex, and thus a more
simple method of achieving a similar result is to use two consecutive instructions. The
first instruction may mean ‘load the next instruction value to register R2’ so that the
IFU simply reads the next value directly into the register rather than trying to decode it.
This evidently means that some instructions require two instruction cycles to execute,
and imposes a timing penalty, especially in pipelined processors (Section 5.2).
For the example of the ARM processor, although the restriction in immediate values
exists, in practice many constants can be encoded with an 8-bit value and a shift so that
this does not translate into a significant performance bottleneck. The ADSP2181 handles
immediate loads in a similar fashion and has been designed for high-speed single-cycle
operation.
An example processor has five instructions for which an analysis of the 1000 instruction
Box
3.5
- software program that it runs reveals the following occurrences:
CALL CO; ADD S00 SUB 80, AND 60, MOV 500
If an equal number of bits were used to represent each instruction in this instruction
set, 3 bits would be needed (since that would allow up to seven possibilities). Ignoring
any operands, 1000 x 3 bits = 3000 bits are required to represent that program.
The processor designers wish to use Huffman coding to reduce the program size.
First, they calculate the probability of each instruction (by dividing each occurrence
by the total number of instructions):
Next, these are ordered in a list in terms of probability. The lowest two probabilities
are combined and the list re-ordered:
instruction. Only the most common instructions were made available in Thumb mode.
The binary encoding within the fixed 16-bit word used to represent an instruction is
length coded based on the number of bits required for the other operands.
Some features of the Thumb instruction set are as follows:
The Thumb instruction set is significantly more complicated than the ARM instruction
set, although the decoding process (from Thumb instruction fetched from memory to
ARM instruction ready to be executed inside the processor) is automatic and very fast.
The following are some example instructions:
From the limited examples shown here, it can be seen that the few most significant
bits identify the instruction. These actually range from 3 bits to 9 bits in length across
the entire instruction set. In the case of the ADD instruction shown, the register it
operates on is fixed: it is an add to stack only — the flexibility and regularity of the ARM
instruction set, where almost all instructions operate on any registers, is lost — but the
most common operations found in software are catered for.
It should be noted at this point that the Thumb instruction set, being 16 bits wide,
really operates at its best when the interface to external memory is 16 bits, in which
case each ARM instruction would require two memory cycles to be retrieved (and thus
the processor would run half as fast as it should), whereas the Thumb code could be
executed at full speed.
Of course, not all possible operand types are suitable for all instructions, and even
so may not be available on some processors (for example RISC processors, being load-
store, typically limit the operands of arithmetic instructions to registers, whereas in
CISC processors they may be located in memory or elsewhere). A final point to note
is the assumption in the two bottom examples above that the first operand written is 2)]
a
the destination — which is true for ARM assembly language, but is reversed for some m2)
c
other processors (see Section 3.2.7). This can be a real cause for confusion when writing 12)
=
assembler code for different processors (and is an occupational hazard for computer =
2
architecture lecturers /authors). as
8)
2
The term addressing mode refers to the method of specifying a load or store ad- —=
w
dress, using one of several different techniques. The following table lists the common =
94
Chapter 3
Immediate addressing MOV RO, #0x1000 Move hexadecimal value 0x1000 to register RO
Absolute addressing LDR RO, #0x20 Load whatever is in memory at address 0x20 into RO
Register direct NOT RO, R1 Take content of R1, NOT it and store inside RO
addressing
Register indirect LDR RO, [R1] If R1 contains value 0x123, then retrieve contents of
addressing memory location 0x123, and place it in RO
Stack addressing PUSH RO In this case, the contents of RO are pushed onto the stack
(and the assumption is of only one stack)
The following extensions and combinations of the basic idea are also common:
Register indirect with LDR RO, [R1, #5] The second operand
immediate offset is the memory
address 1+5=6
Register indirect with STR RO, [R1, R2] The second operand
register indirect index is the memory
address 1+2=3
Register indirect with LDR RO, [R1, R2, #3] The second operand
register indirect index and is the memory
immediate offset address 1+2+3=6
Register indirect with STR RO, [R1, R2, LSL #2] The second operand
immediate scaled register is the memory
indirect index address 14+ (2 <« 2) =9
Various processors, including the ARM and the ADSP2181, also offer an automatic
way to update registers after they have been used to perform offset addressing. For ex-
ample, a register indirect access with immediate offset could leave the register used in
D
pe the access updated after addition of the offset. This is shown in the following examples
se)
= where R1 = 22:
5
<=
Cc EDR PROF etRaN is ras Load RO with content of memory address 22 and then
Be
_
©) set Ri 221+: 52127
2
ie“ AEDSY SASH)Pe |II e215) | Set RI = 22 + 5 = 27 and then load RO with content
= of memory address 27
75
CPU Basics
Note that it is not our intention here to teach the details of the ARM instruction set,
but merely to use it as a teaching aid for the underlying addressing techniques.*
It is instructive to analyse the limitations that caused CPU designers to provide
certain levels of functionality within a processor — and this is rarely more revealing
than in the consideration of the instruction set. In this regard, CISC processors are
more interesting. Some examples are given below from an imaginary CISC processor,
where main memory locations mA, mB and mC are used for absolute operand stor-
age, and a RISC processor, where registers RO, R1 and R2 are used for register direct
addressing:
buses: two to simultaneously transfer both operands and one to collect the result. ao)
&
5
Other alternative arrangements are equally possible. <=
<
=
—
8)
2
3 Those who do wish to learn the ARM instruction set are recommended to refer to the book ARM
=
oa
wn
System Architecture, by Steve Furber (one of the original inventors of the ARM processor). Le
96
Chapter 3
The instruction word needs to contain three register values. However, with a
bank of 32 registers, only 5 bits are needed to specify each register, and so 15 bits
are used in total. This would easily allow the operation to be encoded in a 32-bit
instruction.
CISC processor: ADD mA, mB ;mA=mA+mB
Similar to the first example, the CPU must read two external memory locations
to retrieve the operand values, requiring two memory bus cycles. It also needs to
transfer the result back to memory and thus execution time is unchanged.
However, the instruction word this time only needs to contain two absolute
addresses instead of three. This would be achievable in a real system, especially if
an absolute value is used for the first operand address and an offset used for the
second one.
CISC processor: ADD mB ;ACC=mB+ACC
The CISC processors of the 1980s and earlier commonly utilised accumulators.
These were general-purpose registers (the forerunners of the register bank) that
were used as an operand for all arithmetic and data mode operations and to hold the
result of those operations. The other operand was almost always an absolute value
from memory. In this case, the instruction requires a single value to be loaded from
memory prior to the addition and thus involves a single external memory bus cycle.
The instruction word needs to only contain a single absolute memory value,
which could be achieved by loading a second instruction word containing the ad-
dress (thus requiring two instruction fetches to be performed prior to instruction
execution).
Stack processor: ADD
This is a special case (that will be explored further in the next section and specifi-
cally in Chapter 8) where a CPU pops the top two stack entries, adds them together
and pushes the result back onto the stack. This needs to access a stack which would
be quick if it were an internal memory storage block, however, a stack would more
normally be located in off-chip memory. The main benefit with the stack approach
is that the instruction does not need to encode any absolute memory addresses.
Theoretically, this can make for an extremely small instruction width.
Lukasiewicz in the 1920s to place the operator before the operands, thus it is a prefix
notation. By specifying the operand in this way, operator precedence is unimportant
and parentheses are not required.
Reverse Polish notation (RPN) by contrast is a postfix notation where the order of
the equation completely defines the precedence. This was created during the 1950s and
1960s as an aid to working with a stack-based architecture. It was subsequently intro-
duced and loved (or hated) by two generations of Hewett-Packard electronic calculator
users.
An example of RPN is bc + a+, where the operands b and c are given first fol-
lowed by the command to divide them and hold the result. Then operand a is loaded
followed by the command. to add the previous result to a and store the new result
somewhere. Some further e amples are shown below and in Figure 3.12.
i ¥
¥
LY ay a+b-—c ab +c —
(a+b)+c | ab+c+
Considering the operations taking place, it becomes evident that using a stack is a
very efficient method of performing RPN operations. A stack in this case is a storage
device with a single entry /exit point. Numbers can be pushed onto the ‘top’ of the stack
and then popped back off the ‘top’ of it. It is a last-in first-out (LIFO) construct.
An example of a stack operation performing ab + is shown in Figure 3.12, reading
from left to right. Some things to note are that only a single push occurs in each step
(likely to each take a single cycle in a stack-based processor) although the number of
pops required to feed an operation is determined by the number of operands required.
For example, an ADD requires two operands, so two POPs are used to load those to
the ALU. The result of each operation is PUSHed back onto the top of the stack.
Figure 3.12
turn onto the stack and ALU then executes, popping the operands, calculating the ro
j=
sum and then pushing the result back onto the stack. _—
nn
£
98
Chapter 3
ne es ee ee
‘© Recoding RPN instructions to minimise stack space
ro)
g Consider the infix expression a + (b x c) which can also be written as (b x c)
+ asince the order of addition is unimportant to the final result.
For each expression, write the equation in postfix notation and write out the se-
quence of stack operations that would be required to execute it. Consider the stack
usage for each expression.
It should be clear that writing the equation one way involves the use of amaximum
stack depth of three locations, whereas the alternative way results in a stack depth of
only two locations.
It appears that the order of the postfix expression can have a significant impact on
the stack resources (and hence hardware resources) needed, although it will not alter
the number of steps needed to find a solution.
Not all infix expressions are insensitive to order. Addition and multiplication are,
whereas division and subtraction are most definitely not.
It is also interesting to consider the use of such a stack machine performing com-
plex programming tasks. It seems efficient for simple operations, but sometimes it is
possible that the final state of the stack after a sequence of operations may not have the
correct results located on the top of the stack. This may be exacerbated by multi-tasking
or interrupt service routines. There must be a way of re-ordering the stack, such as
popping items out and storing into main memory, and then pushing them back in a
different order. This could be a very time-consuming process and impacts heavily on
the overall performance of a stack machine. This process is also explored in Box 3.6
where re-ordering is performed to minimise stack usage.
Data Handling
This chapter, up to now, has concentrated on CPU basics — what a computer is and what
it fundamentally consists of. We have mentioned instructions, programs and so on. As
part of this, Section 3.3 considered instruction handling, including some variations on
a theme, as well as the important sub-topic of addressing modes.
Later, Section 3.5 will present a top-down view of computers. However, in between
these two extremes of high-level overview and low-level detail, there is a more philo-
sophical question regarding the purpose of computers. We can consider a ‘black box’
perspective as an example.’ Having a black box perspective, we view a computer as a
2) unit that modifies some input data to produce some output data.
ie
i]
i=
5
= ° For those who have not encountered this term, a ‘black box’ is the name given to something that,
e] when considered as a unit, is defined solely in terms of its inputs and outputs. It does not matter
5
el
ra what is inside the box as long as it produces the correct output given the correct input.
99
CPU Basics
Both input and output data could take many forms: commands, knowledge, sensor
data, multimedia and so on. For some systems, input data could consist of a single
trigger event. Output data could likewise consist of an actuator switch signal. This is
the case in control systems, which often operate with a need for real-time processing
of data (real-time issues are considered in depth in Section 6.4). Some systems are data
rich — either input or output may consist of dense streams of data, such as digital audio
or video. These systems may also need to operate in real time. However, the majority of
computer systems are probably general-purpose machines capable of performing both
control and data processing tasks with little regard to real-time issues.
The common theme here is clearly data: computers process data, whether that is a
single bit trigger for which timing is critical, or a 1 Tbyte block of multimedia data that
completes processing in several minutes. This section is dedicated to this important
aspect of computers: what data is, how it is presented, stored and processed.
specifications. ra
100
Chapter 3
Table 3.1
Comparison of C programming language data type sizes for CPUs ranging from 8 bits to 64 bits.
Note how some of the data types change size between processors, while others remain the same.
For a particular implementation, these sizes are usually defined by maximum and minimum
representable number specifications in the configuration header file types.h. Remember also that
the byte order may change between big and little endian processors (see Section 2.2).
char 8 8 8
byte 8 8 8
short 16 16 16 16
int 16 16 B2 64
long int 32 32 SP 64
long long int 64 64 64 64
float Be Je 32. 32
double 64 64 64 64
long double compiler specific — may be 128, 96, 80 or 64 bits
types has led to many developers adopting specific-sized data types, described further
in Box 3.7.
Of course, experienced programmers will know that any integer data type in the
C programming language (i.e. the top six rows in the table) can be specified as either
signed or unsigned. The default (if neither is specified) data types are signed two’s
complement.
The long int and long long int canalso be specified as just long and long
long respectively. On all but the largest machines these will require multiple memory
locations for storage.
The char type normally contains a 7-bit useful value, complying with the ASCII
standard (American Standard Code for Information Interchange), shown in
Table 3.2. Any top-bit-set character (i.e. a char where bit 8 is non-zero) would
be interpreted as an extended ASCII character (ASCII characters that are not
shown in the figure). Interestingly, characters lower than decimal 32 (space) and
including decimal 127 (delete), are non-printable characters having special values
related to their original definitions for teletype terminals. For example, ASCII
2) character 8, \b is the bell character,
pe which would cause a ‘beep’ sound when
a) printed. A brief web search can easily reveal the meanings of other special ASCII
c
5 characters.
x=
BS5 ASCII was excellent when computers were effectively confined to English (or Amer-
5
a ican) speakers, but not particularly useful for other languages. Hence, significant effort
101
CPU Basics
3.7 Although general programs written in languages such as C and C++ will make use
Box
of the standard data types shown in Table 3.1, this can cause confusion when porting
code. If a programmer makes an implicit assumption regarding the size of a particular
data type, this assumption may no longer be correct when the code is compiled on a
different processor.
The situation was actually far worse in the days before the widespread adop-
tion of the gcc compiler — many compilers had different compilation modes such as
‘large memory model’ and ‘small memory model’ which could result in the num-
ber of bits used to represent variables changing (even gcc has command switches
which can change this, but are not often used). Cross compiling for embedded sys-
tems, where the target machine may differ from the host compilation machine, makes
it doubly important to ensure that any code tested on the host performs similarly on
the target.
Perhaps the simplest way to achieve this, and to remain mindful of the limitations
of different data types, is to directly specify the size of each type when declaring
variables. In the C99 programming language (the version of C formalised in 1999) the
definitions have been made for us in the <st dint .h> header file:
8 Messe eatin Ge
on imiwesQets (laine
2)2)4
64. int64-t
The 64-bit definitions (and other odd sizes such as 24 bits) may exist for a par-
ticular processor implementation but not for others. Of course, if it exists, it will
occupy the sizes given, but otherwise these are optional, so for some machines the
compiler will not support anything but the main 8-, 16- and 32-bit definitions. Writ-
ers of code for embedded systems will likely encounter these safer type declara-
tions more often than those writing desktop machine software. The author would
encourage embedded systems developers to use the specific-sized types wherever
possible.
D
ae
has been paid over many years to define different character encodings for other lan- Oo
Cc
13,000 ce}
guages. Perhaps the ultimate challenge has been Chinese which has around x
pictograms (individual ‘letters’): clearly an 8-bit data type is not able to encode written 5
e)
—
Chinese. Many solutions have appeared over the past two decades, most of which use Q
102
Chapter 3
Table 3.2
co es heb hein 18 EE
The American Standard Code for Information Interchange, 7-bit ASCII table, showing the charac-
ter (or name/identifier for non-printable characters) and the representative code in decimal and
hexadecimal.
wht ze thadhen mage des tal morn tn ond eee aera © aba ee
Char Dec Hex Char Dec Hex Char Dec Hex | Char Dec Hex
Table 3.2
(Continued)
Char Dec Hex Char Dec Hex | Char Dec Hex Char Dec Hex
two or more sequential bytes to hold a single character. The current de-facto standard
encoding is called unicode, which has various ‘flavours’ but which can use up to four
sequential bytes to encode the vast majority of characters, including Chinese, Japanese,
Korean and so on.
Although the detail of this encoding system is beyond the scope of this book,
the implications are not: early computers were byte-sized and were naturally able to
handle byte-sized ASCII characters. These days, it requires a 32-bit machine to handle
a 4-byte unicode character in a single operation. Similarly, early interfacing methods
such as the PC parallel and serial ports (see Chapter 6) were byte-based. Memory
accesses have often been byte-based. The argument has been that a byte is a convenient
size for simple counting and for text processing. However, this argument no longer
applies in many cases. Where non-English alphabet systems are concerned, a byte-
sized processing system is nothing more than a historical curiosity.
One final point to note concerning data sizes is the uniformity of the float and
double types. This uniformity is related to the ubiquity of the IEEE754 standard, and
the fact that the majority of hardware floating point units comply with the standard
(this will be explained a little more in Section 4.6).
Buses, which we will consider more fully in Section 6.1, need to be sized to cope with Qa
104
Chapter 3
the required data flows, and systems should also consider human needs. For example,
the human sensory organs are often far more sensitive to sudden discontinuities than
they are to continuous errors (noise). It is usually more annoying for listeners to hear
music from a CD player which skips than it is to listen to music in the presence of
background noise. Similarly with video: skipped frames can be more annoying than a
slightly noisy picture.
Most of the important real-time issues will be explored in Section 6.4. However, at
this point, we need to stress that computer architects should bear in mind the use to
which their systems will be put. Embedded computer architects may have an advantage
in that their systems are less flexible and more generic, and thus better able to satisfy
users. Unfortunately, they also suffer the considerable disadvantage that size, cost and
power limitations are more severe, and thus require finer balancing of trade-offs in
design.
Technically speaking, data flows through computers on pathways called buses.
This data may originate from external devices or some form of data store, be processed
in some way by a CPU or co-processor, and then output similarly either to another
external device or data store.
1. Load the data item at the particular stack offset corresponding to variable i into a
register.
2. Increment the value stored in that register.
3. Save that register content to the stack offset that it was retrieved from.
Spill code is the name given to the few machine code instructions that a compiler
adds to a program to load-store variables between memory and registers. Since memory
accesses are far slower than register accesses, spill code not only slightly increases the
size of a program, it also adversely affects execution speed. Minimising spill code has
long been a target of compiler researchers and computer architects worldwide.
Figure 3.13 ?
f eS Ly R1 A(15:0]
OperandA
—SS————— er ee
R2 A[31:16]
.
a ar ad R3 B[15:0]
Operand B
a al a a ;
R4 B[31:16] |
2 DItS a
_»| 16 bits <<
Calculation phase
R1 A[15:0]
A[15:0] x B[15:0]
R2 A[31:16] 4 a
A[15:0]xB[31:16] |
R3 B[15:0] a
xB[15:0] |!
A[31:16]
+
RO | |
|
EE OAIDITS a
A block diagram illustrating the set-up and calculation stages of the multi-step procedure nec-
essary to perform a 32-bit x 32-bit = 64-bit multiplication using multiply hardware only capable
of returning a 32-bit result (i.e. 16-bit x 16-bit = 32-bit hardware).
Logical operations on longer data words are quite simple: split the operands, pro-
cess the logical operations on each part separately and then re-assemble the result. This
is because a logical operation on one bit in a binary word does not have any impact
upon the neighbouring bits.
Arithmetic operations require a little more thought than logical operations (but are
simpler than multiplication or division). The issue with arithmetic operations is that of 21)
ae
overflow: the result of adding two 16-bit numbers may be 17 bits in length. The extra ae)
c
bit (carry) must therefore be taken into consideration when performing the addition of 5
S
the split numbers. Usually, that will involve calculating the lower half of the split first —
2]
5
and then adding this (with carry) to the result of the upper half. a
108
Chapter 3
For a machine equipped with FPU, (A x B°) and (D x B®) would be single 32-bit
(for float) or 64-bit (for double) values. These would be loaded into two FPU registers,
a single instruction issued to perform the multiplication and the answer retrieved from
a destination FPU register. By contrast, for a machine without FPU, several fixed point
operations would be required instead:
Split off mantissa and exponent A and C and store in R1 and R2 respectively.
Split off mantissa and exponent D and E and store in R3 and R4 respectively.
Calculate the new mantissa: R1 x R3.
d) Calculate the new exponent: R2 + R4.
me Normalise exponents.
a2)
Cc
2] Se
le
Ba Recombine and store in IEEE754 format.
<=
=
5 Clearly, the single FPU instruction is preferable to the several fixed point operations
je]
a that are needed to replace it.
109
CPU Basics
The complex multiplication needs four real multiplications and two additions.
The complex addition is a little simpler, requiring only two real additions. This will re-
quire the programmer (or compiler) splitting the operation into steps of several simpler
instructions.
A processor with hardware support for complex numbers would possess a single
instruction capable of performing these operations. The underlying hardware architec-
ture would actually need to perform all of the splitting, suboperations and separate
multiplies, but this would be handled very quickly within the CPU without requiring
separate loads, stores and data moves.
A Top-Down View
Choe a! Functionality
Given that all computable functions can be performed by some sequence of logic op-
erations, the main reason why not all functions are computed in such a way (i.e. as a
possibly long sequence of logical operations), is related to efficiency — how long does
such a function take to complete, and what hardware resources are required? There is
some trade-off in that making a computer simpler can allow faster clock speeds. This
argument led to the advent of RISC processors which, being simpler, clock faster — at
the expense of having to perform some functions longhand that would be built into a
CISC computer as single instructions.
=
2
>
8 The notable exception is FORTRAN (FORmula TRANslation), the general-purpose compiled
c
>
language introduced by IBM in the mid-1950s. FORTAN, updated several times since (the latest fe)
being 2003), has natively supported a complex number data type for over 50 years. Among modern
a
a.
languages, there has been some promotion of Java as a scientific language, with a complex number 2
extension. Unfortunately, Java is currently significantly slower to execute than FORTRAN. Sg
110
Chapter 3
amount of memory provided on-chip, and off-chip, must be considered. A cache (Sec-
tion 4.4) in particular is used to maximise the use of faster memory, and the complexity
of hardware memory units tends to influence how well memory use is optimised. In
terms of memory, the way software is written and compiled can also result in more
efficient use of hardware resources.
So completion time reduces when CPI is low, fis high or most obviously P is low
(i.e. a shorter program will probably execute faster than a longer one). The trade-off
between P and CPI in computer architecture is a revisit of the RISC vs CISC debate,
while ever-increasing clock frequency is the story of modern CPUs.
The task of minimising CPI is another aspect of modern computer systems. Up
until the 1980s, CPI would be greater than 2, perhaps as much as several hundreds
in CISC machines. The RISC approach began to shift CPI downwards, with the aim of
achieving a CPI of unity. The ARM family of processors typically achieve a CPI of about
1.1, and other RISC processors can do a little better than this.
Later, the advent of superscaler architectures led to CPI values of below unity,
through allowing several instructions to execute simultaneously. This, and the inverse
of CPI (called IPC) will be explored later in Section 5.5.1.
Sometimes floating point performance is an important attribute and this is mea- &
>
sured in MFLOPS (millions of floating point operations per second). In recent times, e
GFLOPS readings are more commonly quoted, meaning thousands of MFLOPS and ES
°
even petaFLOPS, PFLOPS. These values are more indicative of actual performance efol
than MIPS since we are counting useful calculation operations rather than the low- =
level instructions which comprise them. 4
12
Chapter 3
oe Standardised performance
x
5 (In the mid-1980s, the computer industry worldwide saw an unprecedented level of
competition between vendors. This was not simply a two-entry race between AMD
and Intel. It included thousands of manufacturers selling enormously differing ma-
chines — alternative architectures, different memory, tens of CPU types, custom oper-
ating systems, 8 bits, 16 bits and even some more unusual choices.
In the UK, companies such as Sinclair, Acorn, Oric, Amstrad, Research Machines,
Apricot, Dragon, ICL, Ferranti, Tandy, Triumf-Adler and more battled in the market-
place against IBM, Apple, Compaq, DEC, Atari, Commodore and others. Claims and
counterclaims regarding performance littered the advertisements and sales brochures
available at that time. However, with no standard and no baseline, claims were often
dubious to say the least.
In response, the British Standards Institute (BSI) published a performance stan-
dard for computers — testing useful tasks such as integer calculation, floating point
calculation, branching performance and graphics as well as disc reads and writes.
However, at that time the programming language of choice was BASIC (Beginners
All-purpose Symbolic Instruction Set), and hence the standards were written in this
language! From today’s point of view, the graphics and disc tests are also dated: the
‘graphics’ test was actually text being written to the screen or VDU (visual display unit)
in the parlance of the time. This was important for many users interested in nothing
more than word-processing. Also disc reads and writes were to floppy discs — a great
advance on the tape drives used for most home computers at the time — hard discs
(usually known as Winchester drives in those days) were simply too expensive and
not even supported on most machines available at the time. Far more common was
saving programs to cassette tape.
Today, computer magazines and websites test new hardware and software with a
battery of tests far removed from the BSI standard, but following the same rationale.
Thus, measures such as ‘refresh rate for playing Quake III’ and ‘time taken to sort
1 million rows of random numbers in a spreadsheet’ are to be found. Other more
standard, but often not freely available, tests exist but these are less commonly applied:
after all, most users are more interested in playing Quake than in how quickly they
can calculate z to 100 decimal places.
Benchmarks are so important that several companies exist to provide such services
(Box 3.8 explores the background and necessity of having such benchmarks). BDTi is
> one example which publishes comparative speeds for several digital signal processors
AS
S (DSPs). Their measures are skewed towards outright calculating performance, some-
c
=
fo)
thing which is the mainstay of the DSP market.
a Otherwise, SPECint and SPECfp benchmarks compute integer and floating point
2.
AS) performance directly. These are obtainable in source code format from the Standard
8 Performance Evaluation Corporation (SPEC) for a fee, and can be compiled on an
Ms
CPU Basics
architecture to assess its performance. Each measure is calculated from a set of algo-
rithms that have to be run, and results combined. Generally, a year is provided to
indicate test version. Thus, SPECint92 is the 1992 version of the integer standard.
The SPEC measures themselves incorporate two earlier measures known as Dhiry-
stone and Whetstone, both originating in the 1970s and measuring integer and floating
point performance respectively. Many other performance metrics exist and may be
used to assess performance for various tasks (such as graphics rendering, real-time
performance, byte handling and so on).
Unfortunately, it is a well-known fact that, given any single performance measure,
computer architects can tweak an architecture to yield a high score at the expense of
other, unmeasured, operations. Furthermore, none of these measures really reflect the
overall completion time of anything but the simplest tasks running in isolation. So
many issues intervene in the real world to confuse results, such as interrupted tasks,
operating system calls, varying memory speeds, disc speeds, multi-tasking and cache.
In computing, a cache (covered in detail in Section 4.4) is a small block of very
fast memory provided on a system which has far slower main memory. Any program
running directly from the cache will obviously execute quicker than one running from
slow main memory. Why this is relevant is that in the past, at least one processor vendor
has deliberately designed a cache just big enough to hold an entire performance measure
algorithm (i.e. the entire SPECint or Dhrystone program) so that it runs much faster
than it does on a competitor’s machine.
In such an example, if the main memory were set to run ten times slower, the
performance measure result would not change since the measuring program runs from
the cache, not main memory. Obviously, such a performance measure is not realistic.
In fact, such a machine would yield a faster performance score than a competitor with
a smaller cache but significantly faster main memory — one which would in reality
probably perform real-world tasks much quicker.
Given significant performance-altering factors such as those we have mentioned, it
is clear that the world of benchmarking is fraught with difficulty. A system designer is
thus urged to be careful. In practice, this may mean understanding device operation in
detail, building in large safety margins or testing final code in-situ before committing
to a device. Although it is rare in industrial projects for software to be available and
working before hardware is complete, if such an opportunity arises, the approach of
in-situ testing is very much recommended.
So Assessing Performance
Section 6.4.4 will discuss completion times and execution performance for real-time
and multi-tasking systems, but here we consider estimation of performance. In order
to underscore the need for accurate performance estimation, here is an example from
éS
c
industry:
°>
Several years ago, an embedded design group needed hardware to run an algorithm re- ge
rot
quiring 12 MIPS of processing power. A 32-bit CPU rated at providing 40 MIPS when 2
clocked at 40 MHz was chosen to execute this. In an attempt to reduce design risks, the <x
114
Chapter 3
fo)
This approach is commonly taken for DSP systems, and will include a close look at
zi memory transfers, placement of variable blocks into different memory areas that can be
Qo
2 accessed simultaneously (Section 4.1.4), input and output bottlenecks and mathematical
<q operations which are typically the major strength of such processors. Slow set-up, user
Wiss
CPU Basics
interface and control code are generally ignored in such calculations, except in the sizing
of overall program memory requirements.
At this point it is useful to note that most, if not all, software developments end
up overrunning initial program memory use estimates. Clever coding can often bring
down data memory use and can reduce processing requirements, but can seldom save
significant amounts of program memory. Unlike desktop computer designers, embed-
ded designers do not have the luxury of providing for RAM expansion: this must be
fixed at design time. In such cases, it is wise to significantly overestimate memory needs
up-front.
The second approach mentioned of matching required performance to hardware,
is through careful evaluation. This does not require detailed architectural understanding,
but does require detailed levels of testing. Ideally, the final runtime software should be
executed on candidate hardware to evaluate how much CPU time it requires. A list of
other tasks to be performed should also be made and checked to see whether those can
fit into whatever spare processing time remains. Software profiling tools (such as GNU
gprof) will identify any bottlenecks in the runtime code and make clear which software
routines require large amounts of CPU time.
It is important to run any test a number of times (but do not average the results if
timing is critical — take the maximum worst case), to increase program size sufficiently
to force it out of the cache or on-chip memory, if appropriate, and to enable whatever
interrupts and ancillary tasks might be needed in the final system.
If, as is sometimes the case, the target software is already running on another ma-
chine, it is possible to compare its execution on that machine to execution on another —
but only after considering all important architectural factors as discussed in these last
two chapters. In such instances, compiling and comparing a suite of standard bench-
marks on both machines will help, assuming that the benchmarks chosen are ones of
relevance to the target software.
The world is full of examples where designers have estimated processor perfor-
mance and/or memory requirements incorrectly (including one example designed
for an Asian industrial manufacturer in 1999 by the author: a portable MP3 player
that could only replay seven seconds of MP3 audio at a time, due to unexpectedly
low memory bus bandwidth. Luckily, a faster speed grade processor became
available).
You have been warned! Beware the pitfalls of performance estimation, evaluation
and measurement. Above all, remember to read the small print below manufacturers’
performance claims.
Summary
In this chapter, the basics of the microprocessor have been covered, starting with the
functionality of a CPU, the ability to control this with a program and the need to transfer
this program (and store it somewhere).
116
Chapter 3
A control unit needs to keep a processor on track, managing operations and ex-
ceptions, and being directed in turn by the computer program through a sequence of
instructions. Control units can be centralised, or distributed with timing from a state
machine, a microcode engine or using self-timed logic.
Each instruction in a program is part of an allowable instruction set that (depend-
ing on your point of view) describes the operations capable of being performed by that
processor, or which specifies the microprocessor behaviour. Such behaviour includes
data transfer through internal buses to various functional units. Having laid the foun-
dation for CPU design here and in the previous chapter, in Chapter 4, we will delve into
the internal arrangements and functional units of most mainstream CPUs and attempt
to relate that to the programmer’s experience.
NY,
CPU Basics
If the assembler instruction LSL means ‘logical shift left’, LSR means ‘logical
shift right’, ASL means ‘arithmetic shift left’ and ASR means ‘arithmetic shift
right’ then what are the results of performing these operations on the following
signed 16-bit numbers?
OxO00CA ASR 1
O00 ESR 2
OxFFOF LSL
OxXFFOF LSR
OxXFFOF ASR
&
moan
OxXFFOF ASL bs)
hor
1G)
CO
32 An analysis of representative code for a RISC processor with only eight instruc-
tions finds the following occurrences of those instructions:
ADD 30
AND 22
LDR 68 j
MOV 100
NOT ; ilks)
ORR 10
STR 2 60
SUB 6
a. If each instruction (excluding operands) is 6-bits long, how many bits does
the program occupy?
b. Use the information in the table to design a Huffman coding for the
processor.
Calculate the number of bits needed to store the program using the Huffman
coded instruction set.
ees) Show the sequence of stack PUSHes and POPs during the execution of the
following Reverse Polish notation (RPN) operations and translate each into
infix notation:
Ase Osa
D.ccalb ch cx
Gan ab. cdsi phar
Consider the maximum depth of stack required to perform these operations.
118
Chapter 3
3.6 Calculate the maximum stack usage (depth) for each of the three answers to
part (e) above.
3.8 Given the following segment of ARM assembler, rewrite the code to use condi-
tional ADDS to remove the need for any branch instructions.
DD SRO PRLS RS
step2
step3 TOP
3.11 Scientists discover a new type of silicon memory cell. Semiconductor engineers
design this into anew memory chip. Identify six factors that computer architects
would look at when deciding whether to adopt this new technology for mass
storage in an embedded video player.
| Bil Consider the following instructions and decide whether they are from a RISC
or CISC processor:
a. MPX: Multiply the content of two memory locations, then add the result to
an accumulator.
b. BCDD: Perform a binary-coded decimal division on two registers, format
the result in scientific notation and store as ASCII to a memory block ready
for display to the screen.
c. SUB: Subtract one operand from another and return the result as a third
operand. The operands and result are register contents only.
d. LDIV Rc, Ra, Rb: Perform a 100-cycle-long division of Ra/Rb and place the
result in register Rc.
3.14 What is a load-store architecture? Why would computer designers adopt such
an idea?
3.16 For a fictitious 32-bit processor, the hexadecimal machine code instruction for
the assembler command to store a word 0x1234 in memory location 0x9876
looks like this:
Ox0FOO 1234 088D 9876
By examining the machine code instruction, determine whether this processor
is likely to be capable of absolute addressing. Justify your answer.
SpA Another fictitious processor, this time an 8-bit CPU, has eight registers. Is it
possible to have instructions in this processor that specify two operand registers
and a separate result register?
120
Chapter 3
3.18 Assuming ARM-style assembly language (but not necessarily an ARM proces-
sor), identify the type of addressing represented in the following instructions:
MOV R8, #0x128
AND
SHIR. Ieee os Nile
dt
AND R4, R5, R4
LDRRG (URS peRObe iol 2)
LDR RZ Ry Ole aor
sp Sle Roy
Cemeoean wlinsr ees 0)|
3.19 Which processor is likely to be faster at processing 32-bit floating point data: a
900 MHz 32-bit floating point CPU or a 2 GHz 16-bit integer-only CPU?
| Processor Internals
Many processors do not have this flexibility or are less regular. Sec- o
=
ondly, there is the issue of how much work can be performed in a sin- ©
—
gle instruction cycle. This is normally implicit in the instruction set itself. =
122
Chapter 4
result bus
Again looking at the ARM, there are at most two register input operands and a single
register result operand associated with any arithmetic or logic instruction:
ADD RO, R1, R2 ;RO=R1+R2
With regard to the means of transporting data from a register to the ALU and back
again: if this all happens in a single cycle, it implies that both the input and the output
have their own buses (since only one operand can travel along one bus at any time).
One bus will convey the content of R1, another will convey the content of R2 and yet
another will convey the result from the ALU back to register RO.
Taking the two observations together implies that all registers connect to all buses,
and there are at least three main internal buses.
The arrangement concerning registers and ALU that we can deduce from a brief
examination of the instruction set is shown in Figure 4.1. This is actually a simplified
schematic of the ARM processor internal interconnection arrangement. The arrows in-
dicate controllable tristate buffers, acting as gates controlling read and write access be-
tween the registers and the buses. Control logic (described in Section 3.2.4) is not shown.
dedicated to handling addresses. In the ARM, where every register can hold an address
gio (since it uses indirect addressing, explained in Section 3.3.4), every register must also
4
m4 have connectivity to the internal address bus.
”n
>
r=)
re] "In fact, registers R14 and R15 are the link register and program counter respectively. These
i
c
© understandably require connections that other registers will lack which are not really evident
c
=
through examining the instruction set. Registers also vary in their shadowing arrangements.
123
Processor !nternals
Figure 4.2
inte)
TR14
R15
G
[=
©
—
2 The ‘xx’ means that there are various serial numbers in the ADSP21 family which share these —
In the arrangement shown, both the MAC and the ALU have their own buses —
both input and result, and by extension, their own set of preferred registers. Thus, as
long as the programmer remembers to use RO to R3 when dealing with the MAC, and
R4 to R7 when dealing with the ALU, both of the example instructions:
MUL RO, R1, R2 ;RO=R1+R2
ADD R4, R5, R6 ;R4=R5+4+R6
can be performed simultaneously in a single cycle.
This process is probably the underlying thinking below the design of the ADSP21xx
hardware, squeezed by designers for every last drop of performance gain.
data memory address. Both are address buses that index into the two blocks of memory
(program and data) which also indicate that this is basically a Harvard architecture
processor (see Section 2.1.2). However, it actually goes a step further in its partitioning of
address spaces. PMD and DMD are program and data memory data buses respectively.
Note the bus sizes: not only does this ADSP have a complex internal bus interconnection
arrangement, but the bus width and width of the interconnects differ.
The diagram shows that the ALU and the MAC, but not the shifter, can receive
input operands from the 24-bit PMD bus, but all can receive input and output from the
16-bit DMD bus.
i=0 i=l y
—f
”
This includes the use of a further set of filter coefficients, b[]. IIR filters can also be =|
co
adaptive and are generally able to perform the same workas FIR filters but with a lower ie]
c
order (which means a smaller value of 1). This strong filtering action comes at a price,
ke
—
©
and that is mainly observed by IIR filters becoming unstable, if not designed carefully. =
126
Chapter 4
The act of multiplying two values and adding to something already there is called
multiply-accumulate, which uses an accumulator, usually abbreviated to ‘ACC’.
Now we need to relate that function to the hardware of a digital signal processor.
There are many subtleties that could be discussed here, and using this operation, but
in this case one of the most important aspects is the memory access arrangements.
Consider the block diagram in Figure 4.5 showing a digital signal processor con-
taining a CPU, two memory blocks and a block of external shared memory. The device
seems to have an internal Harvard architecture (separate program and data memory
and buses), but connects externally to a block of shared memory. This type of arrange-
ment is very common, with the internal memory being static RAM (SRAM), and some-
times having SDRAM (synchronous dynamic RAM) externally for the main reason that
it is far less expensive than SRAM (refer to Section 7.6 for details on memory technolo-
gies and their features).
On-chip memory uses short internal buses and is generally extremely fast, some-
times accessing instructions in a single cycle. Occasionally, a block of two-cycle memory
is also provided. This is twice as slow as single-cycle memory since it requires two clock
cycles between requesting data and it being made available.
Ignoring the memory speed for now, and referring back to the multiply-accumulate
example, we need to feed the multiplier with two values: one being a predetermined
od coefficient, a[] and the other being an input data value x/]. Given a shared bus, these
3)
two values cannot be obtained /transferred simultaneously. However, given the internal
=®
ze spilt buses in the diagram, they can both be fetched together and begin to be multiplied
4
< in a single cycle — if obtained from the separate on-chip memory blocks. Overall, this
r=)>
n
will probably be a multi-cycle operation: one cycle to load and decode the instruction,
G the cycle following that to load the operands, and then one or more cycles to operate on
=
a
v those. However, given fast single-cycle on-chip memory it is possible for the operand
Sonal
Usually, anything that traverses an off-chip bus is slow compared to data following
on-chip paths, and this is one major driving factor behind the use of cache memory
(explored later in Section 4.4). Where the external memory device is SDRAM there will
almost always be an on-chip cache to alleviate the issue so that however fast SDRAM
is, there is always a two- or three-cycle latency between requesting a single memory
value and it being provided.
present stable electrical signals at the two input arms of the ALU, we need to wait for
a certain length of time before the answer appearing at the bottom of the ALU is valid.
Some control logic (not shown) would be present to instruct the ALU as to exactly what
arithmetic or logic operation it should be performing, and this is assumed constant
here. But the length of time we have to wait depends on the exact operation being
performed — and the maximum (worst case) time is the one that determines how fast
we can clock the circuitry based around this ALU. In a modern system, this delay may
be something like one or two nanoseconds.
That delay is accounted for, but the problem here is that there is effectively no
minimum delay: what this means is that as soon as one of the input signals is removed
or changes, the result can start to become corrupted. The consequence of this is that
the input operands must remain in place driving the ALU as the result is collected and
stored. Only then can the input operands change, or be removed.
Hence, the registers on the ALU input arms. Without at least one register there is no
way fora two-bus architecture to drive an ALU with input operands and simultaneously
collect the result. With one or two registers present there are several alternatives that
may save on hardware slightly, but the more general is the following sequence of events
performing:
ADD RO, R1, R2 ;RO=R1+R2
Each numbered step is at a monotonically increasing time instant:
1. Set up system, clear buses and set ALU functionality switch to ‘ADD’.
2. Allow register R1 to drive bus 1 (by turning on register output buffer) and register
R2 to drive bus 2 (by turning on register output buffer).
3. Latch bus 1 value into first ALU operand register and bus 2 value into second ALU
operand register.
4. Turn off R1 register output buffer (bus 1 becomes free) and R2 register output buffer
(bus 2 becomes free).
Wait for worst case propagation delay through ALU.
Latch ALU result into ALU output buffer.
Allow ALU output buffer to drive one bus.
Latch content of that bus into register RO.
SS Turn off ALU output buffer (both buses become free and the system is ready to
Gy
SI
So
perform the next operation).
iy> It can be seen that the very simple ADD command actually comprises a number of
—
8)
steps that must be performed in hardware. These steps add up to something like eight
=£ time periods ignoring ALU propagation delay. In a three-bus design (Section 4.1.1),
2
< such an add would require only three time periods.
The complexity of these steps even for a simple ADD instruction goes some way
wy
>
a
GS towards explaining the importance of a control unit inside a CPU to manage this process
=
h—_
D
(Section 3.2.4). Can you imagine the control complexity needed for a large multi-cycle
hen
ae CISC instruction?
129
Processor Internals
latches. wn
>
a
re]
<
7)
—
_—
=
130
Chapter 4
Again there are several alternative arrangements to perform this functionality, but
all allow the result of an ALU calculation to be fed back to the input of one arm of
the ALU. This would be useful when performing accumulation or when following
one arithmetic or logical operation after another. In this case, the register on the left-
hand arm of the ALU became known as the accumulator. It was the basis for almost
every operation, the most used register in the entire system, the programmer’s friend.
Older low-level programmers came to know and love the accumulator: many mourned
its death, killed by RISC and CISC advancements alike. This quote from well-known
New Zealand engineering management guru Adrian Busch sums it all up: ‘If it hasn’t
got an accumulator, it isn’t a real CPU.’
c pe Ty barat t
> hat Satgogies adie A block diagram of the Figure 4.9
pos | parallel bitwise func-
aD
°
eel carry aftBI) tional chain of parallel
a out “JALU | -1-bit units that comprise
7
a
From the background work in Chapter 2, we know that addition and subtraction
are not parallel bit-wise operations. By that, we mean the n'" bit result of an addition
depends not only on the n'" bits of each input operand, but also on all previous bits, 1,
n—1,n—2...0. In fact, arithmetic operations between two values in general are not
accomplished in a bit-parallel manner, but logical operations between two values are.
Knowing what types of functions an ALU performs in typical devices and having
looked at some examples, it may now be instructive to perform a low-level design of
an ALU to explore how it operates.
ON Be,
127 1 | ADD RO,R1,R2__| 128 (unsigned), —128 (signed) wt
—1 [1 [|ADDRO,R1,R2 [0 VAS As
aa
o
E
<=
=
—
3 “V’ is often used to represent the overflow flag instead of ‘O’, which might be confused with a zero. <x
132
Chapter 4