0% found this document useful (0 votes)
3 views145 pages

Michael Langer Comp 273 2016

Uploaded by

yiranhuang268
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views145 pages

Michael Langer Comp 273 2016

Uploaded by

yiranhuang268
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

COMP 273 0 - binary representation of positive integers Jan.

7, 2016

Welcome to COMP 273. Let’s get started! You all know that computers represent numbers using
0’s and 1’s. But how exactly this works is probably a mystery for most of you. We start out
with a basic and fundamental example, namely how to represent integers. We’ll start with positive
integers, then negative integers, and then we’ll turn to non-integers i.e. real numbers.

Decimal vs. binary representation of positive integers


Up to now in your life, you’ve represented numbers using a decimal representation (the ten digits
from 0,1, ... 9). The reason 10 is special is that we have ten fingers. There is no other reason for
using decimal. There is is nothing special otherwise about the number ten.
Computers don’t represent numbers using decimal. Instead, they represent numbers using bi-
nary. All the algorithms you learned in grade school for addition, subtraction, multiplication and
division work for binary as well. Before we review these algorithms, lets make sure we understand
what binary representations of numbers are. We’ll start with positive integers.
In decimal, we write numbers using digits {0, 1, . . . , 9}, in particular, as sums of powers of ten.
For example,
238ten = 2 ∗ 102 + 3 ∗ 101 + 8 ∗ 100
In binary, we represent numbers using bits {0, 1}, in particular, as a sum of powers of two:
11010two = 1 ∗ 24 + 1 ∗ 23 + 0 ∗ 22 + 1 ∗ 21 + 0 ∗ 20
I have put little subscripts (ten and two) to indicate that we are using a particular representation
(decimal or binary). We don’t need to always put this subscript in, but sometimes it helps to remind
us of what base we are using. For example, 11010two is not the same thing as 11010ten .
It is trivial to write a decimal number as a sum of powers of ten and it is also trivial to write a
binary number as a sum of powers of two, namely, just as I did above. To convert from a binary
number to a decimal number is also trivial. You need to write the powers of 2 as decimal numbers
and then add up these decimal numbers.
11010two = 16 + 8 + 2
To do so, you need to memorize the powers of 2
20 = 1, 21 = 2, 22 = 4, 23 = 8, 24 = 16, 25 = 32, 26 = 64, 27 = 128, 28 = 256, 29 = 512, 210 = 1024, . . .
How do you convert from a decimal number to a binary number? Your first idea would be to
find the largest power of 2 that is less than the number, subtract that power of 2, and then repeat
to find the smaller powers of 2 in the remainder. This works, but you need to know the powers for
2 to use it.
On the next page, I show another algorithm for converting from decimal to binary, and an exam-
ple. The algorithm repeatedly divides the decimal number by 2, and concatenates the remainders.
Why does this work? For any positive number m, let
m = bm/2c ∗ 2 + (m mod 2)
where bm/2c is the quotient and b c rounds down (floor). Let (m mod 2) be the remainder, i.e.
division mod 2. In the lecture slides, this was written with slightly different notation for mod – you
should be familiar with both.

last updated: 16th Jan, 2016 1


COMP 273 0 - binary representation of positive integers Jan. 7, 2016

When m is already represented as binary number (i.e. “base 2”), the quotient and remainder
are trivial to obtain. The remainder is the rightmost bit – called the least significant bit (LSB). The
quotient is the number with the LSB chopped off. To convince yourself of this, note that writing a
positive integer as an n bit binary number means that you write it as a sum of powers of 2,
n−1
X n−1
X n−2
X
i i
(bn−1 bn−2 . . . b2 b1 b0 )two ≡ bi 2 = bi 2 + b0 = 2 × bi+1 2i + b0
i=0 i=1 i=0

Here is the convertion algorithm:

Algorithm 1 Convert decimal to binary


INPUT: a number m expressed in base 10 (decimal)
OUTPUT: the number m expressed in base 2 using a bit array b[ ]
i←0
while m > 0 do
bi ← m%2
m ← m/2
i←i+1
end while

For example,

quotient remainder interpretation (not part of algorithm)


241
120 1 241 = 120*2 + 1
60 0 120 = 60*2 + 0
30 0 60 = 30*2 + 0
15 0 30 = 15*2 + 0
7 1 15 = 7*2 + 1
3 1 7 = 3*2 + 1
1 1 3 = 1*2 + 1
0 1 1 = 0*2 + 1
0 0 0 = 0*2 + 0
0 0
: :

Thus,
241ten = 11110001two .
You need to be able to do this for yourself. So practice it!
If you are not fully clear why the algorithm is correct, consider what happens when you run
the algorithm where you already have the number in binary representation. (The quotient and
remainder of a division by 2 does not depend on how you have represented the number i.e. whether
it is represented in decimal or binary.)

last updated: 16th Jan, 2016 2


COMP 273 0 - binary representation of positive integers Jan. 7, 2016

(quotient) (remainder)
11110001
1111000 1
111100 0
11110 0
1111 0
111 1
11 1
1 1
0 1
0 0
0 0
: :

The remainders are simply the bits used in the binary representation of the number!
Final note: The representation has an infinite number of 0’s on the left which can be truncated.
I mention this because in the heat of an exam you might get confused about which way to order
the 0’s and 1’s. Remembering that you have infinitely many 0’s when you continue to apply the
trick. This tells you that the remainder bits go from right to left.

Performing addition with positive binary numbers


Doing addition with positive integers is as easy with the binary representation as it is with the
decimal representation. Let’s assume an eight bit representation and compute 26 + 27.

00110100 ← carry bits (ignore the 0’s for now)


00011010 ← 26
+ 00011011 ← 27
00110101 ← 53

Its the same algorithm that you use with decimal, except that you are only allowed 0’s and 1’s.
Whenever the sum in a column is 2 or more, you carry to the next column:

2i + 2i = 2i+1

We would like to apply the grade school algorithm to do subtraction, multiplication and long division
in binary. However, there are some subtleties in doing so. For example, when you subtract a big
positive number from a small positive number, you end up with a negative number, and I have not
yet told you how to represent negative numbers. That’s what I’ll start with next lecture.

last updated: 16th Jan, 2016 3


COMP 273 1 - twos complement, floating point, hexadecimal Jan. 11, 2012

“Twos complement” representation of integers


To motivate our representation of negative numbers, let me take you back to your childhood again.
Remember driving in your parent’s car and looking at the odometer.1 Remember the excitement
you felt when the odometer turned from 009999 to 010000. Even more exciting was to see the
odometer go from 999999 to 000000, if you were so lucky. This odometer model turns out to be the
key to how computers represent negative numbers.
If we are working with six digit decimal numbers only, then we might say that 999999 behaves
the same as -1. The reason is that if we add 1 to 999999 then we get 000000 which is zero.

999999
+ 000001
000000

Notice that we are ignoring a seventh digit in the sum (a carry over). We do so because we are
restricting ourselves to six digits.
Take another six digit number 328769. Let’s look for the number that, when added to 328769,
gives us 000000. By inspection, we can determine this number to be 671231.

328769
+ 651231
000000

Thus, on a six digit odometer, 651231 behaves the same as -328769.


Let’s apply the same idea to binary representations of integers. Consider eight bit numbers.
Let’s look at 26ten = 00011010two . How do we represent −26ten ?

00011010
+ ????????
00000000

To find −26, we need to find the number that, when added to 00011010 gives us 00000000. We use
the following trick. If we “invert” each of the bits of 26 and add it to 26, we get

00011010 ← 26
+ 11100101 ← inverted bits
11111111

This is not quite right, but its close. We just need to add 1 more. (Remember the odometer).

11111111
+ 00000001
00000000

Note we have thrown away the ninth bit because we are restricting ourself to eight bits only. Thus,
1
Ok, ok, odometers were used back in the 20th century, before you were born..

last updated: 12th Jan, 2016 1


COMP 273 1 - twos complement, floating point, hexadecimal Jan. 11, 2012

00011010 ← 26
11100101 ← inverted bits
+ 00000001 ← +1
00000000

Alternatively, we add 1 to the inverted bit representation and this must give us −26.

11100101 ← inverted bits


+ 00000001 ← +1
11100110

00011010
11100110 ← This is −26 10 .
00000000

Thus, to represent the negative of a binary number, we invert each of the bits and then add 1. This
is called the twos complement representation.

Special cases
One special case is to check is that the twos complement of 00000000 is indeed 00000000.

00000000
11111111 ← invert bits
11111111

And adding 1 gets us back to zero. This makes sense, since -0 = 0.


Another special case is the decimal number 128, and again we assume a 8 bit representation. If
you write 128 in 8-bit binary, you get 10000000. Note that taking the twos complement gives you
back the same number 10000000.

10000000
+ 01111111 ← invert bits
11111111

And adding 1 gets us back to zero. So, what is -128 ?

01111111 ← the inverted bits


+ 00000001 ← adding 1
10000000

Note that 128 has the same representation as -128. Of course, we can’t have both: we have to
decide on one. Either 10000000 represents 128 or it represents -128. How does that work?

last updated: 12th Jan, 2016 2


COMP 273 1 - twos complement, floating point, hexadecimal Jan. 11, 2012

Unsigned vs. signed numbers


If treat all the 8 bit numbers as positive, but we ignore the carry of the leftmost bit in our sum
(the most significant bit, or MSB), then adding 1 to the binary number 11111111 (which is 255 in
decimal) takes us back to 0. See the figure below on the left. This representation is called unsigned.
Unsigned numbers are interpreted as positive.
To allow for negative numbers, we use the twos complement representation. Then we have the
situation of the circle on the right. This is called the signed number representation. Note that the
MSB indicates the sign of the number. If the MSB is 0, then the number is non-negative. If the
MSB is 1, then the number is negative.

00000011
01111110 00000010
01111111 00000001
10000000 00000000
10000001 11111111
10000010 11111110

unsigned signed

126 3 126 3
2 2
127 1 127 1
128 0 -128 0
255 -1
129 - 127
130 254 - 126 -2

Unsigned and signed n-bit numbers


The set of unsigned n-bit numbers is represented on a circle with 2n steps. The numbers are
{ 0, 1, 2, . . . , 2n − 1 }. It is common to use n = 16, 32, 64 or 128, though any value of n is
possible. The signed n bit numbers are represented on a circle with 2n steps, and these numbers are
{− 2n−1 , . . . , 0, 1, 2, . . . , 2n−1 − 1 }. Signed n bit numbers are represented using twos complement.
For example, if n=8, then the signed numbers are {−128, −127, . . . , 0, 1, 2, . . . , 127} as we saw
earlier. Consider the following table for the 8 bit numbers.

binary signed unsigned


00000000 0 0
00000001 1 1
: : :
01111111 127 127
10000000 -128 128
10000001 -127 129
: : :
11111111 -1 255

last updated: 12th Jan, 2016 3


COMP 273 1 - twos complement, floating point, hexadecimal Jan. 11, 2012

If n=16, the corresponding table is:

binary signed unsigned


0000000000000000 0 0
0000000000000001 1 1
: : :
0000000001111111 127 127
0000000010000000 128 128
0000000010000001 129 129
: :
0111111111111111 215 − 1 215 − 1
1000000000000000 −215 215
1000000000000001 −215 + 1 215 + 1
: :
1111111101111111 -129 216 − 129
1111111110000000 -128 216 − 128
1111111110000001 -127 216 − 127
: :
1111111111111111 -1 216 − 1

A surprising example! (Java)


Take n = 32. The largest signed integer is thus 23 1 − 1. In Java (and C), the type int defines a 32
bit signed number. Let’s explore the limits on this representation.
First, note that 210 = 1024 ≈ 103 , i.e. one thousand, and 220 ≈ 106 or one million, and 230 ≈ 109
or one billion. So, 231 ≈ 2, 000, 000, 000 or two billion and in fact 231 is a bit more than two billion.
What if we declare:

int j = 4000000000; // 4 billion > 2^31

This gives a compiler error. ”The literal of type int is out of range.” The compiler knows that
4,000,000,000 is greater than 231 − 1. Now try:

int j = 2000000000; // 2 billion < 2^31


[Link]( 2 * j );

This prints out -294967296. To understand why these particular digits are printed, you would need
to convert 4000000000 to binary, which I don’t recommend because it is tedious. The point is that
it is a negative number! This can easily happen if you are not careful, and obviously it can lead to
problems.

Floating point
Let’s next talk about binary representations of fractional numbers, that is, numbers that lie between
the integers. Take a decimal number such as 22.63. We write this as:

22.63 = 2 ∗ 101 + 2 ∗ 100 + 6 ∗ 10−1 ∗ 3 ∗ 10−2

last updated: 12th Jan, 2016 4


COMP 273 1 - twos complement, floating point, hexadecimal Jan. 11, 2012

The “.” is called the decimal point. We can use a similar representation for fractional binary
numbers. For example,

(110.011)two = 1 ∗ 22 + 1 ∗ 21 + 0 ∗ 20 + 0 ∗ 2−1 + 1 ∗ 2−2 + 1 ∗ 2−3

where “.” is called the binary point. If we convert to decimal, we get

4 + 2 + 0.25 + 0.125 = 6.375

Binary to decimal conversion


Just as with integers, we can convert a binary number into a decimal number using the brute force
method, namely remember or figure out the powers of 2 and then add up all contributions from
1 bits. This is relatively easy to do for simple examples such as above. What about for examples
that have far more bits to the right of the binary point?

Decimal to binary conversion


Take the example 22.63 above. We can convert the number to the left of the decimal point from
decimal to binary, using the method from lecture 1, namely 22 = (10110)2 . But how do we convert
the fractional part (.63) to binary? The idea is to use the fact that multiplication by 2 in binary
produces a shift to the left by one bit. (i.e. Multiplying by 2 in binary just adds 1 to each exponent
in the sum of powers of 2, and this corresponds to a shift of bits to the left.)
We convert 0.63 to binary as follows. To the left of the binary point, we represent the number
in binary. (Assume the number is positive. We’ll deal with negative numbers next lecture.) To
the right of the binary point, we represent the fractional part in decimal. To go from one line to
the next, we multiply by 2 and divide by 2. Specifically we multiple the fractional decimal part by
2, and keep track of the number of divisions by 2 by incrementing the exponent of a power of 2.
To the left of the binary point, we shift by one bit and we fill the least significant bit with 1 or 0
depending on whether the (doubled) fractional part is greater than 1. You should verify why this
makes sense, namely think what happens when we multiply by 2.
. 63
= (1)2 . 26 × 2−1
= (10)2 . 52 × 2−2
= (101)2 . 04 × 2−3
= (1010)2 . 08 × 2−4
= (10100)2 . 16 × 2−5
= (101000)2 . 32 × 2−6
= (1010000)2 . 64 × 2−7
= (10100001)2 . 28 × 2−8
= (101000010)2 . 56 × 2−9
= etc . etc
Notice that the number of bits on the left of the binary point is 9, corresponding to the shift
by 9 bits which is implied by the exponent of 2−9 . Finishing the example, ... since 22 in binary is
10110, we have
22.63 ≈ (10110.101000010)2

last updated: 12th Jan, 2016 5


COMP 273 1 - twos complement, floating point, hexadecimal Jan. 11, 2012

We write ≈ because Pwe have rounded down, namely by chopping off any trailing bits. These trailing
bits are of the form −∞ i
i=−9 bi 2 . Note that we have negative powers of 2 here.
What if we don’t chop off the trailing bits and instead we generate bits ad infinitum. What
happens? Interestingly, at some point you will generate a sequence of bits that repeats itself over
and over. To see why, first consider a simple example. We write 0.05 in binary.
. 05
= (0) . 1 × 2−1
= (00) . 2 × 2−2
= (000) . 4 × 2−3
= (0000) . 8 × 2−4
= (00001) . 6 × 2−5
= (000011) . 2 × 2−6
= (0000110) . 4 × 2−7
= (00001100) . 8 × 2−8
= (000011001) . 6 × 2−9
= (0000110011) . 2 × 2−10
= etc . etc
Note this pattern 0011 will be repeated over and over as the part to the right of the binary point
cycles back to “2”. This gives:
(.05)ten = .000011 · · ·
where the underlined part repeats ad infinitum.
In a more general case, we have N digits to the right of the binary point. If “run the algorithm”
I have described, then you will find that the number to the right of the binary/decimal point will
eventually repeat itself. Why? Because there are only 10N possible numbers of N digits. And once
the algorithm hits an N digit number it has hit before, this defines a loop that will repeat over and
over again. If the algorithm takes k steps to repeat, then k bits are generated. These k bits will
similarly repeat ad infinitum. Of course, the repeating bits might be all 0’s, as in the case of 0.375
which was in the lecture slides. (Do it for yourself if you are studying from these notes.)

Hexadecimal representation of binary strings


When we write down binary strings with lots of bits, we can quickly get lost. No one wants to look
at 16 bit strings, and certainly not at 32 bit strings. Yet we will need to represent such bit strings
often. What can we do?
The most common solution is to use hexadecimal, which is essentially a base 16 representation.
We group bits into 4-tuples (24 = 16) starting at rightmost bit (i.e. least significant bit). Each 4-bit
group can code 16 combinations and we typically write them down as: 0,1,...,9,a,b,c,d,e,f.
The symbol a represents 1010, b represents 1011, c represents 1100, ..., f represents 1111.
You may find yourself saying a represents the decimal number 10 and is written in 4-bit binary
as 1010, b represents the decimal number 11 and is written in 4-bit binary as 1011, ..., f represents
the decimal number 15 and is written in 4-bit binary as 1111. If you do this, then you are first
converting 4-bit binary to decimal, and then converting decimal to hexadecimal. Its fine to do this,
but its not necessary. The 4-tuples don’t always stand for numbers from 0 to 15.

last updated: 12th Jan, 2016 6


COMP 273 1 - twos complement, floating point, hexadecimal Jan. 11, 2012

We commonly write hexadecimal numbers as 0x where the underline is filled with char-
acters from 0,...,9,a,b,c,d,e,f. For example,

0x2fa3 = 0010 1111 1010 0011.

Sometimes hexadecimal numbers are written with capital letters. In that case, a large X is used as
in the example 0X2FA3.
If the number of bits is not a multiple of 4, then you group starting at the rightmost bit (the
least significant bit). For example, if we have six bits string 101011 , then we represent it as 0x2b.
Note that looking at this hexadecimal representation, we don’t know if it represents 6 bits or 7 or
8, that is, 101011 or 0101011 or 00101011. But usually this is clear from the context.

last updated: 12th Jan, 2016 7


COMP 273 2 - IEEE floating point Jan. 13, 2016

Fixed Point
We are mostly discussing floating point numbers today, but I want to mention briefly that some
computers represent non-integers numbers using fixed point. Fixed point means we have a constant
number of bits (or digits) to the left and right of the binary (or decimal) point.
For example, we might have eight digits to the left of the decimal point and two digits to the
right. An example is 23953223.49. You are familiar with representations have two digits to the
right. That is how our currency works (but the number to the left is not fixed with currency). An
example in binary is 10.1101 where we have two and four bits to the left and right, respectively.
Note that I used the term ”digits” for base 10, and ”bits” for base 2.
We can represent negative numbers in fixed point using two’s complement. Take the number
6.5 in decimal which is 110.1 in binary. Suppose (arbitrarily) that we represent it with four bits to
the left and four to the right of the binary point. To represent -6.5 in binary, we invert the bits and
add what we need to get back to 0.

0110.1000
1001.0111 ← invert bits
+ 0000.0001 ← add .0001
0000.0000

We get the representation of -6.5 by adding the 2nd and 3rd rows of the sum:

1001.0111 ← invert bits


+ 0000.0001 ← add .0001
1001.1000 ← -6.5 as signed fixed point

Scientific Notation
Let’s return to floating point and consider a number representation called “scientific notation,”
which most of you learned about in high school. You know that you can represent very large
decimal numbers or very small decimal numbers by writing, for example:

300, 000, 000 = 3 × 108 = 3.0E + 8

which is the speed of light in meters per second, and

.00000456 = 4.56 × 10−6 = 4.56E − 6

which has no particular significance to my knowledge.


In normalized scientific notation in base 10, there is exactly one digit to the left of the decimal
point and this digit is from 1 to 9 (not 0). Note that the number 0 cannot be written in normalized
scientific notation. Examples of normalized binary numbers are shown blow on the right side of the
equations:
(1000.01)2 = 1.00001 × 23
(0.111)2 = 1.11 × 2−1

last updated: 14th Jan, 2016 1


COMP 273 2 - IEEE floating point Jan. 13, 2016

A normalized binary number has the form:

1. × 2E

where the is called the significand (or mantissa) and E is the exponent. Note that it
is impossible to encode the number 0 in normalized form.
The number of bits in the significand is called the number of significant bits. Note that we don’t
count the first bit when counting the number of significant bit. (In decimal, we do have to specify
the most significant digit since it can be anything from 1 to 9, so it is included in the count.)

IEEE floating point standard


To represent floating point numbers in a computer, we need to make some choices:
• how many bits to use for the significand?

• how many bit to use for the exponent?

• how to represent the exponent?

• how to represent the sign of the number?

• how to represent the number 0 ?


Until the late 1970’s, different computer manufacturers made different choices. This meant that the
same software produced different results when run on different computers. Obviously this was not
desirable. A standard was introduced in 1985 by the IEEE organization.1 The standard (number
”754”, i.e. there are lots of different standards) consists of two representations for floating point
numbers. One is called single precision and uses 32 bits. The other is called double precision and
uses 64 bits. This representation is used in nearly all computers you will find today.

Single precision
Single precision represents a floating point number using 32 bits (4 bytes, that is, one byte is 8
bits). We number the bits from right to left. Bit 31 is the leftmost and bit 0 is the rightmost.
• bits 0-22 (23 bits) are used for the significand

• bits 23-30 (8 bits) are used for the exponent

• bit 31 (1 bit) is used for the sign (0 positive, 1 negative).


Because the standard uses normalized numbers, the 1 bit to the left of the binary point in
1. does not need to be coded since it is always there. The significand only refers to
what is to the right of the binary point.
The coding of the exponent is subtle. You might expect the exponent should be encoded
with two’s complement, since you want to allow positive exponents (large numbers) and negative
exponents (small numbers). This is not the way the IEEE chose to do it, however.
1
Institute of Electrical and Electronics Engineers, Inc. See [Link]

last updated: 14th Jan, 2016 2


COMP 273 2 - IEEE floating point Jan. 13, 2016

The two exponent codes (00000000, 11111111) are reserved for special cases.

• The exponent 00000000 is used to represent a set of numbers lying strictly in the tiny interval
( −2−126 , 2−126 ) which includes the number 0. These numbers are called denormalized numbers.

1.00000000000000000000000 ×2−126 ← smallest positive normalized number


0.11111111111111111111111 ×2−126 ← largest denormalized number
0.11111111111111111111110 ×2−126
0.11111111111111111111101 ×2−126
:
0.00000000000000000000010 ×2−126
0.00000000000000000000000 ×2−126 ← 0
- 0.00000000000000000000001 ×2−126
- 0.00000000000000000000010 ×2−126
:
- 0.11111111111111111111110 ×2−126
- 0.11111111111111111111111 ×2−126 ← smallest denormalized number

- 1.00000000000000000000000 ×2−126 ← largest negative normalized number

Notice that, to represent 0, the exponent bits and significand bits are all 0’s.

• The exponent 11111111 is used to represent either infinity (plus or minus, depending on the
sign bit), or the case that there is no number stored (called “not a number”, or NaN). This
case could arise if your program has declared a floating point variable but you haven’t assigned
it a value yet.
These special cases are determined by the significand. If the significand is all 0’s (23 of them),
then the interpretation is ±∞, depending on the sign bit. If the significant is not all zeros,
then the interpretation is NaN.

There are 254 remaining 8-bit exponent codes, namely from 00000001 to 11111110, which are
used for normalized numbers. These codes represent exponents ranging from -126 through 127.

exponent code exponent value


00000001 -126 ← smallest exponent value
00000010 -125
: :
01111110 -1
01111111 0 ← the code of value 0 is the ”bias”
10000000 1
10000001 2
: :
11111110 127 ← largest exponent value

last updated: 14th Jan, 2016 3


COMP 273 2 - IEEE floating point Jan. 13, 2016

To convert from this 8-bit exponent code to the exponent value that it represents, you interpret
the 8-bit code as an unsigned number (from 1 to 254) and then subtract the value 127 (called the
“bias”) which is the code for the exponent value of 0.
How should you think about the discretization of the real line that is defined this way? Think
of consecutive powers of 2, namely [2e , 2e+1 ]. That interval is of size 2e . Partition that interval into
223 equal sized pieces, each size being 2e−23 . Note that the step size between “samples” is constant
within each interval [2e , 2e+1 ] but that the step size doubles in each consecutive power of 2 interval.

An Example
How is 8.75 coded up in the IEEE single precision standard? First you show that
8.75 = (100011)2 × 2−2
using the techniques discussed earlier. Then, you shift the binary point to normalize the number,
(100011)2 × 2−2 = (1.00011)2 × 23
The significand is 00011000000000000000000 where I have underlined the part that came from the
above equation. Note that we have dropped the leftmost 1 bit, because we are using normalized
numbers.
The exponent code is determined by adding 127 to the exponent value, i.e. 3 + 127 = 130, and
writing 130 as an unsigned number (10000010)2 . The sign bit is 0 since we have a positive number.
Thus, 8.75 is encoded with the 32 bits
0 10000010 00011000000000000000000.
We can write it in hexadecimal as 0x410c0000.
Suppose we want to add two single precision floating point numbers, x and y, e.g.
x = 1.101010000000000000000101 ∗ 2−3
y = 1.001001000010101010101010 ∗ 22
where x has a smaller exponent than y. We cannot just add the significands bit by bit, because
then we would be adding terms with different powers of 2. Instead, we need to rewrite one of the
two numbers such that the exponents agree. The usual way to do this is to rewrite the smaller of
the two numbers, that is, rewrite the number with the smaller exponent. So, for the above example,
we rewrite x so that it has the same exponent as y:
x = 1.101010000000000000000101 ∗ 2−3
= 1.101010000000000000000101 ∗ 2−5 ∗ 22
= 0.00001101010000000000000000101 ∗ 22
Notice that x is no longer normalized, that is, it does not have a 1 to the left of the binary point.
Now that x and y have the same exponent, we can add them.
1.001001000010101010101010 ∗ 22
+ 0.00001101010000000000000000101 ∗ 22
= 1.00110001011010101010101000101 ∗ 22

last updated: 14th Jan, 2016 4


COMP 273 2 - IEEE floating point Jan. 13, 2016

The sum is a normalized number. But the significand is more than 23 bits. If we want to represent
it as single precision float, then we must truncate (or round off, or round up) the extra bits.
A related point, which I discussed last lecture, is that most numbers that have a finite number
of digits (base 10) require an infinite number of bits to represent exactly. This means that when
you try to represent these numbers as floating point in a computer, you will to approximate them.
These can lead to surprising results. Take the example 0.05 i.e. 1/20.0 that we saw last class, and
consider the following code:

float x = 0;
for (int ct = 0; ct < 20; ct ++) {
x += 1.0 / 20;
[Link]( x );
}

It prints out up to eight digits to the right of the decimal point.

0.05
0.1
0.15
0.2
0.25
0.3
0.35000002
0.40000004
0.45000005
0.50000006
etc

None of these numbers is represented exactly in the computer! The computer is just printing the
best approximation it can with eight digits to the right of the decimal point. For example, 0.1
means 0.10000000 and it doesn’t bother writing the seven 0’s on the right because people don’t
generally want to see those 0’s. But that’s just convention.
Why eight digits? The answer is presumably that 223 ≈ 108 and so the number that is being
approximated with with those 23 bits of significand can be approximated just as well with eight
digits.

Double precision
The 23 bits of significand might seem like alot at first glance. But 223 is only about 8 million. When
we try to represent integers larger than this value, we find that we have problems. We don’t have
enough precision and we have to skip some. (See Exercises 1 Q10c.)
In order to get more precision, we need more bits. The IEEE double precision representation
uses 64 bits instead of 32. 1 bit is used for the sign, 11 bits for the exponent, and remaining 52 bits
for the significand (1 + 11 + 52 = 64). A similar idea is used for the exponent bias except that
now the bias is 1023 (210 − 1), rather than 127 (27 − 1). Conceptually, there is nothing new here
beyond single precision, though.

last updated: 14th Jan, 2016 5


COMP 273 2 - IEEE floating point Jan. 13, 2016

In the slides, I went through the same examples as I did for single precision. I wrote out the
exponent codes for double precision, and I encoded 8.75 as a double. I suggest you work them out
for yourself, and verify with the slides. I also gave a similar Java example for double.

double x = 0;
for (int ct=0; ct < 10; ct ++) {
x += 1.0 / 10;
[Link]( x );
}

which prints out

0.1
0.2
0.30000000000000004
0.4
0.5
0.6
0.7
0.7999999999999999
0.8999999999999999
0.9999999999999999

Note that it prints out up to about 16 digits. The reason for 16 is that

252 = (21 0)5 ∗ 2≈ (103 )5 ∗ 4 = 1016

and so with 52 bits we can represent about the same number of numbers as we can with 10 digits.
Of course, we are still approximating.
Note that the 3rd line has 17 digits to the right of the decimal point. Also, it seems strange
that the third line is slightly greater than .1 * 3 whereas the 7th-9th lines are slightly less than .1 *
7-9 respectively. It is not obvious why this behavior occurs. (Time permitting, I will figure it out
later – or if one of you wishes to figure it out and let me know, please do.)

IEEE converter web site


Here is a useful tool to check out if you want to practice this stuff.
[Link]

last updated: 14th Jan, 2016 6


COMP 273 3 - combinational logic 1 Jan. 18, 2016

In lectures 1 and 2, we looked at representations of numbers. For the case of integers, we saw
that we could perform addition of two numbers using a binary representation and using the same
algorithm that you used in grade school. I also argued that if you represent negative numbers using
twos complement, then you can do addition with negative numbers as well. In lecture 4, we will see
how to implement the addition algorithm using circuits. Before we can do so, we need to review
some basic logic and use it to build up circuits.

Truth tables
Let A and B be two binary valued variables, that is, A, B each can take values in {0, 1}. If we
associate the value 0 with FALSE and the value 1 with TRUE, then we can make logical expressions
using such binary valued variables. We make such expressions using the binary operators AND,
OR, and the unary operator NOT. Let A · B denote AND(A, B). Let A + B denote OR(A, B). Let
A denotes NOT(A). The AND and OR operators are defined by the following truth table.

A B A·B A+B
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 1

The unary operator NOT is defined by the truth table:

A A
0 1
1 0

Since a truth table of two input variables has four rows, it follows that there are 24 = 16 possible
output functions (that is, possible columns in the truth table). The operators AND and OR are
just two of these 16 possible output functions. Three other output functions that are commonly
used are NAND (not and), NOR (not or), and XOR (exclusive or). These are defined as follows.

A B A·B A+B A·B A+B A⊕B


0 0 0 0 1 1 0
0 1 0 1 1 0 1
1 0 0 1 1 0 1
1 1 1 1 0 0 0

To justify why the symbols “+” and “·” are used to represent OR and AND operators, I would
have to spend some time talking about boolean algebra. This might be of interest to some of you,
but it won’t take us in the direction we want to go in this course. Instead, I will give you a brief
introduction to the topic and then focus on how to use the logical expressions to build up digital
circuits in a computer.

last updated: 19th Jan, 2016 1


COMP 273 3 - combinational logic 1 Jan. 18, 2016

Laws of Boolean Algebra


identity A+0=A A·1=A

inverse A+A=1 A·A=0

one and zero A+1=1 A·0=0

commutative A + B = B + A, A·B =B·A

associative (A + B) + C = A + (B + C) (A · B) · C = A · (B · C)

distributive law: A · (B + C) = (A · B) + (A · C) (A · B) + C = (A + C) · (B + C)

De Morgan A·B =A+B A+B =A·B


These laws capture the logical reasoning that you carry out in your head when you evaluate the
truth of expressions formed using OR, NOT, AND. But they are more than this. The laws also
give a set of rules for automatically evaluating and re-writing such expressions. For example, the
commutative law allows you to swap the order of two terms; De Morgan’s Laws allows you to swap
the order of the NOT with either of the OR or AND operators, etc.
I encourage you to memorize the names of the laws. Although I will not examine you on this,
these names are not just used in Boolean Algebra; they are used in Algebra in many other situations.
At the very least, you should convince yourself that you already know the laws of Boolean Algebra
since you use them everyday in reasoning about the world. However, you should understand what
each of the laws says, and you should make sure that you agree with each of the laws. In particular,
note that most of these laws correspond to the rule so addition and multiplication with numbers, but
not all do. The second distributive law generally does not.

Example
On the last page I wrote truth tables for basic binary and unary operators. We can also write truth
tables for expressions that are built out of these operators. Here is an example:
Y = A · B · C · (A · B + A · C)
The last column Y not necessary here, but I’ll discuss it on the next page.

A B C A·B·C A·B·C A·B A·C A·B+A·C Y Y


0 0 0 0 1 0 0 0 0 1
0 0 1 0 1 0 0 0 0 1
0 1 0 0 1 0 0 0 0 1
0 1 1 0 1 0 0 0 0 1
1 0 0 0 1 0 0 0 0 1
1 0 1 0 1 0 1 1 1 0
1 1 0 0 1 1 0 1 1 0
1 1 1 1 0 1 1 1 0 1

last updated: 19th Jan, 2016 2


COMP 273 3 - combinational logic 1 Jan. 18, 2016

Sum-of-products and product-of-sums (two level logic)


Logical expressions can get very complicated. If a logical expression has n different variables then
there are 2n combinations of values of these variables, and hence 2n rows in the the truth table.
However, once the expression is evaluated, there are simple ways to rewrite the expression, as we
show next.
Suppose that Y were some horribly long expression with the A, B, C variables, and that we
computed the values of Y for the various combinations of A, B, C. We think of Y as being the
output and A, B, C as being the input variables. We can write Y in two simple ways. The first is
called a sum of products. For the example on the previous page:
Y =A·B·C +A·B·C
since “·” is a product and “+” is a sum. The two terms correspond to the two 1’s in the Y column
of the above table.
The second is called a product-of-sums. To compute the product-of-sums, we use a trick. First,
we write Y as a sum-of-products. For the example on the previous page, see the rightmost column:
Y =A·B·C + A·B·C + A·B·C + A·B·C + A·B·C + A·B·C
and then we negate both sides.
Y = (A · B · C ) · ( A · B · C ) · ( A · B · C ) · ( A · B · C ) · ( A · B · C ) · ( A · B · C)
We then apply De Morgan’s laws within each term. Here you must convince yourself that de
Morgan’s laws hold for n variables, not just two variables. In the case below, we have three
variables.
Y = (A + B + C) · (A + B + C) · (A + B + C) · (A + B + C) · (A + B + C) · (A + B + C)
If we have n input variables, then writing our output as a sum-of-products or product-of-sums might
involve as many as 2n terms, one for each row. However, it has the advantage that it only involves
two levels of binary operations (first OR then AND, or vice-versa).

Don’t cares
If we have m input variables and n output variables, then the truth table has 2m rows. However,
sometimes we don’t need to write all the rows because certain combinations of the inputs give the
same output. For example, take
Y =A·B·C +A
The second expression is really the sum of four expressions which have all combinations of B and
C. Alternatively, the second expression can be thought of as saying A and “I don’t care what B
and C are”. We denote such “don’t care’s” in the truth table with x symbols.
A B C Y
0 x x 1
1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 0

last updated: 19th Jan, 2016 3


COMP 273 3 - combinational logic 1 Jan. 18, 2016

Logic Gates
Computers solve for expressions ”Y = ...” using electronic circuits, called logic gates, which are
composed of resistors and transistors, and other elements. Transistors are basic circuits typically
made from silicon and other elements. (To understand how they work, you would need to spend a
few semesters in the physics department.) Transistors were invented in the 1940s at Bell Labs, and
the inventors won the Nobel Prize for it in 1956.
We will not discuss how transistors work. Rather we will work at the gate level. A gate
implements one of the binary operators defined above, or the unary operation NOT. Examples of
the gates we will use are shown below. These are standard symbols and you will need to memorize
them.
AND NAND NOT

OR NOR XOR

The inputs and outputs to gates are wires which have one of two voltages (often called low and
high, or 0 and 1). Often we will put more than two inputs into an AND or OR gate. This is allowed
because of the associative law of Boolean algebra. (The underlying physical circuit would have to
be changed, but we don’t concern ourselves with this lower level.) Also note that the commutative
law of Boolean algebra says that the ordering of the wires into a gate doesn’t matter.

Combinational (or Combinatorial) Circuits


Example 1
Consider a circuit that computes the value of the following expression, where A, B, C are three
input variables and Y is the output variable:

Y = A·B+A·C

Whatever binary values are given to the input variables, the output Y will correspond to the truth
value of the logical expression.

last updated: 19th Jan, 2016 4


COMP 273 3 - combinational logic 1 Jan. 18, 2016

B Y

The black dot in this figure indicates that input wire A branches into two input wires.

Example 2
We can write the XOR gate using a sum-of-products
Y = A⊕B =A·B+A·B
and the circuit is shown below. (We could have also used a product-of-sums.)

Example 3
Consider a circuit,
Y =S·A+S·B
This circuit is called a multiplexor or selector. What does this circuit do? If S is true, then Y is
given the value B whereas if S is false then Y is given the value A. Such a circuit implements the
familiar statement:
if S
then Y := B
else Y := A

last updated: 19th Jan, 2016 5


COMP 273 4 - combinational logic 2 Jan. 20, 2016

Read-only memory (ROM) using combinational logic circuits


The truth tables are defined by “input variables” and “output variables”, and we have been thinking
of them as evaluating logical expressions. Another way to think of a combinational circuit is as a
Read Only Memory (ROM). The inputs encode a memory address. The outputs encode the value
stored at the address. We say such a memory is read-only because the gates of the circuit are fixed.
For example, suppose the memory address is specified by the two input variables A1 , A0 , and
the contents at each address are specified by the bits Y2 Y1 Y0 . Here is a truth table:

A1 A0 Y2 Y1 Y0
0 0 0 1 1
0 1 0 0 1
1 0 0 0 0
1 1 1 0 0

and here is the corresponding memory. (Note that the address is not stored in the memory!)

input output
(address) (contents of memory)
00 011
01 001
10 000
11 100

What is new about this concept? You now have to think of the input and output variables in
a different way than you did before. Before, you thought of the input variables as having TRUE
or FALSE values, and you did not associate any particular significance to their order. Thinking
about the circuit as a ROM is quite different. You no longer think of each of the input variables
as TRUE or FALSE. Rather, you think of the input variables as defining a binary number, namely
an address. The order of the input variables now has a significance, since it defines an ordering on
the address space. Similarly, the output variables are a string of bits whose order may or may not
matter.

Arithmetic Circuits
Last class we discussed gates and circuits and how they could be used to compute the values
of logical expressions, formed by NOT, AND, OR and other operators. Such circuits are called
combinational logic circuits or combinational digital circuits. We will look at several more useful
examples today.
How would we implement addition using these gates? Recall the algorithm from Lecture 1 for
adding two n-bit numbers, A and B :
Cn−1 Cn−2 . . . C2 C1 C0
An−1 An−2 . . . A2 A1 A0
+ Bn−1 Bn−2 . . . B2 B1 B0
———————————

last updated: 11th Mar, 2016 1 lecture notes c Michael Langer


COMP 273 4 - combinational logic 2 Jan. 20, 2016

Sn−1 Sn−2 . . . S2 S1 S0

The Ci are the carry bits, and Si are the sum bits. Note that C0 = 0 because there is no way to
carry in. (It will be clear later why we are defining this bit.)
How are we interpreting these bits? We are doing ”odometer arithmetic” here, so it doesn’t
matter whether we are thinking of the numbers as signed (twos complement) or unsigned. This does
matter when we interpret the results, but it doesn’t matter from the standpoint of the mechanisms
for computing the sum.
The rightmost bit (or least significant bit) S0 of the sum is simple to determine using logical
expressions and combinational circuits, as is the carry out bit C 1 .
A0 B0 S0 C1
0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1
Thus,
S0 = A0 ⊕ B0
C1 = A0 · B0
The circuit that implements these is called a half adder. It has two inputs and two outputs. You
should be able to draw this circuit yourself. More generally, the values of the k th sum and carry
out bits are given as follows:
Ck Ak Bk Sk Ck+1
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1
which we can represent using sum-of-products circuits as defined by these expressions:

Sk = Ak · Bk · Ck + Ak · Bk · Ck + Ak · Bk · Ck + Ak · Bk · Ck

Ck+1 = Ak · Bk · Ck + Ak · Bk · Ck + Ak · Bk · Ck + Ak · Bk · Ck
The box with the ’+’ below contains the combinational circuitry for implementing the truth table
above. Note that we have used a full adder for the least significant bit (LSB) and set C0 to 0.
This circuit is called a ripple adder. To compute Sn−1 you need to allow the carries to ripple
through from bits 0 up to bit n − 1. This suggests that for large n (say 32 bits for A and B each),
you would have to allow a long delay. Later this lecture, we will discuss how this delay can be
avoided. For now, we turn to other useful combinational circuits.

last updated: 11th Mar, 2016 2 lecture notes c Michael Langer


COMP 273 4 - combinational logic 2 Jan. 20, 2016

A0 S0
+
B0

C1

A1
+ S1
B1

C2

A2 S2
+
B2

C3

A3 S3
+
B3

C4

etc

Encoders
An encoder is a combinational logical circuit whose outputs specify which of a set of possible inputs
or groups of inputs has occured. The term encoder is used somewhat loosely. (There is no generally
agreed upon formal definition of an encoder that I am aware of.) Let’s look at a few examples.
A common example is a set of m input wires, such that exactly one of them has the value 1.
For example, consider a keyboard with 5 buttons and suppose that the mechanics of these buttons
are such that one and only one of these buttons is pressed down. (You may have seen something
like this on a children’s toy.) Such an encoder might be constructed using the truth table:

A B C D E Y1 Y2 Y3
0 0 0 0 1 0 0 1
0 0 0 1 0 0 1 0
0 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
1 0 0 0 0 1 0 1

Notice that if none of the button is pressed, or if multiple buttons is pressed (because you push
hard and break the mechanism), then the above truth table does not specify what the value of the
output variables Yi is. But this is not a problem, as long as the design is such that only one can be
pressed (1) at any time.

last updated: 11th Mar, 2016 3 lecture notes c Michael Langer


COMP 273 4 - combinational logic 2 Jan. 20, 2016

Suppose we allow for multiple buttons to be pressed and we want to explicitly indicate in the
truth table what the output of the circuit will be for each possible input. One way to do it is shown
below. Consider the following truth table and recall that X means “don’t care”. If no buttons are
pressed, the output is 000. If two or more buttons are pressed, then the output is the same as if
only the rightmost of these buttons was pressed.
A B C D E Y1 Y2 Y3 interpretation
0 0 0 0 0 0 0 0 none of the buttons is pressed
X X X X 1 0 0 1 E is pressed (and maybe A,B,C,D)
X X X 1 0 0 1 0 D and not E (but maybe A,B,C)
X X 1 0 0 0 1 1 C and neither E nor D (but maybe A,B)
X 1 0 0 0 1 0 0 etc
1 0 0 0 0 1 0 1 etc

For example, written as a sum-of-products, we would have

Y1 = (A · B · C · D · E) + (B · C · D · E)
Y2 = (D · E) + (C · D · E)
Y3 = E + (C · D · E) + (A · B · C · D · E)

A second example of an encoder that you should be familiar with is an output device (an LCD)
that displays a single digit from {0, 1, ..., 8, 9}. Each digit is defined by “turning on” a subset of the
line segments L1 , · · · , L7 . For example, when all seven of the line segments are on, the output digit
represents “8”, whereas when all lines except L2 are turned on, the output digit represents “0”. In
Exercises 2, you will write out the truth table for such an encoder.
L1

L4 L6
L2

ten buttons L5 L7
L3

Decoders
A decoder is the opposite of an encoder. It takes a code (a binary number) as input, and turns on
one of several output wires (or some other event, i.e. subset of the output wires). For example,
here is a truth table for a 1-to-2 decoder:
A Y1 Y2
0 1 0
1 0 1

last updated: 11th Mar, 2016 4 lecture notes c Michael Langer


COMP 273 4 - combinational logic 2 Jan. 20, 2016

Here is the truth table for a “2-to-4 decoder”:


A B Y1 Y2 Y3 Y4
0 0 1 0 0 0
0 1 0 1 0 0
1 0 0 0 1 0
1 1 0 0 0 1

You should be able to draw the circuit for these two examples.

Multiplexor (or selector)


One common way that decoders are used is to select from a set of possible input variables. The
circuit that does a selection is called a multiplexor. A multiplexor takes n inputs and chooses among
them. A multiplexor is also called a selector.
A multiplexor requires two types of inputs. The first carries the data that can go potentially
through the circuit. The second carries a code saying which of the data goes through.
The simplest example of a multiplexor has two input wires that carry a bit of data – call these
A and B – and a selector input wire – call it S – that says which of A and B to select. It is called a
1-bit multiplexor. (We saw this example at the end of last lecture.) The three circuits shown below
show this “one bit multiplexor” at three levels of increasing abstraction, from left to right.

A
Y A 0
M
1
U Y
B B X

A
Y S
B
S
decoder

Suppose instead we had four different inputs (A, B, C, D) and we wanted to select from them.
Then we would use a “two bit multiplexor” circuit as shown below on the left. The MUX symbol
on the right will be used whenever we want to hide the underlying circuitry.
Sometime the inputs A, B, C, D are not single bits, but may have many bits, e.g. we might be
selecting one of four integers. In this case we might use the same notation show in the two bit
multiplexor figure, and if we wanted to explicitly show that the inputs had a certain number (say
16) of bits, we would label the wires as such, by putting a little slash in each wire and writing 16.
We will see examples later in the course and the notation will be come natural to you.

last updated: 11th Mar, 2016 5 lecture notes c Michael Langer


COMP 273 4 - combinational logic 2 Jan. 20, 2016

A
A 3 M
Y B 2
C 1 U Y
B D 0 X

C
S

D
4 M
U Y
decoder X
2

S S

Fast adder
Let’s look at a few examples of how multiplexors can be useful. Recall the ripple adder discussed at
the beginning of the lecture. We can reduce the long delays for the carries to propagate, using the
following trick. Assume n is an even number. Divide the bits into low order bits (0, 1, . . . , n/2 − 1)
and high order bits (n/2, . . . , n − 1). Build one circuit to compute the sum for the low bits. Build
two circuits to compute the sum for the high order bits. Set the carry in bit for the LSB in the
lower order bit circuit to 0. Set the carry in bits for the LSB in the two high order circuits to 0 and
1, respectively. In the figure below, the “+” box is a 16 bit adder, rather than a one bit adder. (I
drew a slightly different looking figure in the lectures.)
The main idea here is that we don’t need to wait for the worse case in which the carries are
passed through the entire circuit (e.g. adding A = 011111 · · · 111 and B = 0000 · · · 0001). Instead,
once the carries pass through the first n/2 full adders and the circuits stabilize, we know which of
the upper order sums we want. We just select it. Thus we have almost cut the computation time
in half. I write “almost” because we also have to do the selection, which takes time.
Note that we could repeat this idea, and speed up the computation of the three n/2 bit additions,
by decomposing each of them into three additions of n/4 bits, and so on. There is a tradeoff here
between time (fast) and space (having the replicate the upper order bits to handle the two cases of
the carry out from low order bits).

last updated: 11th Mar, 2016 6 lecture notes c Michael Langer


COMP 273 4 - combinational logic 2 Jan. 20, 2016

S7,S6, ..., S0
+
carry out bit
0

A15,A14,... A8
+ M
U S15,S14,..., S8
B15,B14,...,B8 X

Subtraction
Suppose we wanted to build a circuit for performing subtraction.

An−1 An−2 . . . A2 A1 A0
– Bn−1 Bn−2 . . . B2 B1 B0
———————————-

To perform A − B, we compute A + (−B). This requires us to negate B and then send the result
through the adder.
To negate a number that is represented in twos complement, we invert each of the bits and then
add 1 (lecture 1). To do this with a combinational circuit is easy. We can use circuit similar to the
adder except that the B bits need to be inverted B. To add one, we can set the carry in bit C0 to
1. See below on the left. Binvert is the variable that specifies whether we are doing addition (0) or
subtraction (1). Note that we can use this variable both to select the B versus B, and for the C0 .

last updated: 11th Mar, 2016 7 lecture notes c Michael Langer


COMP 273 4 - combinational logic 2 Jan. 20, 2016

Binvert Binvert
Operation
Ci (add, and, or)

A0 Ai
M + M
U
+ M
B0 U Bi X
X U
X

A1
M +
B1 U
X

A2 +
M
B2 U
X Op

A3 A
M +
B3 U
X
ALU

ETC B

Arithmetic Logic Unit (ALU)


We can make the circuit even more powerful by also computing the operations AND and OR for
each of the bits. In the figure above right, a dotted square indicates the adder part of the circuit
for the ith bit. The entire local circuit computes four different operations, namely, addition and
subtraction, AND, OR. We select from these four operations using a multiplexor. [MODIFIED
March 11:] Notice the selector for the operation has only three possible values (add, and, or)
rather than four (add, sub, and, or). The reason that the ”sub” doesn’t have its own value is that
the subtraction operation is specifed by the Binvert signal, namely if the operation is subtraction
then Binvert should also be set to 1, and otherwise Binvert should be set to 0.
Such a circuit is called an arithmetic logic unit or ALU. Every computer has (at least) one.

last updated: 11th Mar, 2016 8 lecture notes c Michael Langer


COMP 273 5 - sequential logic 1 Jan. 25, 2016

Sequential Circuits
All of the circuits that I have discussed up to now are combinational digital circuits. For these
circuits, each output is a logical combination of the inputs. We have seen that these circuits can
do arithmetic and other operations. But these circuits are not powerful enough to build a general
purpose computer.
A key element that is missing is memory. Consider two different ways of remembering something.
Suppose I tell you a telephone number. One way for you to remember it would be to write it down,
say on paper. Then later you could just look at the paper. Another way to remember the number
which is good for short term, is just to repeat the number over and over to yourself.
You probably have some intuition about how a computer could write down a number. It could
alter the state of some magnetic material, for example. What is less clear to you is how a computer
could remember a number by repeating it to itself. Indeed this is the mechanism that is often used.
We will now turn to this technique.

RS latch (reset, set)


The basic way to design a circuit that tells itself a value over and over is as follows. Consider a
feedback circuit which is constructed from two NOR gates. Such a circuit is called an RS latch.

R
Q

S Q

On the left below is a truth table for a NOR gate, as a reminder. On the right are the values of Q
and Q for various input combinations. It may seem strange that we label the two outputs as such,
since if R = S = 1, then Q = 0 and Q = 0 which is impossible, and indeed in the slides I used
the symbol Q0 instead of Q when introducing the RS latch. The reason it is ok to give the output
labels Q and Q is that eventually we will not allow R = S = 1.
The way RS latches are used is that at most one of R and S have value 1. When exactly one of
them has value 1, it determines Q and Q. When both have value 0, then Q and Q keep the values
that they had when one of R and S last had the value 1. See slides for illustration.

A B A+B R S Q Q
0 0 1 0 0 hold ← remember
0 1 0 0 1 1 0 ←“set”
1 0 0 1 0 0 1 ←“reset”
1 1 0 1 1 0 0 ← not allowed

last updated: 26th Jan, 2016 1


COMP 273 5 - sequential logic 1 Jan. 25, 2016

It is standard to use letters R and S and to call them “reset” and “set”, respectively. The input
R = 1 and S = 0 “resets” the output Q which traditionally means “gives it value 0”. The input
R = 0 and S = 1 “sets” the output Q which means “gives it the value 1”. If you then put R = 0 and
S = 0, you hold the value of Q i.e. you remember the value. Thus, an RS latch acts as a one-bit
memory, holding (locking, latching shut) the value Q as long as R = 0, S = 0. The feedbacks in the
circuit are what I said before about repeating a number back to yourself to remember it.

The clock
All the circuits we’ve seen up to now can change their output values at any time their inputs change.
You can imagine this has undesirable properties. If the outputs of one circuit are fed into other
circuits, the exact timing and ordering of operations will be important. For example, consider the
adder/subtractor we saw earlier. The sequence of carries takes some time, and we don’t want the
inputs to the adder to change (say, to add two new numbers) before all the carries have propagated
and the result from the first sum is finished, and the result stored somewhere.
To prevent this problem, we need a mechanism of synchronizing the inputs and outputs of the
circuits. Synchronization is achieved by a clock. A clock in a computer is a signal (a value on a
wire) that alternates between O and 1 at regular rate. When you say that a processor has a clock
speed of 3 GHz, you mean that the clock cycles between 0 and 1 at a rate of 3 billion times per
second (GHz means “gigaHerz” where “giga” means billion.)
To understand how a clock signal is generated, you would need to take a course in electronics
and learn about crystal oscillators. For COMP 273, we will simply assume such a clock signal is
available and that it regularly oscillates between 0 and 1 as shown below. The ON interval has the
same duration as the OFF interval.
1

time

clocked RS latch
We can combine the clock and the RS latch to give us more control over when we write into the RS
latch, namely we only write when the clock is 1. The circuit below is called a clocked RS latch.

R
Q
C

Q
S

last updated: 26th Jan, 2016 2


COMP 273 5 - sequential logic 1 Jan. 25, 2016

When C = 0, the circuit holds its value regardless of any changes in the R or S inputs. When
C = 1, the circuit behaves as the RS latch shown above. Note that this does not yet avoid the
R = S = 1 situation which I mentioned above.
Here is an example of how the output Q can vary as the R and S inputs vary. Note that
transitions in Q can occur at the transition of C from 0 to 1 (in the case that S or R already has
the value 1 at that time), or during the time that C=1, in the case that S or R becomes 1 while
C = 1. One tricky case occurs during the second to last clock pulse. During that pulse, there are
two brief events to note: first S goes to 1 temporarily, and then R goes to 1 temporarily. Note what
happens to Q. When S goes to 1, Q goes to 1, and when S drops back to 0, Q stays on (memory).
Q falls to 0 a short time later namely when R goes to 1 since then Q is reset.

D latch (D for “data”)


The RS latch allows us to write either a 1 or 0 value by inputing a 1 to S or R, respectively. The
D latch allows us to write the value of a binary variable D. The detailed circuit is shown below on
the left. The symbolic representation commonly used is shown on the right.
If D = 1, then we want to write 1 and so we make the S input be 1 (recall the clocked RS latch).
If D = 0, then we want to write 0 and so we should make the R input be 1. This is done by feeding
D into the S input and feeding the complement of D into the R input.
When C = 0, the outputs of the two AND gates is 0 and so Q holds (latches) its value, regardless
of what is D. When C = 1, the latch is open and the value of D is written into Q.

R
Q
C
D C Q

D
Q
S

last updated: 26th Jan, 2016 3


COMP 273 5 - sequential logic 1 Jan. 25, 2016

This D latch circuit is extremely important, so lets repeat the idea. When C = 0, the data input
D does not get written to Q and the previous value of D is held. A write only occurs when C = 1,
and in that case the current value of D is written and can be read. Also notice that, by design, it
is impossible for the R and S inputs to both be 1 here, so our earlier concern about these values
being simultaneously 1 now goes away.
The D latch provides some control on when we can write, but it still not good enough. When the
computer’s clock signal C is 1, data passes freely through the circuits. Let’s consider an example
of why this is problematic. Suppose we wish to implement an instruction such as

x := x + y;

We could implement it by putting the values of x and y into circuits made out of an array of
sequential circuits. We will discuss the correct way of doing this over the next few lectures. For
now, let’s look at an incorrect way of doing it, namely by representing x and y as binary numbers
and storing the bits of each number in an array of D latches. Say we have somehow managed to
store y in the A array and x in the B array. (Each of the little boxes in the figure below is supposed
to be a D latch. We when say “store a value” we mean that the Q outputs have that value.) We
feed these two arrays into our ALU (arithmetic logic unit) and feed the answer back into the B
array, in order to execute the above instruction. What happens?

When C = 1, the sum bits Si computed by the adder will write their values back into the Bi
units. But as long as C = 1, these newly written values will go right through into the adder again.
They will be summed once more with y, and will loop back and be written into the B array again,
etc. Moreover, the carry bits will take time to propagate the result and so we cannot even think of
succession of additions. What a mess! When C becomes 0 again, and writes are no longer allowed,
the values that are in the B units are not what we want.

last updated: 26th Jan, 2016 4


COMP 273 5 - sequential logic 1 Jan. 25, 2016

D flip-flop
We need a mechanism that prevents data from being read from a D latch at the same time as it is
being written to a D latch. The mechanism that achieves this is called a D flip-flop. The two basic
types of D flip-flop are shown below. In each case, we have a pair of D latches with the Q output
value of the first being the D input of the second, and with the clock C being inverted for one D
latch but not the other.
rising edge triggered

C C
Q
D D

falling edge triggered

C C
Q
D D

Here I discuss just the falling edge one. (The argument of what happens for the rising edge one
is very similar.) When C goes from 0 to 1, the first D latch is written to, but the second D latch
holds its previous value because the inverted clock input to the second D latch goes from 1 to 0.
When C drops from 1 to 0, the value that had just been written to the first D latch now is written
to the second D latch whose clock input rises from 0 to 1 and this data value appears in the output
Q. Note that when C drops from 1 to 0, the value that had just been written into the first D latch
is held even though the input to the first D latch might have changed. The reason the previous
value is held is that C is 0. We say this flipflop is falling edge triggered because the Q value changes
when the clock falls from 1 to 0.
For this mechanism to guarantee the synchronization we desire, the interval of a clock pulse
has to be long enough that the combinational circuits (such as adders, multiplexors, etc) between
flip-flops have finished computing their output values. Consider again the example above for x :=
x + y. Each of the bit memory units in fact would be a D flip flop rather than a D latch. We
would now have synchronization. During a clock pulse (C=1), the adder circuit would compute x
+ y and at the end of the clock pulse when C falls from 1 to 0, then result would be written into x.
Below is an example of timing diagram for D and the value stored in a flip flop. I have included
both the rising edge case and falling edge case.

last updated: 26th Jan, 2016 5


COMP 273 5 - sequential logic 1 Jan. 25, 2016

Q rising edge triggered

Q falling edge triggered

Registers
To perform an addition, you can imagine that the numbers/bits are read from a set of flip-flops
(sequential circuit), go through an adder circuit (combinational circuit) and are written into another
set of flip-flops, or even possibly the same flip flops. I gave a sketch of this above, and we’ll see how
this works in MIPS computer in coming lectures.
A set of flip-flops that holds the values of some variable is called a register. The key property
of a register is that its flip-flops are written to as a unit. There is no way to selectively write to
the individual flip-flops in a register. On any clock cycle, you either write to every flip-flop in the
register, or you write to none of the flip-flops in the register.
The sketch below shows a four bit register. Each of the square boxes represents one D flip-flop.
Here I have written FF in each box to emphasize its a flip-flop, not a D latch. (In the slides, I
didn’t write ”FF”, and in future I won’t write ”FF”.) Each flip flop has a clock input C and data
input D and a Q output.

FF FF FF FF

D3 Q3 D2 Q2 D1 Q1 D0 Q0

Shift registers
A shift register is a special register that can shift bits either to the left or right. We have seen that
shifting bits of an unsigned number to the left multiplies that number by 2, and shifting bits to the
right divides the number by 2. You can imagine it might be useful to have specialized circuits for
computing such shifts.
The figure below shows a four bit shift right register. When we shift right, we need to specify
the D input for the leftmost flip-flop. In the figure below, the leftmost bits is filled with the value of
variable Q4 . You can design the circuit to fill it with whatever you want (0, 1, or even wraparound
from Q0 , etc) and select one of these with a multiplexor.

last updated: 26th Jan, 2016 6


COMP 273 5 - sequential logic 1 Jan. 25, 2016

Q Q Q Q Q
4 3 2 1 0

Note that for “right” and “left” to be meaningful, we need to use the common convention that
the least significant bit (that is, Q0 ) is on the right.

Examples
Show a timing diagram of the contents of the above register. Assume the initial values in the register
are
(Q3 , Q2 , Q1 , Q0 ) = (1, 0, 0, 1)
and assume the flip-flops in the register are falling-edge triggered. Also assume that, over the close
pulses shown, the value of Q4 is 0. (In general, Q4 is not always 0, but it makes the timing diagram
easier to understand if we think of it as 0. Essentially we are just following the initial Q3 value of
1 as this 1 value moves through Q2 to Q1 to Q0 .

Q0

Q1

Q2

Q3

Let’s look at a more general shift register that can either shift left or right, and that can also
be cleared (all bits set to zero) and that can also be written to as a unit (write to all bits in one
clock cycle). Notice that we have tilted each flip-flop on its side in order to simplify the figure. Also
notice that we need a 2 bit selector to specify which of the four operations you want to perform.
The same selector is applied to each multiplexor. I have only shown bits 7, 8, 9 but you can imagine
extending the figure to other bits.

last updated: 26th Jan, 2016 7


COMP 273 5 - sequential logic 1 Jan. 25, 2016

D9 D8 D7

2 MUX MUX MUX


S
C
D C D C D C

Q9 Q8 Q7

last updated: 26th Jan, 2016 8


COMP 273 6 - sequential logic 2 Jan. 27, 2016

Today I will finish off our discussion of registers, and then move on to discuss larger memories.

T flip-flop (toggle)
The circuit below on the left shows a D flip-flop, such that the data input D comes from the
complement of the stored value Q. At every clock cycle, the Q value changes, namely it takes the
value Q. If the D flip flop is falling edge triggered, then the value changes when the clock goes from
1 to 0 which is the case shown on the right below. If the D flip flop is rising edge triggered, then
the value of Q changes when the clock rises from 0 to 1.

CLK
C Q

D Q 0 1 0 1 0 1 0 1 0
Q

Q 1 0 1 0 1 0 1 0 1

Such a flip-flop is called a T flip-flop where T is for toggle. Note that the T flip flop changes its
value at half the rate of its clock input.

Counters and timers


Suppose we make a register out of T flip flops as shown below. The values in the flip flops are are
labelled Qn−1 . . . Q2 Q1 Q0 . In the figure, the least significant bit is on the left, to make the diagram
more easy to read. Note that we are using the Qi value as a clock input for flip flop i + 1. So, Q0
(on left) toggles its value once per clock cycle, and more generally Qi+1 toggles its value whenever
bit Qi changes from 1 to 0 (falling edge).

Q 0
Q1 Q2
C C C

D Q 0 D Q1 D Q
2

The timing diagram for this circuit is shown below. Suppose the flip flops are falling edge
triggered. The values of Q2 Q1 Q0 represent binary numbers and, for each successive falling edge
of the clock, the triplet encodes a binary number. By inspection we can see that the sequence of
numbers is 000, 001, 010, 011, 100, 101, 110, 111 which is just the numbers from 0 to 7. Thus we
have a counter circuit!

last updated: 8th Feb, 2016 1 lecture notes c Michael Langer


COMP 273 6 - sequential logic 2 Jan. 27, 2016

Q0 0 1 0 1 0 1 0 1 0

Q1 0 1 0 1 0

Q2 0 1 0

What would happen if we were to use rising edge triggered flip-flops instead? Qi+1 would toggle
its value whenever Qi rises from 0 to 1. You can verify for yourself that this produces a counter
which counts down (and which wraps around from 000 to 111). We call this a timer since it decreases
in value. This is the circuit that is used for the timer on your microwave oven or wristwatch.
[ASIDE: in the lecture, someone asked why there is an initial jump form 000 to 111. See if you
can answer that question by thinking about what happens on the first rising edge of the C. ]

0 1 0 1 0 1 0 1 0
Q0

Q1 0 1 0 1 0

Q2 0 1 0

Register array
One common way in which we use registers is to hold the values of variables in a program. Suppose
we wish to add two variables and store the result in another variable e.g. x := y+z, or x := x+z.
Suppose the value of each of the variables is stored in a register. To perform the addition, we must
read the values from two registers, feed the two values into an adder (ALU), and then write the
result into a register. We will do all of this is done in one clock cycle.
A few lectures from now, we will use the MIPS processor to illustrate how such things are done.
In MIPS, the registers that are used in your programs are grouped into an array. There are 32
registers in the array – each register stores 32 bits. There is no significance to the fact that the
number of registers is the same as the number of bits in each register.

last updated: 8th Feb, 2016 2 lecture notes c Michael Langer


COMP 273 6 - sequential logic 2 Jan. 27, 2016

We next discuss how to read from and write to these 32 registers. We will discuss the various
elements in the following diagram.
Read Reg 1

5
Write Enable
32
register 0
M Read Data 1
register 1
U
X 32
etc
:
:
Write Reg :
5 − to − 32 :
:
decoder :
5 : :
etc M

U Read Data 2

X
32

register 30
register 31
32

32 5

Write Data Read Reg 2

Reading from registers


Suppose we wanted to read the data stored in one of the 32 registers, and feed it into some circuit,
say an adder. We can access this data by taking the 32 bit outputs from all 32 registers and feeding
them into a multiplexor (selector). We select a register by inputting the 5 bit code of the desired
register into this multiplexor (25 = 32). For example, if we want register 17 then we input bits
10001. We call this 5 bit code ReadReg1.
Rather than drawing all 32 lines for the output of each of the 32 registers, we just draw one line
from each register and indicate that the line is 32 bits wide. Remember that you cannot access the
individual bits of a register. You have to access all 32 bits at once. So, we might as well just draw
one line for each register.
If we are adding two numbers – or more generally operating on two bit strings contained in two
registers – then we need to read from two registers, not just one. (The two registers may be the
same, such as y = x + x.) To access two registers simultaneously, we need two 32 bit multiplexors
as in the figure above. The two multiplexors select the two 32 bits, and feed them into a 32 bit ALU
which contains the adder circuit which is not shown in the figure above, but see the figure on the
next page. Each of the two multiplexors needs to be told which register to select, and for this each

last updated: 8th Feb, 2016 3 lecture notes c Michael Langer


COMP 273 6 - sequential logic 2 Jan. 27, 2016

needs 5 bits (again 25 = 32). Thus a 5-bit coded number must be fed into each of two multiplexors.
In the figure, these are called ReadReg1 and ReadReg2. The 32 bit values that are read out are
labelled ReadData1 and ReadData2.

Writing into a register


When we add two numbers, we then want to store the sum. We store the sum in a register. For
example, if we were computing x = x+y, we would store the result in the same register as one of
the operands. Or, if we were computing y := x + z, then we would write the sum into a different
register than the two operand registers.
To write into a register, we take a 32 data result of the operation (e.g. the sum bits Si which are
output from the adder) and we feed these into the D inputs of the flipflops of the desired register.
We specify which register to write to using a 5 bits, labelled WriteReg in the figure. Note that the
WriteReg signal is decoded into 25 = 32 lines, one of which is 1 and the rest are 0. e.g. if WriteReg
has the value 13 i.e. 01101, then the 13th line is 1 and the rest are 0.
We do not necessarily want to write into a register: we will see lots of examples later when we
learn how MIPS works. We can control when we write into a register by requiring that certain
conditions are met. First, we only write into the one selected register. Second, we only write when
we have some signal – call it WriteEnable – that says it is time to write.
As we will see later, there may be many situations in which we do not want to write to a register.
For this reason, we have given the clock input a more general name here: WriteEnable. WriteEnable
would be the output of some other circuit which depends on the computer’s clock, but also on other
variables. Think of
WriteEnable = C · Y1 · Y2 . . .
where the Yi are other variables. Sometimes I will not bother writing the clock explicitly to avoid
clutter in the diagram. [ASIDE: Unfortunately I forgot to mention this important point
in the lecture. I have added a slide mentioning it and also added a slide showing the
figure above which includes the read and write circuits together.]
While the details in the above figures are important to understand, sometimes it is easier not
to clutter your head with the details and instead just think of the register array as the contents of
a “black box.”

WriteEnable ReadData1

ReadReg1

ReadReg2

WriteReg
ReadData2
WriteData

last updated: 8th Feb, 2016 4 lecture notes c Michael Langer


COMP 273 6 - sequential logic 2 Jan. 27, 2016

Finally, it is natural to classify the inputs and outputs into three types:
1. addresses: ReadReg1 (5 bits), ReadReg2 (5 bits), WriteReg (5 bits)

2. data: WriteData (32 bits), ReadData1 (32 bits), ReadData2 (32 bits)

3. control: WriteEnable (1 bit)


Later when we look more generally at how the computer works, this three way classification will
help us think about what the various signals do.
Let’s next turn to a slightly different design for memories. The above design is feasible for small
memories such as a register array. But it won’t work for larger memories. The reason is that for
each flip flop (bit) in the memory, you would need a separate wire coming out of the square array.
As you increase the size of the square array, the number of flip-flops in the array rises at a much
faster rate than the length of the perimeter around the array (roughly n2 versus 4n). There simply
will not be enough room for so many wires. Let’s look at some alternative approaches.
We have an N × N array of flip-flops. For each flip-flop, we connect the Q output to a line
that runs vertically down an entire column. The same line would be used for all N flipflops in each
column. Similarly, suppose we were to connect the D input for each flipflop to a single line that
vertically runs up a column. Again, the same line would be used for all N flipflops in each column.
(There is no physical difference between down and up. I just use those words so you can imagine
which way the data moves.)
In this design, we can write to all the flipflops in a row by selecting that row (turn on only the
RowSelect control for that row). Each flipflop within that row has a value determined by the write
data line, and there are N such write data lines – one for each column.

clock
MemWrite
RowSelect

C RowSelect
Q

data lines

WriteData ReadData

This design seems very nice (at first glance) because we need only a small number of lines for
each row and for each column. This design would allow us to scale up to much larger memories.
There is a problem with this design, however. For any column, all of the Q outputs of the cells in
that column are attached to the read data line. This won’t work. We can only allow one flipflop
to put its Q value on any given line. If we tried to put a multiplexor in there, that would be going
back to the previous solution which didn’t work either because there are too many wires needed.

last updated: 8th Feb, 2016 5 lecture notes c Michael Langer


COMP 273 6 - sequential logic 2 Jan. 27, 2016

The tri-state gate (or ”tristate buffer”)


The problem can be solved by using another new type of gate called a tri-state gate (more commonly
called a tri-state buffer). A tri-state gate has a data input and data output, as well as an enable
input. If enable=1, then the data output is equal to the data input. If enable=0, then there
is no data output. (Saying “there is no data output” is different from saying there is a value 0
output.) If enable=0, it is as if we are disconnecting the wire (temporarily). How the electronics of
this works is well beyond the scope of this course, and frankly know how it works would not help
you understanding the behavior just described (just like knowing how the motor of your electric
toothbrush works will not help you to use it). Note that it is called tri-state because the output
can be 0 or 1, or neither. For more, see [Link]

setting interpretation
1

in out in out

in out in out

The figure below is similar to the one on the previous page, except that now a tri-state gate is
used. Previously, to write to a flipflop in a single row, that row must be selected, the RowSelect
control must be 1, and the MemWrite (and clock) control must be 1. Now, we also use the RowSelect
control to open the tristate gate and allow the flip flop shown to put its signal on the ReadData
line. Since only a single row is selected, the tristate gates ensure that only one cell in each column
is putting a value on the ReadData line.

The figure below is a slightly different design in which the read and write data lines in each
column are combined into a single line. Now we can either read into a flip-flop or write from a
flip-flop but not both. Both the MemRead control and the RowSelect control are used to control

last updated: 8th Feb, 2016 6 lecture notes c Michael Langer


COMP 273 6 - sequential logic 2 Jan. 27, 2016

the tristate buffer. As before, the MemWrite control determines (along with clock and RowSelect)
whether the value on the data line is written into the flip-flop.

clock
MemRead
MemWrite
row select

C
Q

data

RAM
In the above discussion, we considered an N ×N array of D flip-flops and we were reading or writing
all N flip-flops in some row. This was motivated by the register application, where we group a set
of bits into a unit and operate in the same way on all bits in that unit. We next consider a different
situation in which we allow ourselves either to read from or write to a single flip-flop in the N × N
array.
To select a column, we need another control line which we call ColumnSelect. This control is
fed into the same clock input for the flip-flop along with the other controls which must be 1 in order
for a write to occur. Only one of the N columns will have its control signal with value 1. The rest
will have value 0.

last updated: 8th Feb, 2016 7 lecture notes c Michael Langer


COMP 273 6 - sequential logic 2 Jan. 27, 2016

In the lecture, I discussed an example of the sort of RAM you could buy and put into your
computer. The example showed 8 chips of RAM which had 2 GB. Since each byte is 8 bits, it
follows that there are 231 bits on each chip. Let’s think about this as a 215 × 216 array of one bit
memories.
[ASIDE: there are many different electronic technologies used for RAM. We won’t go into the details.
Its fine if you just assume they are the flip-flops we have been discussing – but be aware that there
are many ways of getting transister-based circuits to store one bit memories.]
The figure below shows where the RowSelect and ColumnSelect controls come from. Keep
in mind that these signals are 216 and 215 lines. Exactly one of the RowSelect and one of the
ColumnSelect lines has the value 1 at any time. In the figure below, the one out of 231 bits we are
indexing (or addressing) is coded as a binary number A30 A29 · · · A1 A0 . The 16 bits A30 A29 · · · A15
code the row number and the 15 bits A14 · · · A0 code the column number.

We will return to these RAM designs later in the course.

last updated: 8th Feb, 2016 8 lecture notes c Michael Langer


COMP 273 7 - sequential logic 3 Feb. 1, 2016

Integer multiplication
Suppose we have two unsigned integers, A and B, and we wish to compute their product. Let A be
the multiplicand and B the multiplier:

An−1 . . . A1 A0 multiplicand
× Bn−1 . . . B1 B0 multiplier
P2n−1 ... Pn−1 . . . P1 P0 product

If the multiplier and multiplicand each have n bits, then the product has at most 2n bits, since

(2n − 1) ∗ (2n − 1) < 22n − 1

which you can verify for yourself by multiplying out the left side. For example, if we are multiplying
two 32 bit unsigned integers, then we may need as many as 64 bits to represent the product.
Recall your grade school algorithm for multiplying two numbers. We can use the same algorithm
if the numbers are written in binary. The reason is that shifting a binary number one bit to the
left is the same as multiplying by 2, just as decimal multiplication by ten shifts digits left by one.
In the case of binary, each 1 bit bi in the multiplier means we have a contribution 2i times the
multiplicand, and so we add the shifted multiplicands.

1001101
× 0010111
1001101
10011010
100110100
0000000000
10011010000
000000000000
+0000000000000
product

In the grade school algorithm, we compute all the left shifts in advance, building up several rows
of numbers which correspond to the digits of the multiplier. Then, this set of numbers is added
together to yield the product (as above).
Let’s consider an alternative algorithm that avoids writing out all the rows with the shifted
multiplicands. Instead the algorithm shifting the multiplicand (left i.e. multiply by 2) at each step
i. When bit Bi of the multiplier is 1, the shifted multiplicand is added to an accumulated total.
When bit Bi is 0, nothing is added to the total.
What hardware could we use to multiply two n = 32 bit unsigned numbers?

• shift registers: the multiplicand is shifted left. If we are multiplying two 32 bit numbers then
we use a 64 bit shift left register for the multiplicand. (We shift the multiplicand to the left
at most 32 times.)

• a 32 bit register for the multiplier. We use a shift right register, and read the least significant
bit (LSB, i.e. bit 0) to decide whether to add the shifted multiplicand on each step or not.

last updated: 19th Apr, 2016 1 lecture notes c Michael Langer


COMP 273 7 - sequential logic 3 Feb. 1, 2016

• a 64 bit adder
• a 64 bit register which accumulates the product
• a counter to keep track of how many times we’ve shifted.

“Algorithm” for multiplying two unsigned integers:


//
// Let P, A, B be the product, multiplicand, and multiplier registers.
//
P=0
load A so lower 32 bits contain multiplicand (upper 32 bits are 0)
load B so it contains multiplier (32 bits)
for counter = 0 to 31 {
if (LSB of B is 1)
P := P + A
endif
shift A left by one bit
shift B right by one bit
}

D D

shift left register (64 bits) shift right register (32 bits)

A B

shift/load LSB

shift/load

adder combin.
circuit

write enable / clear counter


P

We must provide a control signal to the shift registers telling them whether to shift or not at
each clock cycle. On the first clock cycle, we would load the multiplier and multiplicand registers,
rather than shifting the values in these registers. (Where would we load them from? We would load
them from some other registers that are not drawn in the circuit.) On subsequent clock cycles, we
would perform the shift.
Similarly, we must provide a signal to the product register telling it whether we should clear the
product register (only on the first clock cycle), or whether we should write in the value coming out
of the ALU (only on subsequent clock cycles and only when the LSB of the B register is 1.
Finally, notice that all the instructions within the “for loop” can be performed in parallel within
a clock cycle.

last updated: 19th Apr, 2016 2 lecture notes c Michael Langer


COMP 273 7 - sequential logic 3 Feb. 1, 2016

Fast integer multiplication ( log n argument is not on exam)


In the lecture, I briefly sketched out an alternative scheme which performs n-bit multiplication more
quickly. The idea will be familiar to most of you, namely to those who have taken COMP 250. The
idea is to take the n rows defined by the grade school multiplication algorithm (see page 1) and use
n
2
fast adders to add rows 1 and 2, 3 and 4, 5 and 6, etc. Then the outputs from these adders could
be paired and run through n4 adders which would give the sum of rows 1 to 4, 5 to 8, etc. The bits
would all need to be aligned properly as in the table on page 1, but this doesn’t need shift registers.
It just requires that certain output lines be fed into the correct input lines, and 0’s be fed in where
appropriate.
In principle the multiplication can be done in one clock cycle, using only combinational circuits.
However, the clock cycle would need to be long to ensure that the all signals (carries, etc) have
propagated through this circuit. Since the same clock is used for all gates in the processor (including
the ALU) this means we would need a slow clock, which is not what we want.
To avoid the slow clock requirement, it is common to break the multiplication into stages and
have a multiplication run in multiple clock cycles. This can be done by using registers to hold
intermediate results. In particular, the n2 sums could be written to n2 registers at the end of the
multiplication’s first clock cycle. The next n4 sums could be written to n4 registers at the end of the
second clock cycle. Those of you who have already done COMP 250 know that log n clock cycles
would be needed in total to compute the multiplication. This is still faster than the grade school
algorithm which took n steps.

Integer division (sketch only – not on exams)


When we perform integer division on two positive integers, e.g. 78/21, or more generally “divi-
dend/divisor”, the result is two positive integers, namely a quotient and a remainder.

78 = 3 ∗ 21 + 15

dividend = quotient ∗ divisor + remainder


Suppose we have two 32 bit unsigned integers: a dividend and a divisor. Suppose they are both
positive numbers. How do we compute the quotient and the remainder ? We can apply the “long
division” algorithm we learned in grade school. As with multiplication, we need a set of shift
registers. See slides for a rough sketch of the corresponding circuit. The algorithm is not obvious,
but here it is for completeness (and fun?).

“Algorithm” for long division:


divisor := 2n * divisor // Make the divisor bigger than the dividend
quotient := 0
remainder := dividend
for i = 0 to n
shift quotient left by one bit
if (divisor ≤ remainder)
set LSB of quotient to 1
remainder := remainder - divisor
shift divisor to right

last updated: 19th Apr, 2016 3 lecture notes c Michael Langer


COMP 273 7 - sequential logic 3 Feb. 1, 2016

A few notes: First, the first line of this algorithm is just the usual first step in long division of
placing the divisor to the left of the dividend. When you carry out long division by hand in base
10, you don’t think of this as multiplying by 10n where n is the highest power of the dividend. But
this is essentially what you are doing. Second, the comparison “divisor ≤ remainder” can be done
by computing the difference (divisor - remainder) and checking whether the result is less than zero.
In a twos complement representation, this can be done by checking if the MSB (most significant
bit) is 1.

Floating point addition


Let’s now consider floating point operations: +,-,*,/. Recall the discussion from lecture 2 where we
wanted to add two single precision floating point numbers, x and y.

x = 1.101010000000000000000101 ∗ 2−3
y = 1.001001000010101010101010 ∗ 22

where x has a smaller exponent than y. We cannot just add the significands bit by bit, because
then we would be adding terms with different powers of 2. Instead, we need to rewrite one of the
two numbers such that the exponents agree. e.g. We rewrite x so it has the same exponent as y:

x = 1.101010000000000000000101 ∗ 2−3
= 1.101010000000000000000101 ∗ 2−5 ∗ 22
= 0.00001101010000000000000000101 ∗ 22

Now that x and y have the same exponent, we can add them.

1.001001000010101010101010 ∗ 22
+ 0.00001101010000000000000000101 ∗ 22
= 1.00110001011010101010101000101 ∗ 22

The sum is a normalized number. But the significand is more than 23 bits. If we want to represent
it as single precision float, then we must truncate (or round off, or round up) the extra bits.
It can also happen that the sum of two floats is not normalized. If, instead of y, we had

z = 1.11111000000000000000000 ∗ 22

then check for yourself that x + z would not be normalized. So, to represent the result as single
precision float, we would have to normalize again by shifting (right) the significand and truncating,
and changing the exponent (by adding the size of the shift).
You should appreciate what you need to do floating point addition. You need to compare
exponents, shift right (x or y with smaller exponent), use a big adder, normalize, deal with case if
result is not normalized, round off. Details of the circuit are omitted, but you should appreciate
the many steps are may be involved.

last updated: 19th Apr, 2016 4 lecture notes c Michael Langer


COMP 273 7 - sequential logic 3 Feb. 1, 2016

Floating point multiplication and division


Multiplying two floating point numbers is easy. We reduce the problem to integer multiplication,
plus some bookkeeping with the exponents. To see the idea, suppose we work with 3 bits of
significand instead of 23. Here is an example.

x1 = 1.101 ∗ 2−7 = 1101 ∗ 2−10

x2 = 1.001 ∗ 24 = 1001 ∗ 21
Since (1101)2 ∗ (1001)2 = 13 ∗ 9 = 117 = (1110101)2 and 2−10 ∗ 21 = 2−9 , we can rewrite it as a
normalize number as follows:

x1 ∗ x2 = 1110101 ∗ 2−9 = 1.110101 ∗ 2−3 .

Now go back to the previous example in which x and y each have 23 bits of significand. We can
write each of them as an integer times a power of 2:

x = 1.101010000000000000000101 ∗ 2−3
= 1101010000000000000000101 ∗ 2−26

y = 1.001001000010101010101010 ∗ 22
= 1001001000010101010101010 ∗ 2−21

We then multiply two 24 bit integers (24, not 23, since there is the leading 1 that needs to be
included), and add the exponents. Finally, we normalize.

What circuitry do we need to perform a multiplication? We need circuitry for carrying out
(unsigned) integer multiplications. We also needs circuitry for adding exponents and adding values
(23) to exponents.
So you should be able to appreciate the floating point multiplication is more complicated than
integer multiplication, but not so much more complicated. Note also that, for floating point division,
the long division algorithm doesn’t stop when the remainder is less than the divisor. Instead, it
keeps going to get the negative powers of the base (2 or 10).
What about floating point division? If we want to compute x/y, we can perform the same trick
as above and reduce the problem to the division of two integers, times an exponent. So division of
two floating points requires slightly more circuitry (and about the same amount of time) as integer
division.

Finite State Machines


Let’s get a bit more general here. The past few lectures we have been talking about combinational
and sequential circuits. For sequential circuits you should now have the idea that there may be
multiple registers and each register has a value at each clock cycle. The registers change their
values from one clock cycle to the next. There is an initial state (time = 0 )which has certain values
on the various wires of the circuit. And then the clock starts ticking and the value change in a
deterministic (not random) way.

last updated: 19th Apr, 2016 5 lecture notes c Michael Langer


COMP 273 7 - sequential logic 3 Feb. 1, 2016

Such a set of circuits is an example of a finite state machine. For the circuit, each possible
setting of the values in all the registers defines a state. There maybe other values (called ”outputs”)
defined as well which are not necessarily represented by registers. For example, two register outputs
might feed into a combination circuit which has outputs. There may also be values ”inputs” that
are sitting on wires but perhaps not written into the registers that they connect to (because a
writeenable signal might be false). So at each time, we have inputs, outputs, and a state.
In COMP 330, you will study finite state machines without talking about circuits. You’ll just
talk about states, inputs, and outputs, and you’ll talk about how the machine goes from state to
state, given an input, and how it produces outputs at each time that depend on the state at that
time (and perhaps on the input at that time).
In the lecture, I gave an example of a turnstile, like what you have in the Metro. There are
only two things you can do at a turnstile, assuming that your OPUS pass has tickets in it. You can
either put your pass down, or you can push against the turnstile. Verify for yourself that the table
below captures the transitions from one state to the next.

input currentState output nextState


0=coin 0=locked 0= ~turn 0=locked
1=push 1=unlocked 1= turn 1=unlocked
------- ---------- -------- ---------
0 0 0 0
0 1 0 1
1 0 0 0
1 1 1 1

There is a lot to say about finite state machines, but I will leave it for COMP 330. The
abstraction of inputs, outputs, and state transitions doesn’t help us much in this course and so I’m
going to stop writing now. But hopefully you get a whiff of the idea, and are asking yourself where
this direction leads. Take COMP 330 and you’ll find out.

last updated: 19th Apr, 2016 6 lecture notes c Michael Langer


COMP 273 8 - MIPS instructions 1 Feb. 3, 2016

High level language programs versus Machine code


You all have experience programming in a high level language. You know that when you write
a program in a language such as C or Java, the program is nothing more than a string of ASCII
characters that you type on a keyboard and that are stored in a file. If you are unfamiliar with
ASCII, then you should check out [Link]. Note that ASCII is a 7 bit code. The 8th
bit of each byte is ignored, although there is something called extended ASCII which uses all 8 bits.
[ASIDE: Another character code which is much more elaborate is UNICODE. I like to think of it
as a 16 bit code [Link], but in fact it is more complicated than this. https:
//[Link]/wiki/Unicode.)]
You also know that in order to run a program that is stored in a file, you need to translate this
program into executable instructions (machine code, made of 0’s and 1’s) that specify what your
particular computer is supposed to do. Your particular computer has a processor (made by AMD,
Intel, etc) and your program must be translated into instructions for this particular machine.
When you write your program in a high level language like Java, you typically don’t want to
consider which machine you are going to run this program on. This flexibility allows your program
to be run on many different machines. Once you decide what machine you will use, though, you need
to translate the program into a language for that machine. This translation is done by a compiler.
A compiler is a program that takes a high level language program (say, in C) and translates it into
language that is written in machine code.
In Java, the situation is a bit more subtle. Compiling a .java file gives you a .class file which
is machine code. However, this machine code isn’t understandable by your computer’s processor.
Rather it is code that runs on a virtual (abstract) computer called a Java Virtual Machine. This
JVM code must be again translated – or more precisely, interpreted1 – into machine code for your
particular processor.

Assembly language
When computers were first invented in the 1940s, programmers had to program in machine code –
they had to set 0’s and 1’s by flipping physical switches. For most people, it is not much fun (and
certainly not efficient) to program with 0’s and 1’s, and so a slightly higher level of code, called
assembly language was invented. Assembly language is just human readable machine code. The
MIPS language we will use in the next few weeks is an example.
An assembly language program is an ASCII file, and is not executable. It needs to be translated
into its machine code equivalent. This translation is relatively easy (compared with translating
from a high level language like C to machine code). The translation from assembly language into
machine code is done by a program called an assembler. We are not going to learn in this course
1
I neglected to mention this in the lecture but I’ll mention it here for your interest. The program which interprets
the JVM code is called an interpreter. The interpreter simulates the JVM. Note that translating and interpreting
are not the same thing. Translation happens before the program runs. Interpreting happens while the program
is running. This is the same notion of translation versus interpretation that is used in natural languages: people
who work as translators take written documents in one language and create written documents in another language.
People who work as interpreters take spoken language in real time and convert it into another language. The two
jobs are very different. When politicians travel to foreign countries, they need an interpreter, not a translator.

last updated: 22nd Feb, 2016 1 lecture notes c Michael Langer


COMP 273 8 - MIPS instructions 1 Feb. 3, 2016

how exactly this is done.2 Rather, we will just learn an assembly language and get experience
programming in it, and understand what this corresponds to at the level of the actual processor.

MIPS
In the next few weeks, we will look at a particular machine or computer processor: the MIPS R2000
(from here one, just known as MIPS). We will spend the next few weeks getting to know the MIPS
“instruction set” (set of MIPS instructions) and you will get some experience programming in the
MIPS assembly language.
Before we can introduce MIPS instructions, we need to know that these instructions can refer
to two types of memory. The first is the set of 32 MIPS registers that I described in lecture 6
(page 5). The second is a much larger memory that holds all the data and instructions of a user’s
programs, as well as the instructions and data of the kernel or operating system. If you wish to
write programs for an operating system for the given computer, then you need to make use of the
special instructions and data and memory locations of the kernel. We won’t do that in this course.
We’ll only write user programs.
If you are new to programming, then it may surprise you to learn that both instructions and
data sit in memory. This may be counterintuitive since you might think that instructions and data
are very different things. As we will see, though, both instructions and data ultimately need to
be encoded as 0’s and 1’s, and so there is no reason why they cannot both sit in memory. Your
intuition is correct, though, in that they are kept in separate places in memory, respecting the fact
that they are different types of things.
In MIPS, we write Memory (with a capital M) to refer to a set of 232 bytes (or 230 words, i.e.
a word is 4 bytes) that can be addressed by a MIPS program. These 232 bytes do not include the
32 MIPS registers. These 232 bytes should not be confused with an address on a physical memory
chip or a location on a disk. As we will see later in the course, MIPS Memory addresses need to be
translated into physical addresses.3
You should have only a simple notion of how a MIPS computer works. This is illustrated in the
figure below. You should understand that instructions and data are stored in Memory. You should
also understand that operations such as addition and subtraction are implemented by selecting
two numbers that are stored in registers, running them through the ALU, and then writing the
result back into a new register. The figure above shows another register which is called the program
counter. It holds the address of the instruction that is currently being executed. This instruction
is in Memory and so the address of the instruction is 32 bits.
In this lecture, we will see some examples of instructions for carrying out such arithmetic op-
erations. We will also see instructions for moving words to and from Memory and the registers.
Finally, we will see how a program goes from one instruction to the next, and how branches to other
instructions can be achieved.
2
If you want to learn about this topic, then take the Compiler Design course COMP 520.
3
An analogy here is postal codes versus (latitude,longitude). Postal codes are used by the postal system to sort
and direct mail. (Latitude,longitude) coordinates specify where the mail goes in a physical sense. If this is confusing
to you, then ask yourself: do you believe that that postal code A0A0A0 is at the tip of Newfoundland and Z9Z9Z9
is at the western tip of Vancouver Island?

last updated: 22nd Feb, 2016 2 lecture notes c Michael Langer


COMP 273 8 - MIPS instructions 1 Feb. 3, 2016

program counter
kernel instructions
and data
30
2 words
32 user data
= 2 bytes registers ALU
32 words
user instructions

32 bits
(1 word)

Arithmetic instructions
Consider the C instruction:
c = a + b;
In the MIPS assembly language, this instruction might be written:

add $16, $17, $18 # register 16 is assigned the sum of registers 17 and 18

The $ symbol marks a register, e.g. $17 is register 17. The # symbol marks a comment. When this
symbol appears in a MIPS program, any characters to the right of this symbol are ignored.

Memory transfer instructions


In MIPS, we cannot perform arithmetic or logical operations on numbers that are stored in Memory.
The numbers must first be brought from Memory into registers and the operation performed from
there. To load a word (4 bytes) from Memory to a particular register, we use the instruction lw, or
“load word”. The lw instruction specifies the register that we are going to write to. It also specifies
the address in Memory of the word we want. The Memory address is the sum of a base address
plus an offest. The base address is a 32 bit number which is currently stored in a register, and this
register must be specified. The offset also must be specified. Here is an example:

lw $16, 40($17) # Bring a word from memory into register 16.

In this example, the address of the word in Memory is the value in register 17, plus 40. Allowing
for an offset (e.g. 40) gives the programmer more flexibility. If it happens that the address of the
word we want is already in $17, then then the offset would just be 0.

What about the opposite operation, namely taking a word that is in a register and putting it
into Memory? For this, MIPS has sw, which stands for “store word”.

sw $16, 40($17) # Put the word in $16 into Memory

last updated: 22nd Feb, 2016 3 lecture notes c Michael Langer


COMP 273 8 - MIPS instructions 1 Feb. 3, 2016

Again, the address of the word in Memory is the value in register 17, plus 40.
In the lecture, I described a few simple examples that combine arithmetic instructions like add
and Memory transfer instructions. Say you are trying to perform x = y + z, and x and y values
happen to be in registers but the value of z is not in a register, then you need to load z from
Memory into a register before you can compute the sum. Similarly, if y and z are in registers but
x is in Memory, then you need to compute the sum y + z and store it in some other register, and
then copy (store) the value in that register to x, which is in Memory.
In the slides, I also sketched out a data path for the above lw and sw instructions. We will go
into more detail about data paths in the weeks ahead.

Branching instructions
What is the order of instructions executed in a MIPS program? As I mentioned earlier, the program
counter (PC) register holds the address in Memory of the currently executed instruction. The
default is that this counter is incremented after each instruction.
Sometimes, however, it is useful to branch to other instructions. You are familiar with this idea
e.g. if-then-else statements, for loops or while loops, goto statements. Branching instructions
allow the program to execute an instruction other than the one that immediately follows it. For
example, consider the following snippet of C code.

if (a != b)
f = g + h;

Here you want to execute the sum if a is not equal to b, and so you don’t want to execute the sum
if a is equal to b. In MIPS assembly language, such an instruction could be implemented using beq,
which stands for “branch if equal”, for example:

beq $17, $18, Exit1 # if a is equal to b, goto Exit1


add $19, $20, $21 # f = g + h
Exit1: # Next instruction (following if statement).

The beq instruction is an example of a conditional branch. A condition needs to be satisfied in


order for the branch to be taken, that is, for the program to execute an instruction other than the
one that directly follows. Let’s look at a slightly more complicated C instruction.

if (a != b)
f = g + h;
else g = f + h;

This one is more tricky. If the condition (a != b) is met, then f is assigned a new value and the
program must jump over the else statement. For this, we use a jump instruction j.

beq $17, $18, Else # if a is equal to b, goto Else


add $19, $20, $21 # f = g + h
j Exit # goto Exit
Else: add $20, $19, $21 # g = f + h
Exit:

last updated: 22nd Feb, 2016 4 lecture notes c Michael Langer


COMP 273 8 - MIPS instructions 1 Feb. 3, 2016

MIPS instruction formats: machine code


Let’s now look at how these instructions are coded as 0’s and 1’s. Each MIPS instruction is one
word (1 word = 4 bytes = 32 bits). The upper six bits (26 − 31) of the word always contain an “op
code” that specifies what operation the instruction performs. There are 26 = 64 op codes. There
are more than 64 operations, though, as we’ll see in a moment.
The remaining bits of each instruction are organized into three formats, called R, I, or J.

R-format instruction (R stands for register)


An example of this instruction is add. As we saw above, such instructions use three registers. To
specify these registers, we need 5 bits each. This leaves 32-15 = 17 bits, 6 of which we saw are for
the op code. In general, the R format instruction is as follows:

field op rs rt rd shamt funct


numbits 6 5 5 5 5 6

Here is what the different fields mean:

• “op” stands for ”opcode”. R-format instructions always have opcode 000000.

• ”rs”,”rt”, “rd” are registers.


The rs field refers to a “source” register. This is one of the two registers we read from
(RegRead1). The rd field refers to a “destination” register. This is always the register that
is written to (RegWrite). The rt register can serve either as a source register or a destination
register.4 In the case of an R format instruction, it is a source register (RegRead2).
Notice that the ordering of these three register numbers within the 32 bit machine code of
the add instruction is not the same as the ordering defined by instruction syntax which you
use when programming. (The reason why should be clear in a few weeks.)

• “shamt” stands for “shift amount” (It is not used in add. We will discuss it later.)

• “funct” stands for function. It is used to distinguish different R format instructions. For
example, the subtraction operation sub has the same opcode as add, namely 000000, but it
has a different funct code.
Note that there are 64 possible R-format instruction. Why? R-formal instructions always have
opcode 000000, and there are 26 = 64 possible funct fields. The instructions like add,sub, and
bitwise operations and, or, nor, etc are all R format and have different funct codes.

4
It is unclear why the letter “t” is used.

last updated: 22nd Feb, 2016 5 lecture notes c Michael Langer


COMP 273 8 - MIPS instructions 1 Feb. 3, 2016

I-format instruction: (I stands for “immediate”)


I format instructions use 6 bits for the opcode, 5 bits for each of two registers (rs and rt); and 16
bits for an immediate value. For lw, sw, beq, the immediate value is a signed offset.

field op rs rt immediate
numbits 6 5 5 16

For the lw instruction, a word is transfered from memory to a register. The address of the
word in Memory is determined by the source register rs which contains a 32 bit base address plus
the 16 bit offset. The offset is a signed number (from −2−15 to 215 − 1). The rt register acts as a
destination register.
For the sw instruction, a word is transfered from a register to Memory. The address of the word
in Memory is determined by the source register rs which contains a 32 bit base address) plus the 16
bit signed offset. The register rt contains the 32 bit value to be stored at this address.
For the beq instruction, the two registers rs and rt hold 32 bit values that are compared.
Depending on the result of the comparison (true or false), the next instruction either is the next
instruction in sequence (PC + 4 bytes) or some other instruction (PC + offset). The relative address
of this other instruction is determined by the 16 bit immediate field i.e. by the offset. As we will
see later in the course when we examine data paths, this offset is in words, not bytes. The reason
is that instructions are always one word long, so the offset is always a multiple of 4 bytes, since the
instruction we are branching to is always a multiple of 4 bytes away.

J format (“jump”)
An example of the J format instruction is the jump instruction j that we saw earlier. This instruction
uses 26 bits of address, rather than 16 bits as we saw with the I format instructions. With 26 bits,
we can jump farther than we could with 16 bits. Later when we talk about data paths, I will specify
exactly how this is done.

field op address
numbits 6 26

last updated: 22nd Feb, 2016 6 lecture notes c Michael Langer


COMP 273 9 - MIPS instructions 2 Feb. 8, 2016

Register names (save, temporary, zero)


From what I have said up to now, you will have the impression that you are free to use any of the 32
registers ($0, . . . , $31) in any instruction. This is not so, however. Certain registers have restricted
use. I will introduce these as needed over the next few lectures.
For the instructions we have seen up to now, one typically uses only registers $16,..$23, that
is, registers 10*** . These registers are named $s0,..$s7, where “s” stands for “save”.
Another set of registers that you will use commonly is $8,..$15, that is, registers 01*** . These
registers are named $t0,..$t7 where the “t” stands for temporary. As we will see, these registers
typically store values that are needed only temporarily.
As an example, consider the C or Java instruction:

f = g + h − i;

This instruction is implemented by first computing (g + h) and then subtracting i from the result.
The value g + h is temporary in the sense that it not stored as part of the instruction. Its value is
thrown away after i is subtracted. We might want to use a $t register for it. Here is an example
of the corresponding MIPS instructions:

add $t0, $s1, $s2 # temporary variable $t0 is assigned the value g+h
sub $s0, $t0, $s3 # f is assigned the value (g+h)-i

Another register that gets a special name is $0. This register is called $zero, and it always
contains the value 0. The hardware prevents you from writing into this register!

More I format instructions


Conditional branches
There is a limited number of I format instructions since an I format instruction must be specified
entirely by the opcode and the op code field has 6 bits. We have already seen lw, sw, beq. There
is also a bne (branch if not equal to).

if (a == b)
f = g + h;

The corresponding MIPS instructions might be:

bne $17, $18, Exit1 # if a is not equal to b, goto Exit1


add $19, $20, $21 # f = g + h
Exit1: # Next instruction (following if statement).

What about other conditional branches? You might expect there to also be a blt for “branch
if less than”, bgt for “branch if greater than,” ble for “branch if less than or equal to”, etc. This
would make programming easier. However, having so many of these instructions would use up
precious opcodes. The MIPS designers decided not to have so many instructions and to take an
alternative approach.

last updated: 10th Feb, 2016 1 lecture notes c Michael Langer


COMP 273 9 - MIPS instructions 2 Feb. 8, 2016

To execute a statement such as “branch on less than” in MIPS, you need two instructions. First,
you use an R format instruction, slt which stands for “set on less than”. The slt instruction is
then combined with beq or bne. Together, these span all possible inequality comparisons ≤, <, ≥, >.
See (new) Exercises 4. Here is an example for the C code. I will use C variable names s1 and s2
to make it easier to see what’s going on.

if (s2 > s1)


x = y + z

We want to branch if the condition fails, that is, if s2 is not greater than s1. So, we set a temporary
variable to the truth value ”s2 > s1” or equivalently ”s1 > s2”. We want to branch if the condition
is false, that is, if the truth value is 0. Here is the MIPS code, assuming C variables s1 and s2
correspond to MIPS registers $s1 and $s2 respectively:

slt $t0, $s1, $s2 # set t0 if j < i


beq $t0, $zero, Exit # branch if condition fails, that is, $t0 = 0
add $19, $20, $21 # x = y + z
Exit:

Here is another example. First, the C code:

while (s1 <= s2)


s1 = s1 + s5;

Here we want to exit the loop (branch) if the condition fails, namely, if s1 > s2, i.e. if s2 < s1 .
Here’s the MIPS code:

MyLoop: slt $t0, $s2, $s1


bne $t0, $zero, MyLoopExit
add $s3, $s3, $s5
j MyLoop # back to top of loop
MyLoopExit:

“Pseudoinstructions” for conditional branches


As mentioned above, the instructions slt, beq, and bne can be combined to give several conditional
branches that we would commonly wish to use, namely blt, ble, bgt, bge. It would be annoying if
we had to use the slt instruction to write conditional branches in MIPS. Fortunately, we don’t need
to do this. The MIPS assembly language allows you to use instructions blt, ble, bgt, bge. These
are called pseudoinstructions in that they don’t have a corresponding single 32 bit machine code
representation in the MIPS. Rather, when a MIPS assembler (or a simulator like MARS) translates
the assembly language program that you write into machine code, it substitutes two instructions,
namely a slt instruction and either a beq or bne.
Heads up: this substitution uses register $1 to hold the value of the comparison. You should
therefore avoid intentionally using that register when you are programming, because the MIPS
assembly may destroy the value you have put in that register! Only use $t and $s to hold variables.

last updated: 10th Feb, 2016 2 lecture notes c Michael Langer


COMP 273 9 - MIPS instructions 2 Feb. 8, 2016

One final comment: you may be wondering why the MIPS designers didn’t just have separate
machine code instructions blt, ble, bgt, bge. The answer is that such instructions would use up
precious (I format) op codes. The MIPS designers wanted to keep the number of instructions as
small as possible (RISC). The benefit of this design is that the RISC architecture allows simpler
hardware, which ultimately runs much faster. The cost is that a given program tends to require
more of these simpler instructions to have the same expressive power.

Using the immediate field as an integer constant


We have seen R format instructions for adding/subtracting/etc the contents of two registers. But
sometimes we would like one of the two operands to be a constant that is given by the programmer.
For example, how would we code the following C instruction in MIPS?

f = h + (−10);

We use an I-format instruction addi,

addi $s0, $s2, − 10

The “i” in addi is the same “I” as in “I-format.” It stands for “immediate.” . Recall that I-format
instructions have a 16 bit immediate field. The value −10 is put in the immediate field and treated
as a twos complement number.
Another example of an R format instruction that has an I format version is slti which is used
to compare the value in a register to a constant:

slti $t0, $s0, − 3

In this example, $t0 is given the value 0x00000001 if the value in $s0 is less than -3, and $t0 is
given the value 0x00000000 otherwise. Such an instruction could be used for a conditional branch.

if (i < -3)
do something

Signed vs. unsigned instructions


There are various versions of two instructions “add” and “set on less than” :

• add, addi, addu, addiu

• slt, slti, sltu, sltiu

The i stands for immediate and the u stands for unsigned. For example, addu stands for “add
unsigned”. This instruction treats the register arguments as if they contain unsigned integers
whereas by default add treated them as signed integers. Note that an adder circuit produces
the same result whether the two arguments are signed or unsigned. But as we saw in Exercises 2,
Question 12, the conditions in which an overflow error occurs will depend on whether the arguments
are signed or unsigned, that is, whether the operation specifies that the registers represent signed

last updated: 10th Feb, 2016 3 lecture notes c Michael Langer


COMP 273 9 - MIPS instructions 2 Feb. 8, 2016

or unsigned ints. In this sense, add and addu do not always produce the same result. The overflow
conditions can be tested in the hardware to decide if the program should crash or not. We will talk
about such exception handling issues later in the course.
The I instruction addiu, treats the immediate argument as a unsigned number from 0 to 216 − 1.
For example,
addiu $s3, $s2, 50000
will treat the 16-bit representation of the number 50000 as a positive number, even though bit 15
(MSB) of 50000 is 1. (See “sign extending” below.)
Another example of when it is important to distinguish signed vs. unsigned instructions is when
we make a comparison of two values. For example, compare slt vs. sltu. The result can be quite
different, e.g.
sltu $s3, $s2, $s1
can give a different result than
slt $s3, $s2, $s1
in the case that one of the variables has an MSB of 1 and the other has an MSB of 0.

Sign extending
The 16 bit value that is put in the immediate field typically serves as an argument for an ALU
operation (arithmetic or logical), where the other argument is a 32 bit number coming from a
register. As such, the 16 bit number must be extended to a 32 bit number before this operation can
be performed, since the ALU is expecting two 32 bit arguments. Extending a 16 to 32 bit number
is called sign extending.
If the 16 bit number is unsigned then it is obvious how to extend it. Just append 16 0’s to the
upper 16 bits. If the 16 bit number is signed, however, then this method doesn’t work. Instead, we
need to copy the most significant bit (bit 15) to the upper 16 bits (bits 16-31).
Why does this work? If the 16 bit number is positive, the most significant bit (MSB) is 0 and
so we copy 0s. It trivial to see that this works. If the 16 bit number is negative, the MSB is 1 and
so we copy 1s to the higher order bits. Why does this work? Take an example of the number -27
which is 111111111100101 in 16 bit binary. As we can see below, sign extending by copying 1’s
does the correct thing, in that it gives us a number which, when added to 27, yields 0. Convince
yourself that this sign extending method works in general.

1111111111111111 111111111100101 ← -27


+ 0000000000000000 000000000011011 ← 27
0000000000000000 000000000000000 ←0

last updated: 10th Feb, 2016 4 lecture notes c Michael Langer


COMP 273 9 - MIPS instructions 2 Feb. 8, 2016

Manipulating the bits in a register


I format
Suppose we want to put a particular 32 bit pattern into a register, say 0x37b1fa93. How would we
do this? We cannot do this with just one instruction, since each MIPS instruction itself is 32 bits.
Instead we use two instructions. The first is a new I format instruction that loads the upper two
bytes:
lui $s0, 0x37b1 #load upper immediate
This instruction puts the immediate field into the upper two bytes of the register and puts 0’s into
the lower 16 bits of the register. We then fill the lower 16 bits using:

ori $s0, $s0, 0xfa93

If these instructions are put in the opposite order, then we get a different result. The reason is that
lui puts 0’s into the lower 16 bits. Previous values there are lost forever.
Another bit manipulation that we often want is to shift bits to the left or right. For example,
we saw in an earlier lecture that multiplying a number by 2k shifts left by k bits. (This assumes
that we don’t run over the 32 bit limit.) We shift left with sll which stands for “shift left logical”.
There is also an instruction srl which stands for “shift right logical”.

sll $s0, $s1, 7 # shifts left by 7 bits,


# filling in 7 lowest order bits with 0’s.
srl $s0, $s0, 8 # shifts right by 8 bits,
# filling in 8 highest order bits with 0’s.

You might think that this instruction would be I-format. But in fact it is an R-format instruction.
You don’t ever shift more than 31 bits, so you don’t need the 16 bit immediate field to specify the
shift. MIPS designers realized that they didn’t need to waste two of the 64 opcodes on these
instructions. Instead, they specify the instruction with some R format op code plus a particular
setting of the funct field. The shift amount is specified by the “shamt” field, and only two of the
register fields are used.

last updated: 10th Feb, 2016 5 lecture notes c Michael Langer


COMP 273 10 - MIPS instructions 3 Feb. 10, 2016

In the past two lectures, we discussed MIPS operations on integers. Today we will consider
a few data structures that you are familiar with, namely arrays and strings, and discuss how to
implement them in MIPS.

Arrays
Suppose we wish to declare and use an array of N integers. How would we represent this array in
MIPS? If we were programming in C, we might let the array be a[ ]. Let the address of the first
element of the array be kept in one of the registers, say $s0. This address is called the base address
of the array. To access an element of array, we need to specify the address of that element relative
to the base address, i.e. the offset.
For example, consider the C instructions:

a[12] = a[10];

In MIPS, we cannot transfer data directly from one location in Memory to another. Instead, we
need to pass the data through a register. Here might be the corresponding MIPS instructions:

lw $t1, 40($s0)

sw $t1, 48($s0)
The offsets from the base address are 40 and 48 for a[10] and a[12] respectively. Why? We need
four bytes for each word, so an offset of 10 words is an offset of 40 bytes and an offset of 12 words
is an offset of 48 bytes.
Another example is the C instruction,

m = a[i];

Suppose the integer variable m is stored in register $s1 and integer index i is stored in register
$s2. We multiply the contents of $s2 by 4 to get the offset, and then we add the offset to the base
address which we’ll assume is in $s0 as in the above example. Note that we compute the offset by
multiplying i by 4 by using the sll instruction, which is simpler than having to perform a general
multiplication. (We will see how to perform multiplication in an upcoming lecture.)

sll $t0, $s2, 2


add $t0, $s0, $t0
lw $s1, 0($t0)

Strings
In the C programming language (COMP 206), a variable of type “char” uses one byte (http:
//[Link]). Thus, each character in a string is stored in one byte.
Previously, we saw the instructions lw and sw which load a word from Memory or store a word
to Memory, respectively. There are similar instructions for single bytes. lb loads a byte, and sb
stores a byte. These functions are of I-format. Again, we specify the address of the byte with a
base address plus an offset. The offset is a signed number.

last updated: 17th Feb, 2016 1 lecture notes c Michael Langer


COMP 273 10 - MIPS instructions 3 Feb. 10, 2016

lb $s2, 3( $s1 )
sb $s2, -2( $s1 )
lb takes one byte from memory and puts it into a register. The upper 24 bits of the register are sign
extended i.e. filled with whatever value is in the most significant bit (MSB) of that byte. There is
also an unsigned version of load byte, namely lbu, which fills the upper 24 bits with 0’s (regardless
of the MSB).
sb copies the lower 8 bits of a register into some byte in memory. This instruction ignores the
upper 24 bits of the word in the register.

Example: How to calculate the length of a character string?


Consider the following C instructions that compute the length of a string. In C, a string is a
sequence of bytes, terminated by the ASCII NULL character which is denoted ’\0’ and which has
byte value 0x00.
char *str; // Declare a pointer to a string.
// str is an address (a 32 bit number).
int ct;
str = "I love COMP 273";

while ( *(str + ct) != ’\0’ ){


ct++;
}
The variable str is a pointer (or ”reference”) and so its value is an address, which is just an unsigned
integer (32 bits). The fancy instruction above then adds an offset ct to this address. The * operator
dereferences that address, which means that it returns the content that is stored at that address. If
you are taking COMP 206 now, then you will learn about dereferencing soon. (Joseph told me so.)
Here’s is MIPS code that does roughly the same thing as the while loop part of the above code.
(Let’s not deal with the initialization part of the code that comes before the while loop. For now,
just accept that the first instruction la ”loads the address” of the string str into register $s0. Also,
assume ct is in register $s1.)

la $s0, str # pseudoinstruction (load address)


add $s1, $zero, $zero # initialize ct, $s1 = 0.
loop: add $t0, $s0, $s1 # address of byte to examine next
lb $t1, 0( $t0 ) # load that byte to get *(s + ct)
beq $t1, $zero, exit # branch if *(s + ct) == ’\0’
addi $s1, $s1, 1 # increment ct
j loop
exit:
Register $t1 holds the ctth char (a byte) and this byte is stored in the lower 8 bits of the register.
Note also that I have used $t registers for temporary values that do not get assigned to variables
in the C program, and I have used $s registers for variables. I didn’t need to do this, but I find it
cleaner to do it.

last updated: 17th Feb, 2016 2 lecture notes c Michael Langer


COMP 273 10 - MIPS instructions 3 Feb. 10, 2016

Assembler directives
We have seen many MIPS instructions. We now would like to put them together and write programs.
To define a program, we need certain special instructions that specify where the program begins, etc.
We also need instructions for declaring variables and initializing these variables. Such instructions
are called assembler directives, since they “direct” the assembler on what (data) to put where
(address).
Let’s look at an example. Recall the MIPS code that computes the length of a string. We said
that the string was stored somewhere in Memory and that the address of the string was contained
in some register. But we didn’t say how the string got there. Below is some code program that
defines a string in MIPS Memory and that counts the number of characters in the string.
The program begins with several assembler directives (.data, .asciiz, . . . ) followed by a main
program for computing string length. The .data directive says that the program is about to declare
data that should be put into the data segment of Memory. Next, the instruction label str just
defines an address that you can use by name when you are programming. It allows you to refer a
data item (in this case, a string). We’ll see how this is done later. In particular, the line has an
.asciiz directive which declares a string, “COMP 273”. When the program below uses the label
str, the assembler translates this label into a number. This number can be either an offset value
which is stored in the immediate field (16 bits) of I-format instruction or a 26 bit offset which in a
J format instruction.
The .text directive says that the instructions that follow should be put in the instruction
(“text”) segment of memory. The .align 2 directive puts the instruction at an address whose
lower 2 bits are 00. The .globl main directive declares the instructions that follow are your
program. The first line of your program is labelled main.

.data # Tells assembler to put the following


# ABOVE the static data part of Memory.
# (see remark at bottom of this page)
str: .asciiz "COMP 273" # Terminates string with NULL character.
.text # Tells the assembler to put following
# in the instruction segment of Memory.
.align 2 # Word align (2^2)
.globl main # Assembler expects program to have
# a "main" label, where program starts

main: la $s0, str # pseudoinstruction (load address)


# (INSERT HERE THE MIPS CODE ON PREVIOUS PAGE)

The address of the string is loaded into register using the pseudoinstruction la which stands for
load address. MARS translates this instruction into

lui $s0, 4097

Note that (4097)10 = 0x1001, and so the address is 0x1001000 which is the beginning address of
the user data segment. (Actually there is a small “static” part of the data segment below this which
starts at address 0x10000000. You can see this in MARS.)

last updated: 17th Feb, 2016 3 lecture notes c Michael Langer


COMP 273 10 - MIPS instructions 3 Feb. 10, 2016

Here are some more assembler directives that you may need to use:

• .word assigns 32-bit values

• .byte initializes single bytes

• .space reserves a number of bytes

For example,

y0: .word -14


b: .byte 24, 62, 1
arr: .space 60
y1: .word 17

declares labels y0, b, arr, y1 which stand for addresses that you can use in your program. You
would not use these labels in branch instructions, since these addresses are in the data segment, not
in the text (instruction) segment. Rather, you use them as variables. The variable y0 is initialized
to have value -14. The variable b is an array of 3 bytes. The variable arr is pointer to the start of
60 consecutive bytes of Memory. This region can be used in a flexible way. For example, it could
be used to hold an array of 15 integers.
Suppose we wanted to swap the values that are referenced by y0 and y1. How would you do
this? If you are programming in C or in Java, you use a temporary variable.

tmp = y0;
y0 = y1;
y1 = tmp;

Suppose y0 and y1 are in registers $s0 and $s1 respectively. You could use the pseudoinstruction
move, or

move $t0, $s0 # same as "add $t0, $s0, $zero"


move $s0, $s1
move $s1, $t0

However, if the variables are in Memory rather than in registers, then you would first need to load
them into Memory, do the swap, and then store them back in Memory. Alternatively, you could
just do it like this.

la $t0, y0 # load addresses of variables to swap


la $t1, y1 # assuming they are labelled y0 and y1

lw $s0, 0($t0) # load contents into registers


lw $s1, 0($t1)
sw $s1, 0($t0) # store in opposite position
sw $s0, 0($t1)

last updated: 17th Feb, 2016 4 lecture notes c Michael Langer


COMP 273 10 - MIPS instructions 3 Feb. 10, 2016

System Calls
We have seen how data can be defined using assembler directives. It is often more useful to use
input/output (I/O) to read data into a program at runtime, and also to write e.g. to a console that
a user can read, or to a file. Here I will describe how to use the console for reading and writing.
There is a MIPS instruction syscall which tells the MIPS kernel/OS that you want to do I/O
or some other operation that requires the operating system, such as exiting the program. You need
to specify a parameter to distinguish between several different system calls. You use the register
$2 for this, also known as $v0. To print to the console, use values 1,2,3, 4 which are for the case
of int, float, double, or string, respectively. To read from the console, use values 5,6,7,8 for int,
float, double, or string, respectively. When you are reading or printing, you need to specify which
register(s) hold the values/addresses you are reading from/to. There is no need to memorize this
stuff. Consult the documentation on MARS.
Here is an example of how to print a string to the console.

la $a0, str # la is a pseudoinstruction ("load address")


li $v0, 4 # code for printing a string is 4
syscall

[MODIFIED Feb 17. The notes I originally posted contained a mistake.]


Here is an example of how to read a string from console. You need to specify that you want to read
a string. You need to specify where the string will be put in Memory. You need to specify a buffer
size for the kernel to use to read the string. (See Assignment 2 Part 4.) Suppose the maximum
buffer size for reading a string is specified by a integer value which is stored in the data segment at
label sizeBuffer as in Assignment 2. Here is the code for reading a string.

li $v0, 8 # code for reading string is 8.


add $a0, $zero, $s1 # suppose $s1 specifies address
# where string will start

la $t0, sizeBuffer # specifies a buffer size for reading string.


lw $a1, 0($t0) # load that buffer size.
syscall

In the slides I then discussed Assignment 2.

last updated: 17th Feb, 2016 5 lecture notes c Michael Langer


COMP 273 11 - functions Feb. 15, 2016

We have seen many of the 32 MIPS registers so far. Here is a complete table of them. In this
lecture, we will meet the last three of them: the stack pointer, the frame pointer (which we won’t
use, but it is good to be aware of it), and the return address. We will also discuss the rightmost
column in this table and what is meant by ”preserved on call”.

Functions (and procedures)


As you have learned in your first programming course, when a sequence of instructions is used many
times, it is better to encapsulate these instructions: give them a name and hide the inner details. In
Java, we call these encapsulated instructions methods. In C, we call them functions and procedures.
A ”procedure” is just a “function” that doesn’t return a value. In this lecture, we’ll look at the
basics of how C-like functions are implemented in MIPS. We won’t look at Java methods because
it would take us off track – I would first need to tell you how classes and their instances (objects)
are represented. If time permits, I may say something about this at the end of the course.
When we write a MIPS function, we label the first instruction with the name that we give the
function. Later we can branch to the function by using this label name. The label just indicates the
address to branch to. Instructions that define a function belong to the instruction (text) segment
of Memory, namely below 0x10000000 – see slides at end of lecture 9. We’ll see an example on p.
3 of this lecture.
To pass arguments to functions and return values from functions, we use the data segment of
Memory, in particular, a region called the stack. You learned about stacks as an abstract data type
in COMP 250. In MIPS, the stack has a specific technical meaning which I will tell you about
today.

last updated: 7th Mar, 2016 1 lecture notes c Michael Langer


COMP 273 11 - functions Feb. 15, 2016

Suppose you have a C function:


int myfunction(int j){
int k;
: // other declarations here, and many instructions
return k;
}
Such a function would be called by another function, for example, main. We will refer to the
function that makes the call as the parent and the function that gets called as the child. (Caller
and callee are used, respectively.) In this example, main is the parent, and myfunction is the child.
main(){
int i,j,m;
:
m = myfunction(i);
:
j = myfunction(5*i);
:
}
How might this work in MIPS? main and myfunction are just labels for an instruction, namely
the first instruction in some sequence of instructions which are in the text segment of Memory.
There are a few important questions here: How does MIPS jump from the parent to child and
then later return from child to parent? How do these functions share registers and Memory? First,
let’s consider what main needs to do or consider when it calls my function:
• it will need to save in Memory any values currently in registers that myfunction might destroy
• it needs to provide a “return address” so myfunction can later jump back to the location in
main from where it was called;
• it needs to provide arguments, so that myfunction can access them;
• it will needs to access the result which will be returned from myfunction
• it needs to branch to the first instruction of myfunction;
Next consider what myfunction needs to do:
• it needs to access its arguments
• it needs to allocate storage for its local variables
• (optional) if it computes a result, then it needs to put this result in the place that main expects
it to be
• it must not destroy any data (in Memory or in registers) that its parent (e.g. main) was using
at the time of the call and that its parent will need after the call
• it needs to return to the parent.
The same rules should apply to any parent and child function. Let’s deal with these issues one by
one.

last updated: 7th Mar, 2016 2 lecture notes c Michael Langer


COMP 273 11 - functions Feb. 15, 2016

Jump to and return from a function


Suppose the first line of the MIPS code for myfunction is labelled myfunction. To branch to
myfunction, you might expect main to use the MIPS instruction:

j myfunction

This is not sufficient, however, since myfunction also needs to know where it is supposed to eventu-
ally return to. Here’s how it is done. When main branches to the instruction labelled myfunction,
it writes down the address where it is supposed to return to. This is done automatically by a
different branching function jal, which stands for “jump and link.”

jal myfunction

jal stores a return address in register $31, which is also called $ra. The return address is written
automatically as part of jal instruction. That is, $ra is not part of the syntax of the jal instruction.
Rather, jal has the same syntax as the jump instruction j we saw earlier. It is a J format instruction.
So you can jump to 226 possible word addresses. Later when we look at the data path, we will see
exactly how this works and it will make more sense!
The return address is the address of the current instruction + 4, where the current instruction
is “jal myfunction”. That is, it is the address of the instruction following the call. (Recall each
instruction is 4 bytes or 1 word).
When myfunction is finished executing, it must branch back to the caller main. The address to
which it jumps is in $ra. It uses the instruction

jr $ra

where jr stands for jump register. This is different from the jump instruction j that we saw
earlier. jr has R-format.
So the basic idea for jumping to and returning from a function is this:
..
myfunction: .
..
.
jr $ra
..
.

..
main: .
jal myfunction
..
.

”Passing” arguments to and returning values from functions


The next issue to address is how arguments are passed to a function, and how a value computed by
the function is returned. When writing MIPS code, one uses specific registers to pass arguments

last updated: 7th Mar, 2016 3 lecture notes c Michael Langer


COMP 273 11 - functions Feb. 15, 2016

to a function. Of the 32 registers we discussed previously, four are dedicated to hold arguments to
be passed to a functions. These are $4,$5,$6,$7, and they are given the names $a0, $a1, $a2,
$a3 respectively. There are also two registers that are used to return the values computed by a
function. These are $2,$3 and they have names $v0,$v1, respectively.
For example, if a function has three arguments, then the arguments should be put in $a0,$a1,$a2.
If it returns just one value, this value should be put in $v0. (It is possible to pass more than four
arguments and to return more than two values, but this requires using registers other than $2 − $7.)
[ASIDE: recall from last lecture that syscall uses $v0 to specify the code for the system call e.g.
print an integer, and $a0 and a1 for various arguments needed for the system call. If a function
uses syscalls, then one must be especially careful to save these register values before the syscall and
then load these register values after the syscall. ]
What are the MIPS instructions in main that would correspond to the C statement

m = myfunction(i);

The argument of myfunction is the variable i and let’s suppose i is in $s0. Suppose the returned
value will be put into $s1, that is, the C variable m will be represented in the MIPS code by $s1.
Here are the instructions in main:

move $a0, $s0 # copy i into argument register


jal myfunction
move $s1, $v0 # copy result to variable m

MIPS register conventions


One situation hat we need to avoid is that the child function writes over data (in registers or
Memory) of the parent function. The parent will need this data after the program returns to it.
Just as people have conventions that allow us to avoid bumping into each other like walking on the
right side of a hallway or stairwell, MIPS designers invented conventions that MIPS programmers
should follow to avoid erasing data.
Here are the rules for the $s0,...,$s7 registers and $t0,...,$t7 registers:

• The parent assumes that temporary registers $t0,...,$t7 may be written over by the child.
If the parent needs to keep any values that are in $t0,...,$t7 prior to the call, then it should
store these values into Memory prior to the call.
In addition, the parent assumes that the v0, v1 variables can be overwritten, so if the parent
wants to keep values in those registers then the parent needs to store them in Memory prior
to the call and (re)load them after the call.

• The parent assumes that the values in the save registers $s0,..,$s7 will still be there after
the return from the call. The parent does not store the values in these registers into Memory
before making the call. Thus, if a child wishes to use any of the $s0,..,$s7 registers, it must
store the previous values of those registers into Memory prior to using these values, and then
load these values back into the register(s) prior to returning to the parent.

last updated: 7th Mar, 2016 4 lecture notes c Michael Langer


COMP 273 11 - functions Feb. 15, 2016

In addition, the parent assumes that the a0, · · · , a3 variables will be preserved after the
function call. So if the child wants to use these registers, then it should first store them in
Memory and later it should load them back prior to returning to the parent.
Let’s look at how this is done.

The Stack
Register values that are stored away by either the parent or child function are put on the stack,
which is a special region of MIPS Memory. The stack grows downwards rather than upwards (which
is strange since we sometimes we will use the phrase “top of the stack”). We refer to the base of
the stack as the address of the top of the stack when the stack is empty. The base of the stack is
at a (fixed) MIPS address 0x7fffffff which is the halfway point in MIPS Memory, and the last
location in the user part of Memory. (Recall that the kernel begins at 0x80000000.)

0xffffffff

kernel

0x7fffffff base of stack


stack
$sp (top of stack)

heap
0x10010000
0x10000000 "static" data (global) $gp = 0x10008000

MIPS instructions ("text")

reserved

The address of the “top” of the stack is stored in register $29 which is named $sp, for stack
pointer. This address is a byte address, namely the smallest address of a byte of data on the stack.
To push a word onto the stack, we decrease $sp by 4 and use sw. To pop a word from the stack,
we use lw and increase the $sp by 4. I emphasize that an address always means a byte address,
and so when we use an address in a load/store word instruction, we mean the word that starts at
a particular byte and occupies that byte and the next three bytes after it, which have larger byte
addresses.
ASIDE: In the figure on the previous page, note that the stack grows downward toward another
area of user Memory called the ”heap” (which is not to be confused with the heap data type that

last updated: 7th Mar, 2016 5 lecture notes c Michael Langer


COMP 273 11 - functions Feb. 15, 2016

you learned about in COMP 250). The heap is used to allocate memory for other variables than
what we are discussing here: for example, last lecture we discussed assembler directives and how
you could define strings and other data. These go on the heap. In the C programming language,
when you ”malloc” and ”free”, you use the heap. In Java, when you construct ”new” objects, you
use the heap. Unlike the stack, the heap is not necesarily a continuous regions of memory e.g. when
you free space (C) or garbage collect an object that is no longer reference (Java) you leave a hole
in the heap. The kernel will try to fill that hole later, rather than just moving the heap upwards
to towards the stack, since if you just keep moving the heap boundary up then eventually the heap
and stack will collide and the program will run out of space.

How the caller/parent uses the stack


Suppose that, when the caller (e.g. main) calls a function (e.g. myfunction), there may be values
in some of the temporary registers (say $t3,$t5) that the caller will need after myfunction returns.
According to MIPS conventions, the caller should not expect the values in these $t registers to still
be present after the function call. Thus, caller should store away the values in these temporary
registers before the call, and should load them again after the call. The stack is used to hold these
temporary values:

parent: :
move $a0, $s0 # pass argument
addi $sp, $sp, -8
sw $t2, 4($sp) # store the temporary registers
sw $t5, 0($sp)

# sw $t2, -4($sp) # also correct


# sw $t5, -8($sp)
# addi $sp, $sp, -8

jal child
lw $t2, 4($sp) # (re)load tempory registers
lw $t5, 0($sp)
addi $sp, $sp, 8
:

How the callee/child function uses the stack


Suppose the child (e.g. myfunction) uses registers $s0,$s1,$s2. According the MIPS conventions,
it must store the current contents of the registers onto the stack. Then, when it is finished it must
load these previous values back in.

child: sw $s0, -4($sp)


sw $s1, -8($sp)
sw $s2, -12($sp)
addi $sp, $sp, -12
: # DO WORK OF FUNCTION HERE

last updated: 7th Mar, 2016 6 lecture notes c Michael Langer


COMP 273 11 - functions Feb. 15, 2016

:
lw $s0, 8($sp) # restore the registers from the stack
lw $s2, 4($sp)
lw $s3, 0($sp)
addi $sp, $sp, 12
move $v0, $s2 # write the value to return
jr $ra # return to caller

A child can be a parent


Up to now we have only considered the case that the parent function calls a child, which returns
directly to the parent. It can also happen that the child function is itself a parent: it might call
another function, or it might call itself. For example, consider the call tree shown below. (For
those of you who have taken COMP 250, you know that the sequence of calls in a running program
defines a tree, and the nodes are traversed in ”pre-order”.) In the example below, first, main calls
function1 which calls function2. Then function2 returns to function1 which then returns to
main. Then, main calls function1 again, which calls function2, but this time function2 calls
itself, recursively, twice. function2 then returns to itself, twice, and then returns to function1.
function1 then calls function2 which calls function 3. Then function3 returns to function2
which returns to function1 which returns to main.
main

function1 function1

function2 function2 function2

function2 function3

function2

In this example, main, function1, and function2 are all functions that call other functions.
function3 however, does not call any other other functions. When a function calls itself, or when
it calls another function, it needs to store any temporary registers ($t), as discussed above. But it
also needs to store on the stack:
• $ra : e.g. When function1 calls function2, the instruction “jal function2” automatically
writes into $ra. Thus, function1 needs to store the value of $ra before it makes this call,
and then reloading it after the call. Otherwise it will not be able to return to main.)

last updated: 7th Mar, 2016 7 lecture notes c Michael Langer


COMP 273 11 - functions Feb. 15, 2016

• any of $a0,$a1,$a2,$a3 that are needed by the caller: e.g. when function1 was called by
main, it was passed certain arguments. These arguments will typically be different from the
arguments that function1 passes to function2
• $v0,$v1: if function1 has assigned values to these registers prior to calling function2, then
it would need to store these assigned values. (For example, if function1 uses the $v0 register
as part of a syscall).
If a function (say function3) does not call any other functions (including itself, recursively)
then we say that such a function is a leaf function, in that it is a leaf in the call tree. Leaf functions
are relatively simple because there is no need to store $ra or argument registers $a0, .., $a3 or value
registers $v0, $v1 (unless they are used by syscall as part of the function).

Stack frames
Each function uses a contiguous set of bytes on the stack to store register values that are needed by
its parent or that it will need once its child has returned. This set of contiguous bytes in memory
is called a stack frame. Each function has its own stack frame.

0x7fffffff base of stack


main ($t0, $t3, $a0, $ra)
function1
($s0, $s2, $t2, $t3, $a0, $v0, $ra)

function2
($s0, $s1, $t4, $a0, $a1 $v0, $ra)

function2
$fp (optional)
($s0, $s1, $t4, $a0, $a1 $v0, $ra)

$sp (top of stack)

Stack frames can hold other values as well. If a function declares local variables that are too
big to be kept in registers (for example, many local, or an array), then these variables will also be
located in the stack frame.
Sometimes you don’t know in advance how big the stack frame will be, because it may depend
on data that is defined only at runtime. For example, the function might make a system call and
might (based on some condition that may or not be met) read data from the console, and that
data could be put on the stack. In these cases, you need to move the stack pointer as the stack
frame changes size. In these cases, it may be useful to mark the beginning of the stack frame.
Such a marker is called the frame pointer. There is a special register $fp = $30 for pointing to the
beginning of the stack frame on the top of the stack. I won’t be using it. I am mentioning it for
completeness only.

last updated: 7th Mar, 2016 8 lecture notes c Michael Langer


COMP 273 11 - functions Feb. 15, 2016

Finally, note that the first stack frame is from main and that it has a return address. The reason
is that when the program is over, the function will return to the kernel.

Example: sum to n
Here is a simple example of recursion. It is similar to the ”factorial” function except that here we
are computing the sum of the values from 1 to n rather than the product of values from 1 to n. Of
course, you would not use recursion to compute the sum of the first n numbers from 1 to n since in
fact this sum is n(n+1)/2. I use this example only to illustrate how recursion works.
Here is the code in a high level language, following by MIPS code.

# int sumton( int n) {


# if (n == 0)
# return 0
# else
# return n + sumton( n-1 )
# }

sumton:
beq $a0, $zero, basecase # if n == 0 then branch to base case
addi $sp,$sp,-8 # else , make space for 2 items on the stack
sw $ra, 4($sp) # store return address on stack
sw $a0, 0($sp) # store argument n on stack
# (will need it to calculate returned value)
addi $a0, $a0, -1 # compute argument for next call: n = n-1
jal sumton # jump and link to sumton (recursive)

lw $ra, 4($sp) # load the return address


lw $a0, 0($sp) # load n from the stack
addi $sp, $sp,8 # change the stack pointer

# register $v0 contains result of sumton(n-1)


add $v0, $a0, $v0 # add n to $v0
jr $ra # return to parent

basecase:
addi $v0, $zero, 0 # assign 0 as the value in $v0
jr $ra # return to parent

last updated: 7th Mar, 2016 9 lecture notes c Michael Langer


COMP 273 12 - MIPS co-processors Feb. 17, 2016

In today’s lecture we will look at two ”co-processors”, namely the floating point processor (called
CP1 or the FPU) and the kernel processor (called CP0 or the ’system control’ processor). We will
look at the MIPS assembly language instructions for this processor.
This is the last lecture above MIPS programming. After this, we will go back to the circuits
and connect the general ideas about circuits to the particular instructions we have seen in MIPS,
mostly CPU instructions but occasionally CP0 too.

Integer multiplication and division in MIPS


[ASIDE: The slides also start out with this mini-topic, but at the beginning of the lecture, I decided
to skip over this topic. I then returned to the topic (slides) a bit later.]
In Assignment 1, you built a simple but slow circuit for multiplying two unsigned integers and,
in lecture 7, I discussed more complicated circuits for how you could perform fast multiplication.
You should appreciate that there are many ways this could be done. Those details should now be
put ”under the hood”. Let’s just look at multiplication from the MIPS programmer’s perspective.
In MIPS assembly language, there is a multiplication instruction for signed integers, mult, and
for unsigned integers multu. Since multiplication takes two 32 bit numbers and returns a 64 bit
number, special treatment must be given to the result. The 64 bit product is located in a “product”
register. You access the contents of this register using two separate instructions.

mult $s0, $s1 # Multiply the numbers stored in these registers.


# This yields a 64 bit number, which is stored in two
# 32 bits parts: "hi" and "lo"
mfhi $t0 # loads the upper 32 bits from the product register
mflo $t1 # loads the lower 32 bits from the product register

You can only read from the product register. You cannot manipulate it directly. In MARS, the
product register is shown as two 32 bit registers, HI and LO. If the HI register has all 0’s, then the
product that is computed can be represented by 32 bits (what’s in the LO register). Otherwise, we
have a number that is bigger than the maximum int and it would need to be treated separately.
Details are omitted here.
What about division? To understand division, we need to recall some terminology. If we divide
one positive integer by another, say 78/21, or more generally ”dividend/divisor” then we get a
quotient and a remainder, i.e.

dividend = quotient ∗ divisor + remainder

e.g. 78 = 3 ∗ 21 + 15
In MIPS, the divide instruction also uses the HI and LO registers, as follows:

div $s0, $s1 # Hi contains the remainder, Lo contains quotient


mfhi $t0 # remainder moved into $t0
mflo $t1 # quotient moved into $t1

The mult, div, mfhi, mflo are all R format instructions.

last updated: 21st Apr, 2016 1 lecture notes c Michael Langer


COMP 273 12 - MIPS co-processors Feb. 17, 2016

floating point in MIPS


As I also mentioned in lecture 7, special circuits and registers are needed for floating point op-
erations. The simple version of MIPS that we are using (called the R2000) was created back in
the mid-1980s. At that time, it was not possible to fit the floating point circuits and registers on
the same physical chip 1 as the chip that contained the CPU (including registers $0-$31, ALU,
integer multiplication and division). Instead, the floating point operations were carried out on a
physically separate chip called the floating point coprocessor or floating point unit (FPU) which in
MIPS is called coprocessor 1. The FPU for MIPS has a special set of 32 registers for floating point
operations, named $f0, $f1, ... $f31.
Recall that double precision floats require two words. In MIPS, double precision numbers require
two registers. These are always consecutive registers, beginning with an even number register ($f0,
$f2, etc). Thus, there is no need to reference both registers in the instruction. For example, a
double precision number referenced by $f2 in fact uses $f2 and $f3.

FPU (floating point unit)


CPU (central processing unit)
"coprocessor 1"

mfc1
register integer floating point
arithmetic register arithmetic
$0,..,$31 division mtc1 $f0,.. $f31 divison
multiplication multiplication
logical ops int−float convert

sw lwc1

lw swc1

Memory
(2^32 bytes)

The MIPS instructions for adding and subtracting floating point numbers are of the form:
add.s $f1, $f0, $f1 # single precision add
sub.s $f0, $f0, $f2 # single precision sub
add.d $f2, $f4, $f6 # double precision add
sub.d $f2, $f4, $f6 # double precision sub
Having a separate FPU takes some getting used to. For example, the following instructions have
incorrect syntax and are not allowed.
add.s $s0, $s0, $s1 # NOT ALLOWED (add.s expects FPU registers)
add $f0, $f2, $f2 # NOT ALLOWED (add expects CPU registers)
1
by “chip”, I mean the silicon-based electronics which contains the combinational and sequential circuits

last updated: 21st Apr, 2016 2 lecture notes c Michael Langer


COMP 273 12 - MIPS co-processors Feb. 17, 2016

Multiplication and division are done as follows.

mul.s $f0, $f1, $f2


div.s $f0, $f1, $f2

similarly for double precision, except now we must use an even number register for the argument,

mul.d $f0, $f0, $f2


div.d $f0, $f0, $f2

There is no Hi and Lo register for floats, as there was for integers.

Data transfer operations


In order to perform the above operations, the floating point registers must be filled, and after the
operations the results need to be put somewhere. There are two ways to move data to/from floating
point registers. The first is to move words to/from the CPU registers. This is done with the “move
to/from coprocessor 1” instruction:

mtc1 $s0, $f0 # note order of operands here


mfc1 $s0, $f0 # "

The second is to load/store a word to/from Memory:

lwc1 $f1, 40( $s1 )


swc1 $f1, 40( $s1 )

Note that the memory address is held in a CPU register, not in a float register!
To load/store double precision, we could use two operations to load/store the two words. It is
easier though to just use a pseudoinstruction:

l.d $f0, -10( $s1 )


s.d $f0, 12( $s1)

There is a corresponding pseudoinstruction for single precision i.e. l.s or s.s.

Type conversion (casting)


If you wish to perform an operation that has two arguments (for example, addition, subtraction,
multiplication, division) and if one of the arguments is an integer and the other is a float, then
one of these arguments needs to be converted (or “cast”) to the type of the other. The reason is
that the operation is either performed on the floating point circuits or on the integer circuits. The
conversion is done on the FPU though, regardless of whether you are converting from integer to
float or float to integer.
The MIPS instruction for cast is cvt. . where the underline is filled with a symbol f (for
single precision float), d (for double precision float), w (for integer). I guess MIPS avoids using i
for integer because it often indicates immediate.

last updated: 21st Apr, 2016 3 lecture notes c Michael Langer


COMP 273 12 - MIPS co-processors Feb. 17, 2016

Example 1
Let’s say we have an integer, say -8 in $s0, and we want to convert it to a single precision float and
put the result in $f0. First, we move the integer from the CPU to FPU (c1). Then we convert.
addi $s0, $0, -8 # $s0 = 0xfffffff8
mtc1 $s0, $f0 # from $s0 to $f0 -> $f0 = 0xfffffff8
cvt.s.w $f0, $f0 # from w to s -> $f0 = 0xc1000000
As an exercise, recall from the first few lectures of the course why the registers have the coded
values shown on the right.
Similarly, suppose we have a single precision float (in $f0) and we want to convert it to an
integer and put it into $s0. Here is how we could do it.
cvt.w.s $f2, $f0 # we use $f2 as a temp here
mfc1 $s0, $f2

Example 2
int i = 841242345 // This value in binary would be more than 23
// bits, but less than 32 bits. (Specifically
// we cannot write it exactly as a float.)

int j = (float) i; // Explicitly cast i to float, and then


// implicitly cast it back to int (implicit
// since we store the result as an int j).
Here is the same thing in MIPS.
mtc1 $s0, $f0
cvt.s.w $f1, $f0
cvt.w.s $f1, $f1
mfc1 $s0, $f1
Try the above in MIPS and verify that $s0 and $s1 have different values.

Example 3
double x = -0.3;
double y = (float) x;
Here it is in MIPS:
.data
d1 : .double -0.3

.text
l.d $f0, d1 # $f0 = 0xbfd3333333333333 <- these two...
cvt.s.d $f2, $f0 # $f2 = 0xbe9999a
cvt.d.s $f4, $f2 # $f4 = 0xbfd3333340000000 <- ... are different

last updated: 21st Apr, 2016 4 lecture notes c Michael Langer


COMP 273 12 - MIPS co-processors Feb. 17, 2016

Conditional branch for floats


We do a conditional branch based on a floating point comparison in two steps. First, we compare
the two floats. Then, based on the result of the comparison, we branch or don’t branch:

c. .s $f2, $f4

Here the “ ” is any comparison operator such as: eq, neq, lt, le, ge, gt. These compare
operations are R format instructions.
The programmer doesn’t have access to the result of the comparison. It is stored in a special
D flip-flop, which is written to by the comparison instruction, and read from by the following
conditional branch instruction:

bc1t Label1 # if the condition is true, branch to Label1


bc1f Label1 # if the condition if false, branch to Label1

The same one bit register is used when comparisons are made using double precision numbers.

c. .d $f2, $f4

System call
I did not mention it in the slides, but you also can print and read float and double from the
console, using syscall. Rather than printing from or reading to an argument register, it uses a
particular coprocesor 1 register shown in table below.

$v0 from/to (hardwired i.e. no options)


--- -------
print float 2 $f12
print double 3 $f12
read float 6 $f0
read double 7 $f0

MIPS Coprocessor 0: the System Control Coprocessor


There is another coprocessor, called coprocessor 0. Coprocessor 0 has special registers which are
used only in kernel mode. One of the uses of this coprocessor is “exception handling”. (We will
discuss exceptions in detail later in the course. For now, think of an exception as an event that
requires your program to stop and the operating system to have to do something. A syscall is an
example.) Exceptions are errors that happen at runtime. They cannot be detected by the assembler
i.e. before the program runs.
When you run MARS, you will see four of these registers displayed, namely:

• EPC: ($14) The exception program counter contains a return address in the program. It is
the address in the program that follows the instruction that caused the exception.

• Cause: ($13) contains a code for what kinds of exception occured. ( invalid operation, division
by zero, overflow )

last updated: 21st Apr, 2016 5 lecture notes c Michael Langer


COMP 273 12 - MIPS co-processors Feb. 17, 2016

• BadVaddr: ($8) holds the address that led to an exception if the user program tried to access
an illegal address e.g. above 0x80000000.
• Status: ($12): says whether the processor is running in kernel mode or user mode. Says
whether the processor can be interrupted by various possible interrupts (We will discuss
interrupts in a few weeks).

Example 1: overflow error


Here is an example which produces an overflow error. First, I put 1 into $s0 and then multiply it
by 230 . Then I added this value to itself which gives 231 which is one bigger than the largest signed
32 bit integer, i.e. overflow.
addi $s0, $0, 1 # $s0 = 1
sll $s0, $s0, 30 # $s0 = 2^30
add $t0, $s0, $s0 # $t0 = 2^31 -> overflow (signed)
If you step through this code, you’ll find that it crashes and that the c0 registers change when
this happens. The registers encode what caused the crash and they encode the instruction address
where the crash occurred.
Note that if I insert an instruction which subtracts 1, then I don’t get overflow since the sum is
31
2 − 2 which is less than the largest signed int.
add $s0, $0, 1 # $s0 = 1
sll $s0, $s0, 30 # $s0 = 2^30
addi $s0, $s0, -1 # $s0 = 2^30 - 1 <--- inserted
add $t0, $s0, $s0 # $t0 = 2^31 - 2

Example 2: illegal address


Here is a different type of exception.
.data
str: .asciiz "hello"
.text
la $s0, str
j str
The problem is that the program is jumping (from the text segment) to the data segment, which
makes no sense. Instructions are not data.

Example 3: illegal address


inst: la $s0, inst
lw $t0, 0($s0)

This is the opposite problem. Now I am trying to load from the text (instruction) segment. Note
that the MARS assembler cannot detect this problem, since the problem only occurs at runtime,
once it is determined that $s0 contains an address in the text segment, not the data segment.

last updated: 21st Apr, 2016 6 lecture notes c Michael Langer


COMP 273 12 - MIPS co-processors Feb. 17, 2016

Floating point exceptions?


Here are some examples of similar operations in floating point. These illustrate how certain events
behave differently with floats than with ints, namely some operations that produce exceptions with
ints do not produce exceptions with floating point.

Example 4: Overflow
addi $s0, $0, 1
sll $s0, $s0, 30 # $s0 = 2^30
mtc1 $s0, $f0
cvt.s.w $f0, $f0 # $f0 = 2^30 = 1.00000000000000000000000 x 2^30
mul.s $f0, $f0, $f0 # $f0 = 2^60
mul.s $f0, $f0, $f0 # $f0 = 2^120
mul.s $f0, $f0, $f0 # $f0 = 2^240 -> overflow

Interesting, no exception occurs here. Instead, if you look at the result that ends up in $f0, namely
0x7f80 0000, you will find that it represents +∞, namely 0 for the sign bit, 11111111 for the
exponent, and then 23 0’s for the significand.

Example 5: Division by zero


float1: .float 13.4
l.s $f0, float1
mtc1 $0, $f1 # Note that no cast is needed since
# 0x00000000 also is 0 in float. Wow!
div.s $f2, $f0, $f1

Again, no exception occurs. The result is again +∞.

Example 6: 0/0
mtc1 $0, $f1
div.s $f2, $f1, $f1 # 0/0

The result is NaN, but no exception occurs.

last updated: 21st Apr, 2016 7 lecture notes c Michael Langer


COMP 273 13 - MIPS datapath and control 1 Feb. 22, 2016

You are familiar with how MIPS programs step from one instruction to the next, and how
branches can occur conditionally or unconditionally. We next examine the machine level repre-
sentation of how MIPS goes from one instruction to the next. We will examine how each MIPS
instruction is “decoded” so that data is passed to the appropriate places (e.g. from register to
memory, from memory to a register, from a register to another register, etc). By the end of this
lecture and the next, you will have a basic understanding of how the combinational and sequential
circuits that you learned at the beginning of this course are related to the MIPS instructions and
programs that we have covered in more recent lectures.

Data paths for MIPS instructions


In this lecture and the next, we will assume that one instruction is executed in each clock cycle.
This is known as the single cycle model. In the lecture following the study break, we will see the
limitations of the single cycle model, and we will discuss how MIPS gets around it, namely by
”pipelining” the instructions.

Instruction fetch
Both data and instructions are in Memory. For an instruction to be executed, it first must be read
out of Memory. MIPS has a special program counter register (PC) that holds the address of the
current instruction being executed. As a MIPS programmer, you are not responsible for “fetching”
the next instruction from memory. This done automatically. Let’s have a look at how this is done.
Consider the simple case of a non-branching instruction such as add or lw or sw. The address of
the instruction that follows this non-branching instruction is, by default, the address of the current
instruction plus 4 (that is, plus 4 bytes or one word). Because we are assuming that each instruction
takes one clock cycle, at the end of clock cycle, PC is updated to PC+4.
The current instruction (add or lw or sw, ...) is fetched by using PC to select a word from the
text part of Memory, namely the word containing the current instruction. Later in the course, we
will see how this is done. For now, it is enough for you to understand that an address is sent to
Memory and the instruction at that address is read out of Memory. To summarize, the contents
of the PC (a 32 bit address) is sent to Memory and the instruction (also 32 bits) starting at that
address is read from Memory.

PC

Memory various
parts of
(instructions) instruction
32 (depends
32 on R, I, J
format)

last updated: 23rd Feb, 2016 1 lecture notes c Michael Langer


COMP 273 13 - MIPS datapath and control 1 Feb. 22, 2016

Let’s next look at several examples of instructions and consider the “datapaths” and how these are
controlled.

Data path for add


What happens when the 32 bit add instruction is read out of memory? Recall that an add instruction
has R format:
field op rs rt rd shamt funct
numbits 6 5 5 5 5 6

The opcode field and funct fields together encode that the operation is addition (rather than some
other arithmetic or logical operation). The three register fields of the instruction specify which
registers hold the operands and which register should receive the output of the ALU. Thus, the op
code and funct fields are used to control what data gets puts on what lines and when data gets
written. I will say more about how the control signals are determined next lecture. For now, we
will concentrate on the datapaths, as shown in the figure below.

PC

opcode, funct
control

WriteEnable
12 ("RegWrite")
ALU op
Memory 5 32
(instructions) 5 registers

5
32

32

Here is what happens during an add instruction:

• new PC value is computed and written (at the end of clock cycle)

• instruction is read (“fetched”) from Memory (details later in the course)

• two ReadReg and the WriteReg are selected (recall lecture 6)

• control signals for ALU operation are determined (to be discussed next lecture)

last updated: 23rd Feb, 2016 2 lecture notes c Michael Langer


COMP 273 13 - MIPS datapath and control 1 Feb. 22, 2016

• RegData values are read from two registers and input to ALU; ALU operation is performed

• result is written into WriteReg (at the end of clock cycle)

The steps above can all be done in a single clock cycle, provided the clock interval is long enough
that all circuits have stabilized (and they would need to designed to ensure this). For example, the
ALU involves carry values which take time to ripple through.

Data path for load word (lw)


Recall that this instruction has I format:
field op rs rt immediate
numbits 6 5 5 16

An example is: lw $s0, 24($t3)


Here is the data path (including the PC update!)

4 control
opcode
PC
WriteEnable ALUop
6 ("RegWrite")

5 32
Memory
(instructions) registers
Memory
5 32 (data)

32
32

sign
extend
16
32

For lw, the first three steps are the same as in the add instruction, but the remaining steps are
quite different.

• new PC value is computed (it is written at next clock pulse)

• instruction is read (“fetched”) from Memory

• ReadReg and WriteReg selected and RegWrite is specified

• base address of Memory word to be loaded is read from a register and fed to ALU

• offset (16 bit immediate field) is sign-extended and fed to ALU

last updated: 23rd Feb, 2016 3 lecture notes c Michael Langer


COMP 273 13 - MIPS datapath and control 1 Feb. 22, 2016

• control signals are set up for ALU operation (see next lecture)

• ALU computes sum of base address and sign-extended offset; result is sent to Memory

• word is read from Memory and written into a register (end of clock cycle)

Note that in the figure above I have colored the instruction segment (left) and the data segment
(right) to remind you roughly where these segments are located and the fact that they don’t overlap.

Data path for store word (sw)


The format for sw is the same as lw but the data path for sw is slightly different:

opcode
4 control

PC
ALUop MemWrite
6
5 32
Memory
(instructions) registers
32 Memory
5
(data)

32

sign
extend
16

• new PC value is computed (it is written at next clock pulse)

• instruction is read (“fetched”) from Memory

• two ReadReg are selected

• base address for Memory word is read from a register and fed to ALU

• offset (16 bit immediate field) is sign-extended and fed to ALU

• control signals are set up for ALU operation

• ALU computes sum of base address and sign-extended offset; result is sent to Memory

• word is written into Memory (at end of clock cycle)

last updated: 23rd Feb, 2016 4 lecture notes c Michael Langer


COMP 273 13 - MIPS datapath and control 1 Feb. 22, 2016

Data path for bne


Next let’s look at the case that the current instruction is a conditional branch, for example,

bne $s0 $s2, label

which is also I format. This instruction is more interesting because the PC might not proceed to
the next instruction in Memory at the next clock cycle.
To determine whether the branch condition is met, the bne instruction uses the ALU, and
performs a subtraction on the two register arguments. If the result is not zero, then the two
registers do not hold the same values and the program should take the branch. Otherwise, the
program continues to the next default instruction. Specifically, if the result of the ALU subtraction
is not zero, then
P C := P C + 4 + offset
else
P C := P C + 4 .
It may seem a bit strange that 4 is added in both cases, but that’s in fact what MIPS does. This
makes sense when you consider the the datapath. The offset is relative to the current instruction
address plus 4, rather than to the current instruction address. (See example on slides.)
See the figure below for the datapath. The multiplexor in the upper part of the figure selects
between PC+4 and PC+4+offset. The selector signal depends on the 32 bit output of the ALU,
which has performed a subtraction using the two registers specified by the bne instruction. This
32 bit ALU result is fed into a 32 bit OR gate. The output of the OR gate is labelled “NotZero”
in the figure. This NotZero signal is 1 if and only the subtraction is non-zero, that is, if the two
registers do not have the same contents.
Note that the branch is only taken if indeed the instruction is bne. In the figure, the NotZero
signal is ANDed with a bne instruction control signal that indicates it is indeed a bne instruction.
(I have labelled the branch control as “bne instruction” in the figure.) The branch is taken when
both the NotZero signal is ON and the branch control is ON (1). e.g. For an add instruction or any
other R-format instruction, the branch signal would be OFF (0).
One other detail worth noting. Instructions are word aligned in Memory, namely the two lowest
order bits of an instruction address always have the value 0. Thus, in the 16 bit address field of a
conditional branch, MIPS doesn’t waste these two bits by always setting them to 00. Instead, the
16 bit field represents an offset of 4× the offset representation in binary. That is, the offset value is
“shifted” by two bits to the left before it is added to PC+4. Careful: no shift register is required
for this. Instead, two 0 wires are inserted directly, and the sign extend circuit extends from 16 to
30 bits, so in total there are 32 bits.
Here are the steps of the bne instruction:

• PC holds the address of the current instruction

• instruction is read (“fetched”) from Memory

• PC+4 value is computed

• value of PC+4 is added to sign-extended/shifted offset

last updated: 23rd Feb, 2016 5 lecture notes c Michael Langer


COMP 273 13 - MIPS datapath and control 1 Feb. 22, 2016

4
+ 0M
PC U
+ 1X
NotZero

opcode bne instruction


control
6 ALUOp
5 32
Memory 32
(instructions) registers
5 32

16
sign
extend
and shift
left by 2 bits

32

• ReadReg values are set


• control signals are set up for subtraction in ALU
• RegData values are read from two registers into ALU
• ALU does subtraction and 32 bit result is fed into giant OR gate, whose output we call
NotZero
• NotZero is ANDed with a bne control to select either PC+4 or PC+4+offset, and selected
value is written into PC (at end of clock pulse)

Datapath for j (jump, J format)


Recall that the jump instruction j has the following format. This gives 26 bits to say where we are
jumping to.

field op address
numbits 6 26

You might think that 26 bits specify an offset from PC+4 just like the conditional branches. How-
ever, it is not done this way. Rather, the offset is from the address 0x*0000000, where * is the high
order 4 bits of the PC register, namely bits 28-31. Recall that the MIPS instructions of user pro-
gram go up to but not including 0x10000000. Thus, all user instructions have 0000 as their highest

last updated: 23rd Feb, 2016 6 lecture notes c Michael Langer


COMP 273 13 - MIPS datapath and control 1 Feb. 22, 2016

four bits i.e. 0x0******* in hexadecimal. MIPS doesn’t always concatenate 0000 onto the front
of a jump instruction, since there are kernel programs as well. Instructions in the kernel programs
have addresses that are above 0x80000000. So jump instructions in the kernel do not take 0000 as
their upper four bits. Rather, they take (1000)2 as their upper four bits, i.e. 0x8.
Moreover, because instructions are word aligned, the 26 bit field is used to specify bits 2-27.
Bits 0 and 1 in an instruction address always have the value 00.

4
M
U
PC conca− X
[PC 28−31] tenate

28 PCSrc

32 [inst 0−25]
shifted left
by 2 bits

Memory
(instructions)

6
[inst 26−31]

last updated: 23rd Feb, 2016 7 lecture notes c Michael Langer


COMP 273 14 - MIPS datapath and control 2 Feb. 24, 2016

Merging datapaths: (add,lw, sw)


The datapaths that we saw last lecture considered each instruction in isolation. I drew only those
elements that were needed for each instruction. Of course, the various wires and inputs are hardware
and thus they are there in every instruction. In this lecture we merge the various datapaths. We
do so using multiplexors to select from different possible inputs for circuit elements, for example,
to select which values are to be written into registers or Memory on each clock cycle. We also need
to say more about control signals, for example, the selector signals for the multiplexors.
The figure on the following page shows how the add, lw, sw datapaths from last lecture are
merged into a single circuit. I have not draw a single “control unit”, but instead I have just shown
the various control lines (in green) where they are needed. Let’s discuss these control signals one
by one. I’ll begin by discussing the controls of the three multiplexors, shown in blue.

• RegDst: Recall the R and I format instructions have the form:

bits 26-31 21-25 16-20 11-15 6-10 0-5


R-format op rs rt rd shamt funct
numbits 6 5 5 5 5 6

I-format op rs rt immediate
numbits 6 5 5 16

For either R or I format instructions, at least one register is read. For this reason, one of the
register slots in the instruction format is always used for a read. This is the rs (source) slot,
namely bits [21-25]. A 5 bit line is always fed into the ReadReg1 input and specifies which
register we read from.
It is not true, however, that an instruction always writes to one of the 32 registers. For
example, sw does not write to a register. The rt slot, namely bits [16-20], holds a second
register number which add and sw read from, and which lw writes to.
If an R-format instruction uses three registers (add), then two of them are read from and one
of them is written to. The rd register (R format) is the one that is written to.
To summarize, rs is always specifies a read register (“s” standing for source), rd always specifies
a write register (“d” standing for destination) and rt is a register that is either read from or
written to, depending on the instruction. (I am not sure that “t” stands for anything. )
If you look at the merged datapath, you see that the three 5-bit register number lines are
routed to the register array and that a multiplexor is used to decide which line goes to the
WriteReg input. The selector signal for this multiplexor is called RegDst (register destination);
the value is 1 means rd is selected (e.g. add) and the value is 0 means rt is selected (e.g. lw).

• ALUSrc: One input of the ALU always comes from a register. So one of the ReadData registers
(namely ReadData1) is hardwired to one of the ALU inputs. The other ALU input comes
from either a register (in the case of an add) or from the signed-extended 16 bit immediate
field (in the case of a lw or sw). Thus a multiplexor is needed to choose from among these
two possible inputs to the second ALU input. The selector for this multiplexor is ALUSrc.

last updated: 13th Mar, 2016 1 lecture notes c Michael Langer


COMP 273 14 - MIPS datapath and control 2 Feb. 24, 2016

• MemtoReg: The third multiplexor shown in the figure selects where the register write value
comes from. It might come from the ALU (R-format) or it might come from Memory lw.

Other controls are shown in the merged data path. RegWrite specifies whether or not we are
writing to a register. MemWrite specifes whether or not we are writing to Memory. MemRead
specifies whether or not we are reading from Memory. Note that the RegWrite signal was called
“WriteEnable” in lecture 6 page 5. The MemWrite signal was shown in lecture 6 p. 6,7.

input output (of control, i.e. control variables)


opcode RegWrite MemWrite MemRead MemToReg RegDst ALUsrc
any R-format 000000 1 0 0 0 1 1
lw 100011 1 0 1 1 0 0
sw 101011 0 1 0 X X 0
bne 000100 0 0 0 0 X 1

last updated: 13th Mar, 2016 2 lecture notes c Michael Langer


COMP 273 14 - MIPS datapath and control 2 Feb. 24, 2016

The ALUOp control variable needs more explanation. This control variable specifies which
operation the ALU should perform. For example, add, sub, slt, and, or all require different
operations to be performed by the ALU. Recall that the operation is determined by the instructions
function field, since all R-format instructions have the same opcode, so the function field will also
be needed as a input to the circuit that computes the control signals.
Recall how the ALU circuit works. It uses full adders, along with AND and OR gates so that
the values A and B computes are either summed, anded, or ored. To compute subtraction, the B
variable is inverted and a Binvert control was used for this. So, there are really two controls that
need to be coded here: the Operation control and the Binvert control.

input output (ALUOp)


inst opcode funct Binvert Operation
add 000000 100000 0 10
sub 000000 100010 1 10
and 000000 100100 0 00
or 000000 100101 0 01
slt 000000 101010 1 11
: : : : :
lw 100011 xxxxxx 0 10
sw 101011 xxxxxx 0 10
bne 000100 xxxxxx 1 10
: : : : :

As you recall from Lecture 4, the ALU circuit has two control inputs. There is a Binvert which
is used for subtraction and that takes the twos complement of the B variable, and there is the
operation (either AND, OR, or +). For the latter, there are three operations to be coded, so we
need two bits. Note that this 2-bit Operation control is different from from the 6-bit ”opcode” field
which is part of the 32 bit instruction.
Let’s suppose that the Operation control for the ALU uses 10 for adding the two inputs. The
add and sub instructions also use the 10. (The difference between add and sub comes from the
Binvert control, not the Operation control.) The lw and sw instructions also perform an addition
and so they have the same Operation control. However, the and and or instructions use different
op controls: and has Operation control is 00, and or has operation control 01. These 2 bit values
are not meaningful: they just as selector signals within the circuit.
There is one more op value that could be used: 11. In fact, this is used for slt, ”set if less
than”. This instruction performs a subtraction (A - B), and if the result of the subtract is negative,
then the result of the instruction is defined to be 1, and otherwise it is 0. An extra circuit that is
required for this. (See Exercises 5. )
[ASIDE: In fact, in MIPS, the ALU control circuit that implements the above truth table uses two
combinational circuits: one that handles the case of opcode 000000 and uses the funct field input,
and one that handles the case where the opcode is different from 000000. I am omitting the details
this year. If you look it up e.g. in Patterson and Hennessey’s text, you’ll find that the controls are
labelled differently from what I am giving here. Conceptually though, there is no difference.]

last updated: 13th Mar, 2016 3 lecture notes c Michael Langer


COMP 273 14 - MIPS datapath and control 2 Feb. 24, 2016

Merging the datapaths for (bne)


Let’s consider how we can merge in the bne datapath that we saw last class. We need a multiplexor
to select between PC+4 and PC+4+offset. Call this selector signal PCSrc. In general, the value
of PCSrc depends on the instruction (bne vs. add, etc) as well as on the value of other variables
such as NotZero (in the case of bne). There are other possibilities to update the PC, for example,
if there is an unconditional jump instruction (see below). So we need to build a small control unit
that determines PCSrc. This control unit would need the opcode [inst 26-31] and other inputs (for
the case of conditional branches).
Notice that we use a separate adder circuit to compute the branch address for the bne instruction,
rather than the ALU. The reason that we need an adder here is that the ALU is already being used
in this instruction, namely to check whether the two arguments have the same value.

Datapath for jal (jump and link)


Last lecture we looked at the jump instruction j. Let’s look at a related instruction: jump and link
jal. This instruction also has J format:
field op address
numbits 6 26
Like the jump instruction j, the jump and link instruction jumps to an address that is determined
by a concatenation of the upper 4 bits of the PC, the lower 26 bits of the instruction, and the bits
00 which go into the two lowest order address bits (because instructions are word aligned).

last updated: 13th Mar, 2016 4 lecture notes c Michael Langer


COMP 273 14 - MIPS datapath and control 2 Feb. 24, 2016

The difference between jal and j is that jal also writes PC+4 into register $ra, which is $31.
I have written this 5 bit values as 31 in the figure. I have also added a line which shows how PC+4
gets written into register $31.
Notice in the figure that the two controls WriteDataSrc and MemtoReg are now “generalized”
All I mean by that is that previously these controls were 1 bit because they had to choose between
two possibilities, but now there are three possibilities and so they need to be two bits. (I could have
changed the names of these controls to something more general, but we won’t need them further so
there is no need for that.) bother.)

Datapath for jr (jump register, R format)


You may be surprised to discover that the jr instruction has R format. From our arguments at the
beginning of this lecture, you can see why. The instruction jr $r* (typically $ra) is encoded by
inputting the register number (e.g. $ra = 31) into ReadReg1. The address to jump to is read from
this register. Usually this is the return address of a function. This return address is read out from
the register and selected as the value to be written into the PC, at the end of the clock cycle.
Since we are reading from a register, the register number is in the rs field. The rt, rd, and
shamt fields contain junk for this instruction. The funct field does not contain junk though. Since
this is an R-format instruction, the opcode is the same as for the add and other instructions, and
so jr is only distinguished from these other R-format instructions by the funct field).

Exceptions - jumping to the kernel [ADDED March 2]


We have concentrated on single MIPS programs running in the user part of MIPS memory, that
is, below address 0x80000000. We have also discussed how errors can occur when the programs

last updated: 13th Mar, 2016 5 lecture notes c Michael Langer


COMP 273 14 - MIPS datapath and control 2 Feb. 24, 2016

is running, such as integer division by zero, or a bad address is used for an instruction or for a
load from or store to Memory. When such errors occur, the computer stops executing the program
and jumps to the kernel. The program counter PC must be updated accordingly. In particular,
the program jumps to the first instruction address in the kernel, namely 0x80000000. This is the
first instruction of a kernel program called the exception handler. Here we consider briefly the
mechanism of how this is done.
The PC is updated by selecting from one of many inputs to a multiplexor, namely a multiplexor
whose control is is PCSrc. Now we add another possible address that can go there and this is shown
in the figure below. (0x80000000).

32

from register (jr) M


PC from ALU (beq) U
from concat (j) X
0x80000000

PC Src

There can be many types of exceptions. For example, a syscall call is another type. And there
are other types of exceptions called interrupts which we will discuss later. For now, I just want
to emphasize that the PC can take many different values on the next clock cycle and that the

last updated: 13th Mar, 2016 6 lecture notes c Michael Langer


COMP 273 14 - MIPS datapath and control 2 Feb. 24, 2016

value the PC takes is determined by a selector PCSrc. This control variable can depend on various
combinations of other control variables which describe the present state of the program.

Summary of Single Cycle implementation


What we have described is a single cycle implementation. The entire instruction (including the
instruction fetch, the update of the PC, the ALU operation, the read or write from memory) is
done in one clock cycle. This is possible because the datapaths for these instructions never involve
involve more than one write into register or memory. Careful here: you can write into the PC and
into a register within one instruction, but you are using two different paths within that instruction.
The combinational circuits within these two paths run in parallel.
As mentioned earlier, in a single cycle implementation, the interval between clock pulses must
be long enough to make sure that all the signals stablize by the end of the clock cycle (e.g. the 32
bit ALU requires all the Carryin signals to ripple through...) If a word is read from Memory, the
value must be available on the output wires from the Memory. Since the clock pulses occur at a
regular rate, we must design for the worst case.
What is the worst case? For the instructions we discussed above, the worst case is the lw
instruction. We need to read from registers, to do an addition, read from Memory, and then to
write the data item into a register. For the add instruction, you don’t need to use the Memory at
all, so this is much faster.
Designing for the worst case is a big problem, in terms of efficiency. And in fact, real MIPS
computers do not use a single cycle implementation. Rather they use multiple cycles. Next lecture,
I will introduce some ideas of how a multiple cycle implementation could work.
The idea of using multiple clock cycles should not be entirely new to you. We have already
discussed how integer multiplication and division require a sequences of shifts and either additions
or subtractions, and that floating point operations require multiple shifts, compares (of exponents),
adds and subtracts, etc. Next lecture, I will not discuss integer multiplication and division, and
floating point operations, however. Instead, I will discuss CPU instructions such as we have been
discussing this lecture and the last one. I will discuss how multiple clock cycles are used for these
instructions.

last updated: 13th Mar, 2016 7 lecture notes c Michael Langer


COMP 273 15 - pipelining Mar. 7, 2016

Pipelining
MIPS instructions can be thought of as consisting of up to five stages:

• IF: instruction fetch

• ID: instruction decode and register read

• ALU: ALU execution

• MEM: data memory read or write

• WB: write result back into a register

Some instructions such as lw use all of these stages whereas other instructions such as add use just
some of these stages. In a single cycle implementation, each clock cycle must be long enough for all
stages to complete. This would be inefficient, however. One way to understanding this inefficiency
is to note that each instruction is only in one stage at any time, and so the other stages are not
doing anything during that time. The idea of pipelining is to use all stages all the time, to the
extent possible. In a pipeline, instructions move along the datapath, one stage at a time, through
all stages. As in the single cycle model, the PC is updated at every clock cycle. This implies that
the next instruction begins before the previous instruction is finished. At first glance, it is difficult
to believe that this could work. And indeed there are subtle issues to be addressed to make it work.
There are many familiar examples of pipelining in the world: a car wash, an assembly line
in a factory, cafeteria, etc. The key idea is that you can use resources most efficiently when all
workers are busy. One design challenge is that it is generally not possible to keep everyone busy
at once. But you want to design the system to come as close to that as you can. The other
design challenge is that there are dependencies between workers, so that when one worker doesn’t
do his/her/its job, then this prevents another worker from doing its job. Have a look at a clip
from Charlie Chaplin’s film Modern Times for an amusing example (But MUTE the volume.)
[Link]
Real MIPS processors use a pipeline, and the PC is updated at every clock cycle. Each pipeline
stage is given the same amount of time (one clock cycle) which is long enough that each pipeline
stage can complete in this time. The five pipeline stages are listed above.
Recall that in the single cycle model, each of the control variables was given a value that
depended on the instruction. (These control variables were typically either selector signals for
multiplexors, or write enable signals.) The situation is more subtle in a pipeline scheme, since
there are multiple instructions present in the pipeline. New registers are needed between successive
stages of the pipeline, in order to keep track of which control signal is for which instruction. These
registers are referred to by a pair of stages – see below. The pipeline registers contain all control
information that is needed by that instruction, namely the values read out of the registers are the
control values for an instruction, and values are written into the register at the end of the stage
(end of clock cycle). These pipeline registers pass the control signals and other values through the
pipeline.

• IF/ID: During a clock cycle, this register contains the instruction (one word) that was fetched
from Memory in the previous clock cycle. There is no control information since the instruction

last updated: 26th Apr, 2016 1 lecture notes c Michael Langer


COMP 273 15 - pipelining Mar. 7, 2016

hasn’t been decoded yet. At the end of the clock cycle, the instruction that is being fetched
in the current clock cycle is written in this register.

• ID/ALU: This register contains parts of the fetched instruction (such as register numbers,
the immediate fields, offsets) plus controls that are needed to execute the instruction. These
control signals are available by the end of the ID stage, and these controls are for all remaining
stages of the pipeline.

• ALU/MEM: This register contains the controls that are needed to execute the remaining two
stages (MEM, WB), as well as values that were computed during the ALU stage.

• MEM/WB: This register contains controls needed to execute the WB stage, as well as any
data value that needs to be written back. These data values might have been retrieved from
Memory during the MEM stage, or they may be values that were computed in the ALU stage.

M
U
X

PC

RegWrite

ReadReg1
ReadData1
ReadReg2
ReadData2
Memory WriteReg ALU Memory
(instructions) Writedata (data)

Control
Unit

WB WB WB
MEM MEM

ALU

IF / ID ID / ALU ALU / MEM MEM / WB

Many subtle issues arise in pipelining, since several instructions are processed during the same
clock cycle. We are used to thinking of a program as completing one instruction before another
begins, and we judge the correctness of a program with this assumption in mind. But pipelining
does not obey this assumption. The rest of this lecture will address some of the problems that arise
and how to solve them.

last updated: 26th Apr, 2016 2 lecture notes c Michael Langer


COMP 273 15 - pipelining Mar. 7, 2016

Data Hazards
Example 1
Take the two instructions:

add $t1, $s5, $s2


sub $s1, $t1 $s3

For a single cycle machine, there is no problem with the fact that the register written to by add is
the same as one of the registers that sub reads from. But a pipelined machine does have problems
here. To see why, let’s visualize the pipeline stages of a sequence of instructions using a diagram as
follows:

clock cycles →
add IF ID ALU MEM WB
sub IF ID ALU MEM WB

Notice that the sub instruction has executed the ALU stage before the add instruction has written
its result back into the register. This will produce the wrong answer since one of the arguments of
the sub instruction is the result of the add. Such a problem is called a data hazard. Data hazards
arise when the lead instruction writes to a register that the trailing instruction is supposed to read
from.
There are a few ways to avoid this problem. The easiest is to insert an instruction between
the add and sub which does nothing i.e. it does not write into any of the 32 registers, nor does it
write into data memory. Such an instruction is called nop, which stands for “no operation”. This
instruction is a real MIPS instruction. The key property of nop is that it doesn’t write to any
register or to memory and so it has no effect. Including nop instructions in a program is called
stalling. The nop instruction itself is sometimes called a bubble.

add IF ID ALU MEM WB


nop IF ID ALU MEM WB
nop IF ID ALU MEM WB
sub IF ID ALU MEM WB

Of course, this is not such a satisfying solution as the inserted nop instructions slow down the
pipeline, which defeats the purpose of pipelining!
An alternative solution is to use the fact that the result of the add instruction is available at the
end of the ALU stage of the add instruction, which is before the ALU stage of the sub instruction.
Specifically, the result of add’s ALU stage is written into the ALU/MEM register, and so the result
is available there. A method called data forwarding can be used to send the result along a dedicated
path from the ALU/MEM register to an ALU input to be used in sub’s ALU stage.
How could data forwarding be implemented? For the example above, consider the condition
that could be used for data forwarding. Suppose the add instruction has finished its ALU stage

last updated: 26th Apr, 2016 3 lecture notes c Michael Langer


COMP 273 15 - pipelining Mar. 7, 2016

and (eventually) would write the result into some rd register. The next instruction, which would
have just finished its ID stage, would use that same register (specified in its rs or rt field) as one
its ALU inputs. This could be detected by checking the two conditions, which correspond to the
two possible ALU inputs:
ALU/[Link] & (ALU/[Link] == ID/[Link])
ALU/[Link] & (ALU/[Link] == ID/[Link])
that is, the ALU/MEM register indicates that the leading instruction will write the ALU result into
a register, and this write register is a source register used by the trailing instruction whose controls
are in ID/ALU register.
To forward the data to the ALU, we could put a new multiplexor in front of each ALU input
arm. Each of these would either select the usual line for its input or it would select a forwarded line
(in the case that the condition above is met). I have labelled these controls ALUsrc1 and ALUsrc2.

Example 2
Here is a slightly different example.

lw $s1, 24( $s0 )


add $t0, $s1, $s2

clock cycles →
lw IF ID ALU MEM WB
add IF ID ALU MEM WB

The difficulty here is that $s1 argument for the add instruction has not been properly updated
by the lw instruction until after the WB of the lw. You can imagine that this is quite a common
hazard. The reason we load a word into a register is that we want to perform an operation on it!

last updated: 26th Apr, 2016 4 lecture notes c Michael Langer


COMP 273 15 - pipelining Mar. 7, 2016

The easiest (but also least effective) way to correct this hazard is to stall i.e. insert two nop
instructions. but again this defeats the purpose of pipelining.

lw IF ID ALU MEM WB
nop IF ID ALU MEM WB
nop IF ID ALU MEM WB
add IF ID ALU MEM WB

[ASIDE: Added April 26]

In fact we need three noop instructions here, not two. Why? When lw is in its WB stage, it
copies a value from the MEM/WB register into the register array, but this value not yet copied
into the ID/ALU register. Rather, any register values that are copied from the register array into
the ID/ALU register are part of a ‘later’ instruction which is in its ID stage. Note that both WB
and ID stages of two different instructions are using the register array at the same time. This is
fine since one is writing to the register array (WB) and the other is reading from the register array
(ID).
If this is confusing to you, then recall from way back in the course how a flip flop works. It
consists of two D-latches, and when are writing into the first D-latch, we cannot read that value
from the output of the second D latch yet. This is how we can read and write to a flipflop at the
same time. What’s true for D flip flops is also true for registers, since registers are made out of D
flipflops. ]

A better way to solve the problem is by data forwarding. The new value of $s1 is written into
the MEM/WB register at the end the lw MEM stage. So, if we stall the add by just one cycle,
then we could forward the value read from the MEM/WB register directly into the ALU as part of
add’s ALU stage.
lw IF ID ALU MEM WB
nop IF ID ALU MEM WB
add IF ID ALU MEM WB

How would we control this? We check if the instruction in MEM/WB is supposed to write a
value from Memory into a register rd at the end of its WB stage (namely it is lw), and we check if
this is one of the registers that the ID/ALU instruction is supposed to read from in the ALU stage.
(Note that this is two instructions behind, not one.) The conditions to be detected are either of the
following:
MEM/[Link] & (MEM/[Link] == ID/[Link])
MEM/[Link] & (MEM/[Link] == ID/[Link])
When one of these conditions is met, the value that is read from Memory by lw can indeed be
forwarded. Note that it can be forwarded to an input branch of the ALU. We could add multiplexors
ALUSrc1 and ALUSrc2, to select whether this value is read to either (or both) of the input arms
to the ALU.

last updated: 26th Apr, 2016 5 lecture notes c Michael Langer


COMP 273 15 - pipelining Mar. 7, 2016

Note that I have used the same controls ALUsrc1 and ALUsrc2 as in the previous example. Of
course this will not work since these variable cannot each mean two different things. But I assume
you appreciate the spirit of what we are doing here – isolating only datapaths that pertain to a
particular instruction or concept. The real circuit with everything in it is enough spagetti ”to feed
an army”.

Example 3
Another solution is to take any instructions that comes before the lw or after the add – and that are
independent of lw and add instructions – and reorder the instructions in the program by inserting
these instructions between the lw and add.

[April 13: modified]

sub $t3, $t2, $s0 lw $s1, 40( $s0 )


lw $s1, 24( $s0 ) -----> sub $t3, $t2, $s0
add $t0, $0 , $s2 or $s5, $t5, $s0
or $s5, $t5, $s0 add $t0, $0, $s1

last updated: 26th Apr, 2016 6 lecture notes c Michael Langer


COMP 273 15 - pipelining Mar. 7, 2016

lw IF ID ALU MEM WB
sub IF ID ALU MEM WB
or IF ID ALU MEM WB
add IF ID ALU MEM WB

Control Hazards
In a single cycle implementation, the next instruction is determined by the PCsrc control, which is
determined during the single clock cycle. In a pipeline implementation, however, PCsrc control is
not determined right away. Rather, in the best case, it is determined during the instruction decode.
In a pipeline implementation, PC is ”incremented” to PC+4 by default at the end of the instruc-
tion fetch (IF) stage, and thus the next instruction in the program enters the pipeline. However, in
the case of an unconditional branch such as j, jal, jr, the program will not execute the instruction
that directly follows the jump (unless the program explicitly jumps to that instruction, which would
be unlikely). This creates a problem, since the instruction directly following the jump (at PC+4)
will enter the pipeine by default.
There are two steps needed to handle jumps. First, the PC is updated at the end of the jump
instructions’s ID stage. That is, once the instruction is decoded, it has been determined that it is
a jump instruction and it has been determined where the program should jump to, and this jump
address must be written into the PC. I will not discuss this solution, since it is essentially the same
as in the single cycle implementation.
Second, the instruction in the program that followed the jump and entered the pipeline should
not be executed. This problem is more subtle. Consider an example:

j Label1
Label2: addi $s0, $s1, 0
sub $s2, $t0, $t1
:
Label1: or $s3, $s0, $s1

j IF ID ALU MEM WB
addi IF ID ALU MEM WB
sub IF ID ALU MEM WB

Again, the problem is that we do not want the addi (or sub) instruction to be executed after
the j instruction.1 How can we solve this problem? Inserting a nop between j and addi will work,
because addi will not be fetched in that case. The nop will enter the pipeline, but the instruction
following the nop will be the one that the program jumps to (after the jump’s ID stage). This nop
could be inserted by the assembler after each j instruction.
Inserting a nop instruction after each j is somewhat unsatisfying, however, since its effect is
just to stall. Is there a better approach? Yes! Rather than inserting a nop after j, one can modify
1
Note that the instructions following the jump might be executed at some other time, namely if some other part
of the code jumps to Label2.

last updated: 26th Apr, 2016 7 lecture notes c Michael Langer


COMP 273 15 - pipelining Mar. 7, 2016

the opcode of the instruction that has just been fetched, namely if the opcode field in the IF/ID
register is j, then the new value of the opcode field that is written in the next clock cycle should be
the opcode of nop. In the above example, this would replace addi with nop which avoids the stall.

Example 4
For a conditional branch instruction, it is not known until the end of the ALU stage whether the
branch is taken and (if so) what the address of the next instruction should be. In particular, it
is not known whether the instruction at address PC+4 (following the beq instruction) should be
executed. For example,

beq $s1, $s4, label


add $s5, $s2, $s0

beq IF ID ALU MEM WB


add IF ID ALU MEM WB

We could stall by insert a sufficient number of nop instructions between the beq and add instructions
so that, at the completion of the ALU stage of beq, it is known whether the branch condition is
met and so at the end of that stage, the PC can be updated accordingly with the branch address
or with PC+4. That is, the control PCsrc could be computed at the end of the beq’s ALU stage.
In this case, if the branch is taken, then add never enters the pipeline.

beq IF ID ALU MEM WB


nop IF ID ALU MEM WB
nop IF ID ALU MEM WB
add IF ID ALU MEM WB

A better solution is to put the add instruction into the pipeline in the default way, and then modify
the add instruction after the ALU stage of the beq instruction if the branch is taken. For example,
the nop opcode could be written instead of the add opcode into the ALU/MEM register, and all
the controls could be changed to those of the nop. Alternatively, note that the only damage that
the add instruction could cause is at its WB stage, where it writes the result back into a register.
To prevent this damage, it would be sufficient to change the RegWrite control to 0, in the case that
the branch is taken.

Notice that you certainly don’t want to write assembly code which considers all of these issues.
Debugging assembly code with the single cycle model in mind is difficult enough! Rather, you will to
write your programs in a high level language and have a smart compiler create the assembly/machine
code for you, and this compiler will know all the possible hazards and will have rules for reordering
an inserting nops.

last updated: 26th Apr, 2016 8 lecture notes c Michael Langer


COMP 273 16 - physical vs. virtual mem Mar. 9, 2016

Virtual Memory
The model of MIPS Memory that we have been working with is as follows. There are your MIPS
programs, including main and various functions that are called by main and by each other, and
data used by this program, and there are some kernel functions e.g. that may handle exceptions.
This model makes sense from the perspective of one person writing one program. However, as
you know, many programs may be running on a given computer at any time. For example, when
you run your Windows-based laptop, you may have several programs going simultaneously: a web
browser, Word, Excel, etc. You may even have multiple copies of each of these programs going. The
operating system also has many programs running at any time. (If you are running LINUX, and
you type ps -e, you will get a list of a large number of programs that are running at the current
time.)
A program that is running is called a process. When multiple processes are running simultane-
ously, they must share the computer’s memory somehow. Two related issues arise. First, different
processes use the same 32 bit program address space. (An extreme example is that multiple copies
of a program may be running!) Second, at any time, various registers (PC, registers $0-$31, ...,
pipeline registers, etc) all contain values for just one process. We will deal with the second issue in
a few weeks. For now, let’s consider how processes share the same Memory address space.
We first need to distinguish 32 bit program addresses from physical addresses. The 32 bit pro-
gram address space is called virtual memory and the program addresses themselves are called virtual
addresses. These addresses must be translated into physical memory addresses. This translation is
hidden from the MIPS programmer. As we will see, it is done by the a combination of hardware
and software, namely the operating system i.e. kernel.
Besides allowing multiple processes to share the processor, virtual memory allows independence
between the size of the program address space (232 bits) and the size and layout of physical address
space. From the point of view of memory hardware, there is nothing magical about 32 bits for
MIPS program addresses. You know that when you buy a computer, you can choose the amount of
RAM (e.g. 2 GB or 8 GB), and you can choose the size of your hard disk (200 GB vs. 1 TB). These
choices are just a matter of how much money you want to spend. They are not crucial limitations
to the processor you use. This distinction between the size of a program address space (232 in
MIPS) and size of physical memory may seem strange at first, but it will make sense gradually in
the coming lectures.
An analogy for this idea of virtual versus physical addresses is telephone numbers. My office
telephone is 514-398-3740. Do you think that the person in the office next to me should have
telephone number 514-398-3741 ? Of course not. Or do you think that phone numbers 514-399-
xxxx should be close to McGill? No, of course not. You know that telephone numbers are arbitary,
i.e. there is some arbitrary mapping between phone numbers and physical places. The same is true
for the relation between program addresses and physical addresses.

Physical Memory
MIPS program memory consists of 232 bytes. How much is that?

• 210 is about 1 KB (kilobyte - thousand)

• 220 is about 1 MB (megabyte - million)

last updated: 12th Mar, 2016 1 lecture notes c Michael Langer


COMP 273 16 - physical vs. virtual mem Mar. 9, 2016

• 230 is about 1 GB (gigabyte - billion).

• 240 is about 1 TB (terabyte - trillion).

• 250 is about 1 PB (petabyte - quadrillion)

How does that compare to the size of memories that you read about when buying a personal
computer or tablet, or flash drive, etc. ?

Disk memory
Consider some examples, and the technology they use:

• floppy disk holds about 1.4 MB (not used any more), magnetic

• CD (less than 1 GB), optical

• DVD e.g. 10 GB, optical

• hard disk drive (HDD) on personal computer e.g. 1 TB magnetic

Because disk memories have moving parts, their access time is quite slow. The memory is a
mechanical disk which must spin. Data is read or written when the appropriate part of the disk
physically passes over the read or write head. When we talk about a spinning disk, you should
realize that this is very slow relative to the time scales of the processor’s clock. A typical processor
these days has a clock speed of 1 GHz (clock pulses occuring every 10−9 seconds). If you want read
a word from memory (as a step in the lw instruction, for example) and the word you are looking
for is on one of the above disks, then you won’t be able to do so in one clock cycle. Rather it will
typically take millions of clock cycles for the disk to spin around until the word is physically aligned
with the read or write head. If the processor has to wait this long every time a lw or sw is executed,
it will be very inefficient. Another solution is needed – and that’s what we’ll be talking about in
the next several lectures.
Disk memories are typically called “external memory”, even though the hard disk on your
desktop or laptop is inside the case of the computer. To avoid these long memory access times
in these computers, the computer has an internal memory that is much faster than the hard disk.
We’ll discuss these below.

Flash memory
Another kind of external memory is flash memory. The memory cards that you put in your digital
cameras or mobile phones, or the USB drives that you keep in your pocket are made of flash memory.
Tablet computers use flash memory (called a solid state drive SSD) instead of a hard disk (HDD)
for their bulk storage and indeed many laptops are using SSD’s now too. Flash is much faster to
access than disk. Typical flash access times are 10−4 ms, although this will vary a lot between
technologies and it can be much faster than that.
Hard disk memory and flash serve the same purpose. They are non-volatile memories. When
you turn off the power, they maintain their values. They can therefore be used for storing the
operating system software and all your applications.

last updated: 12th Mar, 2016 2 lecture notes c Michael Langer


COMP 273 16 - physical vs. virtual mem Mar. 9, 2016

RAM
The “main memory” inside your computer is called RAM. RAM stands for “Random access mem-
ory”. There is nothing random about it though, in the sense of probability. Rather, it just means
that you can access any byte at any time – with access time independent of the address. Note this
uniform access time property distinguishes RAM from disk memory. (Flash memory also allows
uniform access time. However, flash memory is non-volatile, whereas RAM is volatile.)
Physically, there are two kinds of RAM:1 these are generally called SRAM (static RAM) and
DRAM (dynamic RAM). SRAM is faster but more expensive than DRAM. These two types of
RAM serve very different roles in the computer, as we will see in the coming lectures. SRAM is
used for the cache, and DRAM is used for “RAM.” (The terminology is confusing here, so I’ll repeat
myself. When one says “RAM”, one means “main memory” and the technology used is DRAM.
When one says “cache”, one is referring to SRAM.)
[ASIDE: after the lecture, I was asked by a student how sleeping and hibernation work, with respect
to these different types of memory. Putting your computer to sleep keeps the data and program
state in volatile memory, and the battery is used to provide enough power to maintain the values in
memory. Hibernation is quite different. The program states are stored on disk (non-volatile). You
can unplug your computer and let the battery drain out and the states will be saved. When you
start the computer again, the program states are read off disk and the programs can continue. ]

Memory Hierarchy
You would like to have all your programs running in the fastest memory (SRAM), so that any item
of data or any instructions could be accessed in one clock cycle. e.g. In the data path lectures, we
were assuming that instruction fetch and (data) Memory access took only one clock cycle. However,
because faster memory is more expensive, it is just not feasible to have everything all instructions
and data sitting in the fastest memory. The typical laptop or desktop has a relatively little SRAM
(say 1 MB), much more DRAM (say a few GB), and much much more disk space (hundreds of GB).
It is common to imagine a “memory hierarchy” as follows. This figure captures two concepts:
the first is the order of access: you try to access data and instructions by looking at the top level
first, and then if the data or instructions are not there (in SRAM), you go to the next level (main
memory, DRAM) and if what you are looking for is not there either, then you proceed down to the
next level (HDD or SSD). The second concept is that the sizes of the memories grows with each
level, and the cost per bite decreases.

cache (MB)
SRAM

DRAM main memory, "RAM" (GB)

HDD, SSD
external memory (GB - TB)

1
The underlying technologies are not our concern in this course. You can just pretend both SRAM and DRAM
are made out of transistors and other electronics and leave it at that. In fact, there are many subtypes of both
SRAM and DRAM and these are changing remarkably fast.

last updated: 12th Mar, 2016 3 lecture notes c Michael Langer


COMP 273 16 - physical vs. virtual mem Mar. 9, 2016

Address translation: virtual to physical (page tables)


In the rest of this lecture, we will begin to relate two notions of memory. The first is the program-
mer/software notion: each program assumes a Memory with 232 bytes. The second is the hardware
notion: memory must be physically embodied.

VIRTUAL MEMORY PHYSICAL MEMORY


SRAM DRAM
"cache" "RAM", "main memory" "HDD", "SSD" (tablet)

processes instructions

MIPS
Memory data
(2^32 bytes)

other

As discussed earlier today, in order for programs to run on the actual processor using real
memory, program/virtual addresses must be translated into physical addresses. This is done by
partitioning virtual address space and the physical address space into equal sized regions, called
pages.

VIRTUAL MEMORY PHYSICAL MEMORY


SRAM DRAM HARD DISK
(cache) (main memory or "RAM")
processes

11111
00000 1111111111
0000000000
00000
11111 instructions
111
000 0000000000
1111111111
00000
11111 000
111 0000000000
1111111111
0000000000
1111111111
page

00000
11111 000
111 swaps
0000000000
1111111111
000
111
Memory
00000
11111 0000000000
1111111111
data
(2^32 bytes)

0000000000
1111111111
other
0000000000
1111111111
0000000000
1111111111
0000000000
1111111111
MB GB TB

(1 clock cycle access) (10 clock cycle access) (10^6 clock cycle access)

last updated: 12th Mar, 2016 4 lecture notes c Michael Langer


COMP 273 16 - physical vs. virtual mem Mar. 9, 2016

Let’s work with an example in which each page is 212 bytes i.e. 4 KB. The address of a specific
byte within a page would then be 12 bits. We say that this 12 bit address is the page offset of
that byte. Thus, the 32 bit MIPS virtual address would be split into two pieces. The lower order
piece (bits 0-11) is the page offset. The high order piece (bits 12-31) is called the virtual page
number. Each virtual page number of a process corresponds to some physical page number in
physical memory. We’ll say this page is either in RAM or on the hard disk.
The computer needs to translate (map) a virtual page number to a physical page number. This
translation or mapping is stored in a page table. Each process has its own page table, since the
mapping from virtual address to physical address is defined separately for each process.
Here’s how it works for our 4 KB page example. Take the high order 20 bits (bits 12-31) which
we are calling the virtual page number, and use these as an index (address) into the page table.
Each entry of the page table holds a physical page number. We are calling it a ”table”, but to make
it concrete you can think of it as an array, whose index variable is the virtual page number.
Each page table entry also contains a valid bit. If the valid bit is 1, then the page table entry
holds a physical page number in main memory (RAM) and if the valid bit is 0 then the page table
entry holds a physical page number on the hard disk.
Suppose main memory consists of 1 GB (30 bit addresses). Then there are 218 = 230 /212 pages
in main memory, assuming all of main memory (RAM) is paged. The address of a particular byte
in main memory would thus be 18 high order bits for the physical page number, concatenated with
the 12 bit offset from the virtual address. It is the upper 18 bits of address that would be stored in
the “physical page address” field of the page table.
If the valid bit is 0, then the physical address has a page number that is on the hard disk. There
are more pages on the hard disk than in main memory. For example, if the hard disk has 1 TB
(240 ) then it has 228 = 240 /212 pages.
Each entry in the page table may have bits that specify other administrative information about
that page and the process which the kernel may need. e.g. whether the page is read or write
protected (user A vs. user B vs. kernel), how long its been since that page was in main memory
(assuming valid is 1, when the page was last accessed, etc).

virtual page number physical page address valid bit R/W etc
(up to 220 addresses/entries) (depends on size of phys. mem.)
0
1
2
3
..
.

[ASIDE: The page table of each process doesn’t need to have 220 entries, since a process may only
use a small amount of memory. The kernel typically represents the page table entries using a more
clever data2 structure to keep the page tables as small as possible.]
Where are the page tables ? First, note that page tables exist both in the virtual address
space (program addresses) as well as in physical address space. Page tables are constructed and
maintained by kernel programs, and so the kernel programs must be able to address them. Of
2
for example, hash tables which you may have learned about in COMP 250

last updated: 12th Mar, 2016 5 lecture notes c Michael Langer


COMP 273 16 - physical vs. virtual mem Mar. 9, 2016

course, they also must exist physically somewhere, and so we need to be able to talk about physical
address too.
For simplicity, you can think of page tables as not themselves lying in a part of Memory that is
paged. (The reason this simplifies things, is that if page tables were themselves broken into pages,
then you would need to think about a page table for the page table, .... which you probably don’t
want to think about right now.) In fact, often page tables are kept in paged memory, but let’s keep
it simple and ignore that.

Page faults and page swaps


When a program tries to access a word (either an instruction or data), but the valid bit of the entry
of that page in the page table is 0, then the page is on the hard disk. Here we say that a page fault
occurs. When a page fault occurs, a kernel program arranges that the desired page gets copied from
the hard disk into main memory. The kernel must choose an available page in main memory, and
if there is no available page then it must first removeg some page from main memory in order to
make space. This copying of a page from disk to main memory and from main memory to disk is
called a page swap. We’ll discuss page swaps again later in the course.
The way page faults are done from a software point of view is as follows. When a user program
is running and page fault occurs, the program branches to the kernel. It may branch directly to
the address of the page fault handler (a kernel program that handles page faults). Or it may first
branch to a general exception handler which then analyzes what the exception is and then branches
to the particular handler for that exception. The page fault handler then updates the page tables
and administers the copy of data from main memory to and from hard disk.

last updated: 12th Mar, 2016 6 lecture notes c Michael Langer


COMP 273 17 - cache 1 (TLB) March 14, 2016

Cache: motivation
Last lecture we discussed the different types of physical memory and the relationship between pro-
gram (virtual) memory addresses and physical addresses. We also discussed the memory hierarchy,
and how we can’t have quick access to all physical memory. Instead, we need to use a combination
of small fast expensive memory (SRAM), slower and bigger less expensive memory (DRAM), and
much slower and much bigger and much cheaper memory (disk).
How do these ideas about memory back to the datapaths that we learned about earlier? Recall
the basic pipeline scheme:

M
U
X

4 RegWrite

PC
ReadReg1
ReadData1
ReadReg2
ReadData2
Memory WriteReg ALU Memory
(instructions) Writedata (data)

Control
Unit

IF ID ALU MEM WB

There, the ”Memory” box was a placeholder for an unspecified mechanism. Then, last lecture we
learned that the memory accesses involve two steps: first, translate the 32 bit virtual address to
a physical address (using a page table lookup); second, use the physical address to index into the
physical memory. So we now think of the IF and MEM stages as shown below (two stages each).
This is still an abstract representation, since a ”page table” is not a circuit, not is it clear yet what
we mean by the ”instruction memory” and ”data memory” boxes. Do we mean RAM or disk, for
example? It will take a few more lectures to unpack this.

page instruction page data


PC table memory memory
table

instruction fetch (IF) ID ALU memory access (MEM) WB

If we think of the two stage memory accesses as being from main memory (RAM) or from disk,
then we see immediately that we have a speed problem. Even if the word we want is in main
memory rather than on disk, each main memory (DRAM) access requires many clock cycles (say

last updated: 15th Mar, 2016 1 lecture notes c Michael Langer


COMP 273 17 - cache 1 (TLB) March 14, 2016

10). Thus, every instruction would require many clock cycles. We would need about 10 cycles to
ge the translation from the page table, and we need about 10 cycles to fetch the instruction, and
we would need another 10 clock cycles or load or store the word.
The solution to the problem is to use small very fast memory (SRAM) to hold frequently used
page table entries (virtual to physical page number translations), and frequently used instructions
and data. These small and very fast SRAM memories are called caches. For now, we’ll just think
of them as boxes which have an input and an output, where each box is a sequential circuit similar
to the register array (but much bigger in practice). In the figure below, the input to the page table
cache is a virtual address and the output is a physical address. The input to the instruction or data
cache is a physical address and the output is an instruction or data word. (For the data cache, it
is also possible to write into it (e.g. sw). We will get to that next lecture. )

page page
table instruction table data
PC cache cache cache
cache
(TLB) ( TLB )

instruction fetch (IF) ID ALU memory access (MEM) WB

Page table cache (”Translation lookaside buffer” or TLB)


Let’s first consider the page table cache which is historically is called the translation lookaside buffer
(or TLB). This cache is as important as the data and instruction caches, since all memory accesses
require that a virtual address is translated into a physical address. Since these translations are done
so commonly, the translations themselves need to be cached.
You can think of the TLB as being split in two parts – one for instruction address translations
and one for data address translations. I will often keep the discussion simple by just referring to
”the” TLB.
How is the TLB organized? Recall that the 32 bit MIPS address is partitioned into a virtual
page number (VPN) and a page offset. Let’s continue with the example from last lecture in which
these two fields are 20 and 12 bits, respectively. Suppose that the TLB itself has 512 (29 ) possible
entries. Then, we partition the 20 bit virtual page number into two components:

• a TLB index (lower 9 bits of VPN). The index specifies which out of 29 = 512 entries in the
TLB that we are referring to. You can think of it as an address of an entry in the TLB.

• a tag (upper 11 bits of VPN). The tag is used to disambiguate the 211 virtual page numbers
that have a given 9 bit index.

For example, suppose we have two (32 bit) virtual addresses:

(tag, 11) (TLB index,9) (page offset, 12)


01010100100 001001011 010101111111
01010100100 001001011 001001001001

last updated: 15th Mar, 2016 2 lecture notes c Michael Langer


COMP 273 17 - cache 1 (TLB) March 14, 2016

Because these two virtual addresses have the same values in their TLB index fields, their VPN’s
would be mapped to the same entry of the TLB. Notice that these addresses happen to have the
same tag as well, since their 20 bit virtual page numbers (VPN) are the same, that is, these addresses
lie on the same page. Only the page offset of the addresses differ.
Now consider a slightly different example:
(tag, 11) (TLB index,9) (page offset, 12)
01000111011 001001011 010101111111
01010100100 001001011 010101111111
Here the TLB index and page offsets are the same, but the tags differ (and so the virtual page
numbers differ as well.) The VPN’s differ (since the tags differ). Since two different virtual pages
must be represented physically by different physical pages, the physical page numbers must also
differ. But since both VPNs map to the same TLB entry, it follows that only one of these two
translations can be represented in the TLB at any one time. More details on how this works will
be given below.
Another issue to be aware of is that the TLB can contain translations from several different
processes. One might think that this should not be allowed and that the TLB should be flushed (all
values set to 0) every time one process pauses (or ends) and another continues (or starts). However,
this harsh flushing policy is unnecessary: there is an easy way to keep track of different processes
within the TLB. To distinguish which process (and hence, which page table) an entry belongs to,
one can add another field to each entry of the TLB, namely a Process ID (PID) field. This is simply
a number that identifies the process. The concept here should be familiar to you. PID’s are like
area codes for telephone numbers. They are used to distinguish possibly the same 7-digit phone
numbers in two different cities.
Finally, each entry of the TLB has a valid bit which says whether the entry corresponds to a
valid translation or not. For example, when a process finishes running, the TLB entries that were
used by that process are no longer considered valid and the so the valid bits for all entries of that
process are set to 0.
Have a look again at the datapath on page 2. What circuit is inside the little boxes that say
“page table cache (TLB)”? The answer is shown below. Consider a 32 bit virtual address which
must be translated into a physical address. This virtual address is partitioned into tag/index/offset
fields. To ensure that the TLB contains the translation for this virtual address, the indexed row
from the TLB is retrieved. (Specifically, a decoder is used to identify which row to retrieve from.
Recall the RowSelect control mechanism from lecture 6.) The TLB valid bit must be checked and
must have the value 1. The tag field must match the upper bits of the virtual address. The Process
ID field (PID) must match the Process ID of the current process. If all these conditions are met,
then that entry in the TLB can used for the translation, so the physical page number stored at that
entry of the TLB is concatenated with the page offset bits of that virtual address (see figure below)
and the result defines the physical address i.e. the translated address. Since SRAM is fast, all this
happens within the IF stage i.e. within the clock cycle.
One final point: last lecture we said that pages can be either in RAM or on disk, and we saw
that there are two lengths of physical addresses – one for RAM and one for disk. The TLB does
not store translations for pages on disk, however. If the program is trying to address a word that
only lies on the hard disk, then the TLB will not have the translation for that address; the TLB
only holds translations for addresses that are in main memory (RAM). In order to have access to

last updated: 15th Mar, 2016 3 lecture notes c Michael Langer


COMP 273 17 - cache 1 (TLB) March 14, 2016

that word on disk, a page fault would need to occur and the page would need to be copied from
the hard disk to main memory. We will discuss how this is done later in the course.

virtual address
31 21 20 12 11 0 PID register
tag TLB index page offset

12 TLB (SRAM)
9
physical valid
page number tag PID
9-to-512
decoder (18 bits) (11 bits) (1 bit)
512

11 11
18

=
=

tag page offset

Physical Address (30 bits)


TLB hit

last updated: 15th Mar, 2016 4 lecture notes c Michael Langer


COMP 273 18 - cache 2 (data & instructions, hit and miss) Mar. 16, 2016

[ASIDE: The ordering of slides in actual lecture was a bit confusing, since I put a brief discussion
of the TLB in the middle. That discussion really belonged at the end of last lecture, but there was
no time then, because it was a quiz day and there alot of interaction in that class. I have rearranged
the slides to try to avoid this confusion, and the notes below corresponds to this new arrangement.
The notes below contain much more information than I gave in the lecture. I will go over that
information a few lectures from now, when I return to page faults and disk accesses.]

TLB miss and TLB refill (and page fault)


Last lecture I discussed the TLB and how virtual addresses are translated to physical addresses. I
only discussed the cases of TLB hits. Here I will briefly discuss TLB misses, namely what happens
if the TLB doesn’t contain a desired virtual-to-physical translation. If a TLB miss occurs, an
exception results. The program jumps to the exception handler which analyzes what the exception
was, and then jumps to a special kernel program which handles TLB misses – namely the TLB miss
handler. This kernel program consults the page table (in main memory) of the current process, to
see if the desired word is in main memory i.e. if that page table entry’s valid bit is 1.
If the page valid bit is 1 then this physical page containing the address we want is in main
memory. (Indeed, the page valid bit could be called instead the ’physical page is in main memory’
bit.) The TLB miss handler retrieves the physical page number from the page table and fills that
entry in the TLB (called a TLB refill), and sets the valid bit for that TLB entry to 1. The kernel
then returns control to the program so that it can continue its execution, namely it can perform
the virtual-to-physical translation that had caused the TLB miss.
If, however, the page valid bit is 0, then the page that the program wants to access is not in
main memory. Rather it is on disk, and so a page fault occurs: the TLB miss handler calls the
page fault handler, and the page fault handler arranges for the desired page to be copied from the
disk to main memory. (We’ll see how in a few lectures from now when we talk about disk accesses.
) The page fault handler then updates the page table appropriately, namely it changes the page
valid bits and physical page numbers. If a page swap occured (so a page was also transferred from
main memory to disk), then two entries of the page table must be changed, namely the entries
corresponding to incoming and outgoing pages. The page fault handler then returns to the TLB
miss handler. Now, the requested page is in main memory and the page valid bit in that entry of
the page table is 1. So the TLB miss handler can copy the page table entry into the TLB. The
TLB miss handler then returns control to the original process.

Data and instruction caches


The TLB provides a quick translation from a virtual address to a physical address in main memory.
However, main memory is still a bit too slow to use everytime we want an instruction or we want to
access a data word. To speed up memory accesses, this is why we also use a cache for instructions
and data.

last updated: 20th Mar, 2016 1 lecture notes c Michael Langer


COMP 273 18 - cache 2 (data & instructions, hit and miss) Mar. 16, 2016

page page
table instruction table data
PC cache cache cache
cache
(TLB) ( TLB )

instruction fetch (IF) ID ALU memory access (MEM) WB

Suppose the cache contains 128 KB (217 bytes).1 [ADDED March 20: Here I am only
counting the bytes for the instructions (or data); I am not counting the extra bits that
are needed for the tag, and the dirty and valid bits.] These extra bits can be significant
(see Exercises 6 Q2).] Also, I am not distinguishing instruction and data cache. Also, again
suppose our main memory contains 1 GB (230 bytes). How is a cache organized and accessed?

case 1: each cache entry has one word (plus tag, etc)
Since MIPS instructions and data are often one word (4 bytes), it is natural to access 4-tuples of
bytes at a time. Since the number of bytes in the instruction cache (or data cache) is 217 and we
have one word per entry, the cache would have 215 entries holding one word (22 bytes) each.
Let’s now run through the sequence of steps by which an instruction accesses a word in the
cache. Suppose the processor is reading from the cache. In the case of an instruction cache, it is
doing an instruction fetch. (In the case of a data cache, it is executing say a lw instruction.) The
starting address of the word to be loaded is translated from virtual (32 bits) to physical (30 bits).
We saw last class how the TLB does this. Then, the 30 bit physical address is used to try to find
the word in the cache. How?
Assume the cache words are word aligned. That is, the physical addresses of the four bytes
within each word in the cache have LSBs 11, 10, 01, 00. The lowest two bits of the address are
called the byte offset since they specify one of the four bytes within a word.
The next fifteen bits (2-16) are used to index into the 215 entries in the cache. We are assuming
a read rather than write, so the entry is then read out of the cache. (Back in lecture 6, we looked
at basic designs for retrieving bits from memory. You do not need to refresh all the details here,
but it would be helpful to refresh the basic idea that you can feed 15 bits into a circuit and read
out a row of data from a 2D array of flipflops.)
The upper 13 bits of the physical address (the tag) are compared to the 13 bit tag field at that
cache entry. If the two are the same, and if the valid bit of that entry is on, then the word sitting
at that entry of the cache is the correct one. If the upper 13 bits don’t match the tag or if the valid
bit is off, however, then the word cannot be loaded from the cache and that cache entry needs to
be refilled from main memory. In this case, we say that a cache miss occurs. This is an exception
1
Note that this is significantly larger than the TLB size from last lecture. In principle, the TLB and cache sizes
could be similar. There are technical reasons why the TLB is typically smaller, which I am omitting. (It has to
do with the mechanism for indexing. I am only presenting one method for indexing, called ”direct mapping”, but
there are other schemes that use more complicated circuits. These are called ”set associative” and ”fully associative”
caches.

last updated: 20th Mar, 2016 2 lecture notes c Michael Langer


COMP 273 18 - cache 2 (data & instructions, hit and miss) Mar. 16, 2016

and so a kernel program known as the cache miss handler takes over. (We will return to this later
in the lecture.)
Using this approach, the cache memory would be organized as shown in the circuit below. For
each of the 215 word entries in the cache, we store four fields:
• the word, namely a copy of the word that starts at the 30 bit physical (RAM) address repre-
sented by that cache entry
• the upper 13 bits of that 30 bit physical address, called the tag. The idea of the tag is the same
as we saw for the TLB. The tag is needed to distinguish all entries whose physical addresses
have the same bits 2-16.
• a valid bit that specifies whether there is indeed something stored in the cache entry, or
whether the tag and byte of data are junk bits, for example, leftover by a previous process
that has terminated.
• a dirty bit that says whether the byte has been written to since it was brought into memory.
We will discuss this bit later in the lecture.
The case of a write to the data cache e.g. sw uses a similar circuit. But there are some subtleties
with writes. I will discuss these later.
tag dirty valid word
13 11 32

15

physical address 13 32

tag cache index 00

13 =
cache hit ?

[ASIDE: Unlike in the TLB, there is no process id field PID in the cache. The reason is that it
is possible for two processes to use the same physical addresses, for example, two processes might
be using the same instructions if they are two copies of the same program running or if they are
sharing a library. Or they might be sharing data.]

case 2: each cache entry has a block of words (plus tag, etc)
Case 1 above took advantage of a certain type of “spatial locality” of memory access, namely that
bytes are usually accessed from memory in four-tuples (words). This is true both for instructions

last updated: 20th Mar, 2016 3 lecture notes c Michael Langer


COMP 273 18 - cache 2 (data & instructions, hit and miss) Mar. 16, 2016

and data. Another type of spatial locality is that instructions are typically executed in sequence.
(Branches and jumps occur only occasionally). At any time during the execution of a program we
would like the instructions that follow the current one to be in the cache, since such instructions
are likely to be all executed. Thus, whenever an instruction is copied into the cache, it would make
sense to copy the neighboring instructions into the cache as well.
Spatial locality also arises for data word accesses. For example, arrays are stored in consecutive
program addresses. Because pages are relatively large, neighboring program addresses tend to
correspond to neighboring physical addresses. Similarly, words that are nearby on the stack tend
to be accessed at similar times during the process execution.
A simple way to implement this idea of spatial locality is as follows: Rather than making the
lines (rows) in the cache hold one word each, we let them hold (say) four words each. We take four
words that are consecutive in memory (namely consecutive in both virtual memory and in physical
memory). These consecutive words are called a block. If our cache holds 217 = 128 KB and each
block holds 4 words or 4 × 4 = 24 bytes, then the cache can hold up to 213 blocks, so we would use
13 bits to index a line in the cache.
To address a word within a block, we need two bits, namely bits 2 and 3 of the physical address.
(Bits 0 and 1 are the “byte offset” within a word.) Bits 2,3 are the block offset which specify one
of four words within the block. Note that each block is 16 bytes (4 × 4) and the blocks are block
aligned – they always start with the physical memory address that has 0000 in the least significant
four bits. This might always work best, but it simplifies the circuitry.
As shown in the figure below, to access a word from the cache, we read an entire block out of
the cache. Then, we select one of the four words in that block. One could draw a similar circuit for
writing a word (or block) to the cache, although the multiplexor part would need to be changed.
tag dirty valid word11 word10 word01 word00
13 1 1 32 32 32 32
13

32 32 32 32
MUX
physical address
13 2
32
tag cache index 2 2

13
=

Hits and misses


The above discussion was about indexing mechanisms, namely how we read from a cache assuming
that the cache has the word we want (a hit). We next consider two other aspects of caches. The
first is what happens when the address we are indexing is not in the cache: a cache miss. The
second aspect is what happens when we write to a cache, either from a register to the cache (in the
case of a store word), or when we copy a block from main memory to the cache in the case of a
miss. Specifically, we are concerned here with consistency between the cache and main memory.

last updated: 20th Mar, 2016 4 lecture notes c Michael Langer


COMP 273 18 - cache 2 (data & instructions, hit and miss) Mar. 16, 2016

Instruction Cache
The instruction and data caches have a subtle difference: instructions are only fetched (read) from
memory, but data can be read from or written to memory. For the instruction cache, blocks are
copied from main memory to the cache. For the data cache, blocks can be copied either from main
memory to the cache, or vice-versa. There can also be writes from registers to the data cache e.g.
by sw (or swc1) instructions.
We begin with the instruction cache, which is simpler. If we are fetching and the instruction is
in the cache, i.e. a hit, then we have the case discussed earlier in the lecture. If the instruction is
not in the cache, however, we have a miss. This causes an exception – a branch to the exception
handler which then branches to the cache miss handler. This kernel program arranges for the
appropriate block in main memory (the one that contains the instruction) to be brought into the
cache. The valid bit for that cache entry is then set, indicating that the block now in the cache
indeed represents a valid block in main memory. The cache miss handler can then return control
to the process and we try again to fetch the instruction. Of course, this time there is a hit.

Data Cache - write-through policy


The data cache is more complicated. Since we can write to the data cache (sw), it can easily happen
that a cache line does not have the same data as the corresponding block in main memory. There
are two policies for dealing with this issue: “write-through” and “write-back”. The write-through
policy ensures that the cache block is consistent with its corresponding main memory block. We
describe it first.
Consider reading from the data cache (as in lw or lwc1). If the word is in the cache, then we
have a hit and this case was covered earlier. If there is a miss, however, then an exception occurs
and we replace that cache entry (namely, the entire block). The previous entry of the cache is
erased in the process. This is no problem for the write-through policy since this policy ensures that
the cache line (just erased) has the same data as its corresponding block in main memory, and so
the erased data is not lost.
Consider happens when we write a word from a register to the cache (sw or swc1). First suppose
that there is a hit: the cache has the correct entry. The word is copied from the register to the
cache and also the word is copied back to the appropriate block in main memory, so that main
memory and cache are consistent (i.e. write through).
If the desired block is not in the cache, then an exception occurs – a cache miss. The cache miss
handler arranges that the appropriate block is transferred from main memory to the cache. The
handler then returns control to the program which tries to write again (and succeeds this time – a
cache hit). Here is a summary of the“write through (data cache)” policy:
hit miss
read (lw) copy word cache → reg copy block main mem → cache (and set valid bit)
copy word cache → register
write (sw) copy word reg → cache copy block main memory → cache (and set valid bit)
copy word cache → main mem copy word register → cache
copy word cache → main memory

last updated: 20th Mar, 2016 5 lecture notes c Michael Langer


COMP 273 18 - cache 2 (data & instructions, hit and miss) Mar. 16, 2016

Data cache: “write back” policy


The second policy avoids copying the updated cache block back to main memory unless it is abso-
lutely necessary. Instead, by design, each entry in the cache holds the most recent version of a block
in main memory. The processor can write to and read from a block in the cache as many times as it
likes without updating main memory – as long as there are hits. The only special care that must be
taken is when there is a cache miss. In this case, the entry in the cache must be replaced by a new
block. But before this new block can be read into the cache, the old block must be written back
into main memory so that inconsistencies between the block in the cache (more recent version) and
the block in main memory (older version) are not lost. This is how the “write back” scheme delays
the writing of the block to memory until it is really necessary.
To keep track of which lines in the cache are consistent with their corresponding blocks in main
memory, , a dirty bit is used – one per cache line. When a block is first brought into the cache, the
dirty bit is set to 0. When a word is written from a register to a cache block (e.g. sw), the dirty
bit is set to 1 to indicate that at least one word in the block no longer corresponds to the word in
main memory.
Later, when a cache miss occurs at the cache line, and when the dirty bit is 1, the data block
at that cache line needs to be written back to main memory, before the new (desired) block can
be brought into the cache. That is, we “write-back”. This policy only writes a block back to main
memory when it is really necessary, i.e. when the block needs to be replaced by another block.
Note that there is just one dirty bit for the whole block. This bit doesn’t indicate which word(s) is
dirty. So the whole block is written back to main memory and a whole new block is brought in.
This “write-back” policy helps performance if there are several writes to individual words in a
block in the cache before that block is replaced by another.
The following table summarizes the steps of the data cache write-back policy.
hit miss
read copy word: cache → reg copy block: cache → main mem (only if valid and dirty bits are 1)
copy block: main memory → cache
and set dirty bit = 0, valid bit = 1
copy word: cache → register
write copy word: reg → cache copy block: cache → main mem (only if valid and dirty bits are 1)
(and set dirty bit = 1) copy block: main mem → cache (and set valid bit = 1)
copy word: reg → cache (and set dirty bit = 1)
Note that the misses take more time than the hits. So this policy only makes sense when the
hits are much more frequent than the misses.
Notice that in the write back scheme, a “write hit” is cheaper and a “read miss” is more
expensive. For large caches, the hit rate is typically over 95 per cent. For this reason, a write back
policy tends to give better performance than the write through policy for large caches.

last updated: 20th Mar, 2016 6 lecture notes c Michael Langer


COMP 273 19 - system bus and I/O devices Mar. 21, 2016

Up to now in the course, we have concentrated on the CPU and memory. In the final few
lectures, we will briefly consider some of the basic ideas of how the CPU and memory interact
with the I/O devices. The I/O devices we will consider are the keyboard and mouse (inputs), and
monitor and printer (outputs). The harddisk drive is also sometimes considered an I/O device
and is both input and output. We will not consider the issues of commputer networks, which is
obviously also included in I/O but which is a more advanced topic.
Our discussion of I/O will be relatively general and untechnical, in contrast to what we have
seen up to now. Also, we will not consider current technologies, which are quite complex and diverse
and would be inappropriate for an introductory course such as this.

(Shared) System bus


The classical way to connect the CPU, main memory, and I/O devices is to use a shared set of lines
called a system bus. A “bus” is just a set of wires for carrying bits which is shared by a number
of devices. The classical system bus connects the major units in the computer: the CPU, the main
memory, the I/O devices. For now, let’s assume there is just one system bus. (Later we will see
that more recent computers have a hierarchy of buses.)
I/O 6
CPU main memory keyboard

system bus

I/O 1 I/O 2 I/O 3 I/O 4 I/O 5


(hard (CD/DVD
disk (display) (mouse) (printer)
drive)
drive)

The main advantage of the system bus is that it reduces the number of connections in the
computer. Rather than the CPU (or main memory) having a direct connection with many lines
each to each of the I/O devices, all the components can share the same lines. There are obviously
disadvantages to such a system bus as well. Because the bus is shared, it can only carry one signal
at a time. This tends to reduce performance because one component (CPU, an I/O device, main
memory) may need to wait until the system bus is free before it can put signals on the system bus.
Also, the various components need to communicate with each other to decide who gets to use the
system bus at any given time and who is talking to whom. These communications also take time.
Another disadvantage of the system bus is that it needs a clock speed that is much slower than
the CPU. Components that share the system bus are relatively far away from each other. The clock
speed of the system bus is constrained by the furthest (worst case) connection between components.
That is, any signal that is put on the bus has to travel between any two components that share the
bus in any single clock cycle. If we have one system bus shared by all components, then the clock
speed might have to be 100 MHz (a period of 10 ns), as opposed to a typical CPU clock speed of 1
GHz (a period of 1 ns).

last updated: 21st Mar, 2016 1 lecture notes c Michael Langer


COMP 273 19 - system bus and I/O devices Mar. 21, 2016

To be concrete about the limits here, consider that light (or more generally, electromagnetic
waves i.e. the voltages that drive our gates) travel at a speed of 3 × 108 ms−1 . If we have a 3 GHz
processor, then this would mean that the voltages can travel at most 10cm per clock pulse, which
is smaller than the distances between many pairs of components in a desktop workstation. And
this doesn’t include the time needed for the components to read/write from/to the bus and for
circuits to stabilize. Hence, the system bus runs at a slower clock speed, giving more time for the
0/1 signals on the bus to stabilize.
There are three components to the system bus. There is a data bus which carries data only.
There is an address bus which carries addresses only. And there is a control bus that carries control
signals only.
How is the system bus used? Let’s take an example: the lw instruction. Suppose that a cache
miss occurs and so a block containing the word must be brought from main memory into the cache.
Here is what we would like to happen.1

• The CPU puts the (physical) address of the new block onto the address bus.

• The CPU sets control signals (ReadMem = 1 and WriteMem = 0) on the control bus.

• The control signal causes a block to be read from main memory, i.e. main memory puts the
requested block onto the data bus.

• The CPU reads the data bus and the block is written into the data cache.

There is more to it than this, however. For example, it would take several system bus clock cycles
for this to happen since each word of the block needs to be sent. Moreover, the above assumes that
the system bus was not being used by any other component. But when you have a shared resource,
you are not guarenteed that it will be available. We will discuss how the system bus is shared next
lecture.

I/O controllers
Each of the I/O devices may have its own complicated electronics (and mechanics, in the case of a
printer or scanner). Each I/O device interfaces with an I/O controller which is inside the computer
case. The controller is responsible for reading/writing on the system bus. Each I/O controller might
have its own specialized processor e.g. own clock, its own special registers, its own special ROM
circuits, its own memory, and its own little program counter (PC).
Do not confuse I/O controllers with device drivers. A driver for an I/O device is a specialized
piece of software that knows all about the particular device and its I/O controller. In particular,
the driver controls the reads and and writes for the registers and other circuits in the I/O controller.
The drivers are part of the operating system (kernel) and they sit in main memory or on disk.
Note that many I/O devices (printer, keyboard, mouse, monitor, etc) are external to the case of
the computer. We use the term peripheral to describe the hardware that is external to the computer,
that is, outside the case. These devices don’t plug directly into the system bus. Rather, each plugs
into a port. Each port is typically part of the I/O controller that sits on the motherboard.
1
We ignore the possible write-back e.g. if the cache entry is dirty.

last updated: 21st Mar, 2016 2 lecture notes c Michael Langer


COMP 273 19 - system bus and I/O devices Mar. 21, 2016

system bus CPU main memory

hard
disk DVD/CD
drive controllers
drive

11111
00000
printer
00000
11111
keyboard
mouse
display

Let’s next consider a few examples of how various components can share the system bus. Recall
the tri-state buffer circuit element from lecture 6. This was a mechanism for allowing many wires
to put signals on a single wire. We will also need to use a tri-state buffers for the system bus. The
reason is that we have many devices that can put signals onto the bus, and so only one of them can
write to the bus at any time.

Example (input device): Keyboard


A keyboard is an input device that enters one byte of data at a time. Suppose there are a maximum
of 128 (27 ) I/O devices recognized by the CPU and that this keyboard happens to be I/O device
number 17. Suppose further than there is a control line on the system bus, called ReadIO that is
set to 1 by the CPU when it tries to read from one of the I/O devices. Then, the keyboard would
put its byte of data on the bus if the ReadIO control signal is 1 AND the address of the I/O device
matches the lower 7 bits (that is, 27 = 128) of the address on the address bus AND one of the keys
was pressed.
To check if the address of the I/O device is correct, the keyboard controller matches a particular
set of lines on the address bus to the values in an IOdeviceID register (see figure). Since the
keyboard is device number 17 (say) the binary number 0010001 is stored there. See the sketch on
the next page.

Example (input device): mouse


Another example of an input device is a mouse. A mouse marks the (x,y) position of a cursor.
Dragging the mouse causes the (x,y) values to change. Moving to the right and left causes the x
value to increase and decrease, respectively. Moving forward and backward causes the y value to
increase and decrease, respectively. The mouse also has buttons and a scroll option.
The mouse itself contains electronics which convert the mechanical action into “packets” of data
which are sent to the mouse I/O controller. The controller (inside the computer case) decodes these
packets and writes the results into various registers. The I/O controller then communicates the
data with the system bus. I will sketch out more details on how this is done in coming lectures.

last updated: 21st Mar, 2016 3 lecture notes c Michael Langer


COMP 273 19 - system bus and I/O devices Mar. 21, 2016

Keyboard reads wires A0,...A6


off the address bus.

KEYBOARD

IOdeviceID

0010001

Keyboard pressed
Read IO

keyboard byte (ASCII)

address data control


bus bus bus

Output devices
We will look briefly at two types of output devices: printers and displays (monitors). The first
distinction to be aware of in thinking about the information sent to an output device is whether
the data describes how to make an image (called vector graphics) or whether the data describes a
square grid of pixels, in particular, their intensities i.e. RGB values (called raster graphics).

Example : printer
Many printers are built to understand instructions in the (Adobe) PostscriptT M language. An
application such as Word or Adobe Reader or a web browser issues a print command. The PostScript
file to be printed is sent from main memory to the I/O (printer) controller. Then individual
PostScript instructions are sent from the controller to the printer. PostScript printers contain
software and hardware which together interpret the instructions, and convert them into a rasterized
image (i.e. a square grid of pixels). See the slides for some very basic examples of postscript
instructions.
A printer is an output device, and so it does not put anything on the data bus. Hence, it does
not need a tri-state buffer. To check whether the data on the data bus is meant for the printer, a
similar mechanism could be used as in the keyboard example, i.e. read from the address bus and
control bus. The file containing the printer instructions (PostScript) would be put on the data bus.
The printer determines whether it should read from the data bus by looking at the address on
the address bus. Continuing with our earlier example, if the lower 7 bits match the address of this
printer, then the data is meant for this printer and is written to the printer. Otherwise, the printer
does not take data from the data bus.

last updated: 21st Mar, 2016 4 lecture notes c Michael Langer


COMP 273 19 - system bus and I/O devices Mar. 21, 2016

Example: monitor
Our next example is the monitor. The monitor screen is a rectangular array of points called pixels
(picture elements). For each pixel (x,y), there are three numbers stored which represent the R,G,
and B intensities of the point. Typically, one byte is used for each of these intensities. That is,
intensity codes goes from 0 to 255 (bigger means brighter).
A example image size is 1024 x 768 pixels. If we consider 3 bytes per pixel – one byte for each
of R,G,B – then we need about 3 MBs of memory just for the image that is on your monitor at any
instant.
A very long time ago in the 1980s when the first wave of personal computer occured , image
pixel values were stored in a part of main memory called the frame buffer (see figure below). That
is, your screen shows a large array from main memory. However, this was infeasible for all but
the simplest images. Most monitors presents a new image every 1/60 of a second. (A minimum of
30 frames per second is required, otherwise the eye detects flicker and the video looks jittery.) If
the image size is about 1000 × 1000 and even the screen were refreshed every 1/30 of second, then
this would still require 100 MB per second (considering RGB) to get put on the system bus. This
should make you concerned. The system bus would have to spend a large amount of its time just
transfering pixels, which is clearly unacceptable.

system bus main memory


CPU
frame buffer

video bus

hard
disk DFD/CD
drive controllers
drive

display
printer
mouse

One solution to the system bus traffic jam problem is to have a dedicated (read only) bus to
carry bytes from the frame buffer to the display controller. This video bus completely freed up the
system bus. However, even if we don’t use the system bus to transfer pixels to the video controller,
the pixel values still need to be computed for each frame and sent to the frame buffer. A long time
ago, the CPU was indeed responsible for making the images. However, the graphics capabilities
were weak since the CPU had so many other things to do. And it still had to use the system bus
to transmit these values to the framebuffer, which clogged up the system bus.

last updated: 21st Mar, 2016 5 lecture notes c Michael Langer


COMP 273 19 - system bus and I/O devices Mar. 21, 2016

GPU (graphics processing unit)


The solution that evolved since the 1980s is to use a second processor called a graphics processing unit
(GPU). This is a specialized processor for making images, e.g. drawing polygons and other fantastic
things that can be done. (Take COMP 557 Fundamentals of Computer Graphics, if you wish to learn
more.) Rather than the CPU performing drawing operations itself and computing the RGB intensity
values of each pixel, these specialized computations are performed by the GPU. Only the instructions
such as drawline(x1,x2,y1,y2,R,G,B) or drawTriangle(x1,x2,x3,y1,y2,y3,R,G,B) are sent to
the GPU over the system bus. Such instructions are similar is concept (but not in detail) to the
PostScript instructions mentioned earlier which were sent to the printer.
The GPU together with its private RAM and the framebuffer, are known as the graphics card
or video card. (See dashed line in figure below.) The GPU executes the instructions it is given by
drawing into the frame buffer, which is part of the graphics/video card.

main
CPU memory

display

frame video
buffer controller
GPU

I/O I/O private


1 ........ n RAM

So, problem solved? Not entirely. Relieving the CPU of so much computational burden just
passes the problem on. The more operations we require the GPU to do, the more complicated the
GPU design must be. It needs its own registers and controls, its own floating point unit(s), its
own instructions. Modern graphics cards (video cards) contain thousands of processors and large
amounts of memory. They can perform general purpose massively parallel computation as well (if
you know how ot program them!) But there is price: the cards are expensive.

That was then... this is now...


The material I have covered in this lecture is appropriate for an introductory course, but it does
not even attempt to explain the current technologies. If you wish to read on your own about how
things work these days, check out the terms below:
• buses: PCI bus (1990s), PCI Express (since 2003)
• graphics cards: recent cards have thousands of processors on them, and (drop your jaw) can
even be programmed (see CUDA) so that your laptop is as powerful as recent supercomputers.
This is called GPGPU, for general purpose GPUs.

last updated: 21st Mar, 2016 6 lecture notes c Michael Langer


COMP 273 20 - memory mapped I/O, polling, DMA Mar. 23, 2016

This lecture we will look at several methods for sending data between an I/O device and either
main memory or the CPU. (Recall that we are considering the hard disk to be an I/O device.)

Programmed I/O: isolated vs. memory-mapped


When we discussed physical addresses of memory, we considered main memory and the hard disk.
The problem of indexing physical addresses is more general that that, though. Although I/O
controllers for the keyboard, mouse, printer, monitor are not memory devices per se, they do have
registers and local memories that need to be addressed too. We are not going to get into details
of particular I/O controllers and their registers, etc. Instead, we’ll keep the discussion at a general
(conceptual) level.
From the perspective of assembly language programming, there are several general methods for
addressing an I/O device (and the registers and memory within the I/O device).
When program MIPS in MARS, you used syscall to do I/O. syscall causes your program to
branch to the kernel. The appropriate exception handler is then run, based on the code number
of the syscall. (You don’t get to see the exception handler when you use the MARS simulator.)
MIPS syscall is an example of isolated I/O, where one has special instructions for I/O operations.
A second and more subtle method by which an assembly language program can address an
I/O device is called memory mapped I/O (MMIO). With memory-mapped I/O, the addresses of
the registers or memory in each I/O device are in a dedicated region of the kernel’s virtual address
space. This allows the same instructions to be used for I/O as are used for reading from and writing
to memory. (Real MIPS processors use MMIO, and use lw and sw to read and write, respectively,
as we will see soon.) The advantage of memory-mapped I/O is that it keeps the set of instructions
small. This is, as you know, one of the design goals of MIPS i.e. reduced instruction set computer
(RISC).
MARS does allow some tools for programming the kernel, so let’s briefly consider memory
mapped I/O in MARS, which is a very simple version of MMIO in MIPS. There is only one input
device (keyboard) and one output device (display). Both the input and output device each have
two registers. The addresses are in the kernel’s address space.

0xffff0000 input control register (only the two LSB’s are used)
0xffff0004 input data register (holds one byte)
0xffff0008 output control register (only the two LSB’s are used)
0xffff000c output data register (holds one byte)

The LSB of each control register indicates whether the corresponding data register is “ready”. In
the input case, ready means that a character has been typed at the keyboard and is sitting in the
low order byte of the data register. In the output case, ready means that the output data register
is free to receive a byte from the CPU. e.g. in the case of a (non-buffered) printer, the device might
not yet have printed the previous byte that was sitting in the register and hence might not be ready.
[Here is the notes I say the registers hold only one byte, whereas in the lecture and in the instructions
below I imply that they hold a word each. It doesn’t really matter. ]

last updated: 10th Apr, 2016 1 lecture notes c Michael Langer


COMP 273 20 - memory mapped I/O, polling, DMA Mar. 23, 2016

Let’s look at an example. Suppose your program should read bytes from the input device and
load each byte into register $s2. Here’s how the kernel might try to do it.

lui $t0, 0xffff


lw $s2, 4( $t0) # load from address 0xffff0004

From the kernel programmer’s perspective, this is quite simple. However, from the underlying
hardware perspective, it is more complicated. The MIPS hardware must recognize that the address
is in the memory mapped I/O region and handle it properly. In particular, when the kernel program
executes the above lw instruction, the hardware must detect that the desired address does not
correspond to an item in the data cache but rather is some word in the I/O device’s memory. This
result is similar to a cache miss, in that the process must stop and the kernel must use the system
bus to get something. But now it must get the desired word from the I/O device (input) instead of
a block from main memory (cache refill).
The CPU puts an address on address line of the system bus, e.g. the address of the I/O device1
and sets certain control lines on the system bus to indicate that the CPU wants to read from that
device. The I/O device controller reads the bus (always) and sees the address and control and then
puts the requested data item onto the data bus.
I emphasize: this is the same general mechanism that would be used for a main memory access
in the case of a cache miss. Now it is an I/O device that provides the data to the CPU.
Similarly, for an output, the sw instruction would be used and the address where the word
should be stored would be within a reserved (memory mapped) region of the kernel’s address space.
The CPU would put that address on the bus (after translating this virtual address into a physical
address that indicates the I/O device number ) and set a write control signal on the control bus.
Again, the mechanism is similar to storing a word to main memory. The main difference is that
the addresses on the address bus indicates that the CPU is communicating with the I/O controller,
not with main memory.

Polling
One issue that arises with the above example is that it only makes sense to read from the input
device when the input device is ready, e.g. when there is a byte of data to be read. (This issue is
independent of whether one uses isolated I/O or memory mapped I/O.)
To solve this problem, one can use a method called polling. Before the CPU can read from the
input device, it checks the status of the input device and see if it is “ready,” meaning that the CPU
checks the “ready” bit in the input control register 0xffff0000, which is bit 0 (the least significant
bit). In particular, the CPU needs to wait until this bit has the value 1. This can be implemented
in MIPS with a small polling loop.

lui $t0, 0xffff


Wait: lw $t1, 0($t0) # load from the input control register
andi $t1, $t1, 0x0001 # reset (clear) all bits except LSB
beq $t1, $zero, Wait # if not yet ready, then loop back
lw $s2, 4( $t0) # input device is ready, so read
1
and/or some register number or some offset within the local memory of that device

last updated: 10th Apr, 2016 2 lecture notes c Michael Langer


COMP 273 20 - memory mapped I/O, polling, DMA Mar. 23, 2016

A similar polling loop is used for the output device:


lui $t0, 0xffff
Wait: lw $t1, 8($t0) # load the output control register
andi $t1, $t1, 0x0001 # reset all bits except LSB
beq $t1, $zero, Wait # if not ready, then loop back
sw $s2, 12( $t0 ) # output device is ready, so write
Obviously polling is not an efficient solution. The CPU would waste many cycles looping and
waiting for the ready bit to turn on. Imagine the CPU clock clicking along at 2 GHz waiting for a
human to respond to some prompt and press <ENTER>. People are slow; the delay could easily
be several billion clock pulses.
This inefficiency problem is solved to some extent by limiting each process to some finite stretch
of time. There are various ways to implement it. The simplest would just be to use a finite for-loop,
instead of an infinite loop. The number of times you go through the loop might depend on various
factors, such as the number of other processes running and the importance of having the I/O done
as soon as the device is available.
I emphasize that, while this polling example uses memory-mapped I/O, it could have used
isolated I/O instead. i.e. Polling isn’t tied to one of these methods.

Example: buffered input with a keyboard


[Added April 10: While I am presenting this example of buffered input in the context of polling,
there is nothing in this example that restricts it to polling per se. Rather, the example illustrates
what buffering is, and why it is useful when there is a bus that is being shared. Buffering can be
used for other I/O schemes as well, e.g. DMA and interrupts. ]
Suppose a computer has a single console display window for the user to enter instructions. The
user types on a keyboard and the typed characters are echoed in the console (so that the user can
see if he/she made typing mistakes).
The user notices that there sometimes there is a large time delay between when the user enters
characters and when the characters are displayed in the console but that the console eventually
catches up. What is happening here?
First, assume that the keyboard controller has a large buffer – a circular array 2 – where it
stores the values entered by the user. The controller has two registers, front and back, which hold
the numbers of characters that have been entered by the user and the number of characters that
have been read by the CPU, respectively. Each time a character is entered or read, the registers are
incremented. Certain situations have special status, which could be detected (see slides for details)
:
• The character buffer is ”full” when the difference of these two indices is N and in this case
keystrokes are ignored.

• The keyboard controller is “ready” to provide a character to the CPU when the difference of
the two indices is greater than 0.
2
In case you didn’t cover it in COMP 250, a circular array is an array of size N where the index i can be any
positive integer, and the index is computer with i mod N .

last updated: 10th Apr, 2016 3 lecture notes c Michael Langer


COMP 273 20 - memory mapped I/O, polling, DMA Mar. 23, 2016

In each case there might be a circuit which tests for these conditions.
With this buffer in mind, we can see what is causing the delay in echoing to the console. Suppose
the user is typing into the keyboard, but the system bus is being used by some I/O device. The
buffer starts to fill up, and the CPU only reads from the buffer when the bus is available for it to do
so. The CPU grabs a short time slice, and quickly reads a character and writes a character (to the
console i.e. output) and does this until it empties the buffer or until it gives up the bus or pauses
the process and continues with another process.

Direct memory access (DMA)


Another mechanism for I/O is direct memory access (DMA). This is a specialized method which
involves communication between memory and an I/O device. The idea is for an I/O controller (say
the hard disk or the printer controller) to have its own specialized set of circuits, registers, and local
memory which can communicate with main memory via the system bus. Such an I/O controller
is called a DMA controller. The advantage is that it can take over much of the work of the CPU
in getting the data onto and off of the system bus and to/from main memory. The CPU would
first tell the DMA controller what it should do, and then the CPU can continue executing other
processes while the DMA controller uses the bus.
What does the CPU need to tell a DMA controller in order for the DMA controller to take over
this work? Take the example of a page fault. The kernel program (page fault handler) needs to
modify the page table, and it also needs to bring a page from the hard disk to main memory (and
maybe swap a page out too). How is this done?
Let’s take the case that a page is swapped out, which is what I sketched in class. The kernel
needs to specify the following parameters to the DMA controller of hard disk:
• a physical page address in main memory to be swapped out
• a physical page address on the hard disk (where the page goes)
• control signals (e.g. copy one page from main memory to hard disk)
The DMA controller then initiates the memory transfer. It could do so by executing a sequence of
instructions, something like this.
initialize baseReg to address where page will be put in the local
memory of the controller (the "disk cache")

while offset = 0 to pagesize -1 {


put address AddrMainMem on address but, and put control signals on the
control bus so main memory writes a word onto data bus
read word from data bus and put into local memory at
address (baseReg + offset)
AddrMainMem = AddMainMem + 4 // 1 word
}
Then, when the entire page has been read, the page can be written the hard disk itself. For this,
the controller uses the value AddrHardDisk which would be the address where the page goes on the
hard disk. A similar sequence of instructions as above could be used for this.

last updated: 10th Apr, 2016 4 lecture notes c Michael Langer


COMP 273 20 - memory mapped I/O, polling, DMA Mar. 23, 2016

DMA: Bus request and bus granted


The next major issue to be discussed is how the DMA controller and the CPU coordinate who gets
to use the bus at a given time. When the CPU sends its instructions to the DMA controller (as
in the above example), the DMA controller might not be ready to execute the instructions right
away. For example, suppose a page fault occurs and the page fault handler requests a page to be
transferred from the hard disk DMA controller to main memory, i.e. it communicates this request
on the system bus. There is no need for the CPU to disconnect itself immediately from the system
bus after making the request since the HD controller first needs to get the page off the hard disk
before it can use the bus. If the hard disk controller is just waiting for the hard disk to swing
around, then there is no reason why the hard disk controller needs to have the system bus.
How then can the hard disk controller eventually take over the system bus once it is ready to
use it, namely after it has read the page from the disk and put it in its local memory? (Assume
there is only one I/O device (the hard disk) and hence only one DMA device. We’ll return to the
case of multiple DMA devices next lecture.) The hard disk/DMA controller needs to communicate
that it is ready to use the bus, and it needs to do so without using the system bus!
A DMA controller and the CPU communicate via two special control lines: BR (bus request)
which goes from the DMA controller to the CPU, and BG (bus granted) which goes from CPU to
DMA. If the CPU is using the bus, it sets BG to 0. Now, when the DMA controller is ready to
use the system bus, the DMA controller sets BR to 1. If the CPU doesn’t need the system bus, it
temporarily disconnects itself from the system bus (electronically speaking, it shuts off its tri-state
buffers and stops writing on the system bus) and it sets BG to 1. If the CPU is using the system
bus, it finishes using it before disconnecting from system bus and setting BG to 1.

system bus CPU main memory


BG1
BR4
BR1 BG4

hard
disk floppy
drive
drive

keyboard
printer
mouse

last updated: 10th Apr, 2016 5 lecture notes c Michael Langer


COMP 273 20 - memory mapped I/O, polling, DMA Mar. 23, 2016

I emphasize that BR and BG signals cannot be sent on the system bus. These signals are used
to decide who uses the system bus, and so they must always be available. The CPU must always be
able to communicate the value of BG to the DMA controller, and the DMA controller must always be
able to communicate the value of BR to the CPU. Separate direct lines (”point-to-point”) between
the CPU and the DMA controller are needed to carry these signals.
[ASIDE: The above figure shows two DMA devices, say the printer and the hard drive. I have
labelled two sets of BR and BG signals. The figure also shows that the I/O controllers have their
own registers and little RAMs. These registers have physical addresses (I/O device number and
register number or memory address number within that device). In a memory mapped system,
these registers and memory would have corresponding virtual addresses which would need to be
translated.]

Hard disk
A few final comments about how to improve hard disk performance....
Recall that the hard disk controller has a local memory (DRAM), sometimes called the disk
cache. When a page is read from disk, it is fist put into this disk cache, and then transferred to
main memory as a subseqent stage. Similarly, when a page is brought from main memory to the
hard disk controller, it is put in the disk cache and then later moved to the disk itself (as was
described earlier in the lecture).
To improve performance when reading a page from the hard disk, neighboring physical pages
could be read as well. This is easy to do because once the hard disk has spun around to the position
of the page, the controller can easily read off neighboring pages too. Reading neighboring pages on
disk would be useful when adjacent virtual pages are stored on the hard disk as adjacent physical
pages. For example, a file might consist of multiple pages – it would be better to store them side
by side so that they can be accessed at the same time.
You might have heard of disk fragmentation. For example, suppose you are storing a file on your
hard disk. The file is an application which consists of many pages of source code. When you first
put the file on the hard disk, it would be nice to keep all the pages in order and at consecutive page
addresses on the disk for the reason I mentioned above. However, this might not be possible since
the disk might not have a big empty region available. The disk might contain lots of free space, but
it might consist of many holes e.g. where previous files had been deleted. You can imagine that
after years of use, and many adds and file deletes and many page swaps, the number of large empty
gaps decreases and the number of small gaps increases. This is fragmentation. The disk doesn’t
work as well when it is fragmented, because the trick mentioned in the previous paragraph is not
as effective.
Finally, another way to improve performance of the disk arises when there are several read/write
requests for data on different parts of the disk. Rather than performing the read/write in the order
of request, which might take multiple spins of the disk, the hard disk controller could sort the list of
pages so that it could access them in one spin. For example, suppose requests were made for page
3,7,5,4 at a time when the head is at position for page 2. Rather that doing 3, 7 and then spinning
all the way around for 5 and then all the way around again for 4, it could get the requests in order
3,4,5,7 in a single spin of the disk.

last updated: 10th Apr, 2016 6 lecture notes c Michael Langer


COMP 273 21 - I/O, interrupts, exceptions April 3, 2016

In this lecture, I will elaborate on some of the details of the past few weeks, and attempt to pull
some threads together.

System Bus and Memory (cache, RAM, HDD)


We first return to an topic we discussed in lectures 16 to 18. There we examined the mechanism
of a TLB access and cache access and considered what happens if there is a TLB miss or cache
miss. We also examined what happens when main memory does not contain a desired word (page
fault). These events cause exceptions, and hence a jump to the kernel where they are handled by
exception handlers. Let’s quickly review these events, adding in a few details of how the system
bus is managed and used.
When there is a TLB miss, the TLB miss handler loads the appropriate entry from the appro-
priate page table, and examines this entry. There are two cases: either the page valid bit in that
entry is 1 or the page valid bit is 0. The value of the bit indicates whether the page is in main
memory or not.
If the pagevalid bit is 1, then the page is in main memory. The TLB miss handler copies the
translation from the page table to the TLB, sets the validTLB bit, and then control is given back
to the user program. (This program again accesses the TLB as it tried to do before and now the
virtual → physical translation is present: a TLB hit!)
From the discussion of the system bus last lecture, we now understand that, for the TLB miss
handler to get the translation from the page table, the CPU needs to access the system bus. The
page table entry is retrieved from the page table in main memory, and brought over the CPU where
it can be analyzed. (Details on how that is done are not unspecified here.)
What if the pagevalid bit was 0? In that case, a page fault occurs. That is, the page is not in
main memory and it needs to be copied from the hard disk to main memory (very slow). The TLB
miss handler jumps to the page fault handler, which arranges that the the desired page is brought
into main memory.1
After the discussion last lecture, we now understand that the page fault handler tells the hard
disk controller (a DMA device) which page should be moved, and to where. The page fault handler
then pauses the process and saves the process state, and the kernel switches to some other process.
The disk controller meanwhile starts getting the page off the hard disk. Once it has done so, it will
need to use the system bus to transfer the page to RAM. (Recall how DMA works.) After it has
finished moving the page to RAM, it sends an interrupt request (see below) to the CPU and tells
the CPU that it is done.
The CPU will eventually allow the process to continue. The page fault handler will need to
update the page table to indicate that the new page is indeed there. It will then jump back to the
TLB miss handler. Now, the requested page is in main memory and the pagevalid bit is on, so the
TLB miss handler can copy the page table entry into the TLB and return control to the original
process. Again, note that these main memory accesses require the system bus.
[ASIDE: I did not fill in all these details in the lecture slides, but rather gave only a tree diagram
summarizing the steps. I am not expecting you to memorize all the details above. Just understand
the steps and the order. ]

1
We ignore the step that it also swaps a page out of main memory.

last updated: 1st Apr, 2016 1 lecture notes c Michael Langer


COMP 273 21 - I/O, interrupts, exceptions April 3, 2016

Last lecture we looked at polling and DMA as a way for the CPU and I/O to coordinate their
actions. Polling is simple, but it is inefficient since the CPU typically asks many times if the device
is ready before the device is indeed ready. Today we’ll look at another way for devices to share the
system bus, called interrupts.

Interrupts
With interrupts, an I/O device gains control of the bus by making an interrupt request. Interrupts
are similar to DMA bus requests in some ways, but there are important differences. With a DMA
bus request, the DMA device asks the CPU to disconnect itself from the system bus so that the
DMA device can use it. The purpose of an interrupt is different. When an I/O device makes
an interrupt request, it is asking the CPU to take specific action, namely to run a specific kernel
program: an interrupt handler. Think of DMA as saying to the CPU, “Can you get off the bus so
that I can use it?,” whereas an interrupt says to the CPU “Can you stop what you are doing and
instead do something for me?”
Interrupts can occur from both input and output devices. A typical example is a mouse click or
drag or a keyboard press. Output interrupts can also occur e.g. when an printer runs out of paper,
it tells the CPU so that the CPU can send a message to the user e.g. via the console.
There are several questions about interrupts that we need to examine:
• how does an I/O device make an interrupt request?
• how are interrupt requests from multiple I/O devices coordinated?
• what happens from the CPU perspective when an I/O device makes an interrupt request ?
I’ll deal with these questions one by one.
The mechanism by which an I/O device makes an interrupt request is similar to what we saw
in DMA with bus request and bus granted. The I/O device makes an interrupt request using a
control signal commonly called IRQ. The I/O device sets IRQ to 1. If the CPU does not ignore
the interrupt request (under certain situations, the CPU does ignore interrupt requests), then the
CPU sets a control signal IACK to 1, where IACK stands for interrupt acknowledge. The CPU also
stops writing on the system bus, by setting its tristate gates to off. The I/O device then observes
that IACK is 1, which means that it can use the system bus.
Often there is more than one I/O device, and so there is more than one type of interrupt request
than can occur. One could have a separate IRQ and IACK line for each I/O device. This requires a
large number of dedicated lines and places the burden on the CPU in administering all these lines.
THis is not such a problem, in fact, and many modern processors do use point to point connections
betwen components. Another method is to have the I/O devices all share the IRQ line to the CPU.
They could all feed a line into one big OR gate. If any I/O device requests an interrupt, then
the output of the OR gate would be 1. How then would the CPU decide whether the allow the
interrupt?
One way is for the CPU to use the system bus to ask each I/O device one by one whether it
requested the interrupt, but using the system bus. It could address each I/O device and ask “did
you request the interrupt?” Each I/O device would then have one system bus cycle to answer yes.
This is a form of polling (although there is no infinite loop here as there was when I discussed
polling last lecture).

last updated: 1st Apr, 2016 2 lecture notes c Michael Langer


COMP 273 21 - I/O, interrupts, exceptions April 3, 2016

Daisy chaining
A more sophisticated method is to have the I/O devices coordinate who gets to interrupt the CPU
at any time. Suppose the I/O devices have a priority ordering, such that a lower priority device
cannot interrupt the CPU when the CPU is servicing the interrupt request from a higher priority
device. This can be implemented with a classical method known as daisy chaining. As described
above, the IRQ lines from each device meet at a single OR gate, and the output of this OR gate is
sent to the CPU as a single IRQ line. The I/O devices are then physically ordered by priority. Each
I/O device would have an IACKin and IACKout line. The IACKin line of the highest priority I/O
device would be the IACK line from the CPU. The IACKout line of one I/O device is the IACKin
line of the next lower priority I/O device. There would be no IACKout line from the lowest priority
device.
IACK

IRQ1
I/O 1

IACK2

IRQ1
I/O 2
IRQ
CPU
IACK3

IRQ1
I/O 3

main IACK4
memory
IRQ1
I/O 4

system bus

Here is how daisy chaining works. At any time, any I/O device can interrupt the CPU by setting
its IRQ line to 1. The CPU can acknowledge that it has received an interrupt request or it can
ignore it. To acknowledge the interrupt request, it sets IACK to 1. The IACK signal gets sent to
the highest priority I/O device.
Let’s start by supposing that all IRQ and IACK signals are 0. Now suppose that an interrupt
request is made (IRQ = 1) by some device, and the CPU responds by setting the IACK line to 1. If
the highest priority device had requested the interrupt, then it leaves its IACKout at 0. Otherwise,
it sets IACKout to 1 i.e. passing the CPU’s IACK=1 signal down to the second highest priority
device. That device does the same thing, and so on. Whenever IACKin switches from 0 to 1, the
device either leaves its IACKout = 0 (if this device requested an interrupt) or it sets IACKout = 1
(if it did not request an interrupt).
Let me explain the last sentence in more detail. I said that the I/O device has to see IACKin
switch from 0 to 1. That is, it has to see that the CPU previously was not acknowledging an

last updated: 1st Apr, 2016 3 lecture notes c Michael Langer


COMP 273 21 - I/O, interrupts, exceptions April 3, 2016

interrupt, but now is acknowledging an interrupt. This condition (observing the transition from 0
to 1) is used to prevent two I/O devices from simultaneously getting write access to the system bus.
What if the CPU is servicing the interrupt request from a lower priority device, and a higher
priority device wishes to interrupt the CPU. With daisy chaining, the IRQ from the higher priority
device cannot be distinguished from that of the lower priority device since both feed into the same
OR gate. Instead, the higher priority device changes its own IACKout signal to 0. This 0 value
gets passed on to the lower priority device. The lower priority device doesn’t know why its IACKin
was just set to 0. It could be because a higher priority device wants to make an interrupt request,
or it could be because the CPU has turned off its IACK signal. Whichever of these occurs doesn’t
matter. The lower priority device just needs to know that its interrupt time is now over. It finishes
up as quickly as it can, and then sets its IRQ to 0. Once it is done, the CPU sets its IACK to 0.
Once the higher priority device sees that the CPU has set IACK to 0, it can then make its interrupt
request.
How does the CPU know which device made the interrupt request? When CPU sets IACK
to 1, it also frees up the system bus. (It “tristates itself.”) When the I/O device that made the
interrupt request observes the IACK 0-to-1 transition, this device then identifies itself by writing
its address (or some other identifier) on the system bus. The CPU reads in this address and takes
the appropriate action. For example, if the device has a low priority, then the CPU may decide to
ignore the interrupt and immediately set IACK to 0 again.
Think of the sequence as follows. “Knock knock” (IRQ 0-to-1). “Who’s there and what do you
want?” (IACK 0-to-1) and CPU tristates from system so that it can listen to the answer. “I/0
device number 6 and I want you to blablabla” (written by I/O device onto system bus). If CPU
then sets IACK 1-to-0, this means that it won’t service this request. In this case, the I/O device
has to try again later. If the CPU doesn’t set IACK 1-to-0, then it may send back a message on
the system bus. (The I/O device needs to tri-state after making its request, so that it can listen to
the CPU’s response.)
Daisy chaining is just one way of coordinating interrupt requests. But it is not the only way, and
often the CPU does the work of figuring out which interrupt request to handle based on priorities.
Let’s now put daisy chaining aside and think about interrupts from the CPU perspective.

Interrupt and exception handlers (kernel)


When a user program is running and an interrupt request occurs and the CPU has enabled inter-
rupts, the current process branches to the exception handler, which in MIPS is located at 0x80000080
i.e. in the kernel. The kernel then examines the Cause and Status registers to see what caused
the exception. It can determine that an interrupt request occurred from the Cause register. It then
examines the interrupt enable bits to see if it should accept this interrupt and, if so, it branches to
the appropriate exception handler. (Note that an interrupt is just another kind of exception.)
Because there is a jump to another part of code, interrupts are similar to function calls. With
functions, the caller needs to store certain registers on a stack so that when the callee returns, these
registers have their correct values. Similarly, when an interrupt (more generally, an exception)
occurs, the kernel needs to save values in registers ($0-$31, $f0-$f31, EPC, PC, Status, etc) that
are being used by that process.
Note that interrupts can be nested. This is reminiscent of recursive functions. However, the
situation is slightly different here since interrupt requests can occur at any time, including while the

last updated: 1st Apr, 2016 4 lecture notes c Michael Langer


COMP 273 21 - I/O, interrupts, exceptions April 3, 2016

interrupt handler is in the middle of saving the registers from the previous interrupt. For example
when saving the state information for a process, it would be bad for the kernel to be interrupted
and to jump to yet another interrupt handler. So, for certain kernel routines, the processor can put
itself into a non-interruptable state. Another example is that a lower priority I/O device should not
be allowed to interrupt the interrupt handler of a higher priority device, whereas a higher priority
device should be allowed to interrupt a lower priority device’s interrupt handler.
In summary, here’s the basic idea of what the kernel (interrupt handler) needs to do:

1. Disable all interrupts. (Don’t acknowledge any interrupts.)

2. Save the process state (registers, etc).

3. Enable higher priority interrupts.

4. Service the interrupt.

5. Restore process state (restore values in registers)

6. Return from interrupt, and reset previous interrupt enable level

Interrupts in MIPS
Let’s look at a few details about how MIPS treats interrupts, and how the Cause and Status registers
are used.
The general interrupt enable bit is the LSB of the Status register. To turn this bit on, we use:

mfc0 $k0, $12 # the Status register is $12 in coprocessor 0


ori $k0, $k0, 0x01 # turns on the LSB which is an IE bit
mtc0 $12, $k0

To disable this interrupt, we could turn off this bit.

mfc0 $k0, $12


lui $1, 0xffff
ori $1, $1, 0xfffe # sets a mask that is all 1’s except LSB
and $k0, $k0, $1 # turns off the LSB
mtc0 $12, $k0

Note that this code uses the register $k0 which is register 26 in the main register array. User
programs are not allowed to use this register. Only kernel programs can use it. (And error will
occur if a user instruction uses this register.)
Finally, there are particular bits in the Status register which indicate whether interrupts of a
particular priority are enabled, and there are corresponding bits in the Cause register which turn on
when an interrupt request of particular priority has been made. The kernel would need to compare
bits in these two registers to decide if a particular interrupt request should be serviced. (Detailed
omitted.)

last updated: 1st Apr, 2016 5 lecture notes c Michael Langer


COMP 273 22 - asynchronous buses April 5, 2016

All the I/O examples we have discussed use the system bus to send data between the CPU, main
memory, and I/O controllers. The system bus runs at a slower clock speed than the CPU because
of greater distances involved between the various components, but there is still a clock governing
the reading and writing to the bus. The sender and receiver of any bits written to the system bus
use the same system bus clock, e.g. writes are performed at the end of the clock pulse. For this
reason, the system bus is called synchronous.
Synchronous buses don’t work well when the distances between the sender and receiver are too
large, however, such as when data is being sent from an I/0 controller to a peripheral device. The
clock signals themselves need to travel, and so the assumption of events happening “at a given
time” becomes problematic. When the distances between sender and receiver are too large (say 1
meter), the precise timing of the clock cannot be relied on in the same way as before. When the
bus itself is not controlled by a clock, one says the bus is asynchronous. Timing events are still
important though. Today we’ll look at two ways to achieve timing on such buses.

Handshaking
How can data be sent on a bus without the sender and receiver synchronizing on the same clock?
Let’s take the case of an I/O controller (e.g. a printer controller, i.e. inside the computer case) that
communicates data to an I/0 peripheral device (i.e. the printer itself, attached to a cable several
metres long). The asynchronous transfer can be implemented with a method called handshaking.
In general, there are two types of handshaking, depending on whether the data source (sender)
or destination (receiver) initiates the transfer. For the printer example, the source/sender is the I/0
controller and the destination/receiver is the printer.
In source initiated handshaking, the sender (a) puts data on data line and shortly afterwards
(b) sets a control signal called data request, and holds both signals. The receiver reads the control
signal, then reads the data, and then (c) responds with a data acknowledge control signal. The
sender receives the acknowledgement and (d) stops writing the data, and resets the data request
line to 0. The receiver then (e) sets the data acknowledge to 0.

data send request (sender to receiver)

data (set by sender)

data acknowledge (receiver to sender)

a b c d e

The hexagonal symbol in the middle row indicates the time that the data that is put on the
line. This symbol is used here indicates that it doesn’t matter whether the data is a 1 or 0 (but
it has to be one them during that interval). Also note that this data line can consist of multiple
parallel lines. It doesn’t have to be just a single line as indicated in the figure.
Alternatively, the receiver (destination) can initiate a data transfer. It does so by (a) setting its
data request control signal. The sender reads this signal, then (b) puts the data on the data line,

last updated: 9th Apr, 2016 1 lecture notes c Michael Langer


COMP 273 22 - asynchronous buses April 5, 2016

data request (receiver to sender)

data (set by sender)

data ready (sender to receiver)

a b c d e

and shortly afterwards – to make sure the data signal can be read – it (c) sets a data ready signal.
The receiver reads the data ready signal and then reads the data, then (d) the receiver resets the
data request signal (off). The sender then sees that the request has been turned off so it (e) turns
the data ready signal off and stops writing the data.
I emphasize that no clock is involved here i.e. it is asynchronous. One side waits until the other
side is ready. The sender and receiver might each have their own clocks, but there is no shared clock
used for the bus.

Example: parallel port (printer)


An example of the above scheme was used by the standard ”parallel port”(called Centronics) which
is a D-shaped port that was used to connect a printer to a computer. See lecture slides. It has 25
holes (1-13 and 14-25) into which 25 wires can fit. For example, the source ready (#1) and data
(#2-#9) pins are output only, i.e. from the controller to the printer. The ACK (#10) and Busy
(#11) pins are input only, i.e. from the printer to the controller. Another input pin is the “Paper
Out” pin (#12), indicating there is no paper left!
Centronics used a source-initiated handshaking method. Suppose the printer controller tries to
write data to the printer. It turns on wires #1 for 100 ns (a relatively long time) and then puts
a byte of data on wires #2-#9. The printer observes the source ready signal on wire #1. If the
printer is ready, it reads the data (from wires #2-#9). After completing the read, the printer sets
the ACK wire (#10) for 400 ns (a long time). Once the controller sees the ACK wire is set, it stops
sending the data and sets wire #1 to 0. If, however, if the printer is not yet ready for the data i.e.
if it cannot keep up with successive data ready signals, it sends a BUSY signal (wire #11), and the
printer controller waits for some time before trying again.

asynchronous shared bus


Although I described this asynchronous scheme for the case of an I/O controller communicating
with a single I/O device, it is possible in principle to use the same scheme for a bus where several
devices could be sharing lines.
Consider the sender initiated case. In the figure below, I have indicated when the sender uses
part of the bus in red and when the receiver uses part of the bus in blue. First, the sender puts the
data, address i.e the targeted receiver, and command (control) on the appropriate lines. It then
waits a short time and turns on the data ready signal. The devices read these signals. The receiver,
in particular, sees from the address bus that it is the target for this signal. Once it has read the

last updated: 9th Apr, 2016 2 lecture notes c Michael Langer


COMP 273 22 - asynchronous buses April 5, 2016

information, it turns on a ’receiver acknowledge’ (ACK) signal (some other line). This signals that
it has read the information off the bus. The sender sees this, then stops writing on the bus (and
the bus then has unspecified values). The receiver eventually drops its ACK signal to 0.

In the receiver initiated case, the receiver puts address i.e the targeted source and the command
(control) on the appropriate lines. Then, the receiver waits and gives the signal a chance to stabilize
over the whole line. The devices decode the signals. The source/sender, reading the address line,
sees that it is the target. So it puts the requested data on the bus and shortly after it sets the ’data
ready’ signal. The receiver sees the ready signal and reads the data. When the receiver is done, it
removes the data request line and stops writing on the address and control lines (and so the values
on these lines is unspecified). When the sender sees that the ’data request’ signal has been turned
off, the sender stops putting the data on the line, and it turns off the data ready signal.

Notice that when the bus is shared, there needs to be some scheme to avoid that multiple devices
write on the bus at the same. This could be achieved using a similar scheme to what we saw with
the (clocked) system bus e.g. dedicated IRQ/IACK lines with one party (the CPU in that case)
controlling who is allowed to write.

last updated: 9th Apr, 2016 3 lecture notes c Michael Langer


COMP 273 22 - asynchronous buses April 5, 2016

Serial bus
For all of the buses we have discussed up to now, we have thought of the signals on the bus as being
constant over each line. That is, each wire of the bus held one value and a device only read from the
bus once that value had stabilized over the whole bus. A serial bus is completely different. With a
serial bus, you should think of a sequence of signal pulses travelling down a wire. To read from the
serial bus, you need to time the reads with the pulses as they pass. A clock is still necessary since
the pulses must be all teh same duration and this duration needs to be agreed upon in advance.
The “speed” (or frequency) of a serial bus is measured in bits per second (baud). This is the
number of bits or pulses that passes any given point on the bus per second. If data is sent at a
certain speed (e.g. 56,000 bps in the case of a very old modem) then each bit has a duration of
1/56,000 seconds, which is about 20,000 ns. If you have a particular physical cable, then you could
think of each bit as occupying a certain spatial extent on the cable. Or, if you measure the signal
at a particular point on the bus, then a particular value can be measured at that point for this
duration.
For a 56 Kbps modem, the speed is much less (by a factor of several thousand!) than the clock
speed of your system bus. So, you should think of the clock speed of the system bus as being a
very fine sampling relative to the bits on the serial bus. (And the CPU clock speed another factor
of ten faster than the system bus clock speed.)
How does the receiver read from a serial bus? Assume for the moment that the receiver and
sender have agreed on:

• the duration T of one bit

• number of bits to be transmitted (say 8, i.e. one byte)

The sender puts a 1 signal on the wire for the normal state (non-transmitting). Then the sender
switches to 0 to indicate that a sequence of bits is to follow. This initial 0 has the same duration T
as each of the bits that follow, and is called start bit. The start bit indicates that a byte is about
to be sent, but otherwise contains no information. The receiver waits .5 T and checks to make sure
that the signal is still 0. This is a check that that the switch to 0 wasn’t some electronice noise.
The receiver then waits 1 T and samples the signal, and repeatedly samples every one T for as
many bits as are expected (8). Note that by waiting .5 T and regularly sampling after that every
T, it samples from the middle of the pulse i.e. avoiding the transition at the boundaries of the
pulse. When it has all 8 bits of data, it reads a “stop bit” (value 1). At least one stop bit is there
to ensure some spacing between bytes. In particular, the receiver can only read in the next byte
once it notices a transition from 1 to 0, that is, when it notices the next start bit. To guarentee
this happens, the line must be in the 1 state.
Note that the sender and receiver each their own clock, typically of different speeds, and the
pulse duration of these clocks is much shorter than T.
For example, if the byte to be sent is 10011101, then the sequence looks like as follows. (Here
the wave of pulses is travelling to the left. In the slides, I showed it travelling to the right.) For this
example, the bits are sent from MSB to LSB (though one could send in the other order too). The
receiver reads the start bit first, followed by the eight bits of the sequence, followed by the stop bit.

last updated: 9th Apr, 2016 4 lecture notes c Michael Langer


COMP 273 22 - asynchronous buses April 5, 2016

1 0 0 1 1 1 0 1

start stop

How do messages get started? Messages often have headers and text. To indicate boundaries, the
sender sends a special ASCII value:

• 0x01, called SOH (start of header)

• 0x02, called STX (start of text)

• 0x03, called ETX (end of text)

• 0x04 called EOT (end of transmission)

In some systems there is an extra mechanism for checking if error have occured in the trans-
mission. Errors can occur because lines are physical and sometimes have noise. One simple way
to check for errors is for the sender to append an extra bit to the character – called a parity bit –
such that the total number of 1 bits in the code is even. This is called even parity. If the character
has an odd number of 1 bits, then the parity bit is set to 1, so that the total number of 1’s is even.
If the ASCII character has an even number of 1 bits, then the parity bit is set to 0. Note it could
happen that there are two errors (two of the ASCII bits are incorrect), and in this case, the errors
would not be detectable.

UART
What sort of hardware is needed by the sender and receiver? The sender needs to convert bytes
to bit sequences. The receiver needs to convert bit sequences into bytes. Thus, the sender needs
a parallel-in serial-out shift register. This is just what it sounds like. For example, consider a I/O
controller for a printer which uses a serial line. It takes one byte in parallel off the system bus, and
then writes the bits one by one on a serial line. It could write either the least significant bits first (in
the case of shift right) or most significant bits first (in the case of shift left). Similarly, the printer
(receiver) needs a serial-in parallel-out shift register. It reads the bits one by one into a register,
and then outputs the bytes as a unit into another register which is used for processing the byte.
The UART (Universal Asynchronous Receiver/Transmitter) is an I/O controller that converts
serial to parallel (input) or parallel to serial (output). A UART is more than just the shift registers
mentioned above, though. It has several registers so that the CPU or I/O device can communicate
with it. It also needs special circuitry to remove the start and stop bits and to check for parity.

USB (Universal serial bus)


As you know, computers that are sold these days do not have dedicated ports for the printer, mouse,
keyboard, scanner, etc. Instead, the devices typically use a standard bus called USB.

last updated: 9th Apr, 2016 5 lecture notes c Michael Langer


COMP 273 22 - asynchronous buses April 5, 2016

Many USB devices can be connected to a computer. Physically the connections define a tree
structure. The root of the tree is the USB controller, inside the computer, which is also called the
USB host. You may have two ports (slots where you plug the USB connector) which are children
of the root. You can connect USB hubs which serve as internal tree nodes. And you can connect
hubs to hubs. In total, you can connect up to 127 devices to a single USB host.
device1

device2
HUB

COMPUTER device3

HUB device4
HUB
device1

device6

Although the above describes a tree like structure, in fact each of the USB devices sees everything
that it put on the bus, i.e. the hubs (internal nodes) do not block the signals being sent. The host
communicates by polling. It sends messages to the devices, and they respond only when prompted.
“Device 15, do you have anything to say”. (Response: “no)”. “Device 24, do you have anything to
say”. (Response: “no”). “Device 15, do you have anything to say”. (Response: “yes, bla bla bla”).
etc.
The USB standard specifies various formats for packets of data that are used to communicate
between the host and a USB device. Each packet consists of fields including a synch which is used
to synchronize the devices to some extent (details omitted), packet ID which says what type of
packet it is (’ready’), address (which device is this packet for?), data, error correctio information.
Details (many!) are omitted here.
Many USB drivers come with typical operating systems (Windows 7, Linux Ubuntu, Max OS
X, ...). If a driver doesn’t come preinstalled then it can be downloaded (sometimes automatically).
When you connect a USB device (such as a new mouse, or printer) to your computer, the USB
controller (host) interrogates the device to find out what it is. It then sends this information to the
CPU, which checks if the OS has the driver on disk. If it does, it loads the driver. This automatic
system is known as “plug and play”.1

Other I/O ports


In class, I briefly discussed a few other I/O ports.

• VGA (a 15 pin parallel connector – what I use to connect my laptop to the projector), invented
in 1980s, the pins contain R,G,B data values, plus various control lines (end of row, end of
frame)
1
When the system was first being introduced about a decade ago and didn’t work as well as it does now, it was
known as “plug and pray”.

last updated: 9th Apr, 2016 6 lecture notes c Michael Langer


COMP 273 22 - asynchronous buses April 5, 2016

• DisplayPort, miniDP

• HDMI (high definition, multimedia (video and audio)

I also briefly discussed how an optical mouse works. This material is for your interest only. It is
not on the final exam and so I do not elaborate here.

last updated: 9th Apr, 2016 7 lecture notes c Michael Langer

You might also like