Michael Langer Comp 273 2016
Michael Langer Comp 273 2016
7, 2016
Welcome to COMP 273. Let’s get started! You all know that computers represent numbers using
0’s and 1’s. But how exactly this works is probably a mystery for most of you. We start out
with a basic and fundamental example, namely how to represent integers. We’ll start with positive
integers, then negative integers, and then we’ll turn to non-integers i.e. real numbers.
When m is already represented as binary number (i.e. “base 2”), the quotient and remainder
are trivial to obtain. The remainder is the rightmost bit – called the least significant bit (LSB). The
quotient is the number with the LSB chopped off. To convince yourself of this, note that writing a
positive integer as an n bit binary number means that you write it as a sum of powers of 2,
n−1
X n−1
X n−2
X
i i
(bn−1 bn−2 . . . b2 b1 b0 )two ≡ bi 2 = bi 2 + b0 = 2 × bi+1 2i + b0
i=0 i=1 i=0
For example,
Thus,
241ten = 11110001two .
You need to be able to do this for yourself. So practice it!
If you are not fully clear why the algorithm is correct, consider what happens when you run
the algorithm where you already have the number in binary representation. (The quotient and
remainder of a division by 2 does not depend on how you have represented the number i.e. whether
it is represented in decimal or binary.)
(quotient) (remainder)
11110001
1111000 1
111100 0
11110 0
1111 0
111 1
11 1
1 1
0 1
0 0
0 0
: :
The remainders are simply the bits used in the binary representation of the number!
Final note: The representation has an infinite number of 0’s on the left which can be truncated.
I mention this because in the heat of an exam you might get confused about which way to order
the 0’s and 1’s. Remembering that you have infinitely many 0’s when you continue to apply the
trick. This tells you that the remainder bits go from right to left.
Its the same algorithm that you use with decimal, except that you are only allowed 0’s and 1’s.
Whenever the sum in a column is 2 or more, you carry to the next column:
2i + 2i = 2i+1
We would like to apply the grade school algorithm to do subtraction, multiplication and long division
in binary. However, there are some subtleties in doing so. For example, when you subtract a big
positive number from a small positive number, you end up with a negative number, and I have not
yet told you how to represent negative numbers. That’s what I’ll start with next lecture.
999999
+ 000001
000000
Notice that we are ignoring a seventh digit in the sum (a carry over). We do so because we are
restricting ourselves to six digits.
Take another six digit number 328769. Let’s look for the number that, when added to 328769,
gives us 000000. By inspection, we can determine this number to be 671231.
328769
+ 651231
000000
00011010
+ ????????
00000000
To find −26, we need to find the number that, when added to 00011010 gives us 00000000. We use
the following trick. If we “invert” each of the bits of 26 and add it to 26, we get
00011010 ← 26
+ 11100101 ← inverted bits
11111111
This is not quite right, but its close. We just need to add 1 more. (Remember the odometer).
11111111
+ 00000001
00000000
Note we have thrown away the ninth bit because we are restricting ourself to eight bits only. Thus,
1
Ok, ok, odometers were used back in the 20th century, before you were born..
00011010 ← 26
11100101 ← inverted bits
+ 00000001 ← +1
00000000
Alternatively, we add 1 to the inverted bit representation and this must give us −26.
00011010
11100110 ← This is −26 10 .
00000000
Thus, to represent the negative of a binary number, we invert each of the bits and then add 1. This
is called the twos complement representation.
Special cases
One special case is to check is that the twos complement of 00000000 is indeed 00000000.
00000000
11111111 ← invert bits
11111111
10000000
+ 01111111 ← invert bits
11111111
Note that 128 has the same representation as -128. Of course, we can’t have both: we have to
decide on one. Either 10000000 represents 128 or it represents -128. How does that work?
00000011
01111110 00000010
01111111 00000001
10000000 00000000
10000001 11111111
10000010 11111110
unsigned signed
126 3 126 3
2 2
127 1 127 1
128 0 -128 0
255 -1
129 - 127
130 254 - 126 -2
This gives a compiler error. ”The literal of type int is out of range.” The compiler knows that
4,000,000,000 is greater than 231 − 1. Now try:
This prints out -294967296. To understand why these particular digits are printed, you would need
to convert 4000000000 to binary, which I don’t recommend because it is tedious. The point is that
it is a negative number! This can easily happen if you are not careful, and obviously it can lead to
problems.
Floating point
Let’s next talk about binary representations of fractional numbers, that is, numbers that lie between
the integers. Take a decimal number such as 22.63. We write this as:
The “.” is called the decimal point. We can use a similar representation for fractional binary
numbers. For example,
We write ≈ because Pwe have rounded down, namely by chopping off any trailing bits. These trailing
bits are of the form −∞ i
i=−9 bi 2 . Note that we have negative powers of 2 here.
What if we don’t chop off the trailing bits and instead we generate bits ad infinitum. What
happens? Interestingly, at some point you will generate a sequence of bits that repeats itself over
and over. To see why, first consider a simple example. We write 0.05 in binary.
. 05
= (0) . 1 × 2−1
= (00) . 2 × 2−2
= (000) . 4 × 2−3
= (0000) . 8 × 2−4
= (00001) . 6 × 2−5
= (000011) . 2 × 2−6
= (0000110) . 4 × 2−7
= (00001100) . 8 × 2−8
= (000011001) . 6 × 2−9
= (0000110011) . 2 × 2−10
= etc . etc
Note this pattern 0011 will be repeated over and over as the part to the right of the binary point
cycles back to “2”. This gives:
(.05)ten = .000011 · · ·
where the underlined part repeats ad infinitum.
In a more general case, we have N digits to the right of the binary point. If “run the algorithm”
I have described, then you will find that the number to the right of the binary/decimal point will
eventually repeat itself. Why? Because there are only 10N possible numbers of N digits. And once
the algorithm hits an N digit number it has hit before, this defines a loop that will repeat over and
over again. If the algorithm takes k steps to repeat, then k bits are generated. These k bits will
similarly repeat ad infinitum. Of course, the repeating bits might be all 0’s, as in the case of 0.375
which was in the lecture slides. (Do it for yourself if you are studying from these notes.)
We commonly write hexadecimal numbers as 0x where the underline is filled with char-
acters from 0,...,9,a,b,c,d,e,f. For example,
Sometimes hexadecimal numbers are written with capital letters. In that case, a large X is used as
in the example 0X2FA3.
If the number of bits is not a multiple of 4, then you group starting at the rightmost bit (the
least significant bit). For example, if we have six bits string 101011 , then we represent it as 0x2b.
Note that looking at this hexadecimal representation, we don’t know if it represents 6 bits or 7 or
8, that is, 101011 or 0101011 or 00101011. But usually this is clear from the context.
Fixed Point
We are mostly discussing floating point numbers today, but I want to mention briefly that some
computers represent non-integers numbers using fixed point. Fixed point means we have a constant
number of bits (or digits) to the left and right of the binary (or decimal) point.
For example, we might have eight digits to the left of the decimal point and two digits to the
right. An example is 23953223.49. You are familiar with representations have two digits to the
right. That is how our currency works (but the number to the left is not fixed with currency). An
example in binary is 10.1101 where we have two and four bits to the left and right, respectively.
Note that I used the term ”digits” for base 10, and ”bits” for base 2.
We can represent negative numbers in fixed point using two’s complement. Take the number
6.5 in decimal which is 110.1 in binary. Suppose (arbitrarily) that we represent it with four bits to
the left and four to the right of the binary point. To represent -6.5 in binary, we invert the bits and
add what we need to get back to 0.
0110.1000
1001.0111 ← invert bits
+ 0000.0001 ← add .0001
0000.0000
We get the representation of -6.5 by adding the 2nd and 3rd rows of the sum:
Scientific Notation
Let’s return to floating point and consider a number representation called “scientific notation,”
which most of you learned about in high school. You know that you can represent very large
decimal numbers or very small decimal numbers by writing, for example:
1. × 2E
where the is called the significand (or mantissa) and E is the exponent. Note that it
is impossible to encode the number 0 in normalized form.
The number of bits in the significand is called the number of significant bits. Note that we don’t
count the first bit when counting the number of significant bit. (In decimal, we do have to specify
the most significant digit since it can be anything from 1 to 9, so it is included in the count.)
Single precision
Single precision represents a floating point number using 32 bits (4 bytes, that is, one byte is 8
bits). We number the bits from right to left. Bit 31 is the leftmost and bit 0 is the rightmost.
• bits 0-22 (23 bits) are used for the significand
The two exponent codes (00000000, 11111111) are reserved for special cases.
• The exponent 00000000 is used to represent a set of numbers lying strictly in the tiny interval
( −2−126 , 2−126 ) which includes the number 0. These numbers are called denormalized numbers.
Notice that, to represent 0, the exponent bits and significand bits are all 0’s.
• The exponent 11111111 is used to represent either infinity (plus or minus, depending on the
sign bit), or the case that there is no number stored (called “not a number”, or NaN). This
case could arise if your program has declared a floating point variable but you haven’t assigned
it a value yet.
These special cases are determined by the significand. If the significand is all 0’s (23 of them),
then the interpretation is ±∞, depending on the sign bit. If the significant is not all zeros,
then the interpretation is NaN.
There are 254 remaining 8-bit exponent codes, namely from 00000001 to 11111110, which are
used for normalized numbers. These codes represent exponents ranging from -126 through 127.
To convert from this 8-bit exponent code to the exponent value that it represents, you interpret
the 8-bit code as an unsigned number (from 1 to 254) and then subtract the value 127 (called the
“bias”) which is the code for the exponent value of 0.
How should you think about the discretization of the real line that is defined this way? Think
of consecutive powers of 2, namely [2e , 2e+1 ]. That interval is of size 2e . Partition that interval into
223 equal sized pieces, each size being 2e−23 . Note that the step size between “samples” is constant
within each interval [2e , 2e+1 ] but that the step size doubles in each consecutive power of 2 interval.
An Example
How is 8.75 coded up in the IEEE single precision standard? First you show that
8.75 = (100011)2 × 2−2
using the techniques discussed earlier. Then, you shift the binary point to normalize the number,
(100011)2 × 2−2 = (1.00011)2 × 23
The significand is 00011000000000000000000 where I have underlined the part that came from the
above equation. Note that we have dropped the leftmost 1 bit, because we are using normalized
numbers.
The exponent code is determined by adding 127 to the exponent value, i.e. 3 + 127 = 130, and
writing 130 as an unsigned number (10000010)2 . The sign bit is 0 since we have a positive number.
Thus, 8.75 is encoded with the 32 bits
0 10000010 00011000000000000000000.
We can write it in hexadecimal as 0x410c0000.
Suppose we want to add two single precision floating point numbers, x and y, e.g.
x = 1.101010000000000000000101 ∗ 2−3
y = 1.001001000010101010101010 ∗ 22
where x has a smaller exponent than y. We cannot just add the significands bit by bit, because
then we would be adding terms with different powers of 2. Instead, we need to rewrite one of the
two numbers such that the exponents agree. The usual way to do this is to rewrite the smaller of
the two numbers, that is, rewrite the number with the smaller exponent. So, for the above example,
we rewrite x so that it has the same exponent as y:
x = 1.101010000000000000000101 ∗ 2−3
= 1.101010000000000000000101 ∗ 2−5 ∗ 22
= 0.00001101010000000000000000101 ∗ 22
Notice that x is no longer normalized, that is, it does not have a 1 to the left of the binary point.
Now that x and y have the same exponent, we can add them.
1.001001000010101010101010 ∗ 22
+ 0.00001101010000000000000000101 ∗ 22
= 1.00110001011010101010101000101 ∗ 22
The sum is a normalized number. But the significand is more than 23 bits. If we want to represent
it as single precision float, then we must truncate (or round off, or round up) the extra bits.
A related point, which I discussed last lecture, is that most numbers that have a finite number
of digits (base 10) require an infinite number of bits to represent exactly. This means that when
you try to represent these numbers as floating point in a computer, you will to approximate them.
These can lead to surprising results. Take the example 0.05 i.e. 1/20.0 that we saw last class, and
consider the following code:
float x = 0;
for (int ct = 0; ct < 20; ct ++) {
x += 1.0 / 20;
[Link]( x );
}
0.05
0.1
0.15
0.2
0.25
0.3
0.35000002
0.40000004
0.45000005
0.50000006
etc
None of these numbers is represented exactly in the computer! The computer is just printing the
best approximation it can with eight digits to the right of the decimal point. For example, 0.1
means 0.10000000 and it doesn’t bother writing the seven 0’s on the right because people don’t
generally want to see those 0’s. But that’s just convention.
Why eight digits? The answer is presumably that 223 ≈ 108 and so the number that is being
approximated with with those 23 bits of significand can be approximated just as well with eight
digits.
Double precision
The 23 bits of significand might seem like alot at first glance. But 223 is only about 8 million. When
we try to represent integers larger than this value, we find that we have problems. We don’t have
enough precision and we have to skip some. (See Exercises 1 Q10c.)
In order to get more precision, we need more bits. The IEEE double precision representation
uses 64 bits instead of 32. 1 bit is used for the sign, 11 bits for the exponent, and remaining 52 bits
for the significand (1 + 11 + 52 = 64). A similar idea is used for the exponent bias except that
now the bias is 1023 (210 − 1), rather than 127 (27 − 1). Conceptually, there is nothing new here
beyond single precision, though.
In the slides, I went through the same examples as I did for single precision. I wrote out the
exponent codes for double precision, and I encoded 8.75 as a double. I suggest you work them out
for yourself, and verify with the slides. I also gave a similar Java example for double.
double x = 0;
for (int ct=0; ct < 10; ct ++) {
x += 1.0 / 10;
[Link]( x );
}
0.1
0.2
0.30000000000000004
0.4
0.5
0.6
0.7
0.7999999999999999
0.8999999999999999
0.9999999999999999
Note that it prints out up to about 16 digits. The reason for 16 is that
and so with 52 bits we can represent about the same number of numbers as we can with 10 digits.
Of course, we are still approximating.
Note that the 3rd line has 17 digits to the right of the decimal point. Also, it seems strange
that the third line is slightly greater than .1 * 3 whereas the 7th-9th lines are slightly less than .1 *
7-9 respectively. It is not obvious why this behavior occurs. (Time permitting, I will figure it out
later – or if one of you wishes to figure it out and let me know, please do.)
In lectures 1 and 2, we looked at representations of numbers. For the case of integers, we saw
that we could perform addition of two numbers using a binary representation and using the same
algorithm that you used in grade school. I also argued that if you represent negative numbers using
twos complement, then you can do addition with negative numbers as well. In lecture 4, we will see
how to implement the addition algorithm using circuits. Before we can do so, we need to review
some basic logic and use it to build up circuits.
Truth tables
Let A and B be two binary valued variables, that is, A, B each can take values in {0, 1}. If we
associate the value 0 with FALSE and the value 1 with TRUE, then we can make logical expressions
using such binary valued variables. We make such expressions using the binary operators AND,
OR, and the unary operator NOT. Let A · B denote AND(A, B). Let A + B denote OR(A, B). Let
A denotes NOT(A). The AND and OR operators are defined by the following truth table.
A B A·B A+B
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 1
A A
0 1
1 0
Since a truth table of two input variables has four rows, it follows that there are 24 = 16 possible
output functions (that is, possible columns in the truth table). The operators AND and OR are
just two of these 16 possible output functions. Three other output functions that are commonly
used are NAND (not and), NOR (not or), and XOR (exclusive or). These are defined as follows.
To justify why the symbols “+” and “·” are used to represent OR and AND operators, I would
have to spend some time talking about boolean algebra. This might be of interest to some of you,
but it won’t take us in the direction we want to go in this course. Instead, I will give you a brief
introduction to the topic and then focus on how to use the logical expressions to build up digital
circuits in a computer.
associative (A + B) + C = A + (B + C) (A · B) · C = A · (B · C)
distributive law: A · (B + C) = (A · B) + (A · C) (A · B) + C = (A + C) · (B + C)
Example
On the last page I wrote truth tables for basic binary and unary operators. We can also write truth
tables for expressions that are built out of these operators. Here is an example:
Y = A · B · C · (A · B + A · C)
The last column Y not necessary here, but I’ll discuss it on the next page.
Don’t cares
If we have m input variables and n output variables, then the truth table has 2m rows. However,
sometimes we don’t need to write all the rows because certain combinations of the inputs give the
same output. For example, take
Y =A·B·C +A
The second expression is really the sum of four expressions which have all combinations of B and
C. Alternatively, the second expression can be thought of as saying A and “I don’t care what B
and C are”. We denote such “don’t care’s” in the truth table with x symbols.
A B C Y
0 x x 1
1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 0
Logic Gates
Computers solve for expressions ”Y = ...” using electronic circuits, called logic gates, which are
composed of resistors and transistors, and other elements. Transistors are basic circuits typically
made from silicon and other elements. (To understand how they work, you would need to spend a
few semesters in the physics department.) Transistors were invented in the 1940s at Bell Labs, and
the inventors won the Nobel Prize for it in 1956.
We will not discuss how transistors work. Rather we will work at the gate level. A gate
implements one of the binary operators defined above, or the unary operation NOT. Examples of
the gates we will use are shown below. These are standard symbols and you will need to memorize
them.
AND NAND NOT
OR NOR XOR
The inputs and outputs to gates are wires which have one of two voltages (often called low and
high, or 0 and 1). Often we will put more than two inputs into an AND or OR gate. This is allowed
because of the associative law of Boolean algebra. (The underlying physical circuit would have to
be changed, but we don’t concern ourselves with this lower level.) Also note that the commutative
law of Boolean algebra says that the ordering of the wires into a gate doesn’t matter.
Y = A·B+A·C
Whatever binary values are given to the input variables, the output Y will correspond to the truth
value of the logical expression.
B Y
The black dot in this figure indicates that input wire A branches into two input wires.
Example 2
We can write the XOR gate using a sum-of-products
Y = A⊕B =A·B+A·B
and the circuit is shown below. (We could have also used a product-of-sums.)
Example 3
Consider a circuit,
Y =S·A+S·B
This circuit is called a multiplexor or selector. What does this circuit do? If S is true, then Y is
given the value B whereas if S is false then Y is given the value A. Such a circuit implements the
familiar statement:
if S
then Y := B
else Y := A
A1 A0 Y2 Y1 Y0
0 0 0 1 1
0 1 0 0 1
1 0 0 0 0
1 1 1 0 0
and here is the corresponding memory. (Note that the address is not stored in the memory!)
input output
(address) (contents of memory)
00 011
01 001
10 000
11 100
What is new about this concept? You now have to think of the input and output variables in
a different way than you did before. Before, you thought of the input variables as having TRUE
or FALSE values, and you did not associate any particular significance to their order. Thinking
about the circuit as a ROM is quite different. You no longer think of each of the input variables
as TRUE or FALSE. Rather, you think of the input variables as defining a binary number, namely
an address. The order of the input variables now has a significance, since it defines an ordering on
the address space. Similarly, the output variables are a string of bits whose order may or may not
matter.
Arithmetic Circuits
Last class we discussed gates and circuits and how they could be used to compute the values
of logical expressions, formed by NOT, AND, OR and other operators. Such circuits are called
combinational logic circuits or combinational digital circuits. We will look at several more useful
examples today.
How would we implement addition using these gates? Recall the algorithm from Lecture 1 for
adding two n-bit numbers, A and B :
Cn−1 Cn−2 . . . C2 C1 C0
An−1 An−2 . . . A2 A1 A0
+ Bn−1 Bn−2 . . . B2 B1 B0
———————————
Sn−1 Sn−2 . . . S2 S1 S0
The Ci are the carry bits, and Si are the sum bits. Note that C0 = 0 because there is no way to
carry in. (It will be clear later why we are defining this bit.)
How are we interpreting these bits? We are doing ”odometer arithmetic” here, so it doesn’t
matter whether we are thinking of the numbers as signed (twos complement) or unsigned. This does
matter when we interpret the results, but it doesn’t matter from the standpoint of the mechanisms
for computing the sum.
The rightmost bit (or least significant bit) S0 of the sum is simple to determine using logical
expressions and combinational circuits, as is the carry out bit C 1 .
A0 B0 S0 C1
0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1
Thus,
S0 = A0 ⊕ B0
C1 = A0 · B0
The circuit that implements these is called a half adder. It has two inputs and two outputs. You
should be able to draw this circuit yourself. More generally, the values of the k th sum and carry
out bits are given as follows:
Ck Ak Bk Sk Ck+1
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1
which we can represent using sum-of-products circuits as defined by these expressions:
Sk = Ak · Bk · Ck + Ak · Bk · Ck + Ak · Bk · Ck + Ak · Bk · Ck
Ck+1 = Ak · Bk · Ck + Ak · Bk · Ck + Ak · Bk · Ck + Ak · Bk · Ck
The box with the ’+’ below contains the combinational circuitry for implementing the truth table
above. Note that we have used a full adder for the least significant bit (LSB) and set C0 to 0.
This circuit is called a ripple adder. To compute Sn−1 you need to allow the carries to ripple
through from bits 0 up to bit n − 1. This suggests that for large n (say 32 bits for A and B each),
you would have to allow a long delay. Later this lecture, we will discuss how this delay can be
avoided. For now, we turn to other useful combinational circuits.
A0 S0
+
B0
C1
A1
+ S1
B1
C2
A2 S2
+
B2
C3
A3 S3
+
B3
C4
etc
Encoders
An encoder is a combinational logical circuit whose outputs specify which of a set of possible inputs
or groups of inputs has occured. The term encoder is used somewhat loosely. (There is no generally
agreed upon formal definition of an encoder that I am aware of.) Let’s look at a few examples.
A common example is a set of m input wires, such that exactly one of them has the value 1.
For example, consider a keyboard with 5 buttons and suppose that the mechanics of these buttons
are such that one and only one of these buttons is pressed down. (You may have seen something
like this on a children’s toy.) Such an encoder might be constructed using the truth table:
A B C D E Y1 Y2 Y3
0 0 0 0 1 0 0 1
0 0 0 1 0 0 1 0
0 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
1 0 0 0 0 1 0 1
Notice that if none of the button is pressed, or if multiple buttons is pressed (because you push
hard and break the mechanism), then the above truth table does not specify what the value of the
output variables Yi is. But this is not a problem, as long as the design is such that only one can be
pressed (1) at any time.
Suppose we allow for multiple buttons to be pressed and we want to explicitly indicate in the
truth table what the output of the circuit will be for each possible input. One way to do it is shown
below. Consider the following truth table and recall that X means “don’t care”. If no buttons are
pressed, the output is 000. If two or more buttons are pressed, then the output is the same as if
only the rightmost of these buttons was pressed.
A B C D E Y1 Y2 Y3 interpretation
0 0 0 0 0 0 0 0 none of the buttons is pressed
X X X X 1 0 0 1 E is pressed (and maybe A,B,C,D)
X X X 1 0 0 1 0 D and not E (but maybe A,B,C)
X X 1 0 0 0 1 1 C and neither E nor D (but maybe A,B)
X 1 0 0 0 1 0 0 etc
1 0 0 0 0 1 0 1 etc
Y1 = (A · B · C · D · E) + (B · C · D · E)
Y2 = (D · E) + (C · D · E)
Y3 = E + (C · D · E) + (A · B · C · D · E)
A second example of an encoder that you should be familiar with is an output device (an LCD)
that displays a single digit from {0, 1, ..., 8, 9}. Each digit is defined by “turning on” a subset of the
line segments L1 , · · · , L7 . For example, when all seven of the line segments are on, the output digit
represents “8”, whereas when all lines except L2 are turned on, the output digit represents “0”. In
Exercises 2, you will write out the truth table for such an encoder.
L1
L4 L6
L2
ten buttons L5 L7
L3
Decoders
A decoder is the opposite of an encoder. It takes a code (a binary number) as input, and turns on
one of several output wires (or some other event, i.e. subset of the output wires). For example,
here is a truth table for a 1-to-2 decoder:
A Y1 Y2
0 1 0
1 0 1
You should be able to draw the circuit for these two examples.
A
Y A 0
M
1
U Y
B B X
A
Y S
B
S
decoder
Suppose instead we had four different inputs (A, B, C, D) and we wanted to select from them.
Then we would use a “two bit multiplexor” circuit as shown below on the left. The MUX symbol
on the right will be used whenever we want to hide the underlying circuitry.
Sometime the inputs A, B, C, D are not single bits, but may have many bits, e.g. we might be
selecting one of four integers. In this case we might use the same notation show in the two bit
multiplexor figure, and if we wanted to explicitly show that the inputs had a certain number (say
16) of bits, we would label the wires as such, by putting a little slash in each wire and writing 16.
We will see examples later in the course and the notation will be come natural to you.
A
A 3 M
Y B 2
C 1 U Y
B D 0 X
C
S
D
4 M
U Y
decoder X
2
S S
Fast adder
Let’s look at a few examples of how multiplexors can be useful. Recall the ripple adder discussed at
the beginning of the lecture. We can reduce the long delays for the carries to propagate, using the
following trick. Assume n is an even number. Divide the bits into low order bits (0, 1, . . . , n/2 − 1)
and high order bits (n/2, . . . , n − 1). Build one circuit to compute the sum for the low bits. Build
two circuits to compute the sum for the high order bits. Set the carry in bit for the LSB in the
lower order bit circuit to 0. Set the carry in bits for the LSB in the two high order circuits to 0 and
1, respectively. In the figure below, the “+” box is a 16 bit adder, rather than a one bit adder. (I
drew a slightly different looking figure in the lectures.)
The main idea here is that we don’t need to wait for the worse case in which the carries are
passed through the entire circuit (e.g. adding A = 011111 · · · 111 and B = 0000 · · · 0001). Instead,
once the carries pass through the first n/2 full adders and the circuits stabilize, we know which of
the upper order sums we want. We just select it. Thus we have almost cut the computation time
in half. I write “almost” because we also have to do the selection, which takes time.
Note that we could repeat this idea, and speed up the computation of the three n/2 bit additions,
by decomposing each of them into three additions of n/4 bits, and so on. There is a tradeoff here
between time (fast) and space (having the replicate the upper order bits to handle the two cases of
the carry out from low order bits).
S7,S6, ..., S0
+
carry out bit
0
A15,A14,... A8
+ M
U S15,S14,..., S8
B15,B14,...,B8 X
Subtraction
Suppose we wanted to build a circuit for performing subtraction.
An−1 An−2 . . . A2 A1 A0
– Bn−1 Bn−2 . . . B2 B1 B0
———————————-
To perform A − B, we compute A + (−B). This requires us to negate B and then send the result
through the adder.
To negate a number that is represented in twos complement, we invert each of the bits and then
add 1 (lecture 1). To do this with a combinational circuit is easy. We can use circuit similar to the
adder except that the B bits need to be inverted B. To add one, we can set the carry in bit C0 to
1. See below on the left. Binvert is the variable that specifies whether we are doing addition (0) or
subtraction (1). Note that we can use this variable both to select the B versus B, and for the C0 .
Binvert Binvert
Operation
Ci (add, and, or)
A0 Ai
M + M
U
+ M
B0 U Bi X
X U
X
A1
M +
B1 U
X
A2 +
M
B2 U
X Op
A3 A
M +
B3 U
X
ALU
ETC B
Sequential Circuits
All of the circuits that I have discussed up to now are combinational digital circuits. For these
circuits, each output is a logical combination of the inputs. We have seen that these circuits can
do arithmetic and other operations. But these circuits are not powerful enough to build a general
purpose computer.
A key element that is missing is memory. Consider two different ways of remembering something.
Suppose I tell you a telephone number. One way for you to remember it would be to write it down,
say on paper. Then later you could just look at the paper. Another way to remember the number
which is good for short term, is just to repeat the number over and over to yourself.
You probably have some intuition about how a computer could write down a number. It could
alter the state of some magnetic material, for example. What is less clear to you is how a computer
could remember a number by repeating it to itself. Indeed this is the mechanism that is often used.
We will now turn to this technique.
R
Q
S Q
On the left below is a truth table for a NOR gate, as a reminder. On the right are the values of Q
and Q for various input combinations. It may seem strange that we label the two outputs as such,
since if R = S = 1, then Q = 0 and Q = 0 which is impossible, and indeed in the slides I used
the symbol Q0 instead of Q when introducing the RS latch. The reason it is ok to give the output
labels Q and Q is that eventually we will not allow R = S = 1.
The way RS latches are used is that at most one of R and S have value 1. When exactly one of
them has value 1, it determines Q and Q. When both have value 0, then Q and Q keep the values
that they had when one of R and S last had the value 1. See slides for illustration.
A B A+B R S Q Q
0 0 1 0 0 hold ← remember
0 1 0 0 1 1 0 ←“set”
1 0 0 1 0 0 1 ←“reset”
1 1 0 1 1 0 0 ← not allowed
It is standard to use letters R and S and to call them “reset” and “set”, respectively. The input
R = 1 and S = 0 “resets” the output Q which traditionally means “gives it value 0”. The input
R = 0 and S = 1 “sets” the output Q which means “gives it the value 1”. If you then put R = 0 and
S = 0, you hold the value of Q i.e. you remember the value. Thus, an RS latch acts as a one-bit
memory, holding (locking, latching shut) the value Q as long as R = 0, S = 0. The feedbacks in the
circuit are what I said before about repeating a number back to yourself to remember it.
The clock
All the circuits we’ve seen up to now can change their output values at any time their inputs change.
You can imagine this has undesirable properties. If the outputs of one circuit are fed into other
circuits, the exact timing and ordering of operations will be important. For example, consider the
adder/subtractor we saw earlier. The sequence of carries takes some time, and we don’t want the
inputs to the adder to change (say, to add two new numbers) before all the carries have propagated
and the result from the first sum is finished, and the result stored somewhere.
To prevent this problem, we need a mechanism of synchronizing the inputs and outputs of the
circuits. Synchronization is achieved by a clock. A clock in a computer is a signal (a value on a
wire) that alternates between O and 1 at regular rate. When you say that a processor has a clock
speed of 3 GHz, you mean that the clock cycles between 0 and 1 at a rate of 3 billion times per
second (GHz means “gigaHerz” where “giga” means billion.)
To understand how a clock signal is generated, you would need to take a course in electronics
and learn about crystal oscillators. For COMP 273, we will simply assume such a clock signal is
available and that it regularly oscillates between 0 and 1 as shown below. The ON interval has the
same duration as the OFF interval.
1
time
clocked RS latch
We can combine the clock and the RS latch to give us more control over when we write into the RS
latch, namely we only write when the clock is 1. The circuit below is called a clocked RS latch.
R
Q
C
Q
S
When C = 0, the circuit holds its value regardless of any changes in the R or S inputs. When
C = 1, the circuit behaves as the RS latch shown above. Note that this does not yet avoid the
R = S = 1 situation which I mentioned above.
Here is an example of how the output Q can vary as the R and S inputs vary. Note that
transitions in Q can occur at the transition of C from 0 to 1 (in the case that S or R already has
the value 1 at that time), or during the time that C=1, in the case that S or R becomes 1 while
C = 1. One tricky case occurs during the second to last clock pulse. During that pulse, there are
two brief events to note: first S goes to 1 temporarily, and then R goes to 1 temporarily. Note what
happens to Q. When S goes to 1, Q goes to 1, and when S drops back to 0, Q stays on (memory).
Q falls to 0 a short time later namely when R goes to 1 since then Q is reset.
R
Q
C
D C Q
D
Q
S
This D latch circuit is extremely important, so lets repeat the idea. When C = 0, the data input
D does not get written to Q and the previous value of D is held. A write only occurs when C = 1,
and in that case the current value of D is written and can be read. Also notice that, by design, it
is impossible for the R and S inputs to both be 1 here, so our earlier concern about these values
being simultaneously 1 now goes away.
The D latch provides some control on when we can write, but it still not good enough. When the
computer’s clock signal C is 1, data passes freely through the circuits. Let’s consider an example
of why this is problematic. Suppose we wish to implement an instruction such as
x := x + y;
We could implement it by putting the values of x and y into circuits made out of an array of
sequential circuits. We will discuss the correct way of doing this over the next few lectures. For
now, let’s look at an incorrect way of doing it, namely by representing x and y as binary numbers
and storing the bits of each number in an array of D latches. Say we have somehow managed to
store y in the A array and x in the B array. (Each of the little boxes in the figure below is supposed
to be a D latch. We when say “store a value” we mean that the Q outputs have that value.) We
feed these two arrays into our ALU (arithmetic logic unit) and feed the answer back into the B
array, in order to execute the above instruction. What happens?
When C = 1, the sum bits Si computed by the adder will write their values back into the Bi
units. But as long as C = 1, these newly written values will go right through into the adder again.
They will be summed once more with y, and will loop back and be written into the B array again,
etc. Moreover, the carry bits will take time to propagate the result and so we cannot even think of
succession of additions. What a mess! When C becomes 0 again, and writes are no longer allowed,
the values that are in the B units are not what we want.
D flip-flop
We need a mechanism that prevents data from being read from a D latch at the same time as it is
being written to a D latch. The mechanism that achieves this is called a D flip-flop. The two basic
types of D flip-flop are shown below. In each case, we have a pair of D latches with the Q output
value of the first being the D input of the second, and with the clock C being inverted for one D
latch but not the other.
rising edge triggered
C C
Q
D D
C C
Q
D D
Here I discuss just the falling edge one. (The argument of what happens for the rising edge one
is very similar.) When C goes from 0 to 1, the first D latch is written to, but the second D latch
holds its previous value because the inverted clock input to the second D latch goes from 1 to 0.
When C drops from 1 to 0, the value that had just been written to the first D latch now is written
to the second D latch whose clock input rises from 0 to 1 and this data value appears in the output
Q. Note that when C drops from 1 to 0, the value that had just been written into the first D latch
is held even though the input to the first D latch might have changed. The reason the previous
value is held is that C is 0. We say this flipflop is falling edge triggered because the Q value changes
when the clock falls from 1 to 0.
For this mechanism to guarantee the synchronization we desire, the interval of a clock pulse
has to be long enough that the combinational circuits (such as adders, multiplexors, etc) between
flip-flops have finished computing their output values. Consider again the example above for x :=
x + y. Each of the bit memory units in fact would be a D flip flop rather than a D latch. We
would now have synchronization. During a clock pulse (C=1), the adder circuit would compute x
+ y and at the end of the clock pulse when C falls from 1 to 0, then result would be written into x.
Below is an example of timing diagram for D and the value stored in a flip flop. I have included
both the rising edge case and falling edge case.
Registers
To perform an addition, you can imagine that the numbers/bits are read from a set of flip-flops
(sequential circuit), go through an adder circuit (combinational circuit) and are written into another
set of flip-flops, or even possibly the same flip flops. I gave a sketch of this above, and we’ll see how
this works in MIPS computer in coming lectures.
A set of flip-flops that holds the values of some variable is called a register. The key property
of a register is that its flip-flops are written to as a unit. There is no way to selectively write to
the individual flip-flops in a register. On any clock cycle, you either write to every flip-flop in the
register, or you write to none of the flip-flops in the register.
The sketch below shows a four bit register. Each of the square boxes represents one D flip-flop.
Here I have written FF in each box to emphasize its a flip-flop, not a D latch. (In the slides, I
didn’t write ”FF”, and in future I won’t write ”FF”.) Each flip flop has a clock input C and data
input D and a Q output.
FF FF FF FF
D3 Q3 D2 Q2 D1 Q1 D0 Q0
Shift registers
A shift register is a special register that can shift bits either to the left or right. We have seen that
shifting bits of an unsigned number to the left multiplies that number by 2, and shifting bits to the
right divides the number by 2. You can imagine it might be useful to have specialized circuits for
computing such shifts.
The figure below shows a four bit shift right register. When we shift right, we need to specify
the D input for the leftmost flip-flop. In the figure below, the leftmost bits is filled with the value of
variable Q4 . You can design the circuit to fill it with whatever you want (0, 1, or even wraparound
from Q0 , etc) and select one of these with a multiplexor.
Q Q Q Q Q
4 3 2 1 0
Note that for “right” and “left” to be meaningful, we need to use the common convention that
the least significant bit (that is, Q0 ) is on the right.
Examples
Show a timing diagram of the contents of the above register. Assume the initial values in the register
are
(Q3 , Q2 , Q1 , Q0 ) = (1, 0, 0, 1)
and assume the flip-flops in the register are falling-edge triggered. Also assume that, over the close
pulses shown, the value of Q4 is 0. (In general, Q4 is not always 0, but it makes the timing diagram
easier to understand if we think of it as 0. Essentially we are just following the initial Q3 value of
1 as this 1 value moves through Q2 to Q1 to Q0 .
Q0
Q1
Q2
Q3
Let’s look at a more general shift register that can either shift left or right, and that can also
be cleared (all bits set to zero) and that can also be written to as a unit (write to all bits in one
clock cycle). Notice that we have tilted each flip-flop on its side in order to simplify the figure. Also
notice that we need a 2 bit selector to specify which of the four operations you want to perform.
The same selector is applied to each multiplexor. I have only shown bits 7, 8, 9 but you can imagine
extending the figure to other bits.
D9 D8 D7
Q9 Q8 Q7
Today I will finish off our discussion of registers, and then move on to discuss larger memories.
T flip-flop (toggle)
The circuit below on the left shows a D flip-flop, such that the data input D comes from the
complement of the stored value Q. At every clock cycle, the Q value changes, namely it takes the
value Q. If the D flip flop is falling edge triggered, then the value changes when the clock goes from
1 to 0 which is the case shown on the right below. If the D flip flop is rising edge triggered, then
the value of Q changes when the clock rises from 0 to 1.
CLK
C Q
D Q 0 1 0 1 0 1 0 1 0
Q
Q 1 0 1 0 1 0 1 0 1
Such a flip-flop is called a T flip-flop where T is for toggle. Note that the T flip flop changes its
value at half the rate of its clock input.
Q 0
Q1 Q2
C C C
D Q 0 D Q1 D Q
2
The timing diagram for this circuit is shown below. Suppose the flip flops are falling edge
triggered. The values of Q2 Q1 Q0 represent binary numbers and, for each successive falling edge
of the clock, the triplet encodes a binary number. By inspection we can see that the sequence of
numbers is 000, 001, 010, 011, 100, 101, 110, 111 which is just the numbers from 0 to 7. Thus we
have a counter circuit!
Q0 0 1 0 1 0 1 0 1 0
Q1 0 1 0 1 0
Q2 0 1 0
What would happen if we were to use rising edge triggered flip-flops instead? Qi+1 would toggle
its value whenever Qi rises from 0 to 1. You can verify for yourself that this produces a counter
which counts down (and which wraps around from 000 to 111). We call this a timer since it decreases
in value. This is the circuit that is used for the timer on your microwave oven or wristwatch.
[ASIDE: in the lecture, someone asked why there is an initial jump form 000 to 111. See if you
can answer that question by thinking about what happens on the first rising edge of the C. ]
0 1 0 1 0 1 0 1 0
Q0
Q1 0 1 0 1 0
Q2 0 1 0
Register array
One common way in which we use registers is to hold the values of variables in a program. Suppose
we wish to add two variables and store the result in another variable e.g. x := y+z, or x := x+z.
Suppose the value of each of the variables is stored in a register. To perform the addition, we must
read the values from two registers, feed the two values into an adder (ALU), and then write the
result into a register. We will do all of this is done in one clock cycle.
A few lectures from now, we will use the MIPS processor to illustrate how such things are done.
In MIPS, the registers that are used in your programs are grouped into an array. There are 32
registers in the array – each register stores 32 bits. There is no significance to the fact that the
number of registers is the same as the number of bits in each register.
We next discuss how to read from and write to these 32 registers. We will discuss the various
elements in the following diagram.
Read Reg 1
5
Write Enable
32
register 0
M Read Data 1
register 1
U
X 32
etc
:
:
Write Reg :
5 − to − 32 :
:
decoder :
5 : :
etc M
U Read Data 2
X
32
register 30
register 31
32
32 5
needs 5 bits (again 25 = 32). Thus a 5-bit coded number must be fed into each of two multiplexors.
In the figure, these are called ReadReg1 and ReadReg2. The 32 bit values that are read out are
labelled ReadData1 and ReadData2.
WriteEnable ReadData1
ReadReg1
ReadReg2
WriteReg
ReadData2
WriteData
Finally, it is natural to classify the inputs and outputs into three types:
1. addresses: ReadReg1 (5 bits), ReadReg2 (5 bits), WriteReg (5 bits)
2. data: WriteData (32 bits), ReadData1 (32 bits), ReadData2 (32 bits)
clock
MemWrite
RowSelect
C RowSelect
Q
data lines
WriteData ReadData
This design seems very nice (at first glance) because we need only a small number of lines for
each row and for each column. This design would allow us to scale up to much larger memories.
There is a problem with this design, however. For any column, all of the Q outputs of the cells in
that column are attached to the read data line. This won’t work. We can only allow one flipflop
to put its Q value on any given line. If we tried to put a multiplexor in there, that would be going
back to the previous solution which didn’t work either because there are too many wires needed.
setting interpretation
1
in out in out
in out in out
The figure below is similar to the one on the previous page, except that now a tri-state gate is
used. Previously, to write to a flipflop in a single row, that row must be selected, the RowSelect
control must be 1, and the MemWrite (and clock) control must be 1. Now, we also use the RowSelect
control to open the tristate gate and allow the flip flop shown to put its signal on the ReadData
line. Since only a single row is selected, the tristate gates ensure that only one cell in each column
is putting a value on the ReadData line.
The figure below is a slightly different design in which the read and write data lines in each
column are combined into a single line. Now we can either read into a flip-flop or write from a
flip-flop but not both. Both the MemRead control and the RowSelect control are used to control
the tristate buffer. As before, the MemWrite control determines (along with clock and RowSelect)
whether the value on the data line is written into the flip-flop.
clock
MemRead
MemWrite
row select
C
Q
data
RAM
In the above discussion, we considered an N ×N array of D flip-flops and we were reading or writing
all N flip-flops in some row. This was motivated by the register application, where we group a set
of bits into a unit and operate in the same way on all bits in that unit. We next consider a different
situation in which we allow ourselves either to read from or write to a single flip-flop in the N × N
array.
To select a column, we need another control line which we call ColumnSelect. This control is
fed into the same clock input for the flip-flop along with the other controls which must be 1 in order
for a write to occur. Only one of the N columns will have its control signal with value 1. The rest
will have value 0.
In the lecture, I discussed an example of the sort of RAM you could buy and put into your
computer. The example showed 8 chips of RAM which had 2 GB. Since each byte is 8 bits, it
follows that there are 231 bits on each chip. Let’s think about this as a 215 × 216 array of one bit
memories.
[ASIDE: there are many different electronic technologies used for RAM. We won’t go into the details.
Its fine if you just assume they are the flip-flops we have been discussing – but be aware that there
are many ways of getting transister-based circuits to store one bit memories.]
The figure below shows where the RowSelect and ColumnSelect controls come from. Keep
in mind that these signals are 216 and 215 lines. Exactly one of the RowSelect and one of the
ColumnSelect lines has the value 1 at any time. In the figure below, the one out of 231 bits we are
indexing (or addressing) is coded as a binary number A30 A29 · · · A1 A0 . The 16 bits A30 A29 · · · A15
code the row number and the 15 bits A14 · · · A0 code the column number.
Integer multiplication
Suppose we have two unsigned integers, A and B, and we wish to compute their product. Let A be
the multiplicand and B the multiplier:
An−1 . . . A1 A0 multiplicand
× Bn−1 . . . B1 B0 multiplier
P2n−1 ... Pn−1 . . . P1 P0 product
If the multiplier and multiplicand each have n bits, then the product has at most 2n bits, since
which you can verify for yourself by multiplying out the left side. For example, if we are multiplying
two 32 bit unsigned integers, then we may need as many as 64 bits to represent the product.
Recall your grade school algorithm for multiplying two numbers. We can use the same algorithm
if the numbers are written in binary. The reason is that shifting a binary number one bit to the
left is the same as multiplying by 2, just as decimal multiplication by ten shifts digits left by one.
In the case of binary, each 1 bit bi in the multiplier means we have a contribution 2i times the
multiplicand, and so we add the shifted multiplicands.
1001101
× 0010111
1001101
10011010
100110100
0000000000
10011010000
000000000000
+0000000000000
product
In the grade school algorithm, we compute all the left shifts in advance, building up several rows
of numbers which correspond to the digits of the multiplier. Then, this set of numbers is added
together to yield the product (as above).
Let’s consider an alternative algorithm that avoids writing out all the rows with the shifted
multiplicands. Instead the algorithm shifting the multiplicand (left i.e. multiply by 2) at each step
i. When bit Bi of the multiplier is 1, the shifted multiplicand is added to an accumulated total.
When bit Bi is 0, nothing is added to the total.
What hardware could we use to multiply two n = 32 bit unsigned numbers?
• shift registers: the multiplicand is shifted left. If we are multiplying two 32 bit numbers then
we use a 64 bit shift left register for the multiplicand. (We shift the multiplicand to the left
at most 32 times.)
• a 32 bit register for the multiplier. We use a shift right register, and read the least significant
bit (LSB, i.e. bit 0) to decide whether to add the shifted multiplicand on each step or not.
• a 64 bit adder
• a 64 bit register which accumulates the product
• a counter to keep track of how many times we’ve shifted.
D D
shift left register (64 bits) shift right register (32 bits)
A B
shift/load LSB
shift/load
adder combin.
circuit
We must provide a control signal to the shift registers telling them whether to shift or not at
each clock cycle. On the first clock cycle, we would load the multiplier and multiplicand registers,
rather than shifting the values in these registers. (Where would we load them from? We would load
them from some other registers that are not drawn in the circuit.) On subsequent clock cycles, we
would perform the shift.
Similarly, we must provide a signal to the product register telling it whether we should clear the
product register (only on the first clock cycle), or whether we should write in the value coming out
of the ALU (only on subsequent clock cycles and only when the LSB of the B register is 1.
Finally, notice that all the instructions within the “for loop” can be performed in parallel within
a clock cycle.
78 = 3 ∗ 21 + 15
A few notes: First, the first line of this algorithm is just the usual first step in long division of
placing the divisor to the left of the dividend. When you carry out long division by hand in base
10, you don’t think of this as multiplying by 10n where n is the highest power of the dividend. But
this is essentially what you are doing. Second, the comparison “divisor ≤ remainder” can be done
by computing the difference (divisor - remainder) and checking whether the result is less than zero.
In a twos complement representation, this can be done by checking if the MSB (most significant
bit) is 1.
x = 1.101010000000000000000101 ∗ 2−3
y = 1.001001000010101010101010 ∗ 22
where x has a smaller exponent than y. We cannot just add the significands bit by bit, because
then we would be adding terms with different powers of 2. Instead, we need to rewrite one of the
two numbers such that the exponents agree. e.g. We rewrite x so it has the same exponent as y:
x = 1.101010000000000000000101 ∗ 2−3
= 1.101010000000000000000101 ∗ 2−5 ∗ 22
= 0.00001101010000000000000000101 ∗ 22
Now that x and y have the same exponent, we can add them.
1.001001000010101010101010 ∗ 22
+ 0.00001101010000000000000000101 ∗ 22
= 1.00110001011010101010101000101 ∗ 22
The sum is a normalized number. But the significand is more than 23 bits. If we want to represent
it as single precision float, then we must truncate (or round off, or round up) the extra bits.
It can also happen that the sum of two floats is not normalized. If, instead of y, we had
z = 1.11111000000000000000000 ∗ 22
then check for yourself that x + z would not be normalized. So, to represent the result as single
precision float, we would have to normalize again by shifting (right) the significand and truncating,
and changing the exponent (by adding the size of the shift).
You should appreciate what you need to do floating point addition. You need to compare
exponents, shift right (x or y with smaller exponent), use a big adder, normalize, deal with case if
result is not normalized, round off. Details of the circuit are omitted, but you should appreciate
the many steps are may be involved.
x2 = 1.001 ∗ 24 = 1001 ∗ 21
Since (1101)2 ∗ (1001)2 = 13 ∗ 9 = 117 = (1110101)2 and 2−10 ∗ 21 = 2−9 , we can rewrite it as a
normalize number as follows:
Now go back to the previous example in which x and y each have 23 bits of significand. We can
write each of them as an integer times a power of 2:
x = 1.101010000000000000000101 ∗ 2−3
= 1101010000000000000000101 ∗ 2−26
y = 1.001001000010101010101010 ∗ 22
= 1001001000010101010101010 ∗ 2−21
We then multiply two 24 bit integers (24, not 23, since there is the leading 1 that needs to be
included), and add the exponents. Finally, we normalize.
What circuitry do we need to perform a multiplication? We need circuitry for carrying out
(unsigned) integer multiplications. We also needs circuitry for adding exponents and adding values
(23) to exponents.
So you should be able to appreciate the floating point multiplication is more complicated than
integer multiplication, but not so much more complicated. Note also that, for floating point division,
the long division algorithm doesn’t stop when the remainder is less than the divisor. Instead, it
keeps going to get the negative powers of the base (2 or 10).
What about floating point division? If we want to compute x/y, we can perform the same trick
as above and reduce the problem to the division of two integers, times an exponent. So division of
two floating points requires slightly more circuitry (and about the same amount of time) as integer
division.
Such a set of circuits is an example of a finite state machine. For the circuit, each possible
setting of the values in all the registers defines a state. There maybe other values (called ”outputs”)
defined as well which are not necessarily represented by registers. For example, two register outputs
might feed into a combination circuit which has outputs. There may also be values ”inputs” that
are sitting on wires but perhaps not written into the registers that they connect to (because a
writeenable signal might be false). So at each time, we have inputs, outputs, and a state.
In COMP 330, you will study finite state machines without talking about circuits. You’ll just
talk about states, inputs, and outputs, and you’ll talk about how the machine goes from state to
state, given an input, and how it produces outputs at each time that depend on the state at that
time (and perhaps on the input at that time).
In the lecture, I gave an example of a turnstile, like what you have in the Metro. There are
only two things you can do at a turnstile, assuming that your OPUS pass has tickets in it. You can
either put your pass down, or you can push against the turnstile. Verify for yourself that the table
below captures the transitions from one state to the next.
There is a lot to say about finite state machines, but I will leave it for COMP 330. The
abstraction of inputs, outputs, and state transitions doesn’t help us much in this course and so I’m
going to stop writing now. But hopefully you get a whiff of the idea, and are asking yourself where
this direction leads. Take COMP 330 and you’ll find out.
Assembly language
When computers were first invented in the 1940s, programmers had to program in machine code –
they had to set 0’s and 1’s by flipping physical switches. For most people, it is not much fun (and
certainly not efficient) to program with 0’s and 1’s, and so a slightly higher level of code, called
assembly language was invented. Assembly language is just human readable machine code. The
MIPS language we will use in the next few weeks is an example.
An assembly language program is an ASCII file, and is not executable. It needs to be translated
into its machine code equivalent. This translation is relatively easy (compared with translating
from a high level language like C to machine code). The translation from assembly language into
machine code is done by a program called an assembler. We are not going to learn in this course
1
I neglected to mention this in the lecture but I’ll mention it here for your interest. The program which interprets
the JVM code is called an interpreter. The interpreter simulates the JVM. Note that translating and interpreting
are not the same thing. Translation happens before the program runs. Interpreting happens while the program
is running. This is the same notion of translation versus interpretation that is used in natural languages: people
who work as translators take written documents in one language and create written documents in another language.
People who work as interpreters take spoken language in real time and convert it into another language. The two
jobs are very different. When politicians travel to foreign countries, they need an interpreter, not a translator.
how exactly this is done.2 Rather, we will just learn an assembly language and get experience
programming in it, and understand what this corresponds to at the level of the actual processor.
MIPS
In the next few weeks, we will look at a particular machine or computer processor: the MIPS R2000
(from here one, just known as MIPS). We will spend the next few weeks getting to know the MIPS
“instruction set” (set of MIPS instructions) and you will get some experience programming in the
MIPS assembly language.
Before we can introduce MIPS instructions, we need to know that these instructions can refer
to two types of memory. The first is the set of 32 MIPS registers that I described in lecture 6
(page 5). The second is a much larger memory that holds all the data and instructions of a user’s
programs, as well as the instructions and data of the kernel or operating system. If you wish to
write programs for an operating system for the given computer, then you need to make use of the
special instructions and data and memory locations of the kernel. We won’t do that in this course.
We’ll only write user programs.
If you are new to programming, then it may surprise you to learn that both instructions and
data sit in memory. This may be counterintuitive since you might think that instructions and data
are very different things. As we will see, though, both instructions and data ultimately need to
be encoded as 0’s and 1’s, and so there is no reason why they cannot both sit in memory. Your
intuition is correct, though, in that they are kept in separate places in memory, respecting the fact
that they are different types of things.
In MIPS, we write Memory (with a capital M) to refer to a set of 232 bytes (or 230 words, i.e.
a word is 4 bytes) that can be addressed by a MIPS program. These 232 bytes do not include the
32 MIPS registers. These 232 bytes should not be confused with an address on a physical memory
chip or a location on a disk. As we will see later in the course, MIPS Memory addresses need to be
translated into physical addresses.3
You should have only a simple notion of how a MIPS computer works. This is illustrated in the
figure below. You should understand that instructions and data are stored in Memory. You should
also understand that operations such as addition and subtraction are implemented by selecting
two numbers that are stored in registers, running them through the ALU, and then writing the
result back into a new register. The figure above shows another register which is called the program
counter. It holds the address of the instruction that is currently being executed. This instruction
is in Memory and so the address of the instruction is 32 bits.
In this lecture, we will see some examples of instructions for carrying out such arithmetic op-
erations. We will also see instructions for moving words to and from Memory and the registers.
Finally, we will see how a program goes from one instruction to the next, and how branches to other
instructions can be achieved.
2
If you want to learn about this topic, then take the Compiler Design course COMP 520.
3
An analogy here is postal codes versus (latitude,longitude). Postal codes are used by the postal system to sort
and direct mail. (Latitude,longitude) coordinates specify where the mail goes in a physical sense. If this is confusing
to you, then ask yourself: do you believe that that postal code A0A0A0 is at the tip of Newfoundland and Z9Z9Z9
is at the western tip of Vancouver Island?
program counter
kernel instructions
and data
30
2 words
32 user data
= 2 bytes registers ALU
32 words
user instructions
32 bits
(1 word)
Arithmetic instructions
Consider the C instruction:
c = a + b;
In the MIPS assembly language, this instruction might be written:
add $16, $17, $18 # register 16 is assigned the sum of registers 17 and 18
The $ symbol marks a register, e.g. $17 is register 17. The # symbol marks a comment. When this
symbol appears in a MIPS program, any characters to the right of this symbol are ignored.
In this example, the address of the word in Memory is the value in register 17, plus 40. Allowing
for an offset (e.g. 40) gives the programmer more flexibility. If it happens that the address of the
word we want is already in $17, then then the offset would just be 0.
What about the opposite operation, namely taking a word that is in a register and putting it
into Memory? For this, MIPS has sw, which stands for “store word”.
Again, the address of the word in Memory is the value in register 17, plus 40.
In the lecture, I described a few simple examples that combine arithmetic instructions like add
and Memory transfer instructions. Say you are trying to perform x = y + z, and x and y values
happen to be in registers but the value of z is not in a register, then you need to load z from
Memory into a register before you can compute the sum. Similarly, if y and z are in registers but
x is in Memory, then you need to compute the sum y + z and store it in some other register, and
then copy (store) the value in that register to x, which is in Memory.
In the slides, I also sketched out a data path for the above lw and sw instructions. We will go
into more detail about data paths in the weeks ahead.
Branching instructions
What is the order of instructions executed in a MIPS program? As I mentioned earlier, the program
counter (PC) register holds the address in Memory of the currently executed instruction. The
default is that this counter is incremented after each instruction.
Sometimes, however, it is useful to branch to other instructions. You are familiar with this idea
e.g. if-then-else statements, for loops or while loops, goto statements. Branching instructions
allow the program to execute an instruction other than the one that immediately follows it. For
example, consider the following snippet of C code.
if (a != b)
f = g + h;
Here you want to execute the sum if a is not equal to b, and so you don’t want to execute the sum
if a is equal to b. In MIPS assembly language, such an instruction could be implemented using beq,
which stands for “branch if equal”, for example:
if (a != b)
f = g + h;
else g = f + h;
This one is more tricky. If the condition (a != b) is met, then f is assigned a new value and the
program must jump over the else statement. For this, we use a jump instruction j.
• “op” stands for ”opcode”. R-format instructions always have opcode 000000.
• “shamt” stands for “shift amount” (It is not used in add. We will discuss it later.)
• “funct” stands for function. It is used to distinguish different R format instructions. For
example, the subtraction operation sub has the same opcode as add, namely 000000, but it
has a different funct code.
Note that there are 64 possible R-format instruction. Why? R-formal instructions always have
opcode 000000, and there are 26 = 64 possible funct fields. The instructions like add,sub, and
bitwise operations and, or, nor, etc are all R format and have different funct codes.
4
It is unclear why the letter “t” is used.
field op rs rt immediate
numbits 6 5 5 16
For the lw instruction, a word is transfered from memory to a register. The address of the
word in Memory is determined by the source register rs which contains a 32 bit base address plus
the 16 bit offset. The offset is a signed number (from −2−15 to 215 − 1). The rt register acts as a
destination register.
For the sw instruction, a word is transfered from a register to Memory. The address of the word
in Memory is determined by the source register rs which contains a 32 bit base address) plus the 16
bit signed offset. The register rt contains the 32 bit value to be stored at this address.
For the beq instruction, the two registers rs and rt hold 32 bit values that are compared.
Depending on the result of the comparison (true or false), the next instruction either is the next
instruction in sequence (PC + 4 bytes) or some other instruction (PC + offset). The relative address
of this other instruction is determined by the 16 bit immediate field i.e. by the offset. As we will
see later in the course when we examine data paths, this offset is in words, not bytes. The reason
is that instructions are always one word long, so the offset is always a multiple of 4 bytes, since the
instruction we are branching to is always a multiple of 4 bytes away.
J format (“jump”)
An example of the J format instruction is the jump instruction j that we saw earlier. This instruction
uses 26 bits of address, rather than 16 bits as we saw with the I format instructions. With 26 bits,
we can jump farther than we could with 16 bits. Later when we talk about data paths, I will specify
exactly how this is done.
field op address
numbits 6 26
f = g + h − i;
This instruction is implemented by first computing (g + h) and then subtracting i from the result.
The value g + h is temporary in the sense that it not stored as part of the instruction. Its value is
thrown away after i is subtracted. We might want to use a $t register for it. Here is an example
of the corresponding MIPS instructions:
add $t0, $s1, $s2 # temporary variable $t0 is assigned the value g+h
sub $s0, $t0, $s3 # f is assigned the value (g+h)-i
Another register that gets a special name is $0. This register is called $zero, and it always
contains the value 0. The hardware prevents you from writing into this register!
if (a == b)
f = g + h;
What about other conditional branches? You might expect there to also be a blt for “branch
if less than”, bgt for “branch if greater than,” ble for “branch if less than or equal to”, etc. This
would make programming easier. However, having so many of these instructions would use up
precious opcodes. The MIPS designers decided not to have so many instructions and to take an
alternative approach.
To execute a statement such as “branch on less than” in MIPS, you need two instructions. First,
you use an R format instruction, slt which stands for “set on less than”. The slt instruction is
then combined with beq or bne. Together, these span all possible inequality comparisons ≤, <, ≥, >.
See (new) Exercises 4. Here is an example for the C code. I will use C variable names s1 and s2
to make it easier to see what’s going on.
We want to branch if the condition fails, that is, if s2 is not greater than s1. So, we set a temporary
variable to the truth value ”s2 > s1” or equivalently ”s1 > s2”. We want to branch if the condition
is false, that is, if the truth value is 0. Here is the MIPS code, assuming C variables s1 and s2
correspond to MIPS registers $s1 and $s2 respectively:
Here we want to exit the loop (branch) if the condition fails, namely, if s1 > s2, i.e. if s2 < s1 .
Here’s the MIPS code:
One final comment: you may be wondering why the MIPS designers didn’t just have separate
machine code instructions blt, ble, bgt, bge. The answer is that such instructions would use up
precious (I format) op codes. The MIPS designers wanted to keep the number of instructions as
small as possible (RISC). The benefit of this design is that the RISC architecture allows simpler
hardware, which ultimately runs much faster. The cost is that a given program tends to require
more of these simpler instructions to have the same expressive power.
f = h + (−10);
The “i” in addi is the same “I” as in “I-format.” It stands for “immediate.” . Recall that I-format
instructions have a 16 bit immediate field. The value −10 is put in the immediate field and treated
as a twos complement number.
Another example of an R format instruction that has an I format version is slti which is used
to compare the value in a register to a constant:
In this example, $t0 is given the value 0x00000001 if the value in $s0 is less than -3, and $t0 is
given the value 0x00000000 otherwise. Such an instruction could be used for a conditional branch.
if (i < -3)
do something
The i stands for immediate and the u stands for unsigned. For example, addu stands for “add
unsigned”. This instruction treats the register arguments as if they contain unsigned integers
whereas by default add treated them as signed integers. Note that an adder circuit produces
the same result whether the two arguments are signed or unsigned. But as we saw in Exercises 2,
Question 12, the conditions in which an overflow error occurs will depend on whether the arguments
are signed or unsigned, that is, whether the operation specifies that the registers represent signed
or unsigned ints. In this sense, add and addu do not always produce the same result. The overflow
conditions can be tested in the hardware to decide if the program should crash or not. We will talk
about such exception handling issues later in the course.
The I instruction addiu, treats the immediate argument as a unsigned number from 0 to 216 − 1.
For example,
addiu $s3, $s2, 50000
will treat the 16-bit representation of the number 50000 as a positive number, even though bit 15
(MSB) of 50000 is 1. (See “sign extending” below.)
Another example of when it is important to distinguish signed vs. unsigned instructions is when
we make a comparison of two values. For example, compare slt vs. sltu. The result can be quite
different, e.g.
sltu $s3, $s2, $s1
can give a different result than
slt $s3, $s2, $s1
in the case that one of the variables has an MSB of 1 and the other has an MSB of 0.
Sign extending
The 16 bit value that is put in the immediate field typically serves as an argument for an ALU
operation (arithmetic or logical), where the other argument is a 32 bit number coming from a
register. As such, the 16 bit number must be extended to a 32 bit number before this operation can
be performed, since the ALU is expecting two 32 bit arguments. Extending a 16 to 32 bit number
is called sign extending.
If the 16 bit number is unsigned then it is obvious how to extend it. Just append 16 0’s to the
upper 16 bits. If the 16 bit number is signed, however, then this method doesn’t work. Instead, we
need to copy the most significant bit (bit 15) to the upper 16 bits (bits 16-31).
Why does this work? If the 16 bit number is positive, the most significant bit (MSB) is 0 and
so we copy 0s. It trivial to see that this works. If the 16 bit number is negative, the MSB is 1 and
so we copy 1s to the higher order bits. Why does this work? Take an example of the number -27
which is 111111111100101 in 16 bit binary. As we can see below, sign extending by copying 1’s
does the correct thing, in that it gives us a number which, when added to 27, yields 0. Convince
yourself that this sign extending method works in general.
If these instructions are put in the opposite order, then we get a different result. The reason is that
lui puts 0’s into the lower 16 bits. Previous values there are lost forever.
Another bit manipulation that we often want is to shift bits to the left or right. For example,
we saw in an earlier lecture that multiplying a number by 2k shifts left by k bits. (This assumes
that we don’t run over the 32 bit limit.) We shift left with sll which stands for “shift left logical”.
There is also an instruction srl which stands for “shift right logical”.
You might think that this instruction would be I-format. But in fact it is an R-format instruction.
You don’t ever shift more than 31 bits, so you don’t need the 16 bit immediate field to specify the
shift. MIPS designers realized that they didn’t need to waste two of the 64 opcodes on these
instructions. Instead, they specify the instruction with some R format op code plus a particular
setting of the funct field. The shift amount is specified by the “shamt” field, and only two of the
register fields are used.
In the past two lectures, we discussed MIPS operations on integers. Today we will consider
a few data structures that you are familiar with, namely arrays and strings, and discuss how to
implement them in MIPS.
Arrays
Suppose we wish to declare and use an array of N integers. How would we represent this array in
MIPS? If we were programming in C, we might let the array be a[ ]. Let the address of the first
element of the array be kept in one of the registers, say $s0. This address is called the base address
of the array. To access an element of array, we need to specify the address of that element relative
to the base address, i.e. the offset.
For example, consider the C instructions:
a[12] = a[10];
In MIPS, we cannot transfer data directly from one location in Memory to another. Instead, we
need to pass the data through a register. Here might be the corresponding MIPS instructions:
lw $t1, 40($s0)
sw $t1, 48($s0)
The offsets from the base address are 40 and 48 for a[10] and a[12] respectively. Why? We need
four bytes for each word, so an offset of 10 words is an offset of 40 bytes and an offset of 12 words
is an offset of 48 bytes.
Another example is the C instruction,
m = a[i];
Suppose the integer variable m is stored in register $s1 and integer index i is stored in register
$s2. We multiply the contents of $s2 by 4 to get the offset, and then we add the offset to the base
address which we’ll assume is in $s0 as in the above example. Note that we compute the offset by
multiplying i by 4 by using the sll instruction, which is simpler than having to perform a general
multiplication. (We will see how to perform multiplication in an upcoming lecture.)
Strings
In the C programming language (COMP 206), a variable of type “char” uses one byte (http:
//[Link]). Thus, each character in a string is stored in one byte.
Previously, we saw the instructions lw and sw which load a word from Memory or store a word
to Memory, respectively. There are similar instructions for single bytes. lb loads a byte, and sb
stores a byte. These functions are of I-format. Again, we specify the address of the byte with a
base address plus an offset. The offset is a signed number.
lb $s2, 3( $s1 )
sb $s2, -2( $s1 )
lb takes one byte from memory and puts it into a register. The upper 24 bits of the register are sign
extended i.e. filled with whatever value is in the most significant bit (MSB) of that byte. There is
also an unsigned version of load byte, namely lbu, which fills the upper 24 bits with 0’s (regardless
of the MSB).
sb copies the lower 8 bits of a register into some byte in memory. This instruction ignores the
upper 24 bits of the word in the register.
Assembler directives
We have seen many MIPS instructions. We now would like to put them together and write programs.
To define a program, we need certain special instructions that specify where the program begins, etc.
We also need instructions for declaring variables and initializing these variables. Such instructions
are called assembler directives, since they “direct” the assembler on what (data) to put where
(address).
Let’s look at an example. Recall the MIPS code that computes the length of a string. We said
that the string was stored somewhere in Memory and that the address of the string was contained
in some register. But we didn’t say how the string got there. Below is some code program that
defines a string in MIPS Memory and that counts the number of characters in the string.
The program begins with several assembler directives (.data, .asciiz, . . . ) followed by a main
program for computing string length. The .data directive says that the program is about to declare
data that should be put into the data segment of Memory. Next, the instruction label str just
defines an address that you can use by name when you are programming. It allows you to refer a
data item (in this case, a string). We’ll see how this is done later. In particular, the line has an
.asciiz directive which declares a string, “COMP 273”. When the program below uses the label
str, the assembler translates this label into a number. This number can be either an offset value
which is stored in the immediate field (16 bits) of I-format instruction or a 26 bit offset which in a
J format instruction.
The .text directive says that the instructions that follow should be put in the instruction
(“text”) segment of memory. The .align 2 directive puts the instruction at an address whose
lower 2 bits are 00. The .globl main directive declares the instructions that follow are your
program. The first line of your program is labelled main.
The address of the string is loaded into register using the pseudoinstruction la which stands for
load address. MARS translates this instruction into
Note that (4097)10 = 0x1001, and so the address is 0x1001000 which is the beginning address of
the user data segment. (Actually there is a small “static” part of the data segment below this which
starts at address 0x10000000. You can see this in MARS.)
Here are some more assembler directives that you may need to use:
For example,
declares labels y0, b, arr, y1 which stand for addresses that you can use in your program. You
would not use these labels in branch instructions, since these addresses are in the data segment, not
in the text (instruction) segment. Rather, you use them as variables. The variable y0 is initialized
to have value -14. The variable b is an array of 3 bytes. The variable arr is pointer to the start of
60 consecutive bytes of Memory. This region can be used in a flexible way. For example, it could
be used to hold an array of 15 integers.
Suppose we wanted to swap the values that are referenced by y0 and y1. How would you do
this? If you are programming in C or in Java, you use a temporary variable.
tmp = y0;
y0 = y1;
y1 = tmp;
Suppose y0 and y1 are in registers $s0 and $s1 respectively. You could use the pseudoinstruction
move, or
However, if the variables are in Memory rather than in registers, then you would first need to load
them into Memory, do the swap, and then store them back in Memory. Alternatively, you could
just do it like this.
System Calls
We have seen how data can be defined using assembler directives. It is often more useful to use
input/output (I/O) to read data into a program at runtime, and also to write e.g. to a console that
a user can read, or to a file. Here I will describe how to use the console for reading and writing.
There is a MIPS instruction syscall which tells the MIPS kernel/OS that you want to do I/O
or some other operation that requires the operating system, such as exiting the program. You need
to specify a parameter to distinguish between several different system calls. You use the register
$2 for this, also known as $v0. To print to the console, use values 1,2,3, 4 which are for the case
of int, float, double, or string, respectively. To read from the console, use values 5,6,7,8 for int,
float, double, or string, respectively. When you are reading or printing, you need to specify which
register(s) hold the values/addresses you are reading from/to. There is no need to memorize this
stuff. Consult the documentation on MARS.
Here is an example of how to print a string to the console.
We have seen many of the 32 MIPS registers so far. Here is a complete table of them. In this
lecture, we will meet the last three of them: the stack pointer, the frame pointer (which we won’t
use, but it is good to be aware of it), and the return address. We will also discuss the rightmost
column in this table and what is meant by ”preserved on call”.
j myfunction
This is not sufficient, however, since myfunction also needs to know where it is supposed to eventu-
ally return to. Here’s how it is done. When main branches to the instruction labelled myfunction,
it writes down the address where it is supposed to return to. This is done automatically by a
different branching function jal, which stands for “jump and link.”
jal myfunction
jal stores a return address in register $31, which is also called $ra. The return address is written
automatically as part of jal instruction. That is, $ra is not part of the syntax of the jal instruction.
Rather, jal has the same syntax as the jump instruction j we saw earlier. It is a J format instruction.
So you can jump to 226 possible word addresses. Later when we look at the data path, we will see
exactly how this works and it will make more sense!
The return address is the address of the current instruction + 4, where the current instruction
is “jal myfunction”. That is, it is the address of the instruction following the call. (Recall each
instruction is 4 bytes or 1 word).
When myfunction is finished executing, it must branch back to the caller main. The address to
which it jumps is in $ra. It uses the instruction
jr $ra
where jr stands for jump register. This is different from the jump instruction j that we saw
earlier. jr has R-format.
So the basic idea for jumping to and returning from a function is this:
..
myfunction: .
..
.
jr $ra
..
.
..
main: .
jal myfunction
..
.
to a function. Of the 32 registers we discussed previously, four are dedicated to hold arguments to
be passed to a functions. These are $4,$5,$6,$7, and they are given the names $a0, $a1, $a2,
$a3 respectively. There are also two registers that are used to return the values computed by a
function. These are $2,$3 and they have names $v0,$v1, respectively.
For example, if a function has three arguments, then the arguments should be put in $a0,$a1,$a2.
If it returns just one value, this value should be put in $v0. (It is possible to pass more than four
arguments and to return more than two values, but this requires using registers other than $2 − $7.)
[ASIDE: recall from last lecture that syscall uses $v0 to specify the code for the system call e.g.
print an integer, and $a0 and a1 for various arguments needed for the system call. If a function
uses syscalls, then one must be especially careful to save these register values before the syscall and
then load these register values after the syscall. ]
What are the MIPS instructions in main that would correspond to the C statement
m = myfunction(i);
The argument of myfunction is the variable i and let’s suppose i is in $s0. Suppose the returned
value will be put into $s1, that is, the C variable m will be represented in the MIPS code by $s1.
Here are the instructions in main:
• The parent assumes that temporary registers $t0,...,$t7 may be written over by the child.
If the parent needs to keep any values that are in $t0,...,$t7 prior to the call, then it should
store these values into Memory prior to the call.
In addition, the parent assumes that the v0, v1 variables can be overwritten, so if the parent
wants to keep values in those registers then the parent needs to store them in Memory prior
to the call and (re)load them after the call.
• The parent assumes that the values in the save registers $s0,..,$s7 will still be there after
the return from the call. The parent does not store the values in these registers into Memory
before making the call. Thus, if a child wishes to use any of the $s0,..,$s7 registers, it must
store the previous values of those registers into Memory prior to using these values, and then
load these values back into the register(s) prior to returning to the parent.
In addition, the parent assumes that the a0, · · · , a3 variables will be preserved after the
function call. So if the child wants to use these registers, then it should first store them in
Memory and later it should load them back prior to returning to the parent.
Let’s look at how this is done.
The Stack
Register values that are stored away by either the parent or child function are put on the stack,
which is a special region of MIPS Memory. The stack grows downwards rather than upwards (which
is strange since we sometimes we will use the phrase “top of the stack”). We refer to the base of
the stack as the address of the top of the stack when the stack is empty. The base of the stack is
at a (fixed) MIPS address 0x7fffffff which is the halfway point in MIPS Memory, and the last
location in the user part of Memory. (Recall that the kernel begins at 0x80000000.)
0xffffffff
kernel
heap
0x10010000
0x10000000 "static" data (global) $gp = 0x10008000
reserved
The address of the “top” of the stack is stored in register $29 which is named $sp, for stack
pointer. This address is a byte address, namely the smallest address of a byte of data on the stack.
To push a word onto the stack, we decrease $sp by 4 and use sw. To pop a word from the stack,
we use lw and increase the $sp by 4. I emphasize that an address always means a byte address,
and so when we use an address in a load/store word instruction, we mean the word that starts at
a particular byte and occupies that byte and the next three bytes after it, which have larger byte
addresses.
ASIDE: In the figure on the previous page, note that the stack grows downward toward another
area of user Memory called the ”heap” (which is not to be confused with the heap data type that
you learned about in COMP 250). The heap is used to allocate memory for other variables than
what we are discussing here: for example, last lecture we discussed assembler directives and how
you could define strings and other data. These go on the heap. In the C programming language,
when you ”malloc” and ”free”, you use the heap. In Java, when you construct ”new” objects, you
use the heap. Unlike the stack, the heap is not necesarily a continuous regions of memory e.g. when
you free space (C) or garbage collect an object that is no longer reference (Java) you leave a hole
in the heap. The kernel will try to fill that hole later, rather than just moving the heap upwards
to towards the stack, since if you just keep moving the heap boundary up then eventually the heap
and stack will collide and the program will run out of space.
parent: :
move $a0, $s0 # pass argument
addi $sp, $sp, -8
sw $t2, 4($sp) # store the temporary registers
sw $t5, 0($sp)
jal child
lw $t2, 4($sp) # (re)load tempory registers
lw $t5, 0($sp)
addi $sp, $sp, 8
:
:
lw $s0, 8($sp) # restore the registers from the stack
lw $s2, 4($sp)
lw $s3, 0($sp)
addi $sp, $sp, 12
move $v0, $s2 # write the value to return
jr $ra # return to caller
function1 function1
function2 function3
function2
In this example, main, function1, and function2 are all functions that call other functions.
function3 however, does not call any other other functions. When a function calls itself, or when
it calls another function, it needs to store any temporary registers ($t), as discussed above. But it
also needs to store on the stack:
• $ra : e.g. When function1 calls function2, the instruction “jal function2” automatically
writes into $ra. Thus, function1 needs to store the value of $ra before it makes this call,
and then reloading it after the call. Otherwise it will not be able to return to main.)
• any of $a0,$a1,$a2,$a3 that are needed by the caller: e.g. when function1 was called by
main, it was passed certain arguments. These arguments will typically be different from the
arguments that function1 passes to function2
• $v0,$v1: if function1 has assigned values to these registers prior to calling function2, then
it would need to store these assigned values. (For example, if function1 uses the $v0 register
as part of a syscall).
If a function (say function3) does not call any other functions (including itself, recursively)
then we say that such a function is a leaf function, in that it is a leaf in the call tree. Leaf functions
are relatively simple because there is no need to store $ra or argument registers $a0, .., $a3 or value
registers $v0, $v1 (unless they are used by syscall as part of the function).
Stack frames
Each function uses a contiguous set of bytes on the stack to store register values that are needed by
its parent or that it will need once its child has returned. This set of contiguous bytes in memory
is called a stack frame. Each function has its own stack frame.
function2
($s0, $s1, $t4, $a0, $a1 $v0, $ra)
function2
$fp (optional)
($s0, $s1, $t4, $a0, $a1 $v0, $ra)
Stack frames can hold other values as well. If a function declares local variables that are too
big to be kept in registers (for example, many local, or an array), then these variables will also be
located in the stack frame.
Sometimes you don’t know in advance how big the stack frame will be, because it may depend
on data that is defined only at runtime. For example, the function might make a system call and
might (based on some condition that may or not be met) read data from the console, and that
data could be put on the stack. In these cases, you need to move the stack pointer as the stack
frame changes size. In these cases, it may be useful to mark the beginning of the stack frame.
Such a marker is called the frame pointer. There is a special register $fp = $30 for pointing to the
beginning of the stack frame on the top of the stack. I won’t be using it. I am mentioning it for
completeness only.
Finally, note that the first stack frame is from main and that it has a return address. The reason
is that when the program is over, the function will return to the kernel.
Example: sum to n
Here is a simple example of recursion. It is similar to the ”factorial” function except that here we
are computing the sum of the values from 1 to n rather than the product of values from 1 to n. Of
course, you would not use recursion to compute the sum of the first n numbers from 1 to n since in
fact this sum is n(n+1)/2. I use this example only to illustrate how recursion works.
Here is the code in a high level language, following by MIPS code.
sumton:
beq $a0, $zero, basecase # if n == 0 then branch to base case
addi $sp,$sp,-8 # else , make space for 2 items on the stack
sw $ra, 4($sp) # store return address on stack
sw $a0, 0($sp) # store argument n on stack
# (will need it to calculate returned value)
addi $a0, $a0, -1 # compute argument for next call: n = n-1
jal sumton # jump and link to sumton (recursive)
basecase:
addi $v0, $zero, 0 # assign 0 as the value in $v0
jr $ra # return to parent
In today’s lecture we will look at two ”co-processors”, namely the floating point processor (called
CP1 or the FPU) and the kernel processor (called CP0 or the ’system control’ processor). We will
look at the MIPS assembly language instructions for this processor.
This is the last lecture above MIPS programming. After this, we will go back to the circuits
and connect the general ideas about circuits to the particular instructions we have seen in MIPS,
mostly CPU instructions but occasionally CP0 too.
You can only read from the product register. You cannot manipulate it directly. In MARS, the
product register is shown as two 32 bit registers, HI and LO. If the HI register has all 0’s, then the
product that is computed can be represented by 32 bits (what’s in the LO register). Otherwise, we
have a number that is bigger than the maximum int and it would need to be treated separately.
Details are omitted here.
What about division? To understand division, we need to recall some terminology. If we divide
one positive integer by another, say 78/21, or more generally ”dividend/divisor” then we get a
quotient and a remainder, i.e.
e.g. 78 = 3 ∗ 21 + 15
In MIPS, the divide instruction also uses the HI and LO registers, as follows:
mfc1
register integer floating point
arithmetic register arithmetic
$0,..,$31 division mtc1 $f0,.. $f31 divison
multiplication multiplication
logical ops int−float convert
sw lwc1
lw swc1
Memory
(2^32 bytes)
The MIPS instructions for adding and subtracting floating point numbers are of the form:
add.s $f1, $f0, $f1 # single precision add
sub.s $f0, $f0, $f2 # single precision sub
add.d $f2, $f4, $f6 # double precision add
sub.d $f2, $f4, $f6 # double precision sub
Having a separate FPU takes some getting used to. For example, the following instructions have
incorrect syntax and are not allowed.
add.s $s0, $s0, $s1 # NOT ALLOWED (add.s expects FPU registers)
add $f0, $f2, $f2 # NOT ALLOWED (add expects CPU registers)
1
by “chip”, I mean the silicon-based electronics which contains the combinational and sequential circuits
similarly for double precision, except now we must use an even number register for the argument,
Note that the memory address is held in a CPU register, not in a float register!
To load/store double precision, we could use two operations to load/store the two words. It is
easier though to just use a pseudoinstruction:
Example 1
Let’s say we have an integer, say -8 in $s0, and we want to convert it to a single precision float and
put the result in $f0. First, we move the integer from the CPU to FPU (c1). Then we convert.
addi $s0, $0, -8 # $s0 = 0xfffffff8
mtc1 $s0, $f0 # from $s0 to $f0 -> $f0 = 0xfffffff8
cvt.s.w $f0, $f0 # from w to s -> $f0 = 0xc1000000
As an exercise, recall from the first few lectures of the course why the registers have the coded
values shown on the right.
Similarly, suppose we have a single precision float (in $f0) and we want to convert it to an
integer and put it into $s0. Here is how we could do it.
cvt.w.s $f2, $f0 # we use $f2 as a temp here
mfc1 $s0, $f2
Example 2
int i = 841242345 // This value in binary would be more than 23
// bits, but less than 32 bits. (Specifically
// we cannot write it exactly as a float.)
Example 3
double x = -0.3;
double y = (float) x;
Here it is in MIPS:
.data
d1 : .double -0.3
.text
l.d $f0, d1 # $f0 = 0xbfd3333333333333 <- these two...
cvt.s.d $f2, $f0 # $f2 = 0xbe9999a
cvt.d.s $f4, $f2 # $f4 = 0xbfd3333340000000 <- ... are different
c. .s $f2, $f4
Here the “ ” is any comparison operator such as: eq, neq, lt, le, ge, gt. These compare
operations are R format instructions.
The programmer doesn’t have access to the result of the comparison. It is stored in a special
D flip-flop, which is written to by the comparison instruction, and read from by the following
conditional branch instruction:
The same one bit register is used when comparisons are made using double precision numbers.
c. .d $f2, $f4
System call
I did not mention it in the slides, but you also can print and read float and double from the
console, using syscall. Rather than printing from or reading to an argument register, it uses a
particular coprocesor 1 register shown in table below.
• EPC: ($14) The exception program counter contains a return address in the program. It is
the address in the program that follows the instruction that caused the exception.
• Cause: ($13) contains a code for what kinds of exception occured. ( invalid operation, division
by zero, overflow )
• BadVaddr: ($8) holds the address that led to an exception if the user program tried to access
an illegal address e.g. above 0x80000000.
• Status: ($12): says whether the processor is running in kernel mode or user mode. Says
whether the processor can be interrupted by various possible interrupts (We will discuss
interrupts in a few weeks).
This is the opposite problem. Now I am trying to load from the text (instruction) segment. Note
that the MARS assembler cannot detect this problem, since the problem only occurs at runtime,
once it is determined that $s0 contains an address in the text segment, not the data segment.
Example 4: Overflow
addi $s0, $0, 1
sll $s0, $s0, 30 # $s0 = 2^30
mtc1 $s0, $f0
cvt.s.w $f0, $f0 # $f0 = 2^30 = 1.00000000000000000000000 x 2^30
mul.s $f0, $f0, $f0 # $f0 = 2^60
mul.s $f0, $f0, $f0 # $f0 = 2^120
mul.s $f0, $f0, $f0 # $f0 = 2^240 -> overflow
Interesting, no exception occurs here. Instead, if you look at the result that ends up in $f0, namely
0x7f80 0000, you will find that it represents +∞, namely 0 for the sign bit, 11111111 for the
exponent, and then 23 0’s for the significand.
Example 6: 0/0
mtc1 $0, $f1
div.s $f2, $f1, $f1 # 0/0
You are familiar with how MIPS programs step from one instruction to the next, and how
branches can occur conditionally or unconditionally. We next examine the machine level repre-
sentation of how MIPS goes from one instruction to the next. We will examine how each MIPS
instruction is “decoded” so that data is passed to the appropriate places (e.g. from register to
memory, from memory to a register, from a register to another register, etc). By the end of this
lecture and the next, you will have a basic understanding of how the combinational and sequential
circuits that you learned at the beginning of this course are related to the MIPS instructions and
programs that we have covered in more recent lectures.
Instruction fetch
Both data and instructions are in Memory. For an instruction to be executed, it first must be read
out of Memory. MIPS has a special program counter register (PC) that holds the address of the
current instruction being executed. As a MIPS programmer, you are not responsible for “fetching”
the next instruction from memory. This done automatically. Let’s have a look at how this is done.
Consider the simple case of a non-branching instruction such as add or lw or sw. The address of
the instruction that follows this non-branching instruction is, by default, the address of the current
instruction plus 4 (that is, plus 4 bytes or one word). Because we are assuming that each instruction
takes one clock cycle, at the end of clock cycle, PC is updated to PC+4.
The current instruction (add or lw or sw, ...) is fetched by using PC to select a word from the
text part of Memory, namely the word containing the current instruction. Later in the course, we
will see how this is done. For now, it is enough for you to understand that an address is sent to
Memory and the instruction at that address is read out of Memory. To summarize, the contents
of the PC (a 32 bit address) is sent to Memory and the instruction (also 32 bits) starting at that
address is read from Memory.
PC
Memory various
parts of
(instructions) instruction
32 (depends
32 on R, I, J
format)
Let’s next look at several examples of instructions and consider the “datapaths” and how these are
controlled.
The opcode field and funct fields together encode that the operation is addition (rather than some
other arithmetic or logical operation). The three register fields of the instruction specify which
registers hold the operands and which register should receive the output of the ALU. Thus, the op
code and funct fields are used to control what data gets puts on what lines and when data gets
written. I will say more about how the control signals are determined next lecture. For now, we
will concentrate on the datapaths, as shown in the figure below.
PC
opcode, funct
control
WriteEnable
12 ("RegWrite")
ALU op
Memory 5 32
(instructions) 5 registers
5
32
32
• new PC value is computed and written (at the end of clock cycle)
• control signals for ALU operation are determined (to be discussed next lecture)
• RegData values are read from two registers and input to ALU; ALU operation is performed
The steps above can all be done in a single clock cycle, provided the clock interval is long enough
that all circuits have stabilized (and they would need to designed to ensure this). For example, the
ALU involves carry values which take time to ripple through.
4 control
opcode
PC
WriteEnable ALUop
6 ("RegWrite")
5 32
Memory
(instructions) registers
Memory
5 32 (data)
32
32
sign
extend
16
32
For lw, the first three steps are the same as in the add instruction, but the remaining steps are
quite different.
• base address of Memory word to be loaded is read from a register and fed to ALU
• control signals are set up for ALU operation (see next lecture)
• ALU computes sum of base address and sign-extended offset; result is sent to Memory
• word is read from Memory and written into a register (end of clock cycle)
Note that in the figure above I have colored the instruction segment (left) and the data segment
(right) to remind you roughly where these segments are located and the fact that they don’t overlap.
opcode
4 control
PC
ALUop MemWrite
6
5 32
Memory
(instructions) registers
32 Memory
5
(data)
32
sign
extend
16
• base address for Memory word is read from a register and fed to ALU
• ALU computes sum of base address and sign-extended offset; result is sent to Memory
which is also I format. This instruction is more interesting because the PC might not proceed to
the next instruction in Memory at the next clock cycle.
To determine whether the branch condition is met, the bne instruction uses the ALU, and
performs a subtraction on the two register arguments. If the result is not zero, then the two
registers do not hold the same values and the program should take the branch. Otherwise, the
program continues to the next default instruction. Specifically, if the result of the ALU subtraction
is not zero, then
P C := P C + 4 + offset
else
P C := P C + 4 .
It may seem a bit strange that 4 is added in both cases, but that’s in fact what MIPS does. This
makes sense when you consider the the datapath. The offset is relative to the current instruction
address plus 4, rather than to the current instruction address. (See example on slides.)
See the figure below for the datapath. The multiplexor in the upper part of the figure selects
between PC+4 and PC+4+offset. The selector signal depends on the 32 bit output of the ALU,
which has performed a subtraction using the two registers specified by the bne instruction. This
32 bit ALU result is fed into a 32 bit OR gate. The output of the OR gate is labelled “NotZero”
in the figure. This NotZero signal is 1 if and only the subtraction is non-zero, that is, if the two
registers do not have the same contents.
Note that the branch is only taken if indeed the instruction is bne. In the figure, the NotZero
signal is ANDed with a bne instruction control signal that indicates it is indeed a bne instruction.
(I have labelled the branch control as “bne instruction” in the figure.) The branch is taken when
both the NotZero signal is ON and the branch control is ON (1). e.g. For an add instruction or any
other R-format instruction, the branch signal would be OFF (0).
One other detail worth noting. Instructions are word aligned in Memory, namely the two lowest
order bits of an instruction address always have the value 0. Thus, in the 16 bit address field of a
conditional branch, MIPS doesn’t waste these two bits by always setting them to 00. Instead, the
16 bit field represents an offset of 4× the offset representation in binary. That is, the offset value is
“shifted” by two bits to the left before it is added to PC+4. Careful: no shift register is required
for this. Instead, two 0 wires are inserted directly, and the sign extend circuit extends from 16 to
30 bits, so in total there are 32 bits.
Here are the steps of the bne instruction:
4
+ 0M
PC U
+ 1X
NotZero
16
sign
extend
and shift
left by 2 bits
32
field op address
numbits 6 26
You might think that 26 bits specify an offset from PC+4 just like the conditional branches. How-
ever, it is not done this way. Rather, the offset is from the address 0x*0000000, where * is the high
order 4 bits of the PC register, namely bits 28-31. Recall that the MIPS instructions of user pro-
gram go up to but not including 0x10000000. Thus, all user instructions have 0000 as their highest
four bits i.e. 0x0******* in hexadecimal. MIPS doesn’t always concatenate 0000 onto the front
of a jump instruction, since there are kernel programs as well. Instructions in the kernel programs
have addresses that are above 0x80000000. So jump instructions in the kernel do not take 0000 as
their upper four bits. Rather, they take (1000)2 as their upper four bits, i.e. 0x8.
Moreover, because instructions are word aligned, the 26 bit field is used to specify bits 2-27.
Bits 0 and 1 in an instruction address always have the value 00.
4
M
U
PC conca− X
[PC 28−31] tenate
28 PCSrc
32 [inst 0−25]
shifted left
by 2 bits
Memory
(instructions)
6
[inst 26−31]
I-format op rs rt immediate
numbits 6 5 5 16
For either R or I format instructions, at least one register is read. For this reason, one of the
register slots in the instruction format is always used for a read. This is the rs (source) slot,
namely bits [21-25]. A 5 bit line is always fed into the ReadReg1 input and specifies which
register we read from.
It is not true, however, that an instruction always writes to one of the 32 registers. For
example, sw does not write to a register. The rt slot, namely bits [16-20], holds a second
register number which add and sw read from, and which lw writes to.
If an R-format instruction uses three registers (add), then two of them are read from and one
of them is written to. The rd register (R format) is the one that is written to.
To summarize, rs is always specifies a read register (“s” standing for source), rd always specifies
a write register (“d” standing for destination) and rt is a register that is either read from or
written to, depending on the instruction. (I am not sure that “t” stands for anything. )
If you look at the merged datapath, you see that the three 5-bit register number lines are
routed to the register array and that a multiplexor is used to decide which line goes to the
WriteReg input. The selector signal for this multiplexor is called RegDst (register destination);
the value is 1 means rd is selected (e.g. add) and the value is 0 means rt is selected (e.g. lw).
• ALUSrc: One input of the ALU always comes from a register. So one of the ReadData registers
(namely ReadData1) is hardwired to one of the ALU inputs. The other ALU input comes
from either a register (in the case of an add) or from the signed-extended 16 bit immediate
field (in the case of a lw or sw). Thus a multiplexor is needed to choose from among these
two possible inputs to the second ALU input. The selector for this multiplexor is ALUSrc.
• MemtoReg: The third multiplexor shown in the figure selects where the register write value
comes from. It might come from the ALU (R-format) or it might come from Memory lw.
Other controls are shown in the merged data path. RegWrite specifies whether or not we are
writing to a register. MemWrite specifes whether or not we are writing to Memory. MemRead
specifies whether or not we are reading from Memory. Note that the RegWrite signal was called
“WriteEnable” in lecture 6 page 5. The MemWrite signal was shown in lecture 6 p. 6,7.
The ALUOp control variable needs more explanation. This control variable specifies which
operation the ALU should perform. For example, add, sub, slt, and, or all require different
operations to be performed by the ALU. Recall that the operation is determined by the instructions
function field, since all R-format instructions have the same opcode, so the function field will also
be needed as a input to the circuit that computes the control signals.
Recall how the ALU circuit works. It uses full adders, along with AND and OR gates so that
the values A and B computes are either summed, anded, or ored. To compute subtraction, the B
variable is inverted and a Binvert control was used for this. So, there are really two controls that
need to be coded here: the Operation control and the Binvert control.
As you recall from Lecture 4, the ALU circuit has two control inputs. There is a Binvert which
is used for subtraction and that takes the twos complement of the B variable, and there is the
operation (either AND, OR, or +). For the latter, there are three operations to be coded, so we
need two bits. Note that this 2-bit Operation control is different from from the 6-bit ”opcode” field
which is part of the 32 bit instruction.
Let’s suppose that the Operation control for the ALU uses 10 for adding the two inputs. The
add and sub instructions also use the 10. (The difference between add and sub comes from the
Binvert control, not the Operation control.) The lw and sw instructions also perform an addition
and so they have the same Operation control. However, the and and or instructions use different
op controls: and has Operation control is 00, and or has operation control 01. These 2 bit values
are not meaningful: they just as selector signals within the circuit.
There is one more op value that could be used: 11. In fact, this is used for slt, ”set if less
than”. This instruction performs a subtraction (A - B), and if the result of the subtract is negative,
then the result of the instruction is defined to be 1, and otherwise it is 0. An extra circuit that is
required for this. (See Exercises 5. )
[ASIDE: In fact, in MIPS, the ALU control circuit that implements the above truth table uses two
combinational circuits: one that handles the case of opcode 000000 and uses the funct field input,
and one that handles the case where the opcode is different from 000000. I am omitting the details
this year. If you look it up e.g. in Patterson and Hennessey’s text, you’ll find that the controls are
labelled differently from what I am giving here. Conceptually though, there is no difference.]
The difference between jal and j is that jal also writes PC+4 into register $ra, which is $31.
I have written this 5 bit values as 31 in the figure. I have also added a line which shows how PC+4
gets written into register $31.
Notice in the figure that the two controls WriteDataSrc and MemtoReg are now “generalized”
All I mean by that is that previously these controls were 1 bit because they had to choose between
two possibilities, but now there are three possibilities and so they need to be two bits. (I could have
changed the names of these controls to something more general, but we won’t need them further so
there is no need for that.) bother.)
is running, such as integer division by zero, or a bad address is used for an instruction or for a
load from or store to Memory. When such errors occur, the computer stops executing the program
and jumps to the kernel. The program counter PC must be updated accordingly. In particular,
the program jumps to the first instruction address in the kernel, namely 0x80000000. This is the
first instruction of a kernel program called the exception handler. Here we consider briefly the
mechanism of how this is done.
The PC is updated by selecting from one of many inputs to a multiplexor, namely a multiplexor
whose control is is PCSrc. Now we add another possible address that can go there and this is shown
in the figure below. (0x80000000).
32
PC Src
There can be many types of exceptions. For example, a syscall call is another type. And there
are other types of exceptions called interrupts which we will discuss later. For now, I just want
to emphasize that the PC can take many different values on the next clock cycle and that the
value the PC takes is determined by a selector PCSrc. This control variable can depend on various
combinations of other control variables which describe the present state of the program.
Pipelining
MIPS instructions can be thought of as consisting of up to five stages:
Some instructions such as lw use all of these stages whereas other instructions such as add use just
some of these stages. In a single cycle implementation, each clock cycle must be long enough for all
stages to complete. This would be inefficient, however. One way to understanding this inefficiency
is to note that each instruction is only in one stage at any time, and so the other stages are not
doing anything during that time. The idea of pipelining is to use all stages all the time, to the
extent possible. In a pipeline, instructions move along the datapath, one stage at a time, through
all stages. As in the single cycle model, the PC is updated at every clock cycle. This implies that
the next instruction begins before the previous instruction is finished. At first glance, it is difficult
to believe that this could work. And indeed there are subtle issues to be addressed to make it work.
There are many familiar examples of pipelining in the world: a car wash, an assembly line
in a factory, cafeteria, etc. The key idea is that you can use resources most efficiently when all
workers are busy. One design challenge is that it is generally not possible to keep everyone busy
at once. But you want to design the system to come as close to that as you can. The other
design challenge is that there are dependencies between workers, so that when one worker doesn’t
do his/her/its job, then this prevents another worker from doing its job. Have a look at a clip
from Charlie Chaplin’s film Modern Times for an amusing example (But MUTE the volume.)
[Link]
Real MIPS processors use a pipeline, and the PC is updated at every clock cycle. Each pipeline
stage is given the same amount of time (one clock cycle) which is long enough that each pipeline
stage can complete in this time. The five pipeline stages are listed above.
Recall that in the single cycle model, each of the control variables was given a value that
depended on the instruction. (These control variables were typically either selector signals for
multiplexors, or write enable signals.) The situation is more subtle in a pipeline scheme, since
there are multiple instructions present in the pipeline. New registers are needed between successive
stages of the pipeline, in order to keep track of which control signal is for which instruction. These
registers are referred to by a pair of stages – see below. The pipeline registers contain all control
information that is needed by that instruction, namely the values read out of the registers are the
control values for an instruction, and values are written into the register at the end of the stage
(end of clock cycle). These pipeline registers pass the control signals and other values through the
pipeline.
• IF/ID: During a clock cycle, this register contains the instruction (one word) that was fetched
from Memory in the previous clock cycle. There is no control information since the instruction
hasn’t been decoded yet. At the end of the clock cycle, the instruction that is being fetched
in the current clock cycle is written in this register.
• ID/ALU: This register contains parts of the fetched instruction (such as register numbers,
the immediate fields, offsets) plus controls that are needed to execute the instruction. These
control signals are available by the end of the ID stage, and these controls are for all remaining
stages of the pipeline.
• ALU/MEM: This register contains the controls that are needed to execute the remaining two
stages (MEM, WB), as well as values that were computed during the ALU stage.
• MEM/WB: This register contains controls needed to execute the WB stage, as well as any
data value that needs to be written back. These data values might have been retrieved from
Memory during the MEM stage, or they may be values that were computed in the ALU stage.
M
U
X
PC
RegWrite
ReadReg1
ReadData1
ReadReg2
ReadData2
Memory WriteReg ALU Memory
(instructions) Writedata (data)
Control
Unit
WB WB WB
MEM MEM
ALU
Many subtle issues arise in pipelining, since several instructions are processed during the same
clock cycle. We are used to thinking of a program as completing one instruction before another
begins, and we judge the correctness of a program with this assumption in mind. But pipelining
does not obey this assumption. The rest of this lecture will address some of the problems that arise
and how to solve them.
Data Hazards
Example 1
Take the two instructions:
For a single cycle machine, there is no problem with the fact that the register written to by add is
the same as one of the registers that sub reads from. But a pipelined machine does have problems
here. To see why, let’s visualize the pipeline stages of a sequence of instructions using a diagram as
follows:
clock cycles →
add IF ID ALU MEM WB
sub IF ID ALU MEM WB
Notice that the sub instruction has executed the ALU stage before the add instruction has written
its result back into the register. This will produce the wrong answer since one of the arguments of
the sub instruction is the result of the add. Such a problem is called a data hazard. Data hazards
arise when the lead instruction writes to a register that the trailing instruction is supposed to read
from.
There are a few ways to avoid this problem. The easiest is to insert an instruction between
the add and sub which does nothing i.e. it does not write into any of the 32 registers, nor does it
write into data memory. Such an instruction is called nop, which stands for “no operation”. This
instruction is a real MIPS instruction. The key property of nop is that it doesn’t write to any
register or to memory and so it has no effect. Including nop instructions in a program is called
stalling. The nop instruction itself is sometimes called a bubble.
Of course, this is not such a satisfying solution as the inserted nop instructions slow down the
pipeline, which defeats the purpose of pipelining!
An alternative solution is to use the fact that the result of the add instruction is available at the
end of the ALU stage of the add instruction, which is before the ALU stage of the sub instruction.
Specifically, the result of add’s ALU stage is written into the ALU/MEM register, and so the result
is available there. A method called data forwarding can be used to send the result along a dedicated
path from the ALU/MEM register to an ALU input to be used in sub’s ALU stage.
How could data forwarding be implemented? For the example above, consider the condition
that could be used for data forwarding. Suppose the add instruction has finished its ALU stage
and (eventually) would write the result into some rd register. The next instruction, which would
have just finished its ID stage, would use that same register (specified in its rs or rt field) as one
its ALU inputs. This could be detected by checking the two conditions, which correspond to the
two possible ALU inputs:
ALU/[Link] & (ALU/[Link] == ID/[Link])
ALU/[Link] & (ALU/[Link] == ID/[Link])
that is, the ALU/MEM register indicates that the leading instruction will write the ALU result into
a register, and this write register is a source register used by the trailing instruction whose controls
are in ID/ALU register.
To forward the data to the ALU, we could put a new multiplexor in front of each ALU input
arm. Each of these would either select the usual line for its input or it would select a forwarded line
(in the case that the condition above is met). I have labelled these controls ALUsrc1 and ALUsrc2.
Example 2
Here is a slightly different example.
clock cycles →
lw IF ID ALU MEM WB
add IF ID ALU MEM WB
The difficulty here is that $s1 argument for the add instruction has not been properly updated
by the lw instruction until after the WB of the lw. You can imagine that this is quite a common
hazard. The reason we load a word into a register is that we want to perform an operation on it!
The easiest (but also least effective) way to correct this hazard is to stall i.e. insert two nop
instructions. but again this defeats the purpose of pipelining.
lw IF ID ALU MEM WB
nop IF ID ALU MEM WB
nop IF ID ALU MEM WB
add IF ID ALU MEM WB
In fact we need three noop instructions here, not two. Why? When lw is in its WB stage, it
copies a value from the MEM/WB register into the register array, but this value not yet copied
into the ID/ALU register. Rather, any register values that are copied from the register array into
the ID/ALU register are part of a ‘later’ instruction which is in its ID stage. Note that both WB
and ID stages of two different instructions are using the register array at the same time. This is
fine since one is writing to the register array (WB) and the other is reading from the register array
(ID).
If this is confusing to you, then recall from way back in the course how a flip flop works. It
consists of two D-latches, and when are writing into the first D-latch, we cannot read that value
from the output of the second D latch yet. This is how we can read and write to a flipflop at the
same time. What’s true for D flip flops is also true for registers, since registers are made out of D
flipflops. ]
A better way to solve the problem is by data forwarding. The new value of $s1 is written into
the MEM/WB register at the end the lw MEM stage. So, if we stall the add by just one cycle,
then we could forward the value read from the MEM/WB register directly into the ALU as part of
add’s ALU stage.
lw IF ID ALU MEM WB
nop IF ID ALU MEM WB
add IF ID ALU MEM WB
How would we control this? We check if the instruction in MEM/WB is supposed to write a
value from Memory into a register rd at the end of its WB stage (namely it is lw), and we check if
this is one of the registers that the ID/ALU instruction is supposed to read from in the ALU stage.
(Note that this is two instructions behind, not one.) The conditions to be detected are either of the
following:
MEM/[Link] & (MEM/[Link] == ID/[Link])
MEM/[Link] & (MEM/[Link] == ID/[Link])
When one of these conditions is met, the value that is read from Memory by lw can indeed be
forwarded. Note that it can be forwarded to an input branch of the ALU. We could add multiplexors
ALUSrc1 and ALUSrc2, to select whether this value is read to either (or both) of the input arms
to the ALU.
Note that I have used the same controls ALUsrc1 and ALUsrc2 as in the previous example. Of
course this will not work since these variable cannot each mean two different things. But I assume
you appreciate the spirit of what we are doing here – isolating only datapaths that pertain to a
particular instruction or concept. The real circuit with everything in it is enough spagetti ”to feed
an army”.
Example 3
Another solution is to take any instructions that comes before the lw or after the add – and that are
independent of lw and add instructions – and reorder the instructions in the program by inserting
these instructions between the lw and add.
lw IF ID ALU MEM WB
sub IF ID ALU MEM WB
or IF ID ALU MEM WB
add IF ID ALU MEM WB
Control Hazards
In a single cycle implementation, the next instruction is determined by the PCsrc control, which is
determined during the single clock cycle. In a pipeline implementation, however, PCsrc control is
not determined right away. Rather, in the best case, it is determined during the instruction decode.
In a pipeline implementation, PC is ”incremented” to PC+4 by default at the end of the instruc-
tion fetch (IF) stage, and thus the next instruction in the program enters the pipeline. However, in
the case of an unconditional branch such as j, jal, jr, the program will not execute the instruction
that directly follows the jump (unless the program explicitly jumps to that instruction, which would
be unlikely). This creates a problem, since the instruction directly following the jump (at PC+4)
will enter the pipeine by default.
There are two steps needed to handle jumps. First, the PC is updated at the end of the jump
instructions’s ID stage. That is, once the instruction is decoded, it has been determined that it is
a jump instruction and it has been determined where the program should jump to, and this jump
address must be written into the PC. I will not discuss this solution, since it is essentially the same
as in the single cycle implementation.
Second, the instruction in the program that followed the jump and entered the pipeline should
not be executed. This problem is more subtle. Consider an example:
j Label1
Label2: addi $s0, $s1, 0
sub $s2, $t0, $t1
:
Label1: or $s3, $s0, $s1
j IF ID ALU MEM WB
addi IF ID ALU MEM WB
sub IF ID ALU MEM WB
Again, the problem is that we do not want the addi (or sub) instruction to be executed after
the j instruction.1 How can we solve this problem? Inserting a nop between j and addi will work,
because addi will not be fetched in that case. The nop will enter the pipeline, but the instruction
following the nop will be the one that the program jumps to (after the jump’s ID stage). This nop
could be inserted by the assembler after each j instruction.
Inserting a nop instruction after each j is somewhat unsatisfying, however, since its effect is
just to stall. Is there a better approach? Yes! Rather than inserting a nop after j, one can modify
1
Note that the instructions following the jump might be executed at some other time, namely if some other part
of the code jumps to Label2.
the opcode of the instruction that has just been fetched, namely if the opcode field in the IF/ID
register is j, then the new value of the opcode field that is written in the next clock cycle should be
the opcode of nop. In the above example, this would replace addi with nop which avoids the stall.
Example 4
For a conditional branch instruction, it is not known until the end of the ALU stage whether the
branch is taken and (if so) what the address of the next instruction should be. In particular, it
is not known whether the instruction at address PC+4 (following the beq instruction) should be
executed. For example,
We could stall by insert a sufficient number of nop instructions between the beq and add instructions
so that, at the completion of the ALU stage of beq, it is known whether the branch condition is
met and so at the end of that stage, the PC can be updated accordingly with the branch address
or with PC+4. That is, the control PCsrc could be computed at the end of the beq’s ALU stage.
In this case, if the branch is taken, then add never enters the pipeline.
A better solution is to put the add instruction into the pipeline in the default way, and then modify
the add instruction after the ALU stage of the beq instruction if the branch is taken. For example,
the nop opcode could be written instead of the add opcode into the ALU/MEM register, and all
the controls could be changed to those of the nop. Alternatively, note that the only damage that
the add instruction could cause is at its WB stage, where it writes the result back into a register.
To prevent this damage, it would be sufficient to change the RegWrite control to 0, in the case that
the branch is taken.
Notice that you certainly don’t want to write assembly code which considers all of these issues.
Debugging assembly code with the single cycle model in mind is difficult enough! Rather, you will to
write your programs in a high level language and have a smart compiler create the assembly/machine
code for you, and this compiler will know all the possible hazards and will have rules for reordering
an inserting nops.
Virtual Memory
The model of MIPS Memory that we have been working with is as follows. There are your MIPS
programs, including main and various functions that are called by main and by each other, and
data used by this program, and there are some kernel functions e.g. that may handle exceptions.
This model makes sense from the perspective of one person writing one program. However, as
you know, many programs may be running on a given computer at any time. For example, when
you run your Windows-based laptop, you may have several programs going simultaneously: a web
browser, Word, Excel, etc. You may even have multiple copies of each of these programs going. The
operating system also has many programs running at any time. (If you are running LINUX, and
you type ps -e, you will get a list of a large number of programs that are running at the current
time.)
A program that is running is called a process. When multiple processes are running simultane-
ously, they must share the computer’s memory somehow. Two related issues arise. First, different
processes use the same 32 bit program address space. (An extreme example is that multiple copies
of a program may be running!) Second, at any time, various registers (PC, registers $0-$31, ...,
pipeline registers, etc) all contain values for just one process. We will deal with the second issue in
a few weeks. For now, let’s consider how processes share the same Memory address space.
We first need to distinguish 32 bit program addresses from physical addresses. The 32 bit pro-
gram address space is called virtual memory and the program addresses themselves are called virtual
addresses. These addresses must be translated into physical memory addresses. This translation is
hidden from the MIPS programmer. As we will see, it is done by the a combination of hardware
and software, namely the operating system i.e. kernel.
Besides allowing multiple processes to share the processor, virtual memory allows independence
between the size of the program address space (232 bits) and the size and layout of physical address
space. From the point of view of memory hardware, there is nothing magical about 32 bits for
MIPS program addresses. You know that when you buy a computer, you can choose the amount of
RAM (e.g. 2 GB or 8 GB), and you can choose the size of your hard disk (200 GB vs. 1 TB). These
choices are just a matter of how much money you want to spend. They are not crucial limitations
to the processor you use. This distinction between the size of a program address space (232 in
MIPS) and size of physical memory may seem strange at first, but it will make sense gradually in
the coming lectures.
An analogy for this idea of virtual versus physical addresses is telephone numbers. My office
telephone is 514-398-3740. Do you think that the person in the office next to me should have
telephone number 514-398-3741 ? Of course not. Or do you think that phone numbers 514-399-
xxxx should be close to McGill? No, of course not. You know that telephone numbers are arbitary,
i.e. there is some arbitrary mapping between phone numbers and physical places. The same is true
for the relation between program addresses and physical addresses.
Physical Memory
MIPS program memory consists of 232 bytes. How much is that?
How does that compare to the size of memories that you read about when buying a personal
computer or tablet, or flash drive, etc. ?
Disk memory
Consider some examples, and the technology they use:
• floppy disk holds about 1.4 MB (not used any more), magnetic
Because disk memories have moving parts, their access time is quite slow. The memory is a
mechanical disk which must spin. Data is read or written when the appropriate part of the disk
physically passes over the read or write head. When we talk about a spinning disk, you should
realize that this is very slow relative to the time scales of the processor’s clock. A typical processor
these days has a clock speed of 1 GHz (clock pulses occuring every 10−9 seconds). If you want read
a word from memory (as a step in the lw instruction, for example) and the word you are looking
for is on one of the above disks, then you won’t be able to do so in one clock cycle. Rather it will
typically take millions of clock cycles for the disk to spin around until the word is physically aligned
with the read or write head. If the processor has to wait this long every time a lw or sw is executed,
it will be very inefficient. Another solution is needed – and that’s what we’ll be talking about in
the next several lectures.
Disk memories are typically called “external memory”, even though the hard disk on your
desktop or laptop is inside the case of the computer. To avoid these long memory access times
in these computers, the computer has an internal memory that is much faster than the hard disk.
We’ll discuss these below.
Flash memory
Another kind of external memory is flash memory. The memory cards that you put in your digital
cameras or mobile phones, or the USB drives that you keep in your pocket are made of flash memory.
Tablet computers use flash memory (called a solid state drive SSD) instead of a hard disk (HDD)
for their bulk storage and indeed many laptops are using SSD’s now too. Flash is much faster to
access than disk. Typical flash access times are 10−4 ms, although this will vary a lot between
technologies and it can be much faster than that.
Hard disk memory and flash serve the same purpose. They are non-volatile memories. When
you turn off the power, they maintain their values. They can therefore be used for storing the
operating system software and all your applications.
RAM
The “main memory” inside your computer is called RAM. RAM stands for “Random access mem-
ory”. There is nothing random about it though, in the sense of probability. Rather, it just means
that you can access any byte at any time – with access time independent of the address. Note this
uniform access time property distinguishes RAM from disk memory. (Flash memory also allows
uniform access time. However, flash memory is non-volatile, whereas RAM is volatile.)
Physically, there are two kinds of RAM:1 these are generally called SRAM (static RAM) and
DRAM (dynamic RAM). SRAM is faster but more expensive than DRAM. These two types of
RAM serve very different roles in the computer, as we will see in the coming lectures. SRAM is
used for the cache, and DRAM is used for “RAM.” (The terminology is confusing here, so I’ll repeat
myself. When one says “RAM”, one means “main memory” and the technology used is DRAM.
When one says “cache”, one is referring to SRAM.)
[ASIDE: after the lecture, I was asked by a student how sleeping and hibernation work, with respect
to these different types of memory. Putting your computer to sleep keeps the data and program
state in volatile memory, and the battery is used to provide enough power to maintain the values in
memory. Hibernation is quite different. The program states are stored on disk (non-volatile). You
can unplug your computer and let the battery drain out and the states will be saved. When you
start the computer again, the program states are read off disk and the programs can continue. ]
Memory Hierarchy
You would like to have all your programs running in the fastest memory (SRAM), so that any item
of data or any instructions could be accessed in one clock cycle. e.g. In the data path lectures, we
were assuming that instruction fetch and (data) Memory access took only one clock cycle. However,
because faster memory is more expensive, it is just not feasible to have everything all instructions
and data sitting in the fastest memory. The typical laptop or desktop has a relatively little SRAM
(say 1 MB), much more DRAM (say a few GB), and much much more disk space (hundreds of GB).
It is common to imagine a “memory hierarchy” as follows. This figure captures two concepts:
the first is the order of access: you try to access data and instructions by looking at the top level
first, and then if the data or instructions are not there (in SRAM), you go to the next level (main
memory, DRAM) and if what you are looking for is not there either, then you proceed down to the
next level (HDD or SSD). The second concept is that the sizes of the memories grows with each
level, and the cost per bite decreases.
cache (MB)
SRAM
HDD, SSD
external memory (GB - TB)
1
The underlying technologies are not our concern in this course. You can just pretend both SRAM and DRAM
are made out of transistors and other electronics and leave it at that. In fact, there are many subtypes of both
SRAM and DRAM and these are changing remarkably fast.
processes instructions
MIPS
Memory data
(2^32 bytes)
other
As discussed earlier today, in order for programs to run on the actual processor using real
memory, program/virtual addresses must be translated into physical addresses. This is done by
partitioning virtual address space and the physical address space into equal sized regions, called
pages.
11111
00000 1111111111
0000000000
00000
11111 instructions
111
000 0000000000
1111111111
00000
11111 000
111 0000000000
1111111111
0000000000
1111111111
page
00000
11111 000
111 swaps
0000000000
1111111111
000
111
Memory
00000
11111 0000000000
1111111111
data
(2^32 bytes)
0000000000
1111111111
other
0000000000
1111111111
0000000000
1111111111
0000000000
1111111111
MB GB TB
(1 clock cycle access) (10 clock cycle access) (10^6 clock cycle access)
Let’s work with an example in which each page is 212 bytes i.e. 4 KB. The address of a specific
byte within a page would then be 12 bits. We say that this 12 bit address is the page offset of
that byte. Thus, the 32 bit MIPS virtual address would be split into two pieces. The lower order
piece (bits 0-11) is the page offset. The high order piece (bits 12-31) is called the virtual page
number. Each virtual page number of a process corresponds to some physical page number in
physical memory. We’ll say this page is either in RAM or on the hard disk.
The computer needs to translate (map) a virtual page number to a physical page number. This
translation or mapping is stored in a page table. Each process has its own page table, since the
mapping from virtual address to physical address is defined separately for each process.
Here’s how it works for our 4 KB page example. Take the high order 20 bits (bits 12-31) which
we are calling the virtual page number, and use these as an index (address) into the page table.
Each entry of the page table holds a physical page number. We are calling it a ”table”, but to make
it concrete you can think of it as an array, whose index variable is the virtual page number.
Each page table entry also contains a valid bit. If the valid bit is 1, then the page table entry
holds a physical page number in main memory (RAM) and if the valid bit is 0 then the page table
entry holds a physical page number on the hard disk.
Suppose main memory consists of 1 GB (30 bit addresses). Then there are 218 = 230 /212 pages
in main memory, assuming all of main memory (RAM) is paged. The address of a particular byte
in main memory would thus be 18 high order bits for the physical page number, concatenated with
the 12 bit offset from the virtual address. It is the upper 18 bits of address that would be stored in
the “physical page address” field of the page table.
If the valid bit is 0, then the physical address has a page number that is on the hard disk. There
are more pages on the hard disk than in main memory. For example, if the hard disk has 1 TB
(240 ) then it has 228 = 240 /212 pages.
Each entry in the page table may have bits that specify other administrative information about
that page and the process which the kernel may need. e.g. whether the page is read or write
protected (user A vs. user B vs. kernel), how long its been since that page was in main memory
(assuming valid is 1, when the page was last accessed, etc).
virtual page number physical page address valid bit R/W etc
(up to 220 addresses/entries) (depends on size of phys. mem.)
0
1
2
3
..
.
[ASIDE: The page table of each process doesn’t need to have 220 entries, since a process may only
use a small amount of memory. The kernel typically represents the page table entries using a more
clever data2 structure to keep the page tables as small as possible.]
Where are the page tables ? First, note that page tables exist both in the virtual address
space (program addresses) as well as in physical address space. Page tables are constructed and
maintained by kernel programs, and so the kernel programs must be able to address them. Of
2
for example, hash tables which you may have learned about in COMP 250
course, they also must exist physically somewhere, and so we need to be able to talk about physical
address too.
For simplicity, you can think of page tables as not themselves lying in a part of Memory that is
paged. (The reason this simplifies things, is that if page tables were themselves broken into pages,
then you would need to think about a page table for the page table, .... which you probably don’t
want to think about right now.) In fact, often page tables are kept in paged memory, but let’s keep
it simple and ignore that.
Cache: motivation
Last lecture we discussed the different types of physical memory and the relationship between pro-
gram (virtual) memory addresses and physical addresses. We also discussed the memory hierarchy,
and how we can’t have quick access to all physical memory. Instead, we need to use a combination
of small fast expensive memory (SRAM), slower and bigger less expensive memory (DRAM), and
much slower and much bigger and much cheaper memory (disk).
How do these ideas about memory back to the datapaths that we learned about earlier? Recall
the basic pipeline scheme:
M
U
X
4 RegWrite
PC
ReadReg1
ReadData1
ReadReg2
ReadData2
Memory WriteReg ALU Memory
(instructions) Writedata (data)
Control
Unit
IF ID ALU MEM WB
There, the ”Memory” box was a placeholder for an unspecified mechanism. Then, last lecture we
learned that the memory accesses involve two steps: first, translate the 32 bit virtual address to
a physical address (using a page table lookup); second, use the physical address to index into the
physical memory. So we now think of the IF and MEM stages as shown below (two stages each).
This is still an abstract representation, since a ”page table” is not a circuit, not is it clear yet what
we mean by the ”instruction memory” and ”data memory” boxes. Do we mean RAM or disk, for
example? It will take a few more lectures to unpack this.
If we think of the two stage memory accesses as being from main memory (RAM) or from disk,
then we see immediately that we have a speed problem. Even if the word we want is in main
memory rather than on disk, each main memory (DRAM) access requires many clock cycles (say
10). Thus, every instruction would require many clock cycles. We would need about 10 cycles to
ge the translation from the page table, and we need about 10 cycles to fetch the instruction, and
we would need another 10 clock cycles or load or store the word.
The solution to the problem is to use small very fast memory (SRAM) to hold frequently used
page table entries (virtual to physical page number translations), and frequently used instructions
and data. These small and very fast SRAM memories are called caches. For now, we’ll just think
of them as boxes which have an input and an output, where each box is a sequential circuit similar
to the register array (but much bigger in practice). In the figure below, the input to the page table
cache is a virtual address and the output is a physical address. The input to the instruction or data
cache is a physical address and the output is an instruction or data word. (For the data cache, it
is also possible to write into it (e.g. sw). We will get to that next lecture. )
page page
table instruction table data
PC cache cache cache
cache
(TLB) ( TLB )
• a TLB index (lower 9 bits of VPN). The index specifies which out of 29 = 512 entries in the
TLB that we are referring to. You can think of it as an address of an entry in the TLB.
• a tag (upper 11 bits of VPN). The tag is used to disambiguate the 211 virtual page numbers
that have a given 9 bit index.
Because these two virtual addresses have the same values in their TLB index fields, their VPN’s
would be mapped to the same entry of the TLB. Notice that these addresses happen to have the
same tag as well, since their 20 bit virtual page numbers (VPN) are the same, that is, these addresses
lie on the same page. Only the page offset of the addresses differ.
Now consider a slightly different example:
(tag, 11) (TLB index,9) (page offset, 12)
01000111011 001001011 010101111111
01010100100 001001011 010101111111
Here the TLB index and page offsets are the same, but the tags differ (and so the virtual page
numbers differ as well.) The VPN’s differ (since the tags differ). Since two different virtual pages
must be represented physically by different physical pages, the physical page numbers must also
differ. But since both VPNs map to the same TLB entry, it follows that only one of these two
translations can be represented in the TLB at any one time. More details on how this works will
be given below.
Another issue to be aware of is that the TLB can contain translations from several different
processes. One might think that this should not be allowed and that the TLB should be flushed (all
values set to 0) every time one process pauses (or ends) and another continues (or starts). However,
this harsh flushing policy is unnecessary: there is an easy way to keep track of different processes
within the TLB. To distinguish which process (and hence, which page table) an entry belongs to,
one can add another field to each entry of the TLB, namely a Process ID (PID) field. This is simply
a number that identifies the process. The concept here should be familiar to you. PID’s are like
area codes for telephone numbers. They are used to distinguish possibly the same 7-digit phone
numbers in two different cities.
Finally, each entry of the TLB has a valid bit which says whether the entry corresponds to a
valid translation or not. For example, when a process finishes running, the TLB entries that were
used by that process are no longer considered valid and the so the valid bits for all entries of that
process are set to 0.
Have a look again at the datapath on page 2. What circuit is inside the little boxes that say
“page table cache (TLB)”? The answer is shown below. Consider a 32 bit virtual address which
must be translated into a physical address. This virtual address is partitioned into tag/index/offset
fields. To ensure that the TLB contains the translation for this virtual address, the indexed row
from the TLB is retrieved. (Specifically, a decoder is used to identify which row to retrieve from.
Recall the RowSelect control mechanism from lecture 6.) The TLB valid bit must be checked and
must have the value 1. The tag field must match the upper bits of the virtual address. The Process
ID field (PID) must match the Process ID of the current process. If all these conditions are met,
then that entry in the TLB can used for the translation, so the physical page number stored at that
entry of the TLB is concatenated with the page offset bits of that virtual address (see figure below)
and the result defines the physical address i.e. the translated address. Since SRAM is fast, all this
happens within the IF stage i.e. within the clock cycle.
One final point: last lecture we said that pages can be either in RAM or on disk, and we saw
that there are two lengths of physical addresses – one for RAM and one for disk. The TLB does
not store translations for pages on disk, however. If the program is trying to address a word that
only lies on the hard disk, then the TLB will not have the translation for that address; the TLB
only holds translations for addresses that are in main memory (RAM). In order to have access to
that word on disk, a page fault would need to occur and the page would need to be copied from
the hard disk to main memory. We will discuss how this is done later in the course.
virtual address
31 21 20 12 11 0 PID register
tag TLB index page offset
12 TLB (SRAM)
9
physical valid
page number tag PID
9-to-512
decoder (18 bits) (11 bits) (1 bit)
512
11 11
18
=
=
[ASIDE: The ordering of slides in actual lecture was a bit confusing, since I put a brief discussion
of the TLB in the middle. That discussion really belonged at the end of last lecture, but there was
no time then, because it was a quiz day and there alot of interaction in that class. I have rearranged
the slides to try to avoid this confusion, and the notes below corresponds to this new arrangement.
The notes below contain much more information than I gave in the lecture. I will go over that
information a few lectures from now, when I return to page faults and disk accesses.]
page page
table instruction table data
PC cache cache cache
cache
(TLB) ( TLB )
Suppose the cache contains 128 KB (217 bytes).1 [ADDED March 20: Here I am only
counting the bytes for the instructions (or data); I am not counting the extra bits that
are needed for the tag, and the dirty and valid bits.] These extra bits can be significant
(see Exercises 6 Q2).] Also, I am not distinguishing instruction and data cache. Also, again
suppose our main memory contains 1 GB (230 bytes). How is a cache organized and accessed?
case 1: each cache entry has one word (plus tag, etc)
Since MIPS instructions and data are often one word (4 bytes), it is natural to access 4-tuples of
bytes at a time. Since the number of bytes in the instruction cache (or data cache) is 217 and we
have one word per entry, the cache would have 215 entries holding one word (22 bytes) each.
Let’s now run through the sequence of steps by which an instruction accesses a word in the
cache. Suppose the processor is reading from the cache. In the case of an instruction cache, it is
doing an instruction fetch. (In the case of a data cache, it is executing say a lw instruction.) The
starting address of the word to be loaded is translated from virtual (32 bits) to physical (30 bits).
We saw last class how the TLB does this. Then, the 30 bit physical address is used to try to find
the word in the cache. How?
Assume the cache words are word aligned. That is, the physical addresses of the four bytes
within each word in the cache have LSBs 11, 10, 01, 00. The lowest two bits of the address are
called the byte offset since they specify one of the four bytes within a word.
The next fifteen bits (2-16) are used to index into the 215 entries in the cache. We are assuming
a read rather than write, so the entry is then read out of the cache. (Back in lecture 6, we looked
at basic designs for retrieving bits from memory. You do not need to refresh all the details here,
but it would be helpful to refresh the basic idea that you can feed 15 bits into a circuit and read
out a row of data from a 2D array of flipflops.)
The upper 13 bits of the physical address (the tag) are compared to the 13 bit tag field at that
cache entry. If the two are the same, and if the valid bit of that entry is on, then the word sitting
at that entry of the cache is the correct one. If the upper 13 bits don’t match the tag or if the valid
bit is off, however, then the word cannot be loaded from the cache and that cache entry needs to
be refilled from main memory. In this case, we say that a cache miss occurs. This is an exception
1
Note that this is significantly larger than the TLB size from last lecture. In principle, the TLB and cache sizes
could be similar. There are technical reasons why the TLB is typically smaller, which I am omitting. (It has to
do with the mechanism for indexing. I am only presenting one method for indexing, called ”direct mapping”, but
there are other schemes that use more complicated circuits. These are called ”set associative” and ”fully associative”
caches.
and so a kernel program known as the cache miss handler takes over. (We will return to this later
in the lecture.)
Using this approach, the cache memory would be organized as shown in the circuit below. For
each of the 215 word entries in the cache, we store four fields:
• the word, namely a copy of the word that starts at the 30 bit physical (RAM) address repre-
sented by that cache entry
• the upper 13 bits of that 30 bit physical address, called the tag. The idea of the tag is the same
as we saw for the TLB. The tag is needed to distinguish all entries whose physical addresses
have the same bits 2-16.
• a valid bit that specifies whether there is indeed something stored in the cache entry, or
whether the tag and byte of data are junk bits, for example, leftover by a previous process
that has terminated.
• a dirty bit that says whether the byte has been written to since it was brought into memory.
We will discuss this bit later in the lecture.
The case of a write to the data cache e.g. sw uses a similar circuit. But there are some subtleties
with writes. I will discuss these later.
tag dirty valid word
13 11 32
15
physical address 13 32
13 =
cache hit ?
[ASIDE: Unlike in the TLB, there is no process id field PID in the cache. The reason is that it
is possible for two processes to use the same physical addresses, for example, two processes might
be using the same instructions if they are two copies of the same program running or if they are
sharing a library. Or they might be sharing data.]
case 2: each cache entry has a block of words (plus tag, etc)
Case 1 above took advantage of a certain type of “spatial locality” of memory access, namely that
bytes are usually accessed from memory in four-tuples (words). This is true both for instructions
and data. Another type of spatial locality is that instructions are typically executed in sequence.
(Branches and jumps occur only occasionally). At any time during the execution of a program we
would like the instructions that follow the current one to be in the cache, since such instructions
are likely to be all executed. Thus, whenever an instruction is copied into the cache, it would make
sense to copy the neighboring instructions into the cache as well.
Spatial locality also arises for data word accesses. For example, arrays are stored in consecutive
program addresses. Because pages are relatively large, neighboring program addresses tend to
correspond to neighboring physical addresses. Similarly, words that are nearby on the stack tend
to be accessed at similar times during the process execution.
A simple way to implement this idea of spatial locality is as follows: Rather than making the
lines (rows) in the cache hold one word each, we let them hold (say) four words each. We take four
words that are consecutive in memory (namely consecutive in both virtual memory and in physical
memory). These consecutive words are called a block. If our cache holds 217 = 128 KB and each
block holds 4 words or 4 × 4 = 24 bytes, then the cache can hold up to 213 blocks, so we would use
13 bits to index a line in the cache.
To address a word within a block, we need two bits, namely bits 2 and 3 of the physical address.
(Bits 0 and 1 are the “byte offset” within a word.) Bits 2,3 are the block offset which specify one
of four words within the block. Note that each block is 16 bytes (4 × 4) and the blocks are block
aligned – they always start with the physical memory address that has 0000 in the least significant
four bits. This might always work best, but it simplifies the circuitry.
As shown in the figure below, to access a word from the cache, we read an entire block out of
the cache. Then, we select one of the four words in that block. One could draw a similar circuit for
writing a word (or block) to the cache, although the multiplexor part would need to be changed.
tag dirty valid word11 word10 word01 word00
13 1 1 32 32 32 32
13
32 32 32 32
MUX
physical address
13 2
32
tag cache index 2 2
13
=
Instruction Cache
The instruction and data caches have a subtle difference: instructions are only fetched (read) from
memory, but data can be read from or written to memory. For the instruction cache, blocks are
copied from main memory to the cache. For the data cache, blocks can be copied either from main
memory to the cache, or vice-versa. There can also be writes from registers to the data cache e.g.
by sw (or swc1) instructions.
We begin with the instruction cache, which is simpler. If we are fetching and the instruction is
in the cache, i.e. a hit, then we have the case discussed earlier in the lecture. If the instruction is
not in the cache, however, we have a miss. This causes an exception – a branch to the exception
handler which then branches to the cache miss handler. This kernel program arranges for the
appropriate block in main memory (the one that contains the instruction) to be brought into the
cache. The valid bit for that cache entry is then set, indicating that the block now in the cache
indeed represents a valid block in main memory. The cache miss handler can then return control
to the process and we try again to fetch the instruction. Of course, this time there is a hit.
Up to now in the course, we have concentrated on the CPU and memory. In the final few
lectures, we will briefly consider some of the basic ideas of how the CPU and memory interact
with the I/O devices. The I/O devices we will consider are the keyboard and mouse (inputs), and
monitor and printer (outputs). The harddisk drive is also sometimes considered an I/O device
and is both input and output. We will not consider the issues of commputer networks, which is
obviously also included in I/O but which is a more advanced topic.
Our discussion of I/O will be relatively general and untechnical, in contrast to what we have
seen up to now. Also, we will not consider current technologies, which are quite complex and diverse
and would be inappropriate for an introductory course such as this.
system bus
The main advantage of the system bus is that it reduces the number of connections in the
computer. Rather than the CPU (or main memory) having a direct connection with many lines
each to each of the I/O devices, all the components can share the same lines. There are obviously
disadvantages to such a system bus as well. Because the bus is shared, it can only carry one signal
at a time. This tends to reduce performance because one component (CPU, an I/O device, main
memory) may need to wait until the system bus is free before it can put signals on the system bus.
Also, the various components need to communicate with each other to decide who gets to use the
system bus at any given time and who is talking to whom. These communications also take time.
Another disadvantage of the system bus is that it needs a clock speed that is much slower than
the CPU. Components that share the system bus are relatively far away from each other. The clock
speed of the system bus is constrained by the furthest (worst case) connection between components.
That is, any signal that is put on the bus has to travel between any two components that share the
bus in any single clock cycle. If we have one system bus shared by all components, then the clock
speed might have to be 100 MHz (a period of 10 ns), as opposed to a typical CPU clock speed of 1
GHz (a period of 1 ns).
To be concrete about the limits here, consider that light (or more generally, electromagnetic
waves i.e. the voltages that drive our gates) travel at a speed of 3 × 108 ms−1 . If we have a 3 GHz
processor, then this would mean that the voltages can travel at most 10cm per clock pulse, which
is smaller than the distances between many pairs of components in a desktop workstation. And
this doesn’t include the time needed for the components to read/write from/to the bus and for
circuits to stabilize. Hence, the system bus runs at a slower clock speed, giving more time for the
0/1 signals on the bus to stabilize.
There are three components to the system bus. There is a data bus which carries data only.
There is an address bus which carries addresses only. And there is a control bus that carries control
signals only.
How is the system bus used? Let’s take an example: the lw instruction. Suppose that a cache
miss occurs and so a block containing the word must be brought from main memory into the cache.
Here is what we would like to happen.1
• The CPU puts the (physical) address of the new block onto the address bus.
• The CPU sets control signals (ReadMem = 1 and WriteMem = 0) on the control bus.
• The control signal causes a block to be read from main memory, i.e. main memory puts the
requested block onto the data bus.
• The CPU reads the data bus and the block is written into the data cache.
There is more to it than this, however. For example, it would take several system bus clock cycles
for this to happen since each word of the block needs to be sent. Moreover, the above assumes that
the system bus was not being used by any other component. But when you have a shared resource,
you are not guarenteed that it will be available. We will discuss how the system bus is shared next
lecture.
I/O controllers
Each of the I/O devices may have its own complicated electronics (and mechanics, in the case of a
printer or scanner). Each I/O device interfaces with an I/O controller which is inside the computer
case. The controller is responsible for reading/writing on the system bus. Each I/O controller might
have its own specialized processor e.g. own clock, its own special registers, its own special ROM
circuits, its own memory, and its own little program counter (PC).
Do not confuse I/O controllers with device drivers. A driver for an I/O device is a specialized
piece of software that knows all about the particular device and its I/O controller. In particular,
the driver controls the reads and and writes for the registers and other circuits in the I/O controller.
The drivers are part of the operating system (kernel) and they sit in main memory or on disk.
Note that many I/O devices (printer, keyboard, mouse, monitor, etc) are external to the case of
the computer. We use the term peripheral to describe the hardware that is external to the computer,
that is, outside the case. These devices don’t plug directly into the system bus. Rather, each plugs
into a port. Each port is typically part of the I/O controller that sits on the motherboard.
1
We ignore the possible write-back e.g. if the cache entry is dirty.
hard
disk DVD/CD
drive controllers
drive
11111
00000
printer
00000
11111
keyboard
mouse
display
Let’s next consider a few examples of how various components can share the system bus. Recall
the tri-state buffer circuit element from lecture 6. This was a mechanism for allowing many wires
to put signals on a single wire. We will also need to use a tri-state buffers for the system bus. The
reason is that we have many devices that can put signals onto the bus, and so only one of them can
write to the bus at any time.
KEYBOARD
IOdeviceID
0010001
Keyboard pressed
Read IO
Output devices
We will look briefly at two types of output devices: printers and displays (monitors). The first
distinction to be aware of in thinking about the information sent to an output device is whether
the data describes how to make an image (called vector graphics) or whether the data describes a
square grid of pixels, in particular, their intensities i.e. RGB values (called raster graphics).
Example : printer
Many printers are built to understand instructions in the (Adobe) PostscriptT M language. An
application such as Word or Adobe Reader or a web browser issues a print command. The PostScript
file to be printed is sent from main memory to the I/O (printer) controller. Then individual
PostScript instructions are sent from the controller to the printer. PostScript printers contain
software and hardware which together interpret the instructions, and convert them into a rasterized
image (i.e. a square grid of pixels). See the slides for some very basic examples of postscript
instructions.
A printer is an output device, and so it does not put anything on the data bus. Hence, it does
not need a tri-state buffer. To check whether the data on the data bus is meant for the printer, a
similar mechanism could be used as in the keyboard example, i.e. read from the address bus and
control bus. The file containing the printer instructions (PostScript) would be put on the data bus.
The printer determines whether it should read from the data bus by looking at the address on
the address bus. Continuing with our earlier example, if the lower 7 bits match the address of this
printer, then the data is meant for this printer and is written to the printer. Otherwise, the printer
does not take data from the data bus.
Example: monitor
Our next example is the monitor. The monitor screen is a rectangular array of points called pixels
(picture elements). For each pixel (x,y), there are three numbers stored which represent the R,G,
and B intensities of the point. Typically, one byte is used for each of these intensities. That is,
intensity codes goes from 0 to 255 (bigger means brighter).
A example image size is 1024 x 768 pixels. If we consider 3 bytes per pixel – one byte for each
of R,G,B – then we need about 3 MBs of memory just for the image that is on your monitor at any
instant.
A very long time ago in the 1980s when the first wave of personal computer occured , image
pixel values were stored in a part of main memory called the frame buffer (see figure below). That
is, your screen shows a large array from main memory. However, this was infeasible for all but
the simplest images. Most monitors presents a new image every 1/60 of a second. (A minimum of
30 frames per second is required, otherwise the eye detects flicker and the video looks jittery.) If
the image size is about 1000 × 1000 and even the screen were refreshed every 1/30 of second, then
this would still require 100 MB per second (considering RGB) to get put on the system bus. This
should make you concerned. The system bus would have to spend a large amount of its time just
transfering pixels, which is clearly unacceptable.
video bus
hard
disk DFD/CD
drive controllers
drive
display
printer
mouse
One solution to the system bus traffic jam problem is to have a dedicated (read only) bus to
carry bytes from the frame buffer to the display controller. This video bus completely freed up the
system bus. However, even if we don’t use the system bus to transfer pixels to the video controller,
the pixel values still need to be computed for each frame and sent to the frame buffer. A long time
ago, the CPU was indeed responsible for making the images. However, the graphics capabilities
were weak since the CPU had so many other things to do. And it still had to use the system bus
to transmit these values to the framebuffer, which clogged up the system bus.
main
CPU memory
display
frame video
buffer controller
GPU
So, problem solved? Not entirely. Relieving the CPU of so much computational burden just
passes the problem on. The more operations we require the GPU to do, the more complicated the
GPU design must be. It needs its own registers and controls, its own floating point unit(s), its
own instructions. Modern graphics cards (video cards) contain thousands of processors and large
amounts of memory. They can perform general purpose massively parallel computation as well (if
you know how ot program them!) But there is price: the cards are expensive.
This lecture we will look at several methods for sending data between an I/O device and either
main memory or the CPU. (Recall that we are considering the hard disk to be an I/O device.)
0xffff0000 input control register (only the two LSB’s are used)
0xffff0004 input data register (holds one byte)
0xffff0008 output control register (only the two LSB’s are used)
0xffff000c output data register (holds one byte)
The LSB of each control register indicates whether the corresponding data register is “ready”. In
the input case, ready means that a character has been typed at the keyboard and is sitting in the
low order byte of the data register. In the output case, ready means that the output data register
is free to receive a byte from the CPU. e.g. in the case of a (non-buffered) printer, the device might
not yet have printed the previous byte that was sitting in the register and hence might not be ready.
[Here is the notes I say the registers hold only one byte, whereas in the lecture and in the instructions
below I imply that they hold a word each. It doesn’t really matter. ]
Let’s look at an example. Suppose your program should read bytes from the input device and
load each byte into register $s2. Here’s how the kernel might try to do it.
From the kernel programmer’s perspective, this is quite simple. However, from the underlying
hardware perspective, it is more complicated. The MIPS hardware must recognize that the address
is in the memory mapped I/O region and handle it properly. In particular, when the kernel program
executes the above lw instruction, the hardware must detect that the desired address does not
correspond to an item in the data cache but rather is some word in the I/O device’s memory. This
result is similar to a cache miss, in that the process must stop and the kernel must use the system
bus to get something. But now it must get the desired word from the I/O device (input) instead of
a block from main memory (cache refill).
The CPU puts an address on address line of the system bus, e.g. the address of the I/O device1
and sets certain control lines on the system bus to indicate that the CPU wants to read from that
device. The I/O device controller reads the bus (always) and sees the address and control and then
puts the requested data item onto the data bus.
I emphasize: this is the same general mechanism that would be used for a main memory access
in the case of a cache miss. Now it is an I/O device that provides the data to the CPU.
Similarly, for an output, the sw instruction would be used and the address where the word
should be stored would be within a reserved (memory mapped) region of the kernel’s address space.
The CPU would put that address on the bus (after translating this virtual address into a physical
address that indicates the I/O device number ) and set a write control signal on the control bus.
Again, the mechanism is similar to storing a word to main memory. The main difference is that
the addresses on the address bus indicates that the CPU is communicating with the I/O controller,
not with main memory.
Polling
One issue that arises with the above example is that it only makes sense to read from the input
device when the input device is ready, e.g. when there is a byte of data to be read. (This issue is
independent of whether one uses isolated I/O or memory mapped I/O.)
To solve this problem, one can use a method called polling. Before the CPU can read from the
input device, it checks the status of the input device and see if it is “ready,” meaning that the CPU
checks the “ready” bit in the input control register 0xffff0000, which is bit 0 (the least significant
bit). In particular, the CPU needs to wait until this bit has the value 1. This can be implemented
in MIPS with a small polling loop.
• The keyboard controller is “ready” to provide a character to the CPU when the difference of
the two indices is greater than 0.
2
In case you didn’t cover it in COMP 250, a circular array is an array of size N where the index i can be any
positive integer, and the index is computer with i mod N .
In each case there might be a circuit which tests for these conditions.
With this buffer in mind, we can see what is causing the delay in echoing to the console. Suppose
the user is typing into the keyboard, but the system bus is being used by some I/O device. The
buffer starts to fill up, and the CPU only reads from the buffer when the bus is available for it to do
so. The CPU grabs a short time slice, and quickly reads a character and writes a character (to the
console i.e. output) and does this until it empties the buffer or until it gives up the bus or pauses
the process and continues with another process.
hard
disk floppy
drive
drive
keyboard
printer
mouse
I emphasize that BR and BG signals cannot be sent on the system bus. These signals are used
to decide who uses the system bus, and so they must always be available. The CPU must always be
able to communicate the value of BG to the DMA controller, and the DMA controller must always be
able to communicate the value of BR to the CPU. Separate direct lines (”point-to-point”) between
the CPU and the DMA controller are needed to carry these signals.
[ASIDE: The above figure shows two DMA devices, say the printer and the hard drive. I have
labelled two sets of BR and BG signals. The figure also shows that the I/O controllers have their
own registers and little RAMs. These registers have physical addresses (I/O device number and
register number or memory address number within that device). In a memory mapped system,
these registers and memory would have corresponding virtual addresses which would need to be
translated.]
Hard disk
A few final comments about how to improve hard disk performance....
Recall that the hard disk controller has a local memory (DRAM), sometimes called the disk
cache. When a page is read from disk, it is fist put into this disk cache, and then transferred to
main memory as a subseqent stage. Similarly, when a page is brought from main memory to the
hard disk controller, it is put in the disk cache and then later moved to the disk itself (as was
described earlier in the lecture).
To improve performance when reading a page from the hard disk, neighboring physical pages
could be read as well. This is easy to do because once the hard disk has spun around to the position
of the page, the controller can easily read off neighboring pages too. Reading neighboring pages on
disk would be useful when adjacent virtual pages are stored on the hard disk as adjacent physical
pages. For example, a file might consist of multiple pages – it would be better to store them side
by side so that they can be accessed at the same time.
You might have heard of disk fragmentation. For example, suppose you are storing a file on your
hard disk. The file is an application which consists of many pages of source code. When you first
put the file on the hard disk, it would be nice to keep all the pages in order and at consecutive page
addresses on the disk for the reason I mentioned above. However, this might not be possible since
the disk might not have a big empty region available. The disk might contain lots of free space, but
it might consist of many holes e.g. where previous files had been deleted. You can imagine that
after years of use, and many adds and file deletes and many page swaps, the number of large empty
gaps decreases and the number of small gaps increases. This is fragmentation. The disk doesn’t
work as well when it is fragmented, because the trick mentioned in the previous paragraph is not
as effective.
Finally, another way to improve performance of the disk arises when there are several read/write
requests for data on different parts of the disk. Rather than performing the read/write in the order
of request, which might take multiple spins of the disk, the hard disk controller could sort the list of
pages so that it could access them in one spin. For example, suppose requests were made for page
3,7,5,4 at a time when the head is at position for page 2. Rather that doing 3, 7 and then spinning
all the way around for 5 and then all the way around again for 4, it could get the requests in order
3,4,5,7 in a single spin of the disk.
In this lecture, I will elaborate on some of the details of the past few weeks, and attempt to pull
some threads together.
1
We ignore the step that it also swaps a page out of main memory.
Last lecture we looked at polling and DMA as a way for the CPU and I/O to coordinate their
actions. Polling is simple, but it is inefficient since the CPU typically asks many times if the device
is ready before the device is indeed ready. Today we’ll look at another way for devices to share the
system bus, called interrupts.
Interrupts
With interrupts, an I/O device gains control of the bus by making an interrupt request. Interrupts
are similar to DMA bus requests in some ways, but there are important differences. With a DMA
bus request, the DMA device asks the CPU to disconnect itself from the system bus so that the
DMA device can use it. The purpose of an interrupt is different. When an I/O device makes
an interrupt request, it is asking the CPU to take specific action, namely to run a specific kernel
program: an interrupt handler. Think of DMA as saying to the CPU, “Can you get off the bus so
that I can use it?,” whereas an interrupt says to the CPU “Can you stop what you are doing and
instead do something for me?”
Interrupts can occur from both input and output devices. A typical example is a mouse click or
drag or a keyboard press. Output interrupts can also occur e.g. when an printer runs out of paper,
it tells the CPU so that the CPU can send a message to the user e.g. via the console.
There are several questions about interrupts that we need to examine:
• how does an I/O device make an interrupt request?
• how are interrupt requests from multiple I/O devices coordinated?
• what happens from the CPU perspective when an I/O device makes an interrupt request ?
I’ll deal with these questions one by one.
The mechanism by which an I/O device makes an interrupt request is similar to what we saw
in DMA with bus request and bus granted. The I/O device makes an interrupt request using a
control signal commonly called IRQ. The I/O device sets IRQ to 1. If the CPU does not ignore
the interrupt request (under certain situations, the CPU does ignore interrupt requests), then the
CPU sets a control signal IACK to 1, where IACK stands for interrupt acknowledge. The CPU also
stops writing on the system bus, by setting its tristate gates to off. The I/O device then observes
that IACK is 1, which means that it can use the system bus.
Often there is more than one I/O device, and so there is more than one type of interrupt request
than can occur. One could have a separate IRQ and IACK line for each I/O device. This requires a
large number of dedicated lines and places the burden on the CPU in administering all these lines.
THis is not such a problem, in fact, and many modern processors do use point to point connections
betwen components. Another method is to have the I/O devices all share the IRQ line to the CPU.
They could all feed a line into one big OR gate. If any I/O device requests an interrupt, then
the output of the OR gate would be 1. How then would the CPU decide whether the allow the
interrupt?
One way is for the CPU to use the system bus to ask each I/O device one by one whether it
requested the interrupt, but using the system bus. It could address each I/O device and ask “did
you request the interrupt?” Each I/O device would then have one system bus cycle to answer yes.
This is a form of polling (although there is no infinite loop here as there was when I discussed
polling last lecture).
Daisy chaining
A more sophisticated method is to have the I/O devices coordinate who gets to interrupt the CPU
at any time. Suppose the I/O devices have a priority ordering, such that a lower priority device
cannot interrupt the CPU when the CPU is servicing the interrupt request from a higher priority
device. This can be implemented with a classical method known as daisy chaining. As described
above, the IRQ lines from each device meet at a single OR gate, and the output of this OR gate is
sent to the CPU as a single IRQ line. The I/O devices are then physically ordered by priority. Each
I/O device would have an IACKin and IACKout line. The IACKin line of the highest priority I/O
device would be the IACK line from the CPU. The IACKout line of one I/O device is the IACKin
line of the next lower priority I/O device. There would be no IACKout line from the lowest priority
device.
IACK
IRQ1
I/O 1
IACK2
IRQ1
I/O 2
IRQ
CPU
IACK3
IRQ1
I/O 3
main IACK4
memory
IRQ1
I/O 4
system bus
Here is how daisy chaining works. At any time, any I/O device can interrupt the CPU by setting
its IRQ line to 1. The CPU can acknowledge that it has received an interrupt request or it can
ignore it. To acknowledge the interrupt request, it sets IACK to 1. The IACK signal gets sent to
the highest priority I/O device.
Let’s start by supposing that all IRQ and IACK signals are 0. Now suppose that an interrupt
request is made (IRQ = 1) by some device, and the CPU responds by setting the IACK line to 1. If
the highest priority device had requested the interrupt, then it leaves its IACKout at 0. Otherwise,
it sets IACKout to 1 i.e. passing the CPU’s IACK=1 signal down to the second highest priority
device. That device does the same thing, and so on. Whenever IACKin switches from 0 to 1, the
device either leaves its IACKout = 0 (if this device requested an interrupt) or it sets IACKout = 1
(if it did not request an interrupt).
Let me explain the last sentence in more detail. I said that the I/O device has to see IACKin
switch from 0 to 1. That is, it has to see that the CPU previously was not acknowledging an
interrupt, but now is acknowledging an interrupt. This condition (observing the transition from 0
to 1) is used to prevent two I/O devices from simultaneously getting write access to the system bus.
What if the CPU is servicing the interrupt request from a lower priority device, and a higher
priority device wishes to interrupt the CPU. With daisy chaining, the IRQ from the higher priority
device cannot be distinguished from that of the lower priority device since both feed into the same
OR gate. Instead, the higher priority device changes its own IACKout signal to 0. This 0 value
gets passed on to the lower priority device. The lower priority device doesn’t know why its IACKin
was just set to 0. It could be because a higher priority device wants to make an interrupt request,
or it could be because the CPU has turned off its IACK signal. Whichever of these occurs doesn’t
matter. The lower priority device just needs to know that its interrupt time is now over. It finishes
up as quickly as it can, and then sets its IRQ to 0. Once it is done, the CPU sets its IACK to 0.
Once the higher priority device sees that the CPU has set IACK to 0, it can then make its interrupt
request.
How does the CPU know which device made the interrupt request? When CPU sets IACK
to 1, it also frees up the system bus. (It “tristates itself.”) When the I/O device that made the
interrupt request observes the IACK 0-to-1 transition, this device then identifies itself by writing
its address (or some other identifier) on the system bus. The CPU reads in this address and takes
the appropriate action. For example, if the device has a low priority, then the CPU may decide to
ignore the interrupt and immediately set IACK to 0 again.
Think of the sequence as follows. “Knock knock” (IRQ 0-to-1). “Who’s there and what do you
want?” (IACK 0-to-1) and CPU tristates from system so that it can listen to the answer. “I/0
device number 6 and I want you to blablabla” (written by I/O device onto system bus). If CPU
then sets IACK 1-to-0, this means that it won’t service this request. In this case, the I/O device
has to try again later. If the CPU doesn’t set IACK 1-to-0, then it may send back a message on
the system bus. (The I/O device needs to tri-state after making its request, so that it can listen to
the CPU’s response.)
Daisy chaining is just one way of coordinating interrupt requests. But it is not the only way, and
often the CPU does the work of figuring out which interrupt request to handle based on priorities.
Let’s now put daisy chaining aside and think about interrupts from the CPU perspective.
interrupt handler is in the middle of saving the registers from the previous interrupt. For example
when saving the state information for a process, it would be bad for the kernel to be interrupted
and to jump to yet another interrupt handler. So, for certain kernel routines, the processor can put
itself into a non-interruptable state. Another example is that a lower priority I/O device should not
be allowed to interrupt the interrupt handler of a higher priority device, whereas a higher priority
device should be allowed to interrupt a lower priority device’s interrupt handler.
In summary, here’s the basic idea of what the kernel (interrupt handler) needs to do:
Interrupts in MIPS
Let’s look at a few details about how MIPS treats interrupts, and how the Cause and Status registers
are used.
The general interrupt enable bit is the LSB of the Status register. To turn this bit on, we use:
Note that this code uses the register $k0 which is register 26 in the main register array. User
programs are not allowed to use this register. Only kernel programs can use it. (And error will
occur if a user instruction uses this register.)
Finally, there are particular bits in the Status register which indicate whether interrupts of a
particular priority are enabled, and there are corresponding bits in the Cause register which turn on
when an interrupt request of particular priority has been made. The kernel would need to compare
bits in these two registers to decide if a particular interrupt request should be serviced. (Detailed
omitted.)
All the I/O examples we have discussed use the system bus to send data between the CPU, main
memory, and I/O controllers. The system bus runs at a slower clock speed than the CPU because
of greater distances involved between the various components, but there is still a clock governing
the reading and writing to the bus. The sender and receiver of any bits written to the system bus
use the same system bus clock, e.g. writes are performed at the end of the clock pulse. For this
reason, the system bus is called synchronous.
Synchronous buses don’t work well when the distances between the sender and receiver are too
large, however, such as when data is being sent from an I/0 controller to a peripheral device. The
clock signals themselves need to travel, and so the assumption of events happening “at a given
time” becomes problematic. When the distances between sender and receiver are too large (say 1
meter), the precise timing of the clock cannot be relied on in the same way as before. When the
bus itself is not controlled by a clock, one says the bus is asynchronous. Timing events are still
important though. Today we’ll look at two ways to achieve timing on such buses.
Handshaking
How can data be sent on a bus without the sender and receiver synchronizing on the same clock?
Let’s take the case of an I/O controller (e.g. a printer controller, i.e. inside the computer case) that
communicates data to an I/0 peripheral device (i.e. the printer itself, attached to a cable several
metres long). The asynchronous transfer can be implemented with a method called handshaking.
In general, there are two types of handshaking, depending on whether the data source (sender)
or destination (receiver) initiates the transfer. For the printer example, the source/sender is the I/0
controller and the destination/receiver is the printer.
In source initiated handshaking, the sender (a) puts data on data line and shortly afterwards
(b) sets a control signal called data request, and holds both signals. The receiver reads the control
signal, then reads the data, and then (c) responds with a data acknowledge control signal. The
sender receives the acknowledgement and (d) stops writing the data, and resets the data request
line to 0. The receiver then (e) sets the data acknowledge to 0.
a b c d e
The hexagonal symbol in the middle row indicates the time that the data that is put on the
line. This symbol is used here indicates that it doesn’t matter whether the data is a 1 or 0 (but
it has to be one them during that interval). Also note that this data line can consist of multiple
parallel lines. It doesn’t have to be just a single line as indicated in the figure.
Alternatively, the receiver (destination) can initiate a data transfer. It does so by (a) setting its
data request control signal. The sender reads this signal, then (b) puts the data on the data line,
a b c d e
and shortly afterwards – to make sure the data signal can be read – it (c) sets a data ready signal.
The receiver reads the data ready signal and then reads the data, then (d) the receiver resets the
data request signal (off). The sender then sees that the request has been turned off so it (e) turns
the data ready signal off and stops writing the data.
I emphasize that no clock is involved here i.e. it is asynchronous. One side waits until the other
side is ready. The sender and receiver might each have their own clocks, but there is no shared clock
used for the bus.
information, it turns on a ’receiver acknowledge’ (ACK) signal (some other line). This signals that
it has read the information off the bus. The sender sees this, then stops writing on the bus (and
the bus then has unspecified values). The receiver eventually drops its ACK signal to 0.
In the receiver initiated case, the receiver puts address i.e the targeted source and the command
(control) on the appropriate lines. Then, the receiver waits and gives the signal a chance to stabilize
over the whole line. The devices decode the signals. The source/sender, reading the address line,
sees that it is the target. So it puts the requested data on the bus and shortly after it sets the ’data
ready’ signal. The receiver sees the ready signal and reads the data. When the receiver is done, it
removes the data request line and stops writing on the address and control lines (and so the values
on these lines is unspecified). When the sender sees that the ’data request’ signal has been turned
off, the sender stops putting the data on the line, and it turns off the data ready signal.
Notice that when the bus is shared, there needs to be some scheme to avoid that multiple devices
write on the bus at the same. This could be achieved using a similar scheme to what we saw with
the (clocked) system bus e.g. dedicated IRQ/IACK lines with one party (the CPU in that case)
controlling who is allowed to write.
Serial bus
For all of the buses we have discussed up to now, we have thought of the signals on the bus as being
constant over each line. That is, each wire of the bus held one value and a device only read from the
bus once that value had stabilized over the whole bus. A serial bus is completely different. With a
serial bus, you should think of a sequence of signal pulses travelling down a wire. To read from the
serial bus, you need to time the reads with the pulses as they pass. A clock is still necessary since
the pulses must be all teh same duration and this duration needs to be agreed upon in advance.
The “speed” (or frequency) of a serial bus is measured in bits per second (baud). This is the
number of bits or pulses that passes any given point on the bus per second. If data is sent at a
certain speed (e.g. 56,000 bps in the case of a very old modem) then each bit has a duration of
1/56,000 seconds, which is about 20,000 ns. If you have a particular physical cable, then you could
think of each bit as occupying a certain spatial extent on the cable. Or, if you measure the signal
at a particular point on the bus, then a particular value can be measured at that point for this
duration.
For a 56 Kbps modem, the speed is much less (by a factor of several thousand!) than the clock
speed of your system bus. So, you should think of the clock speed of the system bus as being a
very fine sampling relative to the bits on the serial bus. (And the CPU clock speed another factor
of ten faster than the system bus clock speed.)
How does the receiver read from a serial bus? Assume for the moment that the receiver and
sender have agreed on:
The sender puts a 1 signal on the wire for the normal state (non-transmitting). Then the sender
switches to 0 to indicate that a sequence of bits is to follow. This initial 0 has the same duration T
as each of the bits that follow, and is called start bit. The start bit indicates that a byte is about
to be sent, but otherwise contains no information. The receiver waits .5 T and checks to make sure
that the signal is still 0. This is a check that that the switch to 0 wasn’t some electronice noise.
The receiver then waits 1 T and samples the signal, and repeatedly samples every one T for as
many bits as are expected (8). Note that by waiting .5 T and regularly sampling after that every
T, it samples from the middle of the pulse i.e. avoiding the transition at the boundaries of the
pulse. When it has all 8 bits of data, it reads a “stop bit” (value 1). At least one stop bit is there
to ensure some spacing between bytes. In particular, the receiver can only read in the next byte
once it notices a transition from 1 to 0, that is, when it notices the next start bit. To guarentee
this happens, the line must be in the 1 state.
Note that the sender and receiver each their own clock, typically of different speeds, and the
pulse duration of these clocks is much shorter than T.
For example, if the byte to be sent is 10011101, then the sequence looks like as follows. (Here
the wave of pulses is travelling to the left. In the slides, I showed it travelling to the right.) For this
example, the bits are sent from MSB to LSB (though one could send in the other order too). The
receiver reads the start bit first, followed by the eight bits of the sequence, followed by the stop bit.
1 0 0 1 1 1 0 1
start stop
How do messages get started? Messages often have headers and text. To indicate boundaries, the
sender sends a special ASCII value:
In some systems there is an extra mechanism for checking if error have occured in the trans-
mission. Errors can occur because lines are physical and sometimes have noise. One simple way
to check for errors is for the sender to append an extra bit to the character – called a parity bit –
such that the total number of 1 bits in the code is even. This is called even parity. If the character
has an odd number of 1 bits, then the parity bit is set to 1, so that the total number of 1’s is even.
If the ASCII character has an even number of 1 bits, then the parity bit is set to 0. Note it could
happen that there are two errors (two of the ASCII bits are incorrect), and in this case, the errors
would not be detectable.
UART
What sort of hardware is needed by the sender and receiver? The sender needs to convert bytes
to bit sequences. The receiver needs to convert bit sequences into bytes. Thus, the sender needs
a parallel-in serial-out shift register. This is just what it sounds like. For example, consider a I/O
controller for a printer which uses a serial line. It takes one byte in parallel off the system bus, and
then writes the bits one by one on a serial line. It could write either the least significant bits first (in
the case of shift right) or most significant bits first (in the case of shift left). Similarly, the printer
(receiver) needs a serial-in parallel-out shift register. It reads the bits one by one into a register,
and then outputs the bytes as a unit into another register which is used for processing the byte.
The UART (Universal Asynchronous Receiver/Transmitter) is an I/O controller that converts
serial to parallel (input) or parallel to serial (output). A UART is more than just the shift registers
mentioned above, though. It has several registers so that the CPU or I/O device can communicate
with it. It also needs special circuitry to remove the start and stop bits and to check for parity.
Many USB devices can be connected to a computer. Physically the connections define a tree
structure. The root of the tree is the USB controller, inside the computer, which is also called the
USB host. You may have two ports (slots where you plug the USB connector) which are children
of the root. You can connect USB hubs which serve as internal tree nodes. And you can connect
hubs to hubs. In total, you can connect up to 127 devices to a single USB host.
device1
device2
HUB
COMPUTER device3
HUB device4
HUB
device1
device6
Although the above describes a tree like structure, in fact each of the USB devices sees everything
that it put on the bus, i.e. the hubs (internal nodes) do not block the signals being sent. The host
communicates by polling. It sends messages to the devices, and they respond only when prompted.
“Device 15, do you have anything to say”. (Response: “no)”. “Device 24, do you have anything to
say”. (Response: “no”). “Device 15, do you have anything to say”. (Response: “yes, bla bla bla”).
etc.
The USB standard specifies various formats for packets of data that are used to communicate
between the host and a USB device. Each packet consists of fields including a synch which is used
to synchronize the devices to some extent (details omitted), packet ID which says what type of
packet it is (’ready’), address (which device is this packet for?), data, error correctio information.
Details (many!) are omitted here.
Many USB drivers come with typical operating systems (Windows 7, Linux Ubuntu, Max OS
X, ...). If a driver doesn’t come preinstalled then it can be downloaded (sometimes automatically).
When you connect a USB device (such as a new mouse, or printer) to your computer, the USB
controller (host) interrogates the device to find out what it is. It then sends this information to the
CPU, which checks if the OS has the driver on disk. If it does, it loads the driver. This automatic
system is known as “plug and play”.1
• VGA (a 15 pin parallel connector – what I use to connect my laptop to the projector), invented
in 1980s, the pins contain R,G,B data values, plus various control lines (end of row, end of
frame)
1
When the system was first being introduced about a decade ago and didn’t work as well as it does now, it was
known as “plug and pray”.
• DisplayPort, miniDP
I also briefly discussed how an optical mouse works. This material is for your interest only. It is
not on the final exam and so I do not elaborate here.