Introduction /
Performance / Machine
Basics
CS270 Unit 1
Max Luttrell, Spring 2017
"Welcome to the machine"
- Pink Floyd, 1975
levels of
computer usage
example user typical task
higher level "can you use your phone
grandma
to check traffic for me?"
"use a smartphone to
dad
check traffic"
"write a smartphone app in
application programmer java which provides traffic
predictions"
"write MIPS assembly
lower level cs270 student language code to sort an
array of integers"
levels of
program code
• Application software
• written in a high level language (e.g.
C/C++, java, swift, etc.)
• Systems software
• compiler - translate C++ code into
target processor's machine code
• operating system
• handle input/output (I/O)
• manage memory and storage
• schedule tasks
• share resources
• Hardware
• processor, memory, I/O controllers
compiler: translates high-level
language into assembly language
assembler: translates assembly
language into binary machine language
why use a high level
language?
• closer to natural language / algebra
• can be tailored to its typical usage, e.g. fortran for scientific
computation, cobol for business applications
• improves programmer productivity
• fewer lines = fewer opportunities for error, easier to debug
• closer to natural language than assembly, easier to understand/
debug
• platform independent
• the same C program can be compiled for different target
architectures, e.g. MIPS, ARM, Intel x86
performance
Which airplane has the best performance?
Boeing 777 Boeing 777
Boeing 747 Boeing 747
BAC/Sud BAC/Sud
Concorde Concorde
Douglas Douglas DC-
DC-8-50 8-50
0 100 200 300 400 500 0 2000 4000 6000 8000 10000
Passenger Capacity Cruising Range (miles)
Boeing 777 Boeing 777
Boeing 747 Boeing 747
BAC/Sud BAC/Sud
Concorde Concorde
Douglas Douglas DC-
DC-8-50 8-50
0 500 1000 1500 0 100000 200000 300000 400000
Cruising Speed (mph) Passengers x mph
what is high performance?
which is best
scenario
performance
run same program on two whichever one gets the
different desktop program done first (best
computers execution time)
datacenter with multiple whichever one completes
computers running jobs the most jobs in a given
submitted by many users day (best throughput)
relative performance
• define performance = 1 / execution time
• X is n times faster than Y
performancex / performancey =
execution timey / execution timex = n
• Example: execution time to run a program
• 10s on computer A, 15s on computer B
• execution timeB / execution timeA = 15s / 10s = 1.5
• Thus, A is 1.5 times faster than B
measuring execution time
• total response time
• more than just computation, also includes I/O,
OS time, idle time
• CPU time for a given job
• CPU time plus OS time
• discounts I/O time and idle time
CPU clocks
• operation of digital hardware is governed by a constant rate-clock
• Clock period = duration of one clock cycle
A. e.g. 0.1 s
-12
B. e.g. 250ps = 0.25ns = 250*10 s
• Clock frequency = clock cycles per second = 1 / clock period
A. e.g. 10 Hz
9
B. e.g. 4.0GHz = 4000MHz = 4.0*10 Hz
CPU time
• CPU Time = CPU Clock Cycles * Clock Cycle Period =
CPU Clock Cycles / Clock Rate
• Performance improved by:
• reduce number of clock cycles needed
• increase clock rate
• HW designer must often trade off clock rate against
cycle count
CPU time example
• Computer A: 2GHz clock, 10s CPU time for our program
• Designing computer B
• Aiming for 6s CPU time for our program
• We can do a faster clock for computer B, but this will cause
us to need 1.2x the clock cycles we needed on computer A
• How fast must computer B's clock be?
Clock CyclesB 1.2 × Clock CyclesA
Clock RateB = =
CPU TimeB 6s
Clock CyclesA = CPU Time A × Clock Rate A
= 10s × 2GHz = 20 × 10 9
1.2 × 20 × 10 9 24 × 10 9
Clock RateB = = = 4GHz
6s 6s
instruction count / CPI
Clock Cycles = Instruction Count × Cycles per Instruction
CPU Time = Instruction Count × CPI × Clock Cycle Time
Instruction Count × CPI
=
Clock Rate
• instruction count
• determined by program, Instruction Set Architecture (ISA), and compiler
• average cycles per instruction (CPI)
• determined by CPU hardware
• different instructions can have different CPI
• average CPI affected by instruction mix
CPI example
• Computer A: Cycle Time = 250ps, CPI = 2.0
• Computer B: Cycle Time = 500ps, CPI = 1.2
• Both computers have the same Instruction Set Architecture (ISA)
• For a given program (with a fixed instruction count I), which
computer is faster?
CPU Time = Instruction Count × CPI × Cycle Time
A A A
= I × 2.0 × 250ps = I × 500ps
CPU Time = Instruction Count × CPI × Cycle Time
B B B
= I × 1.2 × 500ps = I × 600ps
CPU Time
B = I × 600ps = 1.2
CPU Time I × 500ps
A
another CPI example
• we have two compiled code sequences using
instructions in classes A, B, C, each with different CPI's
class A B C
CPI for class 1 2 3
class A B C
IC in code sequence 1 2 1 2
IC in code sequence 2 4 1 1
Sequence 1: IC = 5 Sequence 2: IC = 6
Clock Cycles
Clock Cycles
= 2×1 + 1×2 + 2×3
= 4×1 + 1×2 + 1×3
= 10 =9
Avg. CPI = 10/5 = 2.0 Avg. CPI = 9/6 = 1.5
performance optimization
• common case
• optimize what's used most commonly during
program execution
• typically, this is a small fraction of overall code
Amdahl's law
Taffected
Timproved = + Tunaffected
improvemen t factor
• Given: a program runs in 100 seconds on our computer. 80
seconds of this time is spent doing multiplication. 20 seconds is
spent doing other things.
• Assume we can get our hardware designer to speed up
multiplication on our computer.
• Question: what improvement factor for multiplication do we need if
we want our program to run:
A. two times faster?
B. five times faster?
performance optimization
optimization type example
running two errands: I run
parallel one errand, you run the
other errand
laundry: move first load to
pipeline dryer, immediately start
second load in washer
travel: research trip
speculative execution itinerary before you're sure
you will go
cache: cache memory
use several memory levels
faster than main memory
number representation
• decimal - base 10 - humans
• each digit represents a power of 10
• e.g. 1979decimal = 1*1000 + 9*100 + 7*10 + 1*9
• binary - base 2 - computers
• each digit represents a power of 2
• e.g. 1011binary = 1*8 + 0*4 + 1*2 + 1*1 = 11decimal
number representation
position values examples
base
(b) in in
exponential decimal value base b decimal position values decimal
10 103 102 10 1 1000 100 10 1 1234 1*1000+2*100+3*10+4*1 1234
2 23 22 2 1 8 4 2 1 1011 1*8 + 0*4 + 1*2 + 1*1 11
4 43 42 4 1 64 16 4 1 3203 3*64 + 2*16 + 0*4 + 3*1 227
8 83 82 8 1 512 64 8 1 2715 2*512+ 7*64 + 1*8 + 5*1 1485
16 163 162 16 1 4096 256 16 1 2AC8 2*4096+10*256+12*16+8*1 10952
base 16: hexadecimal (aka hex) uses digits 0-9, then A-F
base conversion
• e.g. convert octal 437 to hexadecimal
• idea: convert octal to decimal, then decimal to
hex
• 437octal = 4*64 + 3*8 + 7 = 287decimal
• 287decimal contains 256 (16*16). 287-256 = 31
• 31 contains 16. 31-16 = 15
• answer: 11Fhex, commonly written 0x11F
base conversion
• an easier way to convert base for powers of 2 is
to express in binary first. for our example:
octal 4 3 7
binary 1 0 0 0 1 1 1 1 1
hex 1 1 F
binary 1 0 0 0 1 1 1 1 1
signed integers
• signed integer can be negative
• unsigned integer can not be negative
• sign/magnitude: use the first bit to indicate sign. 0-
pos; 1-neg. then use the rest of the bits for magnitude.
• example: we have 4 bit numbers. 6 is 0110. -6 is 1110.
two's complement
• instead of sign-magnitude, modern computers use two's
complement
• idea:
• first bit still indicates sign, 0-pos and 1-neg
• to change sign, flip every bit and then add 1 (two’s
complement operation)
• example: convert 8 bit number for positive 23 to negative -23
00010111 11101000 11101001
23 flip +1 -23
interpreting binary
• example: the following binary contains an
unsigned 4-bit int, then a signed 8-bit int, then a
signed, 2’s complement 16-bit int. Convert each
to decimal
• 1001001010011111111111101001
value two's complement decimal total
1001 n/a 8+1 9
00101001 n/a 32+8+1 41
1111111111101001 0000000000010111 16+4+2+1 -23
sign extension
• converting shorter ints to longer ints
• for unsigned, just pad with zeros
• ex: 4-bit 5 is 0101. 8-bit 5 is 00000101.
• for signed two's complement, just pad with
most-significant-bit (MSB)
• ex: 4 bit -7 is 1001. 8-bit -7 is 11111001
adding/subtracting two's
complement
• addition: do bitwise addition
• e.g. 4 bit values
• 3decimal = 0011. 2decimal = 0010. 3+2 = 5 = 0101
• subtraction: take two's complement of number
being subtracted, and then do addition
• 3decimal = 0011. 2decimal = 0010. 3 - 2 = 3+(-2) =
0011 + 1110 = 0001 = 1decimal
overflow
• overflow: not enough bits to store the result
• e.g. if we try signed two’s complement 4-bit 7+3
= 0111+0011 = 1010 (note sign is wrong!)
• can detect overflow if two operands are like-
signed, and the sign of result is different
interpreting data
• what's this?
• 0001001010011111
• could be anything!
• 16 bit unsigned int
• 16 bit signed int
• float
• an instruction for a computer to execute
where is data stored?
memory type
slower,
cheaper, backup
(e.g. cloud, tape backup)
larger
external storage
(e.g. hard drive, flash drive)
random access memory
(RAM)
faster, more
expensive, registers
smaller
typically, data must be in a register to run an instruction on it
running a program
• a program's instructions are typically stored
consecutively in RAM
• typical machine uses two special registers to
execute them
Instruction Register
holds current instruction
(IR)
Program Counter holds address of next
(PC) instruction to execute
fetch / decode / execute
cycle
• Prior to running program, load PC
with address of its first instruction.
IR PC
Then repeatedly:
• Fetch
• load IR with next instruction
• increment PC
• Decode
• determine what instruction
does and where operands are
• Execute
• perform the instruction
types of instructions -
arithmetic
• arithmetic instruction includes:
• opcode operation to perform, e.g. add,
multiply
• operands - what data to perform operation on
• where to place the result
types of instructions -
branch
• branch instruction includes:
• under what conditions to branch
• address of next instruction to execute if we
are to branch
types of instructions -
halt
• halt: stop the machine
• halt instruction: normal termination
• or, error condition encountered
Instruction Set Architecture
(ISA)
• ISA: set of instructions and machine capabilities
• interface between hardware and lowest-level software
type description example
postfix calculator,
stack-based uses stack for data e.g. 3 4 + to
calculate 3+4
implicitly uses one
accumulator-
accumulator (or a select Simple Machine
based
few) for data
many general purpose
general- registers; info about all MIPS, ARM, Intel
register operands contained in x86
each instruction
general-register
architecture types
• CISC vs. RISC
• Reduced Instruction Set Computer -
instructions are simple. Most modern
architectures are RISC
• Complex Instruction Set Computer - more
complex instructions available. Intel x86 is
CISC but with some RISC added
addressing modes
• where are the operands?
mode description example
instruction requires both add two registers,
register source operands and store result in
destination in registers another register
one operand is in a add a register to a
mixed / register, the other operand constant, store
constant is a constant contained in result in register
instruction
MIPS mult
instruction places
one or more operands is
implicit results of multiply
implicit in the instruction
in lo and hi
registers
memory interface
• we need to be able to read from
memory and write to memory
MBR MAR
• memory interface: array of bytes
with two special registers
• Memory Address Register
(MAR) - address of next
memory reference
• Memory Buffer Register
RAM
(MBR) - storage location for
data being read from or
written to memory
memory cycle
• how to write to memory:
MBR MAR • access the MAR
• place contents of MBR at
memory[MAR]
• how to read from memory:
RAM • access the MAR
• retrieve contents of
memory[MAR] and put in
MBR
register transfer language/
notation (RTL/RTN)
• all internal operations in a CPU can be described as
sequence of operations on, and transfers between registers,
and memory cycles.
• the RTL (or RTN) contains:
• operations
• operands
• ability to specify specific bits in registers
• programming constructs for CPU's logic capabilities
•
Exercise 1A
Make sure you can connect to our course Canvas page and view homework
#1.
• http://ccsf.instructure.com
• Make sure you can connect between our hills linux server and the Batmale
413 linux workstations, using the syllabus as a guide and for password /
login information.
• If you're on a linux workstation, demonstrate an ssh connection to hills.
Replace “uname” with your username.
ssh [email protected]
• If you're on your own laptop, demonstrate a terminal window logged into
hills, and from that terminal window, ssh to a linux workstation. The IP
range you can use is anything between 147.144.23.31-147.144.23.58.
ssh [email protected]
• For Mac laptops, ssh is builtin to the terminal window. For Windows
laptops, you will need to download a terminal client like Putty or
SSHSecureShellClient.exe. See the computer systems guide on my web
page, http://fog.ccsf.edu/~mluttrel
Exercise 1B
• Linux basics and simple machine intro
http://fog.ccsf.edu/~mluttrel/cs270/exercises/ex1b.html