0% found this document useful (0 votes)
40 views67 pages

02 Computer Arithmetic

Chapter 2 of 'Foundations of Computer Science' covers Computer Arithmetic, focusing on numeral systems, signed and unsigned numbers, fixed-point, and floating-point representations. It explains various numeral systems, including polyadic systems and binary, octal, decimal, and hexadecimal representations, along with methods for number conversion. Additionally, it discusses the encoding of negative numbers using sign-magnitude, one’s complement, and two’s complement representations.

Uploaded by

lucky25hada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views67 pages

02 Computer Arithmetic

Chapter 2 of 'Foundations of Computer Science' covers Computer Arithmetic, focusing on numeral systems, signed and unsigned numbers, fixed-point, and floating-point representations. It explains various numeral systems, including polyadic systems and binary, octal, decimal, and hexadecimal representations, along with methods for number conversion. Additionally, it discusses the encoding of negative numbers using sign-magnitude, one’s complement, and two’s complement representations.

Uploaded by

lucky25hada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Foundations of

Computer Science
Chapter 2 Computer Arithmetic

Prof. Dr.-Ing. Richard Membarth

October 16th 2024


Contents

2. Computer Arithmetic

2.1 Numeral Systems

2.2 Signed and Unsigned Numbers

2.3 Fixed-Point Numbers

2.4 Floating-Point Numbers

2 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic | Winter Term 2024/2025
Foundations of
Computer Science
Chapter 2.1 Computer Arithmetic
Numeral Systems
Prof. Dr.-Ing. Richard Membarth

October 16th 2024


Numeral Systems

artificial system to quantify amount of elements in a set


consists of
set of symbols, a.k.a. digits
rules to combine digits

set of elements representation of size

7 decimal system
VII Roman system
Maya system

25
||||| || line (unary) system

4 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Numeral Systems | Winter Term 2024/2025
Numeral Systems
Classification

numeral systems

polyadic system
sign-value notation hybrid systems
positional system

Roman system unary system Maya system decimal system binary system
... ...

2
LXXX ||||| || 42 101

5 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Numeral Systems | Winter Term 2024/2025
Numeral Systems
Polyadic System

Polyadic System

The polyadic system is a numeral system in which the contribution of a digit to the value of a number depends on its position. A
polyadic system to the base b is a system in which a number is formed as a sum of powers of b.
n
X
A natural number N can be represented as a sum of powers: N = ai · bi
i =0

with b as base of the number system: b ∈ N, b ≥ 2


and set of b digits (symbols!) ∈ {0, 1, . . . , b − 1}
any number can be represented uniquely in the base-b numeral system
representation of a natural number N
N = (an an−1 · · · a1 a0 )b

6 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Numeral Systems | Winter Term 2024/2025
Numeral Systems
Polyadic System

system base symbols used by humans used in computers


decimal 10 0, 1, . . . , 9 yes no
binary 2 0, 1 no yes
octal 8 0, 1, . . . , 7 no no
hexadecimal 16 0, 1, . . . , 9, A, B, . . . , F no no

decimal system: b = 10(10) = (10)10 = 1010 = 10ten


example: 1999
binary system: b = 210
example: 00102 = 0b0010
octal system: b = 810
example: 1000768
hexadecimal system: b = 1610
example: FF 0AC516 = 0xFF 0AC5

7 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Numeral Systems | Winter Term 2024/2025
Numeral Systems
Binary System

binary system
decimal system octal system binary system
b=2 102 101 100 82 81 80 24 23 22 21 20
b ∈ {0, 1} 0 0 0
polyadic representation 1 1 1
2 2 1 0
overflow to next position if value range is exhausted
3 3 1 1
can be efficiently implemented in hardware
4 4 1 0 0
(represented and processed) 5 5 1 0 1
basic numeral system for number processing in 6 6 1 1 0
7 7 1 1 1
digital systems
8 1 0 1 0 0 0
9 1 1 1 0 0 1
1 0 1 2 1 0 1 0
... ... ... ... ...
1 9 2 3 1 0 0 1 1
2 0 2 4 1 0 1 0 0
8 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Numeral Systems | Winter Term 2024/2025
Numeral Systems
Polyadic System

examples
decimal system: 376110
376110 = 3761 = 3 · 103 + 7 · 102 + 6 · 101 + 1 · 100
octal system: 37618
37618 = 3 · 83 + 7 · 82 + 6 · 81 + 1 · 80 = 2033
| {z }
number shown as decimal
binary system: 1001012
1001012 = 1 · 25 + 0 · 24 + 0 · 23 + 1 · 22 + 0 · 21 + 1 · 20 = |{z}
37
number shown as decimal
hexadecimal system 3E116
3E116 = 3 · 162 + 14 · 161 + 1 · 160 = |{z}
993
number shown as decimal

9 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Numeral Systems | Winter Term 2024/2025
Numeral Systems
Horner’s Scheme

n
X
b-adic representation defines a polynomial: N = ai · bi
i =0
polynomial Horner’s scheme
p(b) = a0 + a1 · b + a2 · b2 + · · · + an · bn p(b) = (· · · (an · b + an−1 ) · b + · · · ) · b + a0
multiplications, additions, and only multiplications and additions, no
exponentiations exponentiations
expensive to compute more efficient computation
problems with accuracy (for floating-point can be used to convert between different
numbers) positional numeral systems
1 p = 0 1 p = 0
2 for i in range(0, n): 2 for i in range(n-1, -1, -1):
3 e = 1 3 p = p * b + a[i]
4 for k in range(1, i):
5 e = e * b
6 p = p + a[i] * e

10 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Numeral Systems | Winter Term 2024/2025
Numeral Systems
Horner’s Scheme

number conversion using the Horner’s scheme


convert from a base b1 to b2 ((((1 · 2 + 1) · 2 + 0) · 2 + 1) · 2 + 0) · 2 + 1
computation happens in b2
(((3 · 2 + 0) · 2 + 1) · 2 + 0) · 2 + 1
example: 1101012 → 5310
((6 · 2 + 1) · 2 + 0) · 2 + 1
polynomial
(13 · 2 + 0) · 2 + 1
p110101 (b) = 1 · b0 + 0 · b1 + 1 · b2 + 0 · b3 + 1 · b4 + 1 · b5
26 · 2 + 1
Horner’s scheme
53
p110101 (b) = ((((1 · b + 1)· b + 0)· b + 1)· b + 0)· b + 1

11 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Numeral Systems | Winter Term 2024/2025
Numeral Systems
Division with Remainder Method

number conversion using successive division with remainder


Euclidean division 157/2 = 78 remainder 1
convert from a base b1 to b2
78/2 = 39 remainder 0
computation happens in b1
39/2 = 19 remainder 1
example: 15710 → 1001 11012
19/2 = 9 remainder 1
it is often easier to convert first to the decimal system and then
9/2 = 4 remainder 1
to the desired system
4/2 = 2 remainder 0

2/2 = 1 remainder 0

1/2 = 0 remainder 1

12 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Numeral Systems | Winter Term 2024/2025
Numeral Systems
Number Conversion

7258 to base 3 using the Horner’s scheme


78 = 213 ; 28 = 23 ; 58 = 123 ; 108 = 810 = 223
7258 = (213 · 223 + 23 ) · 223 + 123
213 · 223 = 20023
20023 + 23 = 20113
20113 · 223 = 1220123
1220123 + 123 = 1221013
889 to base 5 using division with remainder
889 /59 = 179 R 09
179 /59 = 39 R 19
39 /59 = 09 R 39
889 = 3105

13 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Numeral Systems | Winter Term 2024/2025
Numeral Systems
Conversion by Combination

if the base of a numeral system is the power of the base of another numeral system, numbers can be easily converted
combining several digits into a single digit
converting 1011 0101 00012 to hexadecimal
( 1011 0101 0001 )2
⇓ ⇓ ⇓
( B 5 1 )16
converting 5218 to binary
( 5 2 1 )8
⇓ ⇓ ⇓
( 101 010 001 )2

14 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Numeral Systems | Winter Term 2024/2025
Numeral Systems
Conversion of Digits

it is much easier to convert between numeral systems when considering each digits separately
this is not possible when converting from the decimal to the binary system
remedy: BCD codes
each decimal digit is represented by a 4-bit binary digit
4910 = 0100 1001BCD 8421

15 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Numeral Systems | Winter Term 2024/2025
Foundations of
Computer Science
Chapter 2.2 Computer Arithmetic
Signed and Unsigned Numbers
Prof. Dr.-Ing. Richard Membarth

October 17th 2024


Signed and Unsigned Numbers
Negative Numbers

negative numbers are prefixed by minus sign: 2510 , −2510 , −10102 , +10102 , · · ·
how should we encode the numbers / sign?
different types of signed binary number representations
sign-magnitude
one’s complement
two’s complement
elementary arithmetic supported in all systems: +, −, ∗, /
technical realization: costs (area) and performance (throughput) are important

17 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Negative Numbers

sign-magnitude
010101000011
simple approach 0110 5 4 3 0010
use sign bit with magnitude 6 2
0111 0001
7 1
left-most bit (most significant bit) indicates sign 4 bit
1000 -0 0 0000
0: positive number -7
-1
1001 1111
1: negative number -2 -6
1010 -3 -4 -5 1110
signed magnitude
101111001101
↑ | {z } 2 = 1210
0 0001100 | {z } 2 = −1210
1 0001100

| magnitude | magnitude
sign bit sign bit
zero represented twice
000000002 (+0)
100000002 (−0)

18 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Negative Numbers

one’s complement
010101000011
obtained by inverting all bits 0110 5 4 3 0010
4310 = 001010112 → −4310 = 110101002 6 2
0111 0001
7 1
range of signed numbers: −(2N −1 − 1) · · · 2N −1 − 1 4 bit
1000 -7 0 0000
8-bit range: −127 · · · + 127
-6 -0
1001 1111
zero represented twice -5 -1
1010 -4 -3 -2 1110
000000002 (+0)
101111001101
111111112 (−0)

19 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Negative Numbers

two’s complement
010101000011
obtained by inverting all bits and adding one 0110 5 4 3 0010
4310 = 001010112 → −4310 = 110101012 6 2
0111 0001
7 1
range of signed numbers: −(2N −1 ) · · · 2N −1 − 1 4 bit
1000 -8 0 0000
8-bit range: −128 · · · + 127
-7 -1
1001 1111
zero represented once -6 -2
1010 -5 -4 -3 1110
000000002
101111001101
converting to decimal
−27 +26 +24 +22 +20 = −128+64+16+4+1 = −43

20 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Logical Operators

the execution unit of a processor supports different operations on binary numbers


arithmetic operations: +, −, ∗, /
logical operations: NOT, AND, OR, XOR
shift operations: shift left, shift right
logical operations work bitwise
bits are updated in parallel and independent of each other
no polyadic system, no carry-over

operation symbol C syntax meaning

NOT a ¬ ~a invert all bits of a (one’s complement)


a AND b ∧ a & b logical AND operation on each pair of corresponding bits
a OR b ∨ a | b logical OR operation on each pair of corresponding bits
a XOR b ⊕ a ^ b logical XOR operation on each pair of corresponding bits

21 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Logical Operators

a ¬a a b a∧b a b a∨b a b a⊕b

0 1 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 1 1 0 1 1
1 0 0 1 0 1 1 0 1
1 1 1 1 1 1 1 1 0

a = 01010101 a = 01010101 a = 01010101 a = 01010101

b = 11110000 b = 11110000 b = 11110000

¬a = 10101010 a ∧ b = 01010000 a ∨ b = 11110101 a ⊕ b = 10100101

22 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Logical Operators

the logical operators !, &&, and || in C are wordwise logic operators


they combine n-ary binary numbers and return a n-digit zero or one
the bitwise logical operators ~, &, |, and ^ operate on all bits simultaneously
they return n single digit zeros or ones

(5 & (-9)) | 2 -> 7 (5 && (-9)) || 2 -> 1

5 00000101 5 00000101

−9 11110111 −9 11110111

5&(−9) 00000101 5&&(−9) 00000001

2 00000010 2 00000010

(5&(−9))|2 00000111 (5&&(−9))|2 00000001

23 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Arithmetic

addition addition can be realized as known from school


the binary system requires only a few rules
0 0 0 0 0 1 0 0 0 1 0 1 (6910 )
1) 0 + 0 = 0 + 0 0 0 0 0 0 0 0 1 1 0 0 (1210 )
2) 0 + 1 = 1 1 1 carry

3) 1 + 0 = 1 0 0 0 0 0 1 0 1 0 0 0 1 (8110 )

4) 1 + 1 = 0 (with a carry-over of 1)
two’s complement: carry beyond the msb is
ignored

24 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Arithmetic

subtraction subtraction can be realized as known from school


the binary system requires only a few rules
0 0 0 0 0 1 0 0 0 1 0 1 (6910 )
1) 0 − 0 = 0 − 0 0 0 0 0 0 0 0 1 1 0 0 (1210 )
2) 0 − 1 = 1 (with a borrow of 1) 1 1 borrow

3) 1 − 0 = 1 0 0 0 0 0 0 1 1 1 0 0 1 (5710 )

4) 1 − 1 = 0
two’s complement: subtraction via addition

25 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Arithmetic

subtraction: using two’s complement


overflow when taking the two’s complement of 0 is 0 0 0 0 0 1 0 0 0 1 0 1 (6910 )
ignored + 1 1 1 1 1 1 1 1 0 1 0 0 (−1210 )
1 1 1 1 1 1 1 carry
how can we get the value of a negative number
0 0 0 0 0 0 1 1 1 0 0 1 (5710 )
using two’s complement?
(x + ¬x ) + 1 = 0
¬x + 1 = −x
0 0 0 0 0 0 0 0 1 1 0 0 (1210 )
1111 1100 01112 →
+ 1 1 1 1 1 0 1 1 1 0 1 1 (−6910 )
0000 0011 10012 → −5710 1 1 1 carry

1 1 1 1 1 1 0 0 0 1 1 1 (−5710 )

26 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Arithmetic

multiplication
can be realized by shifting and summing up
multiplying n-bit number with m-bit number results in n+m-bit number

1 1 0 1 (1310 )
× 1 0 1 (510 )
1 1 0 1
+ 0 0 0 0
1 1 0 1
+ 1 1 0 1
1 0 0 0 0 0 1 (6510 )

27 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Arithmetic

division
using Euclidean’s algorithm 1 1 0 1 / 1 0 1 = 10 (1310 /510 = 210 )
− 1 0 1 0
1 # c = n / d, remainder r
2 r = n 0 1 1
3 c = 0 − 0 0 0
4 while r >= d:
5 r = r - d 1 1 remainder (310 )
6 c = c + 1

28 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Adder

circuit for binary addition A3 B3 A2 B2 A1 B1 A0 B0


nested for multiple bits
full adder required for more than 2 bits
+ + + +

S3 C3 S2 C2 S1 C1 S0

29 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Status Register

the adder sets several status registers for the next instruction or software
N negative (sign) flag: indicates that the result is negative
Z zero flag: indicates that the result is zero
C carry flag: allows to carry a binary digit from a less significant word to the LSB of a more significant word
V overflow flag: indicates that the signed result of an operation is too large to fit in the register width
the carry flag can be used to indicate overflow for unsigned addition
the overflow flag is set if the sign of the summands are equal and different from the sign of the result

0 1 1 1 1 1 1 1 (12710 ) 1 0 0 1 1 0 0 1 ( − 10310 )
+ 0 0 0 0 0 0 0 1 (110 ) + 1 0 0 0 1 1 1 1 ( − 11310 )
1 1 1 1 1 1 1 carry 1 1 1 1 1 1 carry

1 0 0 0 0 0 0 0 (−12810 ) 0 0 1 0 1 0 0 0 (4010 )
N: 1 Z : 0 C: 0 V : 1 N: 0 Z : 0 C: 1 V : 1

8-bit signed integer (two’s complement) 8-bit signed integer (two’s complement)
30 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Status Register

1 1 1 1 1 1 1 1 ( − 110 ) 1 1 1 1 1 1 1 1 (25510 )
+ 1 1 1 1 1 1 1 1 ( − 110 ) + 1 1 1 1 1 1 1 1 (25510 )
1 1 1 1 1 1 1 1 carry 1 1 1 1 1 1 1 1 carry

1 1 1 1 1 1 1 0 (−210 ) 1 1 1 1 1 1 1 0 (25410 )
N: 1 Z : 0 C: 1 V : 0 N: 1 Z : 0 C: 1 V : 0

8-bit signed integer (two’s complement) 8-bit unsigned integer

31 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Data Types

what does 100001002 represent? 1 char c; 1 int8_t c;


2 short s; 2 int16_t s;
8410 ? −410 ? z ? 3 int i; 3 int32_t i;
4 unsigned int u; 4 uint32_t u;
we need to specify the data type!
languages like Python do not expose data types finite arithmetic: overflow and underflow
C/C++ allows to specify data types

C type stdint type bits sign range


char uint8_t 8 unsigned 0 . . . 255
signed char int8_t 8 signed -128 . . . 127
unsigned short uint16_t 16 unsigned 0 . . . 65,535
short int16_t 16 signed -32,768 . . . 32,767
unsigned int uint32_t 32 unsigned 0 . . . 4,294,967,295
int int32_t 32 signed -2,147,483,648 . . . 2,147,483,647
unsigned long long uint64_t 64 unsigned 0 . . . 18,446,744,073,709,551,615
long long int64_t 64 signed -9,223,372,036,854,775,808 . . . 9,223,372,036,854,775,807

32 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Hex Dumps

hex dump or hex editor allow to inspect and edit raw data
first column: displacement (offset) in file or memory in hexadecimal format
second column: raw data in hexadecimal format, 16 bytes per row
third column (optional): ASCII character translation

1 00000000: 4761 6c6c 6961 2065 7374 206f 6d6e 6973 Gallia est omnis
2 00000010: 2064 6976 6973 6120 696e 2070 6172 7465 divisa in parte
3 00000020: 7320 7472 6573 2c20 7175 6172 756d 2075 s tres, quarum u
4 00000030: 6e61 6d20 696e 636f 6c75 6e74 2042 656c nam incolunt Bel
5 00000040: 6761 652c 2061 6c69 616d 2041 7175 6974 gae, aliam Aquit
6 00000050: 616e 692c 2074 6572 7469 616d 2071 7569 ani, tertiam qui
7 00000060: 2069 7073 6f72 756d 206c 696e 6775 6120 ipsorum lingua
8 00000070: 4365 6c74 6165 2c20 6e6f 7374 7261 2047 Celtae, nostra G
9 00000080: 616c 6c69 2061 7070 656c 6c61 6e74 7572 alli appellantur
10 00000090: 2e20 4869 206f 6d6e 6573 206c 696e 6775 . Hi omnes lingu
11 000000a0: 612c 2069 6e73 7469 7475 7469 732c 206c a, institutis, l
12 000000b0: 6567 6962 7573 2069 6e74 6572 2073 6520 egibus inter se
13 000000c0: 6469 6666 6572 756e 742e 2047 616c 6c6f differunt. Gallo
14 ...
33 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Data Types

type conversions might require an extension of the operand length


the sign of signed operands needs to be preserved
unsigned operands are extended by leading zeros
signed operands copy the sign bit to the leading positions

1 unsigned char c = 0x80; 1 char c = 0x7F; 1 unsigned char c = 0x80;


2 unsigned char d = 0x80; 2 char d = 0x80; 2 char d = 0x80;
3 printf("%08X", c + d); 3 printf("%08X", c + d); 3 printf("%08X", c + d);

1 c: 00000080 1 c: 0000007F 1 c: 00000080


2 d: 00000080 2 d: FFFFFF80 2 d: FFFFFF80
3 -> 00000100 3 -> FFFFFFFF 3 -> 00000000

34 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Data Types

1 #include <stdio.h> overflow / underflow


2
3 void main() { addition / subtraction can overflow
4 int x = 2;
multiplication can overflow
5 for (int i = 0; i < 5; i++) {
6 x = x * x; precision loss
7 printf("%d, ", x);
division can lose precision (rounding, truncation)
8 }
9 printf("\n");
10 }

35 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
GPS Week Number Rollover

GPS broadcasts also a time, starting from January 6th, 1980


10-bit week number
19-bit seconds-into-week
week number counter range: 0–1023
integer overflow causes the counter to roll over every 1024 weeks or ~19.6 years
1st rollover August 21st to 22nd, 1999
2nd rollover April 6th to 7th, 2019
3rd rollover November 20th to 21st, 2038
...
many devices rely on GPS time, resulting in problems
Honeywell’s flight management and navigation software
New York City Wireless Network (NYCWiN)
GPS navigation devices, mobile phones (Apple), car time (Honda, Acura), . . .

36 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Y2K22

dates that are stored in the format yymmddHHMM


31.12.2021, 23:59 → 2112312359 → 01111101 11100111 01010100 00100111
01.01.2022, 00:00 → 2201010000 → 10000011 00110000 10111111 01010000
problem when using 32-bit signed integer to represent dates
Microsoft Exchange was affected by this
malware-scanning component used 32-bit signed integer to check for latest update
became known as Y2K22 bug

37 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Saturation

normal behavior
overflow / underflow silently tolerated → often incorrect result
cause traps for exception handling → expensive
often it is enough to continue the calculating with the highest (overflow) or lowest (underflow) value
example
z = (a + b)/2
a = 7, b = 9 → z = (7 + 9)/2 = 16/2 = 8
calculation with 4-bit unsigned integer

correct overflow saturated


a 0111 0111 0111
b 1001 1001 1001
a+b 10000 0000 1111
(a + b) / 2 1000 0000 0111

38 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Signed and Unsigned Numbers
Saturation

Ariane 5 (V88) (1996)


overflow during converting a 64-bit floating-point value to 16-bit integer
maximum thrust, then self-destruction of the rocket (39s after launch)
loss: 290 mio Euro
root cause: wrong specification (mistakenly copied over by Ariane 4)
[Link]

39 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Signed and Unsigned Numbers | Winter Term 2024/2025
Foundations of
Computer Science
Chapter 2.3 Computer Arithmetic
Fixed-Point Numbers
Prof. Dr.-Ing. Richard Membarth

October 24th 2024


Fixed-Point Numbers
Numeral Systems

Recap: Polyadic System


n
X
A natural number N can be represented as a sum of powers: N = ai · bi
i =0

can we also represent non-integer values?


n
X
also use negative values for i: R = ai · b i
i =−m
representation of rational number R
R = (an an−1 an−2 · · · a1 a0 . a−1 a−2 · · · a−m )b
| {z } | {z }
integer part of R fractional part of R
example
n = 2; m = 5; R = 011.110012
011.110012 = 1 · 21 + 1 · 20 + 1 · 2−1 + 1 · 2−2 + 1 · 2−5 = 2 + 1 + 0.5 + 0.25 + 0.03125 = 3.7812510

41 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Fixed-Point Numbers | Winter Term 2024/2025
Fixed-Point Numbers
Q Number Format

specifies the parameters of a binary fixed-point number


n number of bits before binary point
m number of bits after binary point
Qn.m signed fixed-point number stored using two’s complement
UQn.m unsigned fixed-point number
Arm Q-format documentation
[Link]
example
011.110012 = 3.7812510 uses the Q number format UQ3.5 or Q3.5
11.110012 = 3.7812510 uses the Q number format UQ3.5

42 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Fixed-Point Numbers | Winter Term 2024/2025
Fixed-Point Numbers
Q Number Format

unsigned fixed-point number signed fixed-point number using two’s complement


example: 4-bit in UQ2.2 format example: 4-bit in Q2.2 format

010101000011 010101000011
0110 1.25 1.00 0.75 0010 0110 1.25 1.00 0.75 0010
1.50 0.50 1.50 0.50
0111 0001 0111 0001
1.75 0.25 1.75 0.25
4 bit 4 bit
1000 2.00 0.00 0000 1000 -2.00 0.00 0000
UQ2.2 Q2.2
2.25 3.75 -1.75 -0.25
1001 1111 1001 1111
2.50 3.50 -1.50 -0.50

1010 2.75 3.00 3.25 1110 1010 -1.25 -1.00 -0.75


1110
101111001101 101111001101

fixed-point representation, the point is at a fixed position


+ hardware units for integer numbers can be used
− only fixed and limited value range

43 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Fixed-Point Numbers | Winter Term 2024/2025
Fixed-Point Numbers
Horner’s Scheme

number conversion using the Horner’s scheme


convert integer and fractional parts separately Rf = ((1/2 + 0)/2 + 1)/2
fractional part
= (0.5/2 + 1)/2
p(b) = (· · · ((a−m /b + a−m+1 ) /b + · · · ) /b + a−1 ) /b
= 1.25/2
integer part
= 0.625
p(b) = (· · · (an · b + an−1 ) · b + · · · ) · b + a0
Ri = (1 · 2 + 0) · 2 + 1
example: 101.1012 → 5.62510
=2·2+1
polynomial
p.101 (b) = 1 · b−3 + 0 · b−2 + 1 · b−1 =5

p101. (b) = 1 · b2 + 0 · b1 + 1 · b0 R = Ri .Rf = 5.625

Horner’s scheme
p.101 (b) = ((1/b + 0)/b + 1)/b
p101. (b) = (1 · b + 0) · b + 1

44 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Fixed-Point Numbers | Winter Term 2024/2025
Fixed-Point Numbers
Division with Remainder Method

convert integer and fractional parts separately


fractional part: successive multiplication with remainder
integer part: successive division with remainder
example: 12.687510 → 1100.10112
integer part → successive division fractional part → successive multiplication

12/2 = 6 remainder 0 0.6875 · 2 = 1.375

6/2 = 3 remainder 0 0.375 · 2 = 0.75

3/2 = 1 remainder 1 0.75 · 2 = 1.5

1/2 = 0 remainder 1 0.5 · 2 = 1.0

45 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Fixed-Point Numbers | Winter Term 2024/2025
Fixed-Point Numbers
Division with Remainder Method

successive division / multiplication with remainder


convert 0.110 to binary
fractional part → successive multiplication

0.1 · 2 = 0.2

0.2 · 2 = 0.4

0.4 · 2 = 0.8

0.8 · 2 = 1.6

0.6 · 2 = 1.2

0.2 · 2 = 0.4

0.110 = 0.000112
a finite number of digits in the source system can mean an infinite number of digits in the target system
→ only an approximate representation in the target system is then possible!

46 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Fixed-Point Numbers | Winter Term 2024/2025
Fixed-Point Numbers
Q Number Format

Self Check

The binary representation of a fixed-point number is given as 1011001. Specify the decimal representation of this number if
the number uses the UQ3.4 format
the number uses the Q3.4 format

47 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Fixed-Point Numbers | Winter Term 2024/2025
Foundations of
Computer Science
Chapter 2.4 Computer Arithmetic
Floating-Point Numbers
Prof. Dr.-Ing. Richard Membarth

October 30th 2024


Floating-Point Numbers

integer and fixed-point numbers do not represent very small or very large number well
1 exaFLOP/s: 1 · 1018 floating-point operations per second (FLOP/s)
electron mass: 9.1093837015(28) × 10−31 kg
scientific notation: 6.0210 × 1023 11101.0112 × 2−1001
n = significand × baseexponent
1.2345 = 12345 × 10−4
base: 10, significand: 12345, exponent: −4
representation of floating-point numbers is defined by the IEEE-754 norm
(−1)s × (1 + f ) × 2e
s: sign, m: mantissa, e: exponent

float s e: 8 bit m: 23 bit

double s e: 11 bit m: 52 bit

49 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Representation

general form
±1.sssss2 × 2eeee or (−1)s × (1 + f ) × 2e
sign (s): uses sign-magnitude representation, 0 for positive, 1 for negative
mantissa (f): uses binary number representation for fractional part
1 + f is called significand
normalized if exactly “1” is in front of the point
1.1011011: normalized mantissa (1 ≤ 1 + f < 2)
otherwise denormalized
0.0011011: denormalized mantissa
111.1110111: denormalized mantissa
IEEE-754 standard requires normalized mantissa
⇒ the msb of the mantissa is always 1 and is *not* stored (hidden bit)
exponent (e): uses biased binary number representation for exponent

50 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Exponent

floating-point numbers ≥ 1 have a positive exponent


floating-point numbers < 1 have a negative exponent
negative exponents are not represented as two’s complement
instead, a fixed offset 2n−1 − 1 (bias) that corresponds to half of the value range is added
float: 8-bit exponent: 2−126 · · · 2127 → excess-127
double: 11-bit exponent 2−1022 · · · 21023 → excess-1023
float e e+127 double e e+1023
0.0 0 0.0 0
-126 1 -1022 1
... ... ... ...
-1 126 -1 1022
0 127 0 1023
1 128 1 1024
... ... ... ...
127 254 1023 2046
∞ 255 ∞ 2047
51 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Value Range, Overflow, and Underflow

float s e: 8 bit m: 23 bit

double s e: 11 bit m: 52 bit

smallest representable non-negative number is approximately 2.0 × 10−38 (float) and 2.0 × 10−308 (double)
largest representable number is approximately 2.0 × 1038 (float) and 2.0 × 10308 (double)
what if the number is outside of that range?

1 . 0 1 1 0 ×2127 1 . 0 0 1 1 ×2−124
+ 1 . 1 1 0 1 ×2127 − 1 . 0 0 0 1 ×2−124
1 1 1 carry carry

1 1 . 0 0 1 1 ×2127 0 . 0 0 1 0 ×2−124

→ 1 . 1 0 0 1 ×2128 overflow! → 1 . 0 0 0 0 ×2−127 underflow!

exponent can be −126 · · · 127 for float


52 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Conversion

we can convert a rational number to IEEE-754 as follows

1. convert integer and fractional parts to binary system


2. normalize mantissa by shifting, adapt exponent
shift m by one position to the left → decrement e by one
shift m by one position to the right → increment e by one
3. combine s, e + offset, and m to build the IEEE-754 representation

34.510 = 100010.12
100010.12 × 20 = 1.0001012 × 25
s: 0 m: 1.0001012 e: 510 + 12710 = 13210 = 100001002

fp32 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

53 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Conversion

fp32 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

we can convert a IEEE-754 number to a rational decimal number as follows


(−1)s × (1 + fraction) × 2exponent −bias

1|10000100|01010000000000000000000

n = (−1)1 × (20 + 2−2 + 2−4 ) × 2132−127 = −1 × (20 + 2−2 + 2−4 ) × 25

= −(25 + 23 + 21 ) = −(32 + 8 + 2) = −42

0|00011011|00100111100111011100110

n = (−1)0 × (20 + 2−3 + 2−6 + 2−7 + 2−8 + 2−9 + 2−12 + 2−13 + 2−14 + 2−16 + 2−17 + 2−18 + 2−21 + 2−22 ) × 227−127

= 1 × (20 + 2−3 + 2−6 + 2−7 + 2−8 + 2−9 + 2−12 + 2−13 + 2−14 + 2−16 + 2−17 + 2−18 + 2−21 + 2−22 ) × 2−100
= (2−100 + 2−97 + 2−94 + 2−93 + 2−92 + 2−91 + 2−88 + 2−87 + 2−86 + 2−84 + 2−83 + 2−82 + 2−79 + 2−78 )
= · · · = 9.1093837015 × 10−31
54 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Special Numbers

−2128 −2−126 2−126 2128


−∞ 0 +∞

the value range of floating-point numbers does not include 0, ±∞, and NaN
for those, special encodings are reserved
0: all bits for e and f set to 0
0x00000000 = 0
0x80000000 = 0
∞: all bits for e set to 1 and for f to 0, the sign bit indicates +∞ or −∞
0x7F 800000 = +∞
0xFF 800000 = −∞
NaN: all bits for e and f set to 1
0x7FFFFFFF = NaN
0xFFFFFFFF = NaN

55 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Value Range

7-bit floating-point number


1 bit sign, 3 bits exponent, 3 bits mantissa
numbers non-uniformly distributed between [−15; +15]
dynamic range, not uniformly spaced
difference between two consecutive representable numbers varies with exponent

−∞ 0 +∞

7-bit integer number


two’s complement
numbers uniformly distributed between [−63; +64]

−∞
−63 0 +∞
64

56 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Addition

adding two floating-point numbers with the same sign


1. adapt the smaller exponent to the larger one
shift the mantissa correspondingly (denormalizes the mantissa)
2. add the mantissas
3. normalize and round the result if required
−87.8f = 1010111.11002 × 20 = 1.01011111002 × 26
−43.5f = 101011.12 × 20 = 1.0101112 × 25 = 0.10101112 × 26
1 . 0 1 0 1 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 ×26
+ 0 . 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ×26
1 1 1 1 1 1 1 1 carry

1 0 . 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 ×26
1 . 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 ×27 normalize
1 . 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 ×27 round

fp32 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1
57 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Addition

adding two floating-point numbers with different sign

1. adapt the smaller exponent to the larger one


2. the result has the sign of the operand with the larger absolute value
3. subtraction of the number with the smaller absolute value from the number with the larger absolute value: the two’s
complement of the number with the smaller absolute value will be added the number with the larger absolute value
4. normalize and round the result if required

5.610 − 5.562510 = 101.10012 × 20 − 101.10012 × 20 = 1.0110012 × 22 − 1.0110012 × 22

1 . 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 ×22 1 . 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 ×22
+ 0 . 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ×22
− 1 . 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ×22 1 1 1 1 1 1 1 0 carry

0 . 0 0 0 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 ×22
1 . 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 ×2−5 normalize
1 . 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 ×2−5 round

fp32 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0
58 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Rounding

1.1000110 × 26 + 1.0111010 ×22


0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 GR S
=1.1000110 × 26 + 0.00010111010×26
the hardware uses an extended mantissa
G R S
guard-bit
1 . 1 0 0 0 1 1 0 0 0 0 ×26
round-bit
+ 0 . 0 0 0 1 0 1 1 1 0 1 ×26
sticky-bit 1 1 1 1 carry

rounding considers the extended mantissa 1 . 1 0 1 0 0 0 1 1 0 1 ×26


round towards +∞
round to nearest
round towards −∞ input tie round result
round towards 0 m|000 same m
m|001 down m
round to nearest
m|... down m
m|100 ..0|100 down m
m|100 ..1|100 up m+1
m|... up m+1
m|111 up m+1

59 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Multiplication, Division

multiplication
multiply mantissas
add exponents
division
divide mantissas
subtract exponents

60 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Recap: Boolean Algebra

commutativity
x ∧y =y ∧x
x ∨y =y ∨x
associativity
(x ∧ y ) ∧ z = x ∧ (y ∧ z )
(x ∨ y ) ∨ z = x ∨ (y ∨ z )
distributivity
x ∧ (y ∨ z ) = x ∧ y ∨ x ∧ z
x ∨ (y ∧ z ) = x ∨ y ∧ x ∨ z
De Morgan’s law
x ∧y =x ∨y
x ∨y =x ∧y

61 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Associativity

x = −1.100000 × 2100
y = +1.100000 × 2100
z = +1.000000
what is x + (y + z) ?
y + z : 1.1000000 × 2100 + 0.0000000 × 2100 = 1.1000000 × 2100 → x + (y + z ) = 0.0
what is (x + y) + z ?
x + y = 0.0 → (x + y ) + z = 1.0
x + (y + z ) ̸= (x + y ) + z → floating-point addition is *not* associative!
floating-point addition is *not* distributive: x · (y + z ) ̸= (x · y ) + (x · z )
floating-point addition is commutative: x + y = y + x and x × y = y × x
what does that mean when we add x1 + x2 + · · · + xn in parallel?

62 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Endianness

IEEE-754 defines the representation of 16-bit, 32-bit, and 64-bit floating-point numbers, but not the order in which the
4 or 8 bytes of a number are stored
the endianness of a processor defines the order it stores the bytes of a word (byte sex)
little-endian: the least significant bit is stored at the lowest memory address
big-endian: the most significant bit is stored at the lowest memory address
some processors support big- and little-endian (e.g., ARM, MIPS)
memory addresses

0x12345678 78 56 34 12 little-endian (e.g., Intel, RISC-V)

0x12345678 12 34 56 78 big-endian (e.g., Motorola, Sun)

0x12345678 34 12 78 56 mixed-endian (e.g., National Semiconductor)

fp32 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0
3 D 1 9 9 9 8 016
63 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
Patriot Failure at Dhahran

US Patriot system failed to intercept an incoming Iraqi Scud


missile at an army base in Saudi Arabia, killing 28 soldiers (Feb.
25th, 1991)
the time was stored as 24-bit floating-point
the time was increased every 100ms by 1/10s
1/10 can’t be exactly represented as binary number
error accumulated to 0.34s after 100h standby
as a consequence, the target position was off by ~570m
a software update was available to address the issue
1 void main() {
[Link] 2 float sec = 0.0, sec_inc = 0.1;
3 for (int i = 0; i < 10000; i++)
4 sec = sec + sec_inc;
5 printf("time: %f\n", sec);
6 }

64 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
IEEE-754 Floating-Point Formats

IEEE-754 floating-point formats


64-bit double-precision floating-point (double, fp64)
32-bit single-precision floating-point (float, fp32)
16-bit half-precision floating-point (half, fp16)

double s e: 11 bit m: 52 bit

float s e: 8 bit m: 23 bit

half s e: 5 bit m: 10 bit

65 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
New Floating-Point Formats

new floating-point representations for Deep Learning


bf16 (brain floating-point, bfloat16): Google TPU, Intel AVX-512, NVIDIA Tensor Cores, AMD Matrix Cores, . . .
tf32 (tensor float, tfloat32): NVIDIA Tensor Cores (Ampere), AMD Matrix Cores (MI300)
fp8 (E5M2, E4M3): NVIDIA Tensor Cores (Hopper), AMD Matrix Cores (MI300)
fp4 (E1M2, E2M1): NVIDIA Tensor Cores (Blackwell), AMD Matrix Cores (MI355)

fp4 s e:1 m:2


fp4 s e:2 m:1
fp8 s e: 4 bit m: 3 bit
fp8 s e: 5 bit m: 2 bit
fp16 s e: 5 bit m: 10 bit
fp32 s e: 8 bit m: 23 bit
bf16 s e: 8 bit m: 7 bit
tf32 s e: 8 bit m: 10 bit
66 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025
Floating-Point Numbers
FP4 Values

Self Check

We consider the values that can be represented using FP4 (E2M1) with 1 bit for the sign, 2 bits for the exponent, and 1 bit for the
mantissa. List the values that can be represented if
encodings are reserved for 0, ∞, and NaN
encodings are only reserved for 0

67 THI | Prof. Dr.-Ing. Richard Membarth | Foundations of Computer Science | Computer Arithmetic: Floating-Point Numbers | Winter Term 2024/2025

You might also like