Introduction to Computing
(CS 1109/1110)
http://jatinga.iitg.ac.in/~asahu/cs1109/
Floating point
A. Sahu
Dept of Comp. Sc. & Engg.
Indian Institute of Technology Guwahati 1
Outline
• Floating Point
• Floating Point Density
• Operations
• Type Casting
2
Floating Point Numbers
3
Floating Point Numbers
• Need to floating point number
• Number representation : IEEE 754
• Floating point range
• Floating point density
–Accuracy
• Arithmetic and Logical Operation on
FP
• Conversions and type casting in C
4
Need to go beyond integers
complex
• integer 7
• rational 5/8 real
• real √3 rational
• complex 2-3i integer
Extremely large and small values:
distance pluto - sun = 5.9 1012 m
mass of electron = 9.1 x 10-28 gm
Representing fractions
• Integer pairs (for rational numbers)
5 8 = 5/8
Strings with explicit decimal point
- 2 4 7 . 0 9
Implicit point at a fixed position
010011010110001011
Floating point implicit point
fraction x base power
Numbers with binary point
101.11 = 1x22 + 0x21 + 1x20 + . +1x2-1 + 1x2-2
= 4 + 1 + .+ 0.5 + 0.25 = 5.7510
0.6 = 0.10011001100110011001.....
.6 x 2 = 1 + .2
.2 x 2 = 0 + .4
.4 x 2 = 0 + .8
.8 x 2 = 1 + .6
Numeric Data Type
• char, short, int, long int
– char : 8 bit number (1 byte=1B)
– short: 16 bit number (2 byte)
– int : 32 bit number (4B)
– long int : 64 bit number (8B)
• float, double, long double
– float : 32 bit number (4B)
– double : 64 bit number (8B)
– long double : 128 bit number (16B)
8
Numeric Data Type
unsigned char
char
unsigned short
short
Unsigned int
int
9
Number in integer and log scale
• unsigned char C;
– 8 bit, 256 number
– Start from 0, 1, 2, 3, 4, …, 255 //Unit spaced
• Int A
– 32 bit numbers
– Start from 0, 1, 2, ,,,231
– Negative side: -1, -2, -3, …., -231 //Unit spaced
• Float f
– 32 bit
– Log scale
10
Number in integer and log scale
• Log Scale
– 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106
• Number of integers :
– Between 100 and 101 = 9
– Between 101 and 102 = 90
– Between 102 and 103 = 900
– Between 103 and 104 = 9000
– Between 104 and 105 = 90000
– Between 105 and 106 = 900000
11
Example Log Scale Numbers
• Example Scientific Format is d.dd X 10n
• For specific exponent value : only 999 numbers
Range N Range
0.01x100 - 9.99x100 999 0.01 – 9.99
0.01x101 - 9.99x101 999 0. 1 – 99.9
0.01x102 - 9.99x102 999 1.00 – 999
0.01x103 - 9.99x103 999 10 – 9990
0.01x104 - 9.99x104 999 100 – 99900
0.01x105 - 9.99x105 999 1000 –999000
0.01x106 - 9.99x106 999 10000-9990000
12
Consecutive Two numbers when
power is 5
• Ex1 Difference
– 0.01x105 = 1000
– 0.02x105 = 2000 One Digit
• Ex2 Difference Minimum
– 5.55x105 = 555000 Difference is
– 5.54x105 = 554000 Valued
• Ex2 Difference : 1000
– 9.99x105 = 999000
– 9.98x105 = 998000
13
C Float: with precision printing
float A=5.84e5;
A=A+20;
printf(“A=%e \n”,A); //A=5.840200e5
printf(“A=%.2e \n”,A); //A=5.84e5
14
Numeric Data Type
• char, short, int, long int
– We have : Signed and unsigned version
– char (8 bit)
• char : -128 to 127, we have +0 and -0 ☺ ☺ Fun
• unsigned char: 0 to 255
– int : -231 to 231-1
– unsigned int : 0 to 232-1
• float, double, long double
– For fractional, real number data
– All these numbered are signed and get stored in
different format
15
Sign bit Numeric Data Type
Exponent Mantissa
float
Exponent Mantiss-1
Mantissa-2
double
16
FP numbers with base = 10
(-1) S xFx 10E
S = Sign
F = Fraction (fixed point number)
usually called Mantissa or Significand
E = Exponent (positive or negative integer)
Example 5.9x1012 , -2.6x103 9.1 x 10-28
Only one non-zero digit left to the point
FP with two sign: How to handle
• Two signs: one for number other for
exponents
± d.dd x 10 ± dd
• Remove confusion:
– Only one sign for number
– Sign for exponent managed by Biasing
– 8 bit 256 represented as [-127 to 127]
– With Bias 127 means
[0, 1, 2, 3, 4, …127] [127, 128,..254]
[-1, -2, ..-127] [126, 125, …0]
IEEE 754 standard
Single precision numbers
1 8 23
0 1011 0101 1101 0110 1011 0001 0110 110
S E F
Double precision numbers
1 11 20+32
0 1011 0101 111 1101 0110 1011 0001 0110
S E F
1011 0001 0110 1100 1011 0101 1101 0110
Representing F in IEEE 754
Single precision numbers
23
1. 110101101011000101101101
F
Double precision numbers
20+32
1. 101101011000101101101
F
101100010110110010110101110101101
Only one non-zero digit left to the point: default it will be 1 incase
of binary. So no need to store this bit
Value Range for F
Single precision numbers
1 ≤ F ≤ 2 - 2-23 or 1≤F<2
Double precision numbers
1 ≤ F ≤ 2 - 2-52 or 1≤F<2
These are “normalized”.
Representing E in IEEE 754
Single precision numbers
8
10110101
E bias 127
Double precision numbers
11
10110101110
E bias 1023
FP-How to store: -0.75 in fp
• V = -0.75 = (0.11)2 Given numeric value, how FP
store it in Bits in memory?
• Scientific : - 1.1x2-1
• With Bias : E= -1+127 =126, S=1 for Neg
• Mantissa: remove default part 1.1 => X.1
Single precision numbers
1 8 23
1 0111 1110 1000 0000 0000 0000 0000 000
S E’ F
23
FP : What value stored?
Stored Bits in memory in FP
format: What numeric value?
• E=E’-127, V =(-1)s x 1 .M x 2 E’-127
• V= 1.1101… x 2 (40-127)=1.1101.. x 2-87
Single precision numbers
1 8 23
0 0010 1000 1101 0110 1011 0001 0110 110
S E’ F
24
Value Range for E
Single precision numbers
-126 ≤ E ≤ 127
(all 0’s and all 1’s have special meanings)
Double precision numbers
-1022 ≤ E ≤ 1023
(all 0’s and all 1’s have special meanings)
Floating point demo applet on the
web
• https://www.h-
schmidt.net/FloatConverter/IEEE754.html
• Google “Float applet” to get the above link
26
Overflow and underflow
largest positive/negative number (SP) =
±(2 - 2-23) x 2127 ≅ ± 2 x 1038
smallest positive/negative number (SP) =
± 1 x 2-126 ≅ ± 2 x 10 -38
Largest positive/negative number (DP) =
±(2 - 2-52) x 21023 ≅ ± 2 x 10308
Smallest positive/negative number (DP) =
± 1 x 2-1022 ≅ ± 2 x 10 -308
Density of int vs float
Int : 32 bit
Exponent Mantissa
Float : 32 bit
• Number of number can be represented
– Both the cases (float, int) : 232
• Range
– int (-231 to 231-1)
– float Large ±(2 - 2-23) x 2127 Small±
± 1 x 2-126
• 50% of float numbers are Small (less then ±1 ) 28
Density of Floating Points
• 256 Persons in Room of Capacity 256 (Range)
8 bit integer : 256/256 = 1
• 256 person in Room of Capacity 200000
(Range)
– 1st Row should be filled with 128 person
– 50% number with negative power are -1 < N > +1
• Density of Floating point number is
– Dense towards 0
-–∞Sparse towards ∞
-2 -1 0 +1 +2 +∞
29
Expressible Numbers(int and float)
Expressible integers
- overflow + overflow
-231 0 231-1
- underflow
Expressible Float
+ underflow
- overflow + overflow
0
(1-2-24)x2128 -0.5x2-127 0.5x2-127 (1-2-24)x2128
Distribution of Values
• 6-bit IEEE-like format
– e = 3 exponent bits
– f = 2 fraction bits
– Bias is 3
• Notice how the distribution gets denser
-15
toward-10zero.Denormalized
-5 0 5
Normalized Infinity
10 15
Distribution of Values
(close-up view)
• 6-bit IEEE-like format
– e = 3 exponent bits
– f = 2 fraction bits
– Bias is 3
-1 -0.5 0 0.5 1
Denormalized Normalized Infinity
Density of 32 bit float SP
• Fraction/mantissa is 23 bit
• Number of different number can be stored for
particular value of exponent
– Assume for exp=1, 223=8x1024x1024 ≈8x106
– Between 1-2 we can store 8x106 numbers
• Similarly
– for exp=2, between 2-4, 8x106 number of number can
be stored
– for exp=3, between 4-8, 8x106 number of number can
be stored
– for exp=4, between 8-16, 8x106 number of number
can be stored
33
Density of 32 bit float SP
• Similarly
– for exp=23, between 222-223, 8x106 number of
number can be stored
– for exp=24, between 223-224, 8x106 number of
number can be stored OK
– for exp=25, between 224-225, 8x106 number of
number can be stored
• 224-225 >8 x106 BAD
–…
– for exp=127, between 2126-2127, 8x106 number of
number can be stored WROST 34
Density of 32 bit float SP
• 223=8x1024x1024 ≈8x106
012 4 8 16
35
Thanks
36