MODULE -I
Entropy, Sources and
Source Coding
INFORMATION
any message that brings a specification - which involves
a certain degree of uncertainty.
Two correspondents: one generating it (the information
source S) and another receiving it (the destination D, or
the user U).- can be transmitted at distance or stored
(memorized) for later reading.
The physical medium,( including the contained
equipment ) -for remote transmission of the information
from S to D, is called transmission channel C;
In storage systems - the channel - replaced by the
storage medium, e.g. DVDS, Print outs etc.
INFORMATION
Information source produces an output
that is random in nature.
Ifthe source output had no
randomness, (i.e., the output
were known exactly), Why
should we transmit it! ?
DISCRETE SOURCE
a discrete information source - generates ‘ m’
distinct symbols (messages)
A discrete source is called memory less (discrete
memory less source -DMS) if the emission of a
symbol does not depend on the previous
transmitted symbols
AMOUNT OF INFORMATION
“Sun rises in the east”
-known to all –certain event. No new
information
“I will get at least one mail today.”
more chance to happen- but some
amount of uncertainty
“It will snow in Munnar in next January”.-
No idea. Highly uncertain-
It will snow in Ernakulam tomorrow- will not
happen certain event
SO INFERENCE
A high probability event - less information than a
low probability event.
A less probable event- means high amount of
information or higher degree of uncertainty
contains more information . ( and vice versa)
So- quantify information -based on probability.
INFORMATION
Consider a discrete random variable X with
possible outcomes xi, i = 1, 2, …, n. The Self
Information of the event X = xi
Units
WHY LOG OF PROBABILITY?
Monotonically decreasing function of
probability
If a particular event has a 100% probability of
occurring,- then its self-information is {log(1)=0}:
its occurrence is "perfectly non-surprising" and
yields no information.
If a particular event has a 0% probability of
occurring, then its self-information is {log(0)=α }: -
its occurrence is "infinitely surprising."
TO CALCULATE LOG2 FROM ANOTHER LOG
BASE
ENTROPY- AVERAGE INFORMATION
Where X represents the alphabet of possible output
letters from a source,
H(X) represents the average information per source
letter.
Unit : bits, nats, Hartleys based on log base
PROBLEM 1
Let us define a random variable X as follows:
red with probability 1/2
green with probability 1/ 4
X= .
blue with probability 1/ 8
yellow with probability 1/8
Find entropy of X.
PROBLEM 2
Suppose we have a horse race with eight horses
taking part. Assume that the probabilities of
winning for the eight horses are( ½ ¼,1/8,
1/16,1/64, 1/64,1/64,1/64). Calculate the entropy
of horse race .
Ans = 2 bits
PROPERTIES OF ENTROPY
H(X) ≥0
The entropy is a continuous function with respect to each
variable pi because it is a sum of continuous functions.
Additivity: self information is additive and so it is H(X),
which represents the mean value of self information.
Entropy is maximum when the symbols have equal
probabilities; this maximum value is also called
decision quantity, D(X): H max ( X) =: D( X) = log 2 m
Symmetry: H(X) is unchanged if the events xi are
reordered.
The maximum value of the entropy s obtained for
equiprobabile messages.
Hb(X) = logb(a) Ha(X)
MINIMUM AVERAGE NUMBER OF BITS IS
MESSAGES WITH EQUAL PROBABILITY
Consider a discrete binary source – emits 0 or 1-
a sequence -statistically independent.
The output is “0” with probability “ p” or 1 with a
probability 1 – p.
The entropy of this binary source is
H ( X )
i
p i log p i
= – p log2(p) – (1 – p) log2(1 – p)
=H(p) ( def)
MESSAGE WITH EQUAL PROBABILITY FOR
BINARY SOURCE.
If p=0.5 for above case , then 1-p=0.5
H(X) =1
Entropy – concave function
When p=0 or 1 – no uncertainty- not an RV
RATE OF INFORMATION
Information rate or Rate of Information ‘R’ –
represented in average bits of information per
second.
Information Rate R= r H(X)
H(X) – average information
r- rate at which information is generated.
PROBLEM 3
A discrete source emits 5 symbols once each 1 ms
with the probabilities ½, ¼,1/8, 1/16, 1/16.
Determine the source entropy and information
rate.
Ans=1.875 bits ms
PROBLEM 4
An analog signal band limited to 10 kHz is
quantized in 8 levels of a PCM system with
probabilities of ¼, 1/5, 1/5, 1/10, 1/10, 1/20, 1/20,
1/20 respectively. Find the entropy and the rate
of information.
REMEMBER
Information increases Entropy
SOURCE ENTROPY
The minimum average number of binary digits
needed to specify a source output (message) uniquely
is called “SOURCE ENTROPY”
JOINT ENTROPY
measures how much entropy is contained in a
joint system of two random variables.
If the random variables are X and Y, the joint
entropy - H(X,Y).
measured in bits, nats, or Hartleys depending on
the base of the logarithm.
JOINT ENTROPY
The Joint Entropy of a pair of discrete random
variables (X, Y) with a joint distribution
p(x, y ) is given by
PROPERTIES OF JOINT ENTROPY
Nonnegative, i.e., H(X,Y)≥0
greater than individual entropies
H(X,Y) ≥ max{H(X), H(Y)}
Less than or equal to the sum of individual
entropies
H(X,Y) ≤ H(X)+ H(Y)
REMEMBER CONDITIONAL PROBABILITY
probability of one event occurring with some
relationship to one or more other events.
Example:
Event A is that it is raining outside, and it has a
0.3 (30%) chance of raining today.
Event B is that you will need to go outside, and
that has a probability of 0.5 (50%).
A conditional probability would look at these two
events in relationship with one another
the probability that it is both
raining and you will need to go outside.
CONDITIONAL ENTROPY
conditional entropy is the amount of information
in one random variable given, we already know
the other.
the conditional entropy quantifies the amount
of information needed to describe the outcome of
a random variable Y given the other random
variable X is known
Represented as H(Y/X)
how much entropy a random variable Y has
remaining if we have already learned the value of
a second random variable X
CONDITIONAL ENTROPY
The Average Conditional Self Information called the
Conditional Entropy.
If X and Y are discrete random variables and p (x, y) and p
(x|y) are the values of their joint and conditional
probability distributions, then
H(X|Y) is the information (or uncertainty) in
X after “Y " is observed.
entropy of X conditional on Y
OBSERVATIONS
(i) I(X; Y) ≥0 ,H(X) ≥ H(X|Y).
(ii) The case I(X; Y )=0, H(X) = H(X|Y), which is
possible if and only if X and Y are statistically
independent.
(iii) Since H(X) ≥ H(X|Y), the observation of Y does
not increase the entropy (uncertainty).
FOR A CHANNEL
The amount of uncertainty remaining about the
channel input after observing the channel output,
RELATIONSHIP BETWEEN CONDITIONAL
ENTROPY AND JOINT ENTROPY-CHAIN RULE
H(X, Y) = H(X ) +H(Y|X) = H(Y ) + H(X|Y)
MUTUAL INFORMATION
the mutual information (MI) of two random
variables is a measure of the
mutual dependence between the two variables
Unit bits, nats, Hartleys (bsed on log base)
measures how much one random variables tells us
about another
can be thought of as the reduction in uncertainty
about one random variable given knowledge of
another
High mutual information - large reduction in
uncertainty
low mutual information - small reduction;
zero mutual information between two random
variables means the variables are independent.
MUTUAL INFORMATION AND ENTROPY
RELATIONSHIP BETWEEN H AND I(X:Y)
PROPERTIES OF MUTUAL INFORMATION
Mutual information of a channel is symmetric.
I(X:Y)= I(Y:X)
Mutual information is non negative
I(X;Y) ≥0
Mutual information can be expressed in terms of
entropy of the channel output.
Mutual information is additive for independent
variables
I(X,W;Y,Z) =I(X;Y) +I(W;Z)
If X and Y are independent I(X;Y) =0
DISCRETE MEMORY LESS SOURCE
A source from which the data is being emitted at
successive intervals,
w independent of previous values
source is discrete as it is not considered for a
continuous time interval, but at discrete time
intervals
source is memoryless as it is fresh at each instant
of time, without considering the previous values.
DISCRETE MEMORY LESS CHANNEL
A statistical model
X- input Y –o/p
o/p corrupted by noise
X and Y are Random variables
DISCRETE MEMORY LESS CHANNEL
CHANNEL MATRIX
A channel is completely specified by the complete
set of transition probabilities.
From the figure in previous slide, channel is
often specified by the matrix of transition
probabilities P(Y| X) , is given by
CHANNEL MATRIX
The matrix [P( Y /X) ] is called the channel
matrix. Since each input to the channel results in
some output, each row of the channel matrix
must to unity,
PROBLEM
Find joint entropy, marginal entropies H(X),H(Y), conditional
entropies H(X/Y) H(Y/X) and mutual information I(X;Y)
PROBLEM
Given joint probabilities of two RV X and Y as
Y/X X=X1 X=X2 X=X3
Y=Y1 0.2 0.1 0.1
Y=Y2 0.3 0.2 0.1
Find the joint entropy, conditional entropies,
marginal entropies and mutual information.
PROBLEM (HOME WORK)
SOURCE CODING
SOURCE CODING
Concept behind data compression
a mapping from (a sequence of) symbols from an
information source to a sequence of alphabet
symbols (usually bits)
the source symbols can be exactly recovered from
the binary bits (lossless source coding) or
recovered within some distortion (lossy source
coding)
The aim -source coding is to represent
information as accurately as possible using
as few bits as possible
redundancy from the source needs to be
removed.
SOURCE CODING
Use of variable-length codes in order- to reduce
the number of symbols in a message to the
minimize the information in the message.
Example of source code - someone using HTML
code to create a screen.
In telegraphy, we use Morse code, in which the
alphabets are denoted by Marks and Spaces. If
the letter E is considered, which is mostly used, it
is denoted by “.” Whereas the letter Q which is
rarely used, is denoted by “--.-”
Use to remove redundancy
SOURCE CODING
In a file – Source coding-data reduction
In transmission-Source coding- reduced transmission bits
Source Coding increases Entropy
Effective representation of data
A DEMO
LOSSLESS COMPRESSION OF DATA
Assumptions made:
1. the channel is perfectly noiseless i.e. the
receiver sees exactly the encoder’s output
2. we always require the output of the decoder to
exactly match the original sequence X.
3. X is generated according to a fixed
probabilistic model, p(X)
So Encoding -a mapping of each source symbol into
a finite sequence of code symbols called a
codeword.
LENGTH OF A CODE
Average code word length L of the source
encoder as
n
L k 0
p k l k
Efficiency of a code :
H (s)
L
Usually expressed in % (multiplied by 100)
Redundancy R= 1-η
PROBLEMS
1. A scanner converts a black and white document,
line-by-line into binary data for transmission.
The scanner produces source data comprising
symbols representing runs of up to six similar
image pixel elements with the probabilities as
shown below:
No. of 1 2 3 4 5 6
consecutive
pixels
Probability of 0.2 0.4 0.15 0.1 0.06 0.09
occurrence
PROBLEM CONTINUED
Determine the average length of a run (in pixels)
and the corresponding effective information rate
for this source when the scanner is traversing
1000 pixel/s.
Answers
H(s) = 2.29 bits
Length – 2.69 bits
At 1000 pixel/s scan rate we generate 1000/2.69 =
372 symbol/s. Thus the source information rate is
2.29 x 372 = 852 bit/s
SOURCE CODE
Can be of fixed length or variable length
Two fundamental requirement of a source code:
1. The code words produced by the encoder should
be in binary form.
2. The source code is uniquely decodable, so that
the original source sequence can be
reconstructed perfectly from the encoded binary
sequence.
may also reduce fluctuations in the information
rate from the source and avoid symbol 'surges'
which could overload the channel when the
message sequence contains many high
probability .
VARIABLE LENGTH SOURCE CODING
When the source symbols are not equally
probable, a more efficient encoding method is to
use variable-length code words.
Eg: Morse code
Probabilities may be used to assign codes, i.e, a
more occurring symbol uses less no.bits.
This is entropy coding
PREFIX CODE
To be a prefix code- the entire set of possible
encoded values ("code words") must not
contain any values that start with
any other value in the set.
Actually prefix (free) code- Instantaneous code
Eg: 1, 22,333 – a prefix code
But 1, 12,33 is not a prefix code, because one of
the codeword (12) starts with another codeword
(1).
A prefix code is a code in which no codeword is a
prefix of another codeword, neither can a
codeword be derived from another by appending
more bits to a shorter codeword.
INSTANTANEOUS CODE
Instantaneous/Prefix-Free Codes A code is called
instantaneous if each symbol can be decoded as
soon as the corresponding codeword is
completed.
It is not necessary to see bits of later symbols in
order to decode the current symbol.
The code 0, 01, 011, 111 for the symbols s1, s2,
s3, s4, respectively is not instantaneous
(although it is uniquely decodable).
PREFIX CODE- BINARY TREE
REPRESENTATION
most easily represented by a binary tree
the external nodes are labeled with single
characters that are combined to form the
message.
The encoding for a character is determined by
following the path down from the root of the tree
to the external node that holds that character:
A ‘0’ bit identifies a left branch in the path, and a
‘1’ bit identifies a right branch.
PREFIX CODE- BINARY TREE
REPRESENTATION EXAMPLE
Character Encoding
a 0
b 111
c 1011
d 1010
r 110
! 100
Each character is encoded with a
different number of bits
Square box- external node
Black round- internal
node
PREFIX CODE- DECODING
Start at the root of the tree.
Repeat until you reach an external leaf node.
Read one message bit.
Take the left branch in the tree if the bit is 0; take
the right branch if it is 1.
o Print the character in that external node.
o The whole process is repeated, starting over at
the root, until all of the bits in the compressed
message are exhausted.
EXAMPLE
Decode the following code word based on
given code tree
0111110010110101001111100100
Now
|0|111|110|0|1011|0|1010|0|111
|110|0|100
|0|111|110|0|1011|0|1010|0|111|110|0|100
a b r a c a d a b r a !
PREFIX CODE- PROPERTY
The encoding scheme to reduce the number of
bits in a message, always use short encodings for
frequently used characters, and long encodings
for infrequent ones.
A second fundamental property of prefix codes is
that messages can be formed by simply stringing
together the code bits from left to right
Fixed codes are prefix codes
UNIQUELY DECODABLE CODE
a prefix code (or prefix-free code) if it has
the prefix property, which requires that no
codeword is a proper prefix of any other
codeword.
One- to one mapping source symbol to code
symbol
TRY YOURSELF
The four symbols A, B, C, D are encoded using
the following sets of code words. In
each case state whether the code is (i) uniquely
decodable ii) instantaneous
a) {1, 01, 000, 001}
b) {0, 10, 000, 100}
c) {0, 01,011 , 0111}
d) {01, 01, 110, 100}
e) {01, 10, 0010, 0111
ANSWERS
Note: Instantaneous (Prefix free) Uniquely Decodable
A) {1, 01, 000, 001} is an instantaneous code (and therefore
uniquely decodable )
B) no
C) {0, 01, 011, 0111} is uniquely decodable but not an
instantaneous code
D) {01, 01, 110, 100} – no
E) is not an instantaneous code since 01 is a prefix of 0111
VENN DIAGRAM REPRESENTATION
HUFFMAN CODING
Invented by David A Huffman as a part of his
class assignment.
Idea : to assign variable-length codes to input
characters,-lengths of the assigned codes are
based on the frequencies of corresponding
characters
Prefix code
A greedy algorithm- an approach for solving a
problem by selecting the best option available at
the moment.
HUFFMAN CODING
Huffman Code is also Compact code and satisfies
the properties:
1. Has the shortest mean length among binary
instantaneous codes
2. Optimal tree
3. Compact Code tree
STEPS IN HUFFMAN CODING
(i) Arrange the source symbols in decreasing
order of their probabilities.
(ii) Take the lowermost two symbols and tie them
together as shown in Fig.
STEPS IN HUFFMAN CODING
Add the probabilities of the two symbols and
write it on a combined node.
(iii) Treat this sum of probabilities as a new
probability associated with a new symbol. Again
pick the two smallest probabilities, tie them
together to form a new probability. Each time we
perform the combination of two symbols we
reduce the total number of symbols by one.
Whenever we tie together two probabilities
STEPS IN HUFFMAN CODING
(iv) Continue the procedure until only one
probability is left (and it should be 1).
This completes the construction of the Huffman
tree. symbol.
While tracing back the route, read out the labels
on the branches. This is the codeword for the
symbol.
EXAMPLE
Given the probabilities for message symbols a1 to
a6 as 0.1 ,0.4, 0.06, 0.1, 0.04, 0.3. Find the binary
Huffman codes
Step 1. Arrange according to decreasing order of
probabilities
STEP 1
NON SINGULAR CODES
A code is non-singular if each source symbol is
mapped to a different non-empty bit string,
i.e. the mapping from source symbols to bit
strings is injective. (one to one mapping)
SHANNON’S SOURCE CODING THEOREM
Shannon’s first theorem or main theorem
Statement : Given a discrete memory less source
of H(s), the average length of code is ‘L’, for any
distortion less source encoding scheme is
bounded as
H(s) L
Mathematical tool for assessing data compression
SOURCE CODING THEOREM
Also called noiseless coding theorem- establishes
the condition for error- free coding to be possible.
SHANNON’S SOURCE CODING THEOREM
KRAFT INEQUALITY
limits the lengths of code words.
In prefix codes
By Kraft in 1949
if one takes an exponential of the length of each
valid codeword, the resulting set of values must
look like a probability mass function
, it must have total measure less than or equal to
one.
can tell us whether the lengths of a prefix
code can be shortened, but it cannot make
any change to the lengths
KRAFT INEQUALITY
Relates- how to construct a symbol code having
some of the good properties with minimum
expected length of the code.
We shall assume that the encoding is not in
binary, but in a general D-ary code i.e.,(uses D
symbols [D] = {0, 1, ..., D−1} instead of 2 symbols
{0, 1}).
KRAFT (MC –MILLAN)INEQUALITY
SOME PROPERTIES
If Kraft's inequality holds with strict inequality,
the code has some redundancy.
If Kraft's inequality holds with equality, the code
in question is a complete code.
If Kraft's inequality does not hold, the code is
not uniquely decodable.
For every uniquely decodable code, there exists a
prefix code with the same length distribution
STATEMENT
Kraft's inequality states that given a list of
positive integers (n1,n2,…nr), there exists a prefix
code with a set of code words (σ1,σ2,…σr) where
the length of each codeword |σi|=ni, i| if and
only if
ni
D 1
D=2 ,for binary codes
PROOF
Refer notes.
PROOF (ALTERNATE METHOD)
The Kraft inequality can tell that a given code
is not a prefix code but it cannot be used to decide
if a given code is a prefix code.
VARIABLE LENGTH CODING
Consider an example which can be coded in three
forms
VARIABLE LENGTH OF CODE
BOUNDS ON CODE LENGTH
BOUNDS ON CODE LENGTH
There is an overhead that is at most 1 bit-due to the fact
that log 1/pi is not always an integer.
We can reduce the overhead per symbol by spreading it out
over many symbols.
Let us consider a system in which we send a
sequence of n symbols from X.
The symbols are assumed to be drawn i.i.d.
according to p(x).
BOUNDS ON CODE LENGTH
By using large block lengths we can achieve an
expected codelength per symbol arbitrarily close to
the entropy.
We can use the same argument for a sequence of
symbols from a stochastic process that is not
necessarily i.i.d.
In this case, we still have the bound
SHANNON’S SOURCE CODING THEOREM
Shannon’s first theorem
(or Shannon’s noiseless coding theorem)
establishes the statistical limits to possible data
compression for data whose source is
an independent identically-distributed random
variable, and the operational meaning of
the Shannon entropy.
BASED ON BOUNDS -SHANNON’S SOURCE
CODING THEOREM
MODULE I
Module 1 – Entropy, Sources and Source
Coding
Entropy, Properties of Entropy, Joint and
Conditional Entropy, Mutual Information,
Properties of Mutual Information.
Discrete memory less sources, Source code,
Average length of source code, Bounds on average
length, Uniquely decodable and prefix-free source
codes. Kraft Inequality (with proof), Huffman
code. Shannon’s source coding theorem (both
achievability and converse) and operational
meaning of entropy.