ch1. Information Coding Theory
ch1. Information Coding Theory
1.1 Structure
1
2 Fundamentals of Information Theory and Coding Design
any other part of the melody. In fact, pieces of music often have a complicated,
multi-layered structure, which is not obvious to the casual listener.
In this book, we are going to be concerned with things that have structure. The kinds
of structure we will be concerned with will be like the structure of pieces of music.
They will not be fixed and obvious.
We can extend a probability distribution from to the set of all subsets of ,
which we denote by , by setting ¾ for any . Note
that .
An event whose probability is is impossible and an event whose probability is is
certain to occur.
If and are events and then .
DEFINITION 1.3 Expected Value If is a sample space
to a vector space , the expected value of is
with probability distribution , and is a function from the sample space
.
NOTE We will often have equations that involve summation over the elements
of a finite set. In the equations above, the set has been and
the summation has been denoted by . In other places in the text we will denote
such summations simply by ¾ .
Defining the surprise as the negative logarithm of the probability not only gives us the
appropriate limiting values as the probability tends to or , it also makes surprise
additive. If several independent events occur in succession, the total surprise they
generate is the sum of their individual surprises.
DEFINITION 1.5 Entropy We can restrict the surprise to the sample space
and consider it to be a function from the sample space to the real numbers. The
expected value of the surprise is the entropy of the probability distribution.
The concept of entropy was introduced into thermodynamics in the nineteenth cen-
tury. It was considered to be a measure of the extent to which a system was disor-
dered. The tendency of systems to become more disordered over time is described by
the Second Law of Thermodynamics, which states that the entropy of a system can-
not spontaneously decrease. In the 1940’s, Shannon [6] introduced the concept into
communications theory and founded the subject of information theory. It was then
realised that entropy is a property of any stochastic system and the concept is now
used widely in many fields. Today, information theory (as described in books such
as [1], [2], [3]) is still principally concerned with communications systems, but there
are widespread applications in statistics, information processing and computing (see
[2], [4], [5]).
Let us consider some examples of probability distributions and see how the entropy is
related to predictability. First, let us note the form of the function
where and denotes the logarithm to base 2. (The actual base does not
matter, but we shall be using base 2 throughout the rest of this book, so we may as
well start here.) The graph of this function is shown in Figure 1.1.
Note that approaches as tends to and also as tends to . This
means that outcomes that are almost certain to occur and outcomes that are unlikely
to occur both contribute little to the entropy. Outcomes whose probability is close to
make a comparatively large contribution to the entropy.
EXAMPLE 1.1
with . The entropy is
In this case, and are equally likely to occur and the situation is as unpredictable
as it can be.
Entropy and Information 5
0.6
0.5
0.4
-p*log(p)
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
p
FIGURE 1.1
The graph of .
EXAMPLE 1.2
with , and . The entropy is
In this case, the situation is more predictable, with more than thirty times more
likely to occur than . The entropy is close to zero.
EXAMPLE 1.3
with , and . Using the convention that
, the entropy is . The situation is entirely predictable, as always
occurs.
EXAMPLE 1.4
, with for . The entropy is
and the situation is as unpredictable as it can be.
EXAMPLE 1.5
, with
for .
6 Fundamentals of Information Theory and Coding Design
The entropy is and the situation is fairly predictable as will occur far more
frequently than any other outcome.
EXAMPLE 1.6
, with for
. The entropy is and the situation is about as predictable as in
Example 1.1 above, with outcomes and equally likely to occur and the others
very unlikely to occur.
PROPOSITION 1.1
If is the entropy of a probability distribution measured using natural logarithms,
and is the entropy of the same probability distribution measured using logarithms
to the base , then
(1.2)
(1.3)
It follows that
Entropy and Information 7
(1.4)
THEOREM 1.1
If we have a sample space with elements, the maximum value of the entropy
function is .
8 Fundamentals of Information Theory and Coding Design
(1.5)
subject to the constraint
(1.6)
We introduce the Lagrange multiplier , and put
(1.7)
for and
(1.9)
(1.10)
so
(1.11)
for each . The remaining condition gives
(1.12)
which can be solved for , or can be used directly to give
(1.13)
for all . Using these values for the , we get
(1.14)
Entropy and Information 9
LEMMA 1.1
If
conditions
and are all non-negative numbers that satisfy the
and , then
PROOF We prove the result for the natural logarithm; the result for any other base
follows immediately from the identity
(1.16)
(1.21)
with equality if and only if for all .
Note that putting for all in this inequality gives us an alternative proof
that the maximum value of the entropy function is .
10 Fundamentals of Information Theory and Coding Design
There are many situations in which it is useful to consider sample spaces that are the
Cartesian product of two or more sets.
(1.22)
for , , and
(1.23)
If is the joint probability distribution function on , the definition of entropy
becomes
(1.24)
If we want to emphasise the spaces and , we will denote the entropy of the joint
probability distribution on by or simply by . This is known
as the joint entropy of and .
If there are probability distributions and on and , respectively, and these
are independent, the joint probability distribution on is given by
There is a simple relationship between the entropy of the joint probability distribution
function and that of the marginal distribution functions.
THEOREM 1.2
If is a joint probability distribution function on , and and are the
marginal distributions on and , respectively, then
PROOF
(1.29)
and similarly
(1.30)
So
12 Fundamentals of Information Theory and Coding Design
(1.31)
Also,
(1.32)
Since
(1.33)
and
(1.34)
we can use the inequality of Lemma 1.1 to conclude that
(1.35)
with equality if and only if for all and , that is, if the
two marginal distributions are independent.
It is obvious that
(1.37)
Almost as obvious is one form of Bayes’ Theorem:
THEOREM 1.3
If is a sample space with a probability distribution function , and and are
events in , then
(1.38)
Entropy and Information 13
EXAMPLE 1.7
We have two jars, A and B. Jar A contains 8 green balls and 2 red balls. Jar B contains
3 green balls and 7 red balls. One jar is selected at random and a ball is drawn from
it.
We have probabilities as follows. The set of jars forms one sample space, ,
with
as one jar is as likely to be chosen as the other.
The set of colours forms another sample space, . The probability of
drawing a green ball is
as 11 of the 20 balls in the jars are green. Similarly,
We have a joint probability distribution over the colours of the balls and the jars with
the probability of selecting Jar A and drawing a green ball being given by
We have the conditional probabilities: given that Jar A was selected, the probability
of drawing a green ball is
and the probability of drawing a red ball is
(1.40)
(1.41)
(1.42)
Since , we can re-write this as
(1.43)
THEOREM 1.4
16 Fundamentals of Information Theory and Coding Design
PROOF
(1.44)
THEOREM 1.5
with equality if and only if and are independent.
This result is obviously symmetric in and ; so we also have
with equality if and only if and are independent. We can sum up this result
by saying the conditioning reduces entropy or conditioning reduces uncertainty.
signals. The first term is used by statisticians, the second by mathematicians and the
third by engineers. This may reflect differences in the way these people approach the
subject: statisticians are primarily interested in describing such sequences in terms
of probability theory, mathematicians are interested in the behaviour of such series
and the ways in which they may be generated and engineers are interested in ways of
using such sequences and processing them to extract useful information from them.
A device or situation that produces such a sequence is called an information source.
The elements of the sequence are usually drawn from a finite set, which may be
referred to as the alphabet. The source can be considered to be emitting an element
of the alphabet at each instant of a sequence of instants in time. The elements of the
alphabet are referred to as symbols.
EXAMPLE 1.8
Tossing a coin repeatedly and recording the outcomes as heads (H) or tails (T) gives
us a random sequence whose alphabet is .
EXAMPLE 1.9
Throwing a die repeatedly and recording the number of spots on the uppermost face
gives us a random sequence whose alphabet is
.
EXAMPLE 1.10
Computers and telecommunications equipment generate sequences of bits which are
random sequences whose alphabet is .
EXAMPLE 1.11
A text in the English language is a random sequence whose alphabet is the set con-
sisting of the letters of the alphabet, the digits and the punctuation marks. While we
normally consider text to be meaningful rather than random, it is only possible to
predict which letter will come next in the sequence in probabilistic terms, in general.
The last example above illustrates the point that a random sequence may not appear
to be random at first sight. The difference between the earlier examples and the final
example is that in the English language there are correlations between each letter in
the sequence and those that precede it. In contrast, there are no such correlations in
the cases of tossing a coin or throwing a die repeatedly. We will consider both kinds
of information sources below.
18 Fundamentals of Information Theory and Coding Design
An obvious question that is raised by the term “information source” is: What is the
“information” that the source produces? A second question, perhaps less obvious,
is: How can we measure the information produced by an information source?
An information source generates a sequence of symbols which has a certain degree
of unpredictability. The more unpredictable the sequence is, the more information
is conveyed by each symbol. The information source may impose structure on the
sequence of symbols. This structure will increase the predictability of the sequence
and reduce the information carried by each symbol.
The random behaviour of the sequence may be described by probability distribu-
tions over the alphabet. If the elements of the sequence are uncorrelated, a simple
probability distribution over the alphabet may suffice. In other cases, conditional
probability distributions may be required.
We have already seen that entropy is a measure of predictability. For an information
source, the information content of the sequence that it generates is measured by the
entropy per symbol. We can compute this if we make assumptions about the kinds
of structure that the information source imposes upon its output sequences.
To describe an information source completely, we need to specify both the alpha-
bet and the probability distribution that governs the generation of sequences. The
entropy of the information source with alphabet and probability distribution
will be denoted by in the following sections, even though it is actually the en-
tropy of . Later on, we will wish to concentrate on the alphabet and will use
to denote the entropy of the information source, on the assumption that the alphabet
will have a probability distribution associated with it.
EXAMPLE 1.12
Tossing a fair coin gives us an example of a stationary memoryless information source.
At any instant, the probability distribution is given by , .
This probability distribution has an entropy of 1 bit; so the information content is 1
bit/symbol.
Entropy and Information 19
EXAMPLE 1.13
As an example of a non-stationary memoryless information source, suppose we have a
fair coin and a die with painted on four faces and painted on two faces. Tossing the
coin and throwing the die in alternation will create a memoryless information source
with alphabet . Every time the coin is tossed, the probability distribution of
the outcomes is ,
, and every time the die is thrown, the
probability distribution is
,
.
The probability distribution of the outcomes of tossing the coin has an entropy of 1
bit. The probability distribution of the outcomes of throwing the die has an entropy
of 0.918 bits. The information content of the sequence is the average entropy per
symbol, which is 0.959 bits/symbol.
Markov sources and n-gram models are descriptions of a class of information sources
with memory.
To generate a sequence, a state is selected on the basis of the initial probability distri-
bution. A transition from this state to another state (or to the same state) is selected
on the basis of the transition probabilities, and the label of this transition is output.
This process is repeated to generate the sequence of output symbols.
It is convenient to represent Markov models diagrammatically in the form of a graph,
with the states represented by vertices and the transitions by edges, as in the follow-
ing example.
20 Fundamentals of Information Theory and Coding Design
½
Label 1,
Label 1,
Label 0,
¾ ¿
Label 0,
FIGURE 1.2
Diagrammatic representation of a Markov source.
EXAMPLE 1.14
Consider a Markov source with alphabet and set of states ½ ¾ ¿ .
Suppose there are four transitions:
EXAMPLE 1.15
The following probabilities give us a 3-gram model on the language .
To describe the relationship between n-gram models and Markov sources, we need
to look at special cases of Markov sources.
When we have an th-order Markov model, the transition probabilities are usually
given in terms of the probabilities of single symbols being emitted when the source
is in a given state. For example, in a second-order Markov model on , the
transition probability from to , which would be represented by , would
be represented instead by the probability of emission of when in the state , that is
. Obviously, some transitions are impossible. For example, it is not possible
to go from the state to the state , as the state following must have as its
first symbol.
We can construct a th-order Markov model from an -gram model and an
-gram model. The -gram model gives us the probabilities of strings of length ,
such as ½ ¾ . To find the emission probability of from this state, we
set
½ ¾
½ ¾ (1.45)
½ ¾
EXAMPLE 1.16
In the previous example 1.15 we had a 3-gram model on the language given
by
22 Fundamentals of Information Theory and Coding Design
P(1|11)=0.4
P(1|01)=0.5
01 11
P(0|01)=0.5
P(1|00)=0.2 P(0|11)=0.6
P(1|10)=0.2
00 10
P(0|10)=0.8
P(0|00)=0.8
FIGURE 1.3
Diagrammatic representation of a Markov source equivalent to a 3-gram model.
If a 2-gram model for the same source is given by , ,
and , then we can construct a second-order Markov source
as follows:
½
¾
(1.47)
..
.
Because they all represent probability distributions, each of the columns of must
add up to , and all the must add up to for each .
EXAMPLE 1.17
Consider a source with transition matrix
Then
Similarly,
and so on.
Suppose instead that
Then
so that
½ ¾
and the other two give
½ ¾ ¿
we get
¾ ¾ ¾
which gives us
¾ ½ ¿
So the stationary distribution is
In the examples above, the source has an unique stationary distribution. This is not
always the case.
26 Fundamentals of Information Theory and Coding Design
EXAMPLE 1.19
Consider the source with four states and probability transition matrix
The diagrammatic representation of this source is shown in Figure 1.4.
FIGURE 1.4
A source with two stationary distributions.
Some Markov sources have the property that every sequence generated by the source
has the same statistical properties. That is, the various frequencies of occurrence of
Entropy and Information 27
symbols, pairs of symbols, and so on, obtained from any sequence generated by the
source will, as the length of the sequence increases, approach some definite limit
which is independent of the particular sequence. Sources that have this property are
called ergodic sources.
The source of Example 1.19 is not an ergodic source. The sequences generated by
that source fall into two classes, one of which is generated by sequences of states
that end in the first state, the other of which is generated by sequences that end in the
fourth state. The fact that there are two distinct stationary distributions shows that
the source is not ergodic.
There are various ways of defining the entropy of an information source. The fol-
lowing is a simple approach which applies to a restricted class of Markov sources.
If we denote the probability distribution on the set of transitions from the th state by
, then the entropy of the th state is given by
DEFINITION 1.19 Unifilar Markov Source A unifilar Markov source is one with
the property that the labels on the transitions from any given state are all distinct.
We need this property in order to be able to define the entropy of a Markov source.
We assume that the source has a stationary distribution.
28 Fundamentals of Information Theory and Coding Design
, is
(1.52)
It can be shown that this definition is consistent with more general definitions of the
entropy of an information source.
EXAMPLE 1.20
For the Markov source of Example 1.14, there are three states, , and . The
probability distribution on the set of transitions from is for .
is given by
Its entropy is
using the usual convention that .
is given by
Its entropy is
is given by
Its entropy is
Entropy and Information 29
EXAMPLE 1.21
For the source of Example 1.16, the states are , , , .
is given by , and . Its entropy
is
The entropy of the source is
to ergodic Markov sources and are stated without proof. In a sense, they justify
the use of the conditional probabilities of emission of symbols instead of transition
probabilities between states in th-order Markov models.
THEOREM 1.6
Given any Æ
and any , we can find a positive integer
such that all
sequences of length fall into two classes: a set of sequences whose total
probability is less than ; and the remainder, for which the following inequality holds:
Æ
(1.53)
where is the probability of the sequence and is the entropy of the source.
THEOREM 1.7
Let be a Markov source with alphabet , and entropy . Let
denote the set of all sequences of symbols from of length . For , let
be the probability of the sequence being emitted by the source. Define
(1.54)
¾
which is the entropy per symbol of the sequences of symbols. Then is a
monotonic decreasing function of and
(1.55)
THEOREM 1.8
Let be a Markov source with alphabet , and entropy . Let
denote the set of all sequences of symbols from of length . For ,
let be the probability of the source emitting the sequence followed by the
symbol , and let be the conditional probability of the symbol being
emitted immediately after the sequence . Define
(1.56)
½
Entropy and Information 31
which is the conditional entropy of the next symbol when the preceding
symbols are known. Then is a monotonic decreasing function of and
(1.57)
THEOREM 1.9
If and are defined as in the previous theorems, then
(1.58)
(1.59)
and
(1.60)
These results show that a series of approximations to the entropy of a source can
be obtained by considering only the statistical behaviour of sequences of symbols
of increasing length. The sequence of estimates is a better approximation than
the sequence . If the dependencies in a source extend over no more than
symbols, so that the conditional probability of the next symbol knowing the preced-
ing symbols is the same as the conditional probability of the next symbol
knowing the preceding symbols, then .
where represents a sequence of symbols from the alphabet of the
source, is the probability of this sequence in the stationary dis-
tribution of the Markov source and the summation over indicates that all such
sequences are included in the summation. The adjoint source of this Markov source,
denoted , is the memoryless source that emits these symbols with the same prob-
abilities.
EXAMPLE 1.22
For the 3-gram model of Example 1.15, we have transition probabilities
we get the stationary distribution
The probabilities for the adjoint source of the 3-gram models are
and
Although the probabilities of emission of single symbols are the same for both the
Markov source and its adjoint source, the probabilities of emission of sequences
of symbols may not be the same. For example the probability of emission of the
sequence by the Markov source is , while for the adjoint source it
is (by the assumption of independence).
Going from a Markov source to its adjoint reduces the number of constraints on the
output sequence and hence increases the entropy. This is formalised by the following
theorem.
THEOREM 1.10
If is the adjoint of the Markov source , their entropies are related by
(1.62)
(1.63)
where the summation is over all states and is the probability of state in the
stationary distribution of the source.
The entropy of the adjoint is
34 Fundamentals of Information Theory and Coding Design
EXAMPLE 1.23
Consider the memoryless source with alphabet and emission probabilities
, .
Entropy and Information 35
THEOREM 1.11
If is the th extension of the stationary memoryless source , their entropies are
related by
(1.67)
The entropy of is
½ ¾ ½ ¾
½
½ ¾ (1.70)
½
Breaking ½ ¾ into the product of probabilities, and rearranging, we get
½ (1.72)
½
Since
(1.73)
for , we are left with
(1.74)
EXAMPLE 1.24
Let be the first-order Markov source with alphabet and transition probabil-
ities
EXAMPLE 1.25
Consider the second order Markov source with alphabet and transition proba-
bilities
EXAMPLE 1.26
Consider the fourth order Markov source with alphabet and transition proba-
bilities
Entropy and Information 39
We can use these probabilities to compute the transition probabilities of the second
extension, for example,
½ ¾ ½ ½½ ½ ¾ ½¾ ½
½ ½ ½ (1.76)
THEOREM 1.12
If is the th extension of the Markov source their entropies are related by
(1.77)
The proof is similar to the proof of the corresponding result for memoryless sources.
(Exercise 17 deals with a special case.)
Note that if , then
(1.78)
(1.79)
EXAMPLE 1.27
Suppose the sample space is the set of natural numbers, and the
probability distribution is given by
½ (1.80)
42 Fundamentals of Information Theory and Coding Design
If the sample space is a continuum, in particular, the real line, the summations be-
come integrals. Instead of the probability distribution function, we use the probabil-
ity density function , which has the property that
(1.82)
where is a closed interval and is the probability density function.
The mean and variance of the probability density function are defined by
½
(1.83)
½
and
½
(1.84)
½
The obvious generalisation of the definition of entropy for a probability density func-
tion defined on the real line is
½
(1.85)
½
provided this integral exists. This definition was proposed by Shannon, but has been
the subject of debate because it is not invariant with respect to change of scale or
change of co-ordinates in general. It is sometimes known as the differential entropy.
If we accept this definition of the entropy of a continuous distribution, it is easy to
compute the entropy of a Gaussian distribution.
is Ô
THEOREM 1.13
The entropy of a Gaussian distribution with mean and variance
in natural units.
Entropy and Information 43
By definition,
½
¾ ¾ (1.89)
½
this will be used below.
We now calculate the entropy:
½
½
½ ¾
½ ¾
½ ½ ¾
½ ½ ½ ½
¾
¾
½ ½ ¾
¾
¾
(1.90)
If the probability density function is defined over the whole real line, it is not possi-
ble to find a specific probability density function whose entropy is greater than the
entropy of all other probability density functions defined on the real line. However,
if we restrict consideration to probability density functions with a given variance, it
44 Fundamentals of Information Theory and Coding Design
can be shown that the Gaussian distribution has the maximum entropy of all these
distributions. (Exercises 20 and 21 outline the proof of this result.)
We have used
to denote the entropy of the probability distribution whose prob-
ability density function is . If is a random variable whose probability density
function is , we will denote its entropy by either or .
1.20 Exercises
1. Let ½ be a sample space with probability distribution given
by , , . Let be a function defined on
by , , . What is the expected value of ?
2. Let be a sample space with probability distribution given by
, . Let be the function from to Ê given by
and
What is the expected value of ?
3. Suppose that a fair die is tossed. What is the expected number of spots on the
uppermost face of the die when it comes to rest? Will this number of spots
ever be seen when the die is tossed?
4. Let
be a sample space with probability distribution
given by
,
,
,
.
There are sixteen possible events that can be formed from the elements of .
Compute the probability and surprise of these events.
5. Let
be a sample space for some . Compute the en-
tropy of each of the following probability distributions on :
Write down the transition matrix for this source. Is it possible for this source to
generate an output sequence that includes the subsequence ? Is it possible
for this source to generate an output sequence that includes the subsequence
?
10. A 2-gram model on the language is given by the probabilities
Construct a (first-order) Markov source from these models. Draw a diagram
to represent the source and write down its transition matrix.
46 Fundamentals of Information Theory and Coding Design
with all other probabilities . All the 3-gram probabilities are also , except
for
Compute for .
13. Find the stationary distribution of the Markov source whose transition matrix
is
*14. Prove that a Markov source with two states always has a stationary distribution,
provided that none of the transition probabilities are .
15. Find the stationary distribution of the Markov source whose transition matrix
is
The next two exercises prove the result that the Gaussian distribution has the
maximum entropy of all distributions with a given variance.
Entropy and Information 47
20. Use the identity to show that if and are probability density
functions defined on the real line then
½ ½
½ ½
Conclude that of all the probability density functions on the real line with
variance , the Gaussian has the greatest entropy.
22. Compute the entropy of the probability density function of a uniform distribu-
tion on a closed interval whose variance is and confirm that it is less than
the entropy of a Gaussian distribution with variance . (Use the results of
Exercise 19.)
The next four exercises are concerned with the Kullback-Leibler Divergence.
*23. Let and be probability distributions on the sample space
with and for . The Kullback-Leibler
Divergence of and is defined by
48 Fundamentals of Information Theory and Coding Design
*26. If the sample space is the real line, it is easier to define the Kullback-Leibler
Divergence in terms of probability density functions. If and are probability
density functions on Ê, with and for all Ê, then we
define ½
½
Compute when is the probability density function of a Gaussian
distribution with mean ½ and variance and is the probability density
function of a Gaussian distribution with mean and variance .
The next two exercises deal with the topic of Maximum Entropy Estimation.
*27. We have shown in Section 1.6 that the entropy of the uniform distribution on a
sample space with elements is and this is the maximum value of the
entropy for any distribution defined on that sample space. In many situations,
we need to find a probability distribution that satisfies certain constraints and
has the maximum entropy of all the distributions that satisfy those constraints.
One type of constraint that is common is to require that the mean of the dis-
tribution should have a certain value. This can be done using Lagrange multi-
pliers. Find the probability distribution on that has maximum
entropy subject to the condition that the mean of the distribution is . To do
this you have to find , , and that maximise the entropy
*28. Find the probability distribution on that has maximum entropy
subject to the conditions that the mean of the distribution is and the second
moment of the distribution of the distribution is . In this case the constraints
are
Entropy and Information 49
and
The next two exercises deal with the topic of Mutual Information.
*29. If is a joint probability distribution on , the Mutual Information of
, denoted , is the Kullback-Leibler Divergence between and the
product of the marginals , given by
1.21 References
[1] R. Ash, Information Theory, John Wiley & Sons, New York, 1965.
[2] T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley &
Sons, New York, 1991.
[3] D. S. Jones, Elementary Information Theory, Oxford University Press, Oxford,
1979.
[4] H. S. Leff and A. F. Rex, Eds., Maxwell’s Demon: Entropy, Information, Com-
puting, Adam Hilger, Bristol, 1990.
[5] R. D. Rosenkrantz, Ed., E. T. Jaynes: Papers on Probability, Statistics and Sta-
tistical Physics, D. Reidel, Dordrecht, 1983.
[6] C. E. Shannon and W. Weaver, The Mathematical Theory Of Communication,
The University of Illinois Press, Urbana, IL, 1949.