0% found this document useful (0 votes)

10 views49 pages

ch1. Information Coding Theory

Information coding theory

Uploaded by

zak zak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views49 pages

ch1. Information Coding Theory

Information coding theory

Uploaded by

zak zak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Chapter 1

Entropy and Information

1.1 Structure

Structure is a concept of which we all have an intuitive understanding. However,

it is not easy to articulate that understanding and give a precise definition of what
structure is. We might try to explain structure in terms of such things as regularity,
predictability, symmetry and permanence. We might also try to describe what struc-
ture is not, using terms such as featureless, random, chaotic, transient and aleatory.
Part of the problem of trying to define structure is that there are many different kinds
of behaviour and phenomena which might be described as structured, and finding a
definition that covers all of them is very difficult.
Consider the distribution of the stars in the night sky. Overall, it would appear that
this distribution is random, without any structure. Yet people have found patterns in
the stars and imposed a structure on the distribution by naming constellations.
Again, consider what would happen if you took the pixels on the screen of your
computer when it was showing a complicated and colourful scene and strung them
out in a single row. The distribution of colours in this single row of pixels would
appear to be quite arbitrary, yet the complicated pattern of the two-dimensional array
of pixels would still be there.
These two examples illustrate the point that we must distinguish between the pres-
ence of structure and our perception of structure. In the case of the constellations,
the structure is imposed by our brains. In the case of the picture on our computer
screen, we can only see the pattern if the pixels are arranged in a certain way.
Structure relates to the way in which things are put together, the way in which the
parts make up the whole. Yet there is a difference between the structure of, say, a
bridge and that of a piece of music. The parts of the Golden Gate Bridge or the
Sydney Harbour Bridge are solid and fixed in relation to one another. Seeing one
part of the bridge gives you a good idea of what the rest of it looks like.
The structure of pieces of music is quite different. The notes of a melody can be
arranged according to the whim or the genius of the composer. Having heard part
of the melody you cannot be sure of what the next note is going to be, leave alone

1
2 Fundamentals of Information Theory and Coding Design

any other part of the melody. In fact, pieces of music often have a complicated,
multi-layered structure, which is not obvious to the casual listener.
In this book, we are going to be concerned with things that have structure. The kinds
of structure we will be concerned with will be like the structure of pieces of music.
They will not be ﬁxed and obvious.

1.2 Structure in Randomness

Structure may be present in phenomena that appear to be random. When it is present,
it makes the phenomena more predictable. Nevertheless, the fact that randomness is
present means that we have to talk about the phenomena in terms of probabilities.
Let us consider a very simple example of how structure can make a random phe-
nomenon more predictable. Suppose we have a fair die. The probability of any face
coming up when the die is thrown is 1/6. In this case, it is not possible to predict
which face will come up more than one-sixth of the time, on average.
On the other hand, if we have a die that has been biased, this introduces some struc-
ture into the situation. Suppose that the biasing has the effect of making the probabil-
ity of the face with six spots coming up 55/100, the probability of the face with one
spot coming up 5/100 and the probability of any other face coming up 1/10. Then
the prediction that the face with six spots will come up will be right more than half
the time, on average.
Another example of structure in randomness that facilitates prediction arises from
phenomena that are correlated. If we have information about one of the phenomena,
we can make predictions about the other. For example, we know that the IQ of iden-
tical twins is highly correlated. In general, we cannot make any reliable prediction
about the IQ of one of a pair of twins. But if we know the IQ of one twin, we can
make a reliable prediction of the IQ of the other.
In order to talk about structure in randomness in quantitative terms, we need to use
probability theory.

1.3 First Concepts of Probability Theory

To describe a phenomenon in terms of probability theory, we need to deﬁne a set
of outcomes, which is called the sample space. For the present, we will restrict
consideration to sample spaces which are ﬁnite sets.
Entropy and Information 3

DEFINITION 1.1 Probability Distribution A probability distribution on a

sample space is a function that assigns a probability

to each outcome in the sample space. is a map from to the unit interval,
, which must satisfy .

DEFINITION 1.2 Events Events are subsets of the sample space.

We can extend a probability distribution from to the set of all subsets of ,
which we denote by , by setting ¾ for any . Note
that .
An event whose probability is is impossible and an event whose probability is is
certain to occur.
If and are events and then .
DEFINITION 1.3 Expected Value If is a sample space
to a vector space , the expected value of is

with probability distribution , and is a function from the sample space
.

NOTE We will often have equations that involve summation over the elements

of a ﬁnite set. In the equations above, the set has been and

the summation has been denoted by . In other places in the text we will denote
such summations simply by ¾ .

1.4 Surprise and Entropy

In everyday life, events can surprise us. Usually, the more unlikely or unexpected
an event is, the more surprising it is. We can quantify this idea using a probability
distribution.

DEFINITION 1.4 Surprise If is an event in a sample space , we deﬁne the

surprise of to be .
Events for which , which are certain to occur, have zero surprise, as we
would expect, and events that are impossible, that is, for which , have
inﬁnite surprise.
4 Fundamentals of Information Theory and Coding Design

Deﬁning the surprise as the negative logarithm of the probability not only gives us the
appropriate limiting values as the probability tends to or , it also makes surprise
additive. If several independent events occur in succession, the total surprise they
generate is the sum of their individual surprises.

DEFINITION 1.5 Entropy We can restrict the surprise to the sample space
and consider it to be a function from the sample space to the real numbers. The
expected value of the surprise is the entropy of the probability distribution.

If the sample space is , with probability distribution , the

entropy of the probability distribution is given by

(1.1)

The concept of entropy was introduced into thermodynamics in the nineteenth cen-
tury. It was considered to be a measure of the extent to which a system was disor-
dered. The tendency of systems to become more disordered over time is described by
the Second Law of Thermodynamics, which states that the entropy of a system can-
not spontaneously decrease. In the 1940’s, Shannon [6] introduced the concept into
communications theory and founded the subject of information theory. It was then
realised that entropy is a property of any stochastic system and the concept is now
used widely in many ﬁelds. Today, information theory (as described in books such
as [1], [2], [3]) is still principally concerned with communications systems, but there
are widespread applications in statistics, information processing and computing (see
[2], [4], [5]).
Let us consider some examples of probability distributions and see how the entropy is
related to predictability. First, let us note the form of the function
where and denotes the logarithm to base 2. (The actual base does not
matter, but we shall be using base 2 throughout the rest of this book, so we may as
well start here.) The graph of this function is shown in Figure 1.1.
Note that approaches as tends to and also as tends to . This
means that outcomes that are almost certain to occur and outcomes that are unlikely
to occur both contribute little to the entropy. Outcomes whose probability is close to
make a comparatively large contribution to the entropy.

EXAMPLE 1.1
with . The entropy is

In this case, and are equally likely to occur and the situation is as unpredictable
as it can be.
Entropy and Information 5

0.6

0.5

0.4
-p*log(p)

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
p

FIGURE 1.1
The graph of .

EXAMPLE 1.2
with , and . The entropy is

In this case, the situation is more predictable, with more than thirty times more
likely to occur than . The entropy is close to zero.

EXAMPLE 1.3
with , and . Using the convention that
, the entropy is . The situation is entirely predictable, as always
occurs.

EXAMPLE 1.4
, with for . The entropy is
and the situation is as unpredictable as it can be.

EXAMPLE 1.5
, with

for .
6 Fundamentals of Information Theory and Coding Design

The entropy is and the situation is fairly predictable as will occur far more
frequently than any other outcome.

EXAMPLE 1.6
, with for
. The entropy is and the situation is about as predictable as in
Example 1.1 above, with outcomes and equally likely to occur and the others
very unlikely to occur.

Roughly speaking, a system whose entropy is is about as unpredictable as a system

with equally likely outcomes.

1.5 Units of Entropy

The units in which entropy is measured depend on the base of the logarithms used
to calculate it. If we use logarithms to the base 2, then the unit is the bit. If we
use natural logarithms (base ), the entropy is measured in natural units, sometimes
referred to as nits. Converting between the different units is simple.

PROPOSITION 1.1
If is the entropy of a probability distribution measured using natural logarithms,
and is the entropy of the same probability distribution measured using logarithms
to the base , then

(1.2)

PROOF Let the sample space be , with probability distri-

bution . For any positive number ,

(1.3)

It follows that

Entropy and Information 7

(1.4)

1.6 The Minimum and Maximum Values of Entropy

If we have a sample space with elements, and probability distribution on ,
it is convenient to denote the probability of ¾ by . We can construct a vector
in consisting of the probabilities:

...

Because the probabilities have to add up to unity, the set of all probability distribu-
tions forms a simplex in , namely

¾

We can consider the entropy to be a function deﬁned on this simplex. Since it is
a continuous function, extreme values will occur at the vertices of this simplex, at
points where all except one of the probabilities are zero. If is a vertex, then the
entropy there will be

The logarithm of zero is not deﬁned, but the limit of as tends to ex-
ists and is equal to zero. If we take the limiting values, we see that at any vertex,
, as . This is the minimum value of the entropy function.
The entropy function has a maximum value at an interior point of the simplex. To
ﬁnd it we can use Lagrange multipliers.

THEOREM 1.1
If we have a sample space with elements, the maximum value of the entropy
function is .
8 Fundamentals of Information Theory and Coding Design

PROOF We want to ﬁnd the maximum value of

(1.5)

subject to the constraint

(1.6)

We introduce the Lagrange multiplier , and put

(1.7)

To ﬁnd the maximum value we have to solve

(1.8)

for and

(1.9)

(1.10)

so
(1.11)

for each . The remaining condition gives

(1.12)

which can be solved for , or can be used directly to give

(1.13)

for all . Using these values for the , we get
(1.14)
Entropy and Information 9

1.7 A Useful Inequality

LEMMA 1.1

If
conditions

and are all non-negative numbers that satisfy the
and , then

(1.15)

with equality if and only if for all .

PROOF We prove the result for the natural logarithm; the result for any other base
follows immediately from the identity
(1.16)

It is a standard result about the logarithm function that

(1.17)
for , with equality if and only if . Substituting , we get
(1.18)
with equality if and only if . This holds for all , so if we
multiply by and sum over the , we get

(1.19)

with equality if and only if for all . So

(1.20)

which is the required result.

The inequality can also be written in the form

(1.21)

with equality if and only if for all .
Note that putting for all in this inequality gives us an alternative proof
that the maximum value of the entropy function is .
10 Fundamentals of Information Theory and Coding Design

1.8 Joint Probability Distribution Functions

There are many situations in which it is useful to consider sample spaces that are the
Cartesian product of two or more sets.

DEFINITION 1.6 Cartesian Product Let and

be two sets. The Cartesian product of and is the set
.
The extension to the Cartesian product of more than two sets is immediate.

DEFINITION 1.7 Joint Probability Distribution A joint probability distribution

is a probability distribution on the Cartesian product of a number of sets.

If we have and as above, then a joint probability distribution function assigns a

probability to each pair . We can denote this probability by . Since these
values form a probability distribution, we have

(1.22)

for , , and

(1.23)

If is the joint probability distribution function on , the deﬁnition of entropy
becomes

(1.24)

If we want to emphasise the spaces and , we will denote the entropy of the joint
probability distribution on by or simply by . This is known
as the joint entropy of and .
If there are probability distributions and on and , respectively, and these
are independent, the joint probability distribution on is given by

(1.25)

Entropy and Information 11

for , . If there are correlations between the and , then this

formula does not apply.

DEFINITION 1.8 Marginal Distribution If is a joint probability distribution

function on , the marginal distribution on is given by

(1.26)

for and the marginal distribution on is given by

(1.27)

for .

There is a simple relationship between the entropy of the joint probability distribution
function and that of the marginal distribution functions.

THEOREM 1.2
If is a joint probability distribution function on , and and are the
marginal distributions on and , respectively, then

(1.28)

with equality if and only if the marginal distributions are independent.

PROOF

(1.29)

and similarly

(1.30)

So

12 Fundamentals of Information Theory and Coding Design

(1.31)

Also,

(1.32)

Since

(1.33)

and

(1.34)

we can use the inequality of Lemma 1.1 to conclude that
(1.35)
with equality if and only if for all and , that is, if the
two marginal distributions are independent.

1.9 Conditional Probability and Bayes’ Theorem

DEFINITION 1.9 Conditional Probability If is a sample space with a prob-

ability distribution function , and and are events in , the conditional prob-
ability of given is

(1.36)

It is obvious that
(1.37)
Almost as obvious is one form of Bayes’ Theorem:

THEOREM 1.3
If is a sample space with a probability distribution function , and and are
events in , then

(1.38)
Entropy and Information 13

Bayes’ Theorem is important because it enables us to derive probabilities of hypothe-

ses from observations, as in the following example.

EXAMPLE 1.7
We have two jars, A and B. Jar A contains 8 green balls and 2 red balls. Jar B contains
3 green balls and 7 red balls. One jar is selected at random and a ball is drawn from
it.
We have probabilities as follows. The set of jars forms one sample space, ,
with

as one jar is as likely to be chosen as the other.
The set of colours forms another sample space, . The probability of
drawing a green ball is

as 11 of the 20 balls in the jars are green. Similarly,

We have a joint probability distribution over the colours of the balls and the jars with
the probability of selecting Jar A and drawing a green ball being given by

Similarly, we have the probability of selecting Jar A and drawing a red ball

the probability of selecting Jar B and drawing a green ball

and the probability of selecting Jar B and drawing a red ball

We have the conditional probabilities: given that Jar A was selected, the probability
of drawing a green ball is

and the probability of drawing a red ball is

14 Fundamentals of Information Theory and Coding Design

Given that Jar B was selected, the corresponding probabilities are:

and

We can now use Bayes’ Theorem to work out the probability of having drawn from
either jar, given the colour of the ball that was drawn. If a green ball was drawn, the
probability that it was drawn from Jar A is

while the probability that it was drawn from Jar B is

If a red ball was drawn, the probability that it was drawn from Jar A is

while the probability that it was drawn from Jar B is

(In this case, we could have derived these conditional probabilities from the joint
probability distribution, but we chose not to do so to illustrate how Bayes’ Theorem
allows us to go from the conditional probabilities of the colours given the jar selected
to the conditional probabilities of the jars selected given the colours drawn.)

1.10 Conditional Probability Distributions and Conditional Entropy

In this section, we have a joint probability distribution on a Cartesian product
, where and , with marginal distri-
butions and .

DEFINITION 1.10 Conditional Probability of given For and

, the conditional probability of given is

(1.39)

Entropy and Information 15

DEFINITION 1.11 Conditional Probability Distribution given For a ﬁxed

, the conditional probabilities sum to 1 over , so they form a probability
distribution on , the conditional probability distribution given . We will denote
this by .

DEFINITION 1.12 Conditional Entropy given The conditional entropy

given is the entropy of the conditional probability distribution on given . It
will be denoted .

(1.40)

DEFINITION 1.13 Conditional Probability Distribution on given The

conditional probability distribution on given is the weighted average of the
conditional probability distributions given for all . It will be denoted .

(1.41)

DEFINITION 1.14 Conditional Entropy given The conditional entropy given

is the weighted average of the conditional entropies on given for all .
It will be denoted .

(1.42)

Since , we can re-write this as

(1.43)

We now prove two simple results about the conditional entropies.

THEOREM 1.4

16 Fundamentals of Information Theory and Coding Design

PROOF

(1.44)

The proof of the other equality is similar.

THEOREM 1.5
with equality if and only if and are independent.

PROOF From the previous theorem,

From Theorem 1.2, with equality if and only if and
are independent.
So .

Subtracting from both sides we get , with equality if and

only if and are independent.

This result is obviously symmetric in and ; so we also have

with equality if and only if and are independent. We can sum up this result
by saying the conditioning reduces entropy or conditioning reduces uncertainty.

1.11 Information Sources

Most of this book will be concerned with random sequences. Depending on the
context, such sequences may be called time series, (discrete) stochastic processes or
Entropy and Information 17

signals. The first term is used by statisticians, the second by mathematicians and the
third by engineers. This may reflect differences in the way these people approach the
subject: statisticians are primarily interested in describing such sequences in terms
of probability theory, mathematicians are interested in the behaviour of such series
and the ways in which they may be generated and engineers are interested in ways of
using such sequences and processing them to extract useful information from them.
A device or situation that produces such a sequence is called an information source.
The elements of the sequence are usually drawn from a finite set, which may be
referred to as the alphabet. The source can be considered to be emitting an element
of the alphabet at each instant of a sequence of instants in time. The elements of the
alphabet are referred to as symbols.

EXAMPLE 1.8
Tossing a coin repeatedly and recording the outcomes as heads (H) or tails (T) gives
us a random sequence whose alphabet is .

EXAMPLE 1.9
Throwing a die repeatedly and recording the number of spots on the uppermost face
gives us a random sequence whose alphabet is
.

EXAMPLE 1.10
Computers and telecommunications equipment generate sequences of bits which are
random sequences whose alphabet is .

EXAMPLE 1.11
A text in the English language is a random sequence whose alphabet is the set con-
sisting of the letters of the alphabet, the digits and the punctuation marks. While we
normally consider text to be meaningful rather than random, it is only possible to
predict which letter will come next in the sequence in probabilistic terms, in general.

The last example above illustrates the point that a random sequence may not appear
to be random at ﬁrst sight. The difference between the earlier examples and the ﬁnal
example is that in the English language there are correlations between each letter in
the sequence and those that precede it. In contrast, there are no such correlations in
the cases of tossing a coin or throwing a die repeatedly. We will consider both kinds
of information sources below.
18 Fundamentals of Information Theory and Coding Design

An obvious question that is raised by the term “information source” is: What is the
“information” that the source produces? A second question, perhaps less obvious,
is: How can we measure the information produced by an information source?
An information source generates a sequence of symbols which has a certain degree
of unpredictability. The more unpredictable the sequence is, the more information
is conveyed by each symbol. The information source may impose structure on the
sequence of symbols. This structure will increase the predictability of the sequence
and reduce the information carried by each symbol.
The random behaviour of the sequence may be described by probability distribu-
tions over the alphabet. If the elements of the sequence are uncorrelated, a simple
probability distribution over the alphabet may sufﬁce. In other cases, conditional
probability distributions may be required.
We have already seen that entropy is a measure of predictability. For an information
source, the information content of the sequence that it generates is measured by the
entropy per symbol. We can compute this if we make assumptions about the kinds
of structure that the information source imposes upon its output sequences.
To describe an information source completely, we need to specify both the alpha-
bet and the probability distribution that governs the generation of sequences. The
entropy of the information source with alphabet and probability distribution
will be denoted by in the following sections, even though it is actually the en-
tropy of . Later on, we will wish to concentrate on the alphabet and will use
to denote the entropy of the information source, on the assumption that the alphabet
will have a probability distribution associated with it.

1.12 Memoryless Information Sources

For a memoryless information source, there are no correlations between the outputs
of the source at different times. For each instant at which an output is emitted, there
is a probability distribution over the alphabet that describes the probability of each
symbol being emitted at that instant. If all the probability distributions are the same,
the source is said to be stationary. If we know these probability distributions, we can
calculate the information content of the sequence.

EXAMPLE 1.12
Tossing a fair coin gives us an example of a stationary memoryless information source.
At any instant, the probability distribution is given by , .
This probability distribution has an entropy of 1 bit; so the information content is 1
bit/symbol.
Entropy and Information 19

EXAMPLE 1.13
As an example of a non-stationary memoryless information source, suppose we have a

fair coin and a die with painted on four faces and painted on two faces. Tossing the
coin and throwing the die in alternation will create a memoryless information source
with alphabet . Every time the coin is tossed, the probability distribution of
the outcomes is ,
, and every time the die is thrown, the
probability distribution is
,
.

The probability distribution of the outcomes of tossing the coin has an entropy of 1
bit. The probability distribution of the outcomes of throwing the die has an entropy
of 0.918 bits. The information content of the sequence is the average entropy per
symbol, which is 0.959 bits/symbol.

Memoryless information sources are relatively simple. More realistic information

sources have memory, which is the property that the emission of a symbol at any
instant depends on one or more of the symbols that were generated before it.

1.13 Markov Sources and n-gram Models

Markov sources and n-gram models are descriptions of a class of information sources
with memory.

DEFINITION 1.15 Markov Source A Markov source consists of an alphabet ,

a set of states , a set of transitions between states, a set of labels for the transitions
and two sets of probabilities. The ﬁrst set of probabilities is the initial probability
distribution on the set of states, which determines the probabilities of sequences
starting with each symbol in the alphabet. The second set of probabilities is a set

of transition probabilities. For each pair of states, and , the probability of a

transition from to is . (Note that these probabilities are ﬁxed and do not
depend on time, so that there is an implicit assumption of stationarity.) The labels
on the transitions are symbols from the alphabet.

To generate a sequence, a state is selected on the basis of the initial probability distri-
bution. A transition from this state to another state (or to the same state) is selected
on the basis of the transition probabilities, and the label of this transition is output.
This process is repeated to generate the sequence of output symbols.
It is convenient to represent Markov models diagrammatically in the form of a graph,
with the states represented by vertices and the transitions by edges, as in the follow-
ing example.

20 Fundamentals of Information Theory and Coding Design

Label 1,
Label 1,

Label 0,

¾ ¿
Label 0,

FIGURE 1.2
Diagrammatic representation of a Markov source.

EXAMPLE 1.14
Consider a Markov source with alphabet and set of states ½ ¾ ¿ .
Suppose there are four transitions:

1. ½ ¾, with label 1 and ;

2. ¾ ¿, with label 0 and ;

3. ¿ ½, with label 1 and ;

4. ¿ ¾, with label 0 and .

The initial probability distribution is ½ , ¾ , ¿ .

The diagrammatic representation of this is shown in Figure 1.2.

The random sequences generated by this source all consist of subsequences of an
even number of 0’s separated by a pair of 1’s, except at the very beginning of the
sequence, where there may be a single 0 or 1.

It is possible to describe these sources without explicit reference to the states. In an

n-gram model, the description is solely in terms of the probabilities of all the possible
sequences of symbols of length .
Entropy and Information 21

EXAMPLE 1.15
The following probabilities give us a 3-gram model on the language .

To describe the relationship between n-gram models and Markov sources, we need
to look at special cases of Markov sources.

DEFINITION 1.16 th-order Markov Source A Markov source whose states

are sequences of symbols from the alphabet is called an th-order Markov
source.

When we have an th-order Markov model, the transition probabilities are usually
given in terms of the probabilities of single symbols being emitted when the source
is in a given state. For example, in a second-order Markov model on , the
transition probability from to , which would be represented by , would
be represented instead by the probability of emission of when in the state , that is
. Obviously, some transitions are impossible. For example, it is not possible

to go from the state to the state , as the state following must have as its
ﬁrst symbol.
We can construct a th-order Markov model from an -gram model and an
-gram model. The -gram model gives us the probabilities of strings of length ,
such as ½ ¾ . To ﬁnd the emission probability of from this state, we
set
½ ¾
½ ¾ (1.45)
½ ¾

where the probability ½ ¾ is given by the -gram model.

EXAMPLE 1.16
In the previous example 1.15 we had a 3-gram model on the language given
by

22 Fundamentals of Information Theory and Coding Design

P(1|11)=0.4

P(1|01)=0.5
01 11

P(0|01)=0.5
P(1|00)=0.2 P(0|11)=0.6
P(1|10)=0.2

00 10
P(0|10)=0.8

P(0|00)=0.8

FIGURE 1.3
Diagrammatic representation of a Markov source equivalent to a 3-gram model.

If a 2-gram model for the same source is given by , ,
and , then we can construct a second-order Markov source
as follows:

Figure 1.3 shows this Markov source.

To describe the behaviour of a Markov source mathematically, we use the transition

matrix of probabilities. If the set of states is
½ ¾
Entropy and Information 23

the transition matrix is the matrix

(1.46)
.. .. .. ..
. . . .

The probability of the source being in a given state varies over time. Let be the

probability of the source being in state at time , and set

½
¾
(1.47)
..
.

Then ¼ is the initial probability distribution and

·½ (1.48)

and so, by induction,

¼ (1.49)

Because they all represent probability distributions, each of the columns of must

add up to , and all the must add up to for each .

1.14 Stationary Distributions

The vectors describe how the behaviour of the source changes over time. The
asymptotic (long-term) behaviour of sources is of interest in some cases.

EXAMPLE 1.17
Consider a source with transition matrix

Suppose the initial probability distribution is

¼
24 Fundamentals of Information Theory and Coding Design

Then

Similarly,

and so on.
Suppose instead that

Then

so that

for all . This distribution will persist for all time.

In the example above, the initial distribution has the property that

(1.50)
and persists for all time.

DEFINITION 1.17 Stationary Distribution A probability distribution over

the states of a Markov source with transition matrix that satisﬁes the equation
is a stationary distribution.

As shown in the example, if is a stationary distribution, it persists for all time,

for all . The deﬁning equation shows that a stationary distribution
must be an eigenvector of with eigenvalue . To ﬁnd a stationary distribution for ,
we must solve the equation

together with the condition that .
EXAMPLE 1.18
Suppose

Entropy and Information 25

Then the equation gives

½ ¾ ¿ ½
½ ¾ ¿ ¾
½ ¾ ¿ ¿
The ﬁrst equation gives us

½ ¾
and the other two give

½ ¾ ¿

½ ¾ ¿
from which we get
¾ ¿
So
½ ¾
and
¿ ¾
Substituting these values in

½ ¾ ¿
we get

¾ ¾ ¾
which gives us
¾ ½ ¿
So the stationary distribution is

In the examples above, the source has an unique stationary distribution. This is not
always the case.
26 Fundamentals of Information Theory and Coding Design

EXAMPLE 1.19
Consider the source with four states and probability transition matrix

The diagrammatic representation of this source is shown in Figure 1.4.

FIGURE 1.4
A source with two stationary distributions.

For this source, any distribution with , and satisﬁes

the equation . However, inspection of the transition matrix shows that once
the source enters either the ﬁrst state or the fourth state, it cannot leave it. The only
stationary distributions that can occur are , , ,
or , , , .

Some Markov sources have the property that every sequence generated by the source
has the same statistical properties. That is, the various frequencies of occurrence of
Entropy and Information 27

symbols, pairs of symbols, and so on, obtained from any sequence generated by the
source will, as the length of the sequence increases, approach some deﬁnite limit
which is independent of the particular sequence. Sources that have this property are
called ergodic sources.

The source of Example 1.19 is not an ergodic source. The sequences generated by
that source fall into two classes, one of which is generated by sequences of states
that end in the ﬁrst state, the other of which is generated by sequences that end in the
fourth state. The fact that there are two distinct stationary distributions shows that
the source is not ergodic.

1.15 The Entropy of Markov Sources

There are various ways of deﬁning the entropy of an information source. The fol-
lowing is a simple approach which applies to a restricted class of Markov sources.

DEFINITION 1.18 Entropy of the th State of a Markov Source The entropy of

the th state of a Markov source is the entropy of the probability distribution on the
set of transitions from that state.

If we denote the probability distribution on the set of transitions from the th state by
, then the entropy of the th state is given by

(1.51)

DEFINITION 1.19 Uniﬁlar Markov Source A uniﬁlar Markov source is one with
the property that the labels on the transitions from any given state are all distinct.

We need this property in order to be able to deﬁne the entropy of a Markov source.
We assume that the source has a stationary distribution.
28 Fundamentals of Information Theory and Coding Design

DEFINITION 1.20 Entropy of a Uniﬁlar Markov Source The entropy

of a uniﬁlar Markov source , whose stationary distribution is given by
, and whose transition probabilities are for ,

, is

(1.52)

It can be shown that this deﬁnition is consistent with more general deﬁnitions of the
entropy of an information source.

EXAMPLE 1.20
For the Markov source of Example 1.14, there are three states, , and . The
probability distribution on the set of transitions from is for .
is given by

Its entropy is

using the usual convention that .
is given by

Its entropy is

is given by

Its entropy is

The stationary distribution of the source is given by

Entropy and Information 29

The entropy of the source is

EXAMPLE 1.21
For the source of Example 1.16, the states are , , , .
is given by , and . Its entropy

is

is given by , and . Its entropy

The stationary distribution of the source is given by

The entropy of the source is

1.16 Sequences of Symbols

It is possible to estimate the entropy of a Markov source using information about
the probabilities of occurrence of sequences of symbols. The following results apply
30 Fundamentals of Information Theory and Coding Design

to ergodic Markov sources and are stated without proof. In a sense, they justify
the use of the conditional probabilities of emission of symbols instead of transition
probabilities between states in th-order Markov models.

THEOREM 1.6
Given any Æ
and any , we can ﬁnd a positive integer
such that all
sequences of length fall into two classes: a set of sequences whose total

probability is less than ; and the remainder, for which the following inequality holds:
Æ

(1.53)

where is the probability of the sequence and is the entropy of the source.

PROOF See [6], Appendix 3.

THEOREM 1.7
Let be a Markov source with alphabet , and entropy . Let

denote the set of all sequences of symbols from of length . For , let
be the probability of the sequence being emitted by the source. Deﬁne

(1.54)
¾
which is the entropy per symbol of the sequences of symbols. Then is a
monotonic decreasing function of and

(1.55)

PROOF See [6], Appendix 3.

THEOREM 1.8
Let be a Markov source with alphabet , and entropy . Let
denote the set of all sequences of symbols from of length . For ,

let be the probability of the source emitting the sequence followed by the

symbol , and let be the conditional probability of the symbol being
emitted immediately after the sequence . Deﬁne

(1.56)

½
Entropy and Information 31

which is the conditional entropy of the next symbol when the preceding

symbols are known. Then is a monotonic decreasing function of and

(1.57)

PROOF See [6], Appendix 3.

THEOREM 1.9
If and are deﬁned as in the previous theorems, then
(1.58)

(1.59)

and
(1.60)

PROOF See [6], Appendix 3.

These results show that a series of approximations to the entropy of a source can
be obtained by considering only the statistical behaviour of sequences of symbols

of increasing length. The sequence of estimates is a better approximation than

the sequence . If the dependencies in a source extend over no more than
symbols, so that the conditional probability of the next symbol knowing the preced-
ing symbols is the same as the conditional probability of the next symbol
knowing the preceding symbols, then .

1.17 The Adjoint Source of a Markov Source

It is possible to approximate the behaviour of a Markov source by a memoryless
source.

DEFINITION 1.21 Adjoint Source of a Markov Source The adjoint source

of a Markov source is the memoryless source with the same alphabet which emits
symbols independently of each other with the same probabilities as the Markov
source.
32 Fundamentals of Information Theory and Coding Design

If we have an th-order Markov source

with alphabet , the
probabilities of emission of the symbols are

(1.61)

where represents a sequence of symbols from the alphabet of the

source, is the probability of this sequence in the stationary dis-
tribution of the Markov source and the summation over indicates that all such
sequences are included in the summation. The adjoint source of this Markov source,
denoted , is the memoryless source that emits these symbols with the same prob-

abilities.

EXAMPLE 1.22
For the 3-gram model of Example 1.15, we have transition probabilities

which give us the transition matrix

We need to ﬁnd the stationary distribution of the source. The equation
gives

Solving these equations together with the constraint

we get the stationary distribution

Entropy and Information 33

The probabilities for the adjoint source of the 3-gram models are

and

Although the probabilities of emission of single symbols are the same for both the
Markov source and its adjoint source, the probabilities of emission of sequences
of symbols may not be the same. For example the probability of emission of the
sequence by the Markov source is , while for the adjoint source it
is (by the assumption of independence).

Going from a Markov source to its adjoint reduces the number of constraints on the
output sequence and hence increases the entropy. This is formalised by the following
theorem.

THEOREM 1.10
If is the adjoint of the Markov source , their entropies are related by

(1.62)

PROOF If is an th-order source with alphabet , we will

denote the states, which are -tuples of the , by , where . We
assume that has a stationary distribution.
The probabilities of emission of the symbols are

(1.63)

where the summation is over all states and is the probability of state in the
stationary distribution of the source.
The entropy of the adjoint is

34 Fundamentals of Information Theory and Coding Design

(1.64)

The entropy of the th state of is

(1.65)

and the entropy of is

(1.66)

If we apply the inequality of Lemma 1.1 to each summation over , the result follows.

1.18 Extensions of Sources

In situations where codes of various types are being developed, it is often useful to
consider sequences of symbols emitted by a source.

DEFINITION 1.22 Extension of a Stationary Memoryless Source The th

extension of a stationary memoryless source is the stationary memoryless source
whose alphabet consists of all sequences of symbols from the alphabet of , with
the emission probabilities of the sequences being the same as the probabilities of
occurrence of the sequences in the output of .

The th extension of will be denoted by . Because the emission of successive

symbols by is statistically independent, the emission probabilities in can be
computed by multiplying the appropriate emission probabilities in .

EXAMPLE 1.23
Consider the memoryless source with alphabet and emission probabilities
, .
Entropy and Information 35

The second extension of has alphabet with emission probabilities

The third extension of has alphabet with
emission probabilities

There is a simple relationship between the entropy of a stationary memoryless source

and the entropies of its extensions.

THEOREM 1.11
If is the th extension of the stationary memoryless source , their entropies are
related by
(1.67)

PROOF If the alphabet of is , and the emission probabilities

of the symbols are for , the entropy of is

(1.68)

The alphabet of consists of all sequences ½ ¾ , where .

The emission probability of ½ ¾ is

½ ¾ ½ ¾ (1.69)

36 Fundamentals of Information Theory and Coding Design

The entropy of is

½ ¾ ½ ¾
½

½ ¾ (1.70)
½

We can interchange the order of summation to get

½ ¾ (1.71)
½

Breaking ½ ¾ into the product of probabilities, and rearranging, we get

½ (1.72)
½

Since

(1.73)

for , we are left with

(1.74)

We also have extensions of Markov sources.

DEFINITION 1.23 Extension of an th-order Markov Source Let and

be positive integers, and let be the smallest integer that is greater than or equal
to . The th extension of the th-order Markov source is the th-order
Markov source whose alphabet consists of all sequences of symbols from the
alphabet of and for which the transition probabilities between states are equal
to the probabilities of the corresponding -fold transitions of the th-order source.

We will use to denote the th extension of .

Entropy and Information 37

EXAMPLE 1.24

Let be the ﬁrst-order Markov source with alphabet and transition probabil-
ities

The second extension of has and , so . It is a ﬁrst-order

source with alphabet . We can calculate the transition probabilities
as follows.

EXAMPLE 1.25

Consider the second order Markov source with alphabet and transition proba-
bilities

The transition probabilities of the second extension are

38 Fundamentals of Information Theory and Coding Design

If we denote the states of the second extension by , , and

, we have the transition probabilities of a ﬁrst-order Markov source:

EXAMPLE 1.26

Consider the fourth order Markov source with alphabet and transition proba-
bilities
Entropy and Information 39

We can use these probabilities to compute the transition probabilities of the second
extension, for example,

If we denote the states of the second extension by , , and

, we have the transition probabilities of a second-order Markov source:
40 Fundamentals of Information Theory and Coding Design

It is convenient to represent elements of the alphabet of by single symbols as

we have done in the examples above. If the alphabet of is , then
we will use for a generic element of the alphabet of , and ½ ¾ will stand
for the sequence ½ ¾ . For further abbreviation, we will let stand for
and use to denote ½ ¾ , and so on.
The statistics of the extension are given by the conditional probabilities ½ ¾ .
In terms of the alphabet of , we have

½ ¾ ½ ½½ ½ (1.75)

Entropy and Information 41

This can also be written as

½ ¾ ½ ½½ ½ ¾ ½¾ ½
½ ½ ½ (1.76)

We can use this relationship to prove the following result.

THEOREM 1.12
If is the th extension of the Markov source their entropies are related by

(1.77)

The proof is similar to the proof of the corresponding result for memoryless sources.
(Exercise 17 deals with a special case.)
Note that if , then

(1.78)

Since an extension of a Markov source is a th-order Markov source, we can consider

its adjoint source. If is an th-order Markov source, is an th extension of
, and the adjoint source of the extension, then we can combine the results of
Theorem 1.10 and Theorem 1.12 to get

(1.79)

1.19 Inﬁnite Sample Spaces

The concept of entropy carries over to inﬁnite sample spaces, but there are a number
of technical issues that have to be considered.
If the sample space is countable, the entropy has to be deﬁned in terms of the limit
of a series, as in the following example.

EXAMPLE 1.27
Suppose the sample space is the set of natural numbers, and the
probability distribution is given by

½ (1.80)
42 Fundamentals of Information Theory and Coding Design

for ¾ . The entropy of this distribution is

½

½

½

(1.81)

This inﬁnite sum converges to , so .

If the sample space is a continuum, in particular, the real line, the summations be-
come integrals. Instead of the probability distribution function, we use the probabil-
ity density function , which has the property that

(1.82)

where is a closed interval and is the probability density function.
The mean and variance of the probability density function are defined by
½
(1.83)
½
and

½
(1.84)
½
The obvious generalisation of the definition of entropy for a probability density func-
tion defined on the real line is
½
(1.85)
½
provided this integral exists. This definition was proposed by Shannon, but has been
the subject of debate because it is not invariant with respect to change of scale or
change of co-ordinates in general. It is sometimes known as the differential entropy.
If we accept this definition of the entropy of a continuous distribution, it is easy to
compute the entropy of a Gaussian distribution.

is Ô
THEOREM 1.13
The entropy of a Gaussian distribution with mean and variance
in natural units.
Entropy and Information 43

PROOF The density function of the Gaussian distribution is

´ µ¾ ¾¾ (1.86)

Since this is a probability density function, we have

½
(1.87)
½
Taking the natural logarithm of , we get
¾
(1.88)
¾

By deﬁnition,
½
¾ ¾ (1.89)
½
this will be used below.
We now calculate the entropy:
½

½
½ ¾

½ ¾
½ ½ ¾

½ ½ ½ ½
¾

¾
½ ½ ¾

¾

¾

(1.90)

If the probability density function is defined over the whole real line, it is not possi-
ble to find a specific probability density function whose entropy is greater than the
entropy of all other probability density functions defined on the real line. However,
if we restrict consideration to probability density functions with a given variance, it
44 Fundamentals of Information Theory and Coding Design

can be shown that the Gaussian distribution has the maximum entropy of all these
distributions. (Exercises 20 and 21 outline the proof of this result.)
We have used
to denote the entropy of the probability distribution whose prob-

ability density function is . If is a random variable whose probability density

function is , we will denote its entropy by either or .

1.20 Exercises
1. Let ½ be a sample space with probability distribution given
by , , . Let be a function deﬁned on
by , , . What is the expected value of ?
2. Let be a sample space with probability distribution given by
, . Let be the function from to Ê given by

and

What is the expected value of ?

3. Suppose that a fair die is tossed. What is the expected number of spots on the
uppermost face of the die when it comes to rest? Will this number of spots
ever be seen when the die is tossed?

4. Let
be a sample space with probability distribution
given by
,
,
,
.
There are sixteen possible events that can be formed from the elements of .
Compute the probability and surprise of these events.

5. Let

be a sample space for some . Compute the en-

tropy of each of the following probability distributions on :

(a) , , , ;

(b) , , , , ;
(c) , , , , ,
;
(d) , , , , ,
;
Entropy and Information 45

(e) , ½ , , , ,

.
6. Convert the following entropy values from bits to natural units: (a)
; (b) ; (c) ; (d) ; (e) ; (f)
.
7. Convert the following entropy values from natural units to bits: (a)
; (b) ; (c) ; (d) ; (e) ; (f)
.
8. Let and . Let be a joint probability distribution
function on , given by , ,
, , , . Com-
pute the marginal distributions Ë and Ì and the entropies , Ë
and Ì . Also compute the conditional probability distributions Ë Ì and
Ì Ë and their entropies Ë Ì and Ì Ë .
9. Draw a diagram to represent the Markov source with alphabet and set
of states with the following ﬁve transitions:

(a) , with label 1 and ;

(b) , with label 0 and ;
(c) , with label 0 and ;
(d) , with label 1 and ;
(e) , with label 1 and .

Write down the transition matrix for this source. Is it possible for this source to
generate an output sequence that includes the subsequence ? Is it possible
for this source to generate an output sequence that includes the subsequence
?
10. A 2-gram model on the language is given by the probabilities

The probabilities of the individual symbols are

Construct a (ﬁrst-order) Markov source from these models. Draw a diagram
to represent the source and write down its transition matrix.
46 Fundamentals of Information Theory and Coding Design

11. A 4-gram model on the language is given by

with all other probabilities . All the 3-gram probabilities are also , except
for

Construct a third-order Markov source from these models. Draw a diagram to

represent the source and write down its transition matrix.
12. Consider a Markov source with transition matrix

and initial probability distribution

Compute for .
13. Find the stationary distribution of the Markov source whose transition matrix
is

*14. Prove that a Markov source with two states always has a stationary distribution,
provided that none of the transition probabilities are .
15. Find the stationary distribution of the Markov source whose transition matrix
is

16. Compute the entropy of the Markov source in Exercise 9 above.

*17. Prove that if is the th extension of the ﬁrst-order Markov source , their
entropies are related by .
18. Let be the set of positive integers and let be the probability
distribution given by
. Let be the function on deﬁned by
·½
. What is the expected value of ?
19. Let be the probability density function of the uniform distribution over the
closed and bounded interval , so that ½ if
and otherwise. Compute the mean, variance and entropy (in natural
units) of .

The next two exercises prove the result that the Gaussian distribution has the
maximum entropy of all distributions with a given variance.
Entropy and Information 47

20. Use the identity to show that if and are probability density
functions deﬁned on the real line then
½ ½

½ ½

provided both these integrals exist.

21. Show that if is a probability density function with mean and variance
and is the probability density function of a Gaussian distribution with the
same mean and variance, then
½

½

Conclude that of all the probability density functions on the real line with
variance , the Gaussian has the greatest entropy.
22. Compute the entropy of the probability density function of a uniform distribu-
tion on a closed interval whose variance is and conﬁrm that it is less than
the entropy of a Gaussian distribution with variance . (Use the results of
Exercise 19.)

The next four exercises are concerned with the Kullback-Leibler Divergence.
*23. Let and be probability distributions on the sample space

with and for . The Kullback-Leibler
Divergence of and is deﬁned by

Compute and , when , for ,

and , , , . Is ?
*24. If and are probability distributions on , a ﬁnite sample space, show that
and if and only if .
*25. If , , and are probability distributions on , with
, and for , show that

if and only if

48 Fundamentals of Information Theory and Coding Design

*26. If the sample space is the real line, it is easier to deﬁne the Kullback-Leibler
Divergence in terms of probability density functions. If and are probability
density functions on Ê, with and for all Ê, then we
deﬁne ½

½
Compute when is the probability density function of a Gaussian
distribution with mean ½ and variance and is the probability density
function of a Gaussian distribution with mean and variance .

The next two exercises deal with the topic of Maximum Entropy Estimation.
*27. We have shown in Section 1.6 that the entropy of the uniform distribution on a
sample space with elements is and this is the maximum value of the
entropy for any distribution defined on that sample space. In many situations,
we need to find a probability distribution that satisfies certain constraints and
has the maximum entropy of all the distributions that satisfy those constraints.
One type of constraint that is common is to require that the mean of the dis-
tribution should have a certain value. This can be done using Lagrange multi-
pliers. Find the probability distribution on that has maximum
entropy subject to the condition that the mean of the distribution is . To do
this you have to find , , and that maximise the entropy

subject to the two constraints

(so that the form a probability distribution) and

*28. Find the probability distribution on that has maximum entropy
subject to the conditions that the mean of the distribution is and the second
moment of the distribution of the distribution is . In this case the constraints
are

Entropy and Information 49

and

The next two exercises deal with the topic of Mutual Information.
*29. If is a joint probability distribution on , the Mutual Information of
, denoted , is the Kullback-Leibler Divergence between and the
product of the marginals , given by

Compute the Mutual Information of when , ,

and is given by , , ,
, , .
*30. Show that if and , an alternative
expression for the Mutual Information of , a joint probability distribution on
, is given by

1.21 References
[1] R. Ash, Information Theory, John Wiley & Sons, New York, 1965.
[2] T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley &
Sons, New York, 1991.
[3] D. S. Jones, Elementary Information Theory, Oxford University Press, Oxford,
1979.
[4] H. S. Leff and A. F. Rex, Eds., Maxwell’s Demon: Entropy, Information, Com-
puting, Adam Hilger, Bristol, 1990.
[5] R. D. Rosenkrantz, Ed., E. T. Jaynes: Papers on Probability, Statistics and Sta-
tistical Physics, D. Reidel, Dordrecht, 1983.
[6] C. E. Shannon and W. Weaver, The Mathematical Theory Of Communication,
The University of Illinois Press, Urbana, IL, 1949.

A Visual Introduction To Information Theory
No ratings yet
A Visual Introduction To Information Theory
43 pages
An Introduction To Information Theory and Entropy: Tom Carter
No ratings yet
An Introduction To Information Theory and Entropy: Tom Carter
139 pages
Gentle Intro To Information Theory
No ratings yet
Gentle Intro To Information Theory
139 pages
Entropic Information Overview
No ratings yet
Entropic Information Overview
139 pages
Carter - An Introduction To Information Theory and Entropy
No ratings yet
Carter - An Introduction To Information Theory and Entropy
126 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
2009 Lecture25
No ratings yet
2009 Lecture25
11 pages
Understanding Probability Theory Basics
No ratings yet
Understanding Probability Theory Basics
13 pages
A Brief Introduction On Shannon's Information Theory: January 2016
No ratings yet
A Brief Introduction On Shannon's Information Theory: January 2016
10 pages
Introduction To Probability and Statistics
No ratings yet
Introduction To Probability and Statistics
5 pages
Info Theory for Systems Scientists
No ratings yet
Info Theory for Systems Scientists
18 pages
Ch5 Entropy and Information
No ratings yet
Ch5 Entropy and Information
77 pages
Quantum Info for Physics Students
No ratings yet
Quantum Info for Physics Students
64 pages
A Brief Introduction On Shannon's Information Theory
No ratings yet
A Brief Introduction On Shannon's Information Theory
12 pages
Origins Randomness Physical Systems
No ratings yet
Origins Randomness Physical Systems
4 pages
Class Notes
No ratings yet
Class Notes
338 pages
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
No ratings yet
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
44 pages
Understanding Probability Basics
No ratings yet
Understanding Probability Basics
24 pages
Information, Entropy, and The Motivation For Source Codes: Hapter
No ratings yet
Information, Entropy, and The Motivation For Source Codes: Hapter
12 pages
Cours Chapter1
No ratings yet
Cours Chapter1
12 pages
ECS 315: Probability and Random Processes: September 2010
No ratings yet
ECS 315: Probability and Random Processes: September 2010
54 pages
Information Theory, Coding and Cryptography Unit-1 by Arun Pratap Singh
71% (7)
Information Theory, Coding and Cryptography Unit-1 by Arun Pratap Singh
46 pages
Randomness
No ratings yet
Randomness
11 pages
Typical Random
No ratings yet
Typical Random
27 pages
1 Basic Ideas 1
No ratings yet
1 Basic Ideas 1
75 pages
Introduction to Information Theory
No ratings yet
Introduction to Information Theory
44 pages
Understanding Random Variables in Probability
No ratings yet
Understanding Random Variables in Probability
62 pages
Randomness Test for Finite Strings
No ratings yet
Randomness Test for Finite Strings
13 pages
Hamilton - 2021 - Information Theory A Gentle Introduction
No ratings yet
Hamilton - 2021 - Information Theory A Gentle Introduction
8 pages
Possible Outcomes With Uncertainty As To Which Will Occur
No ratings yet
Possible Outcomes With Uncertainty As To Which Will Occur
2 pages
Basic Information Theory: Thinh Nguyen Oregon State University
No ratings yet
Basic Information Theory: Thinh Nguyen Oregon State University
17 pages
Module 1
No ratings yet
Module 1
40 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
Department of Statistics Module Code: STA1142: School of Mathematical and Natural Sciences
No ratings yet
Department of Statistics Module Code: STA1142: School of Mathematical and Natural Sciences
19 pages
Entropy (Information Theory)
No ratings yet
Entropy (Information Theory)
17 pages
Intro to Information Theory
No ratings yet
Intro to Information Theory
17 pages
Probability & Statistics Basics
No ratings yet
Probability & Statistics Basics
9 pages
Rojas 10 Why The Normal Distribution
No ratings yet
Rojas 10 Why The Normal Distribution
10 pages
1 Introduction To Information Theory
No ratings yet
1 Introduction To Information Theory
9 pages
21ECE72 - Coding and Cryp Module 1
No ratings yet
21ECE72 - Coding and Cryp Module 1
34 pages
C&C Combined Module Notes
No ratings yet
C&C Combined Module Notes
206 pages
Academic Writing
No ratings yet
Academic Writing
8 pages
Lecture-1 Information Theory
No ratings yet
Lecture-1 Information Theory
20 pages
Adami What Is Information
No ratings yet
Adami What Is Information
14 pages
Lecture 01 Math4453 (CE)
No ratings yet
Lecture 01 Math4453 (CE)
19 pages
Randomness
No ratings yet
Randomness
21 pages
Understanding Entropy and Information Theory
No ratings yet
Understanding Entropy and Information Theory
37 pages
Orhan Gazi - Introduction To Probability and Random Variables-Springer (2023)
No ratings yet
Orhan Gazi - Introduction To Probability and Random Variables-Springer (2023)
240 pages
Information Theory
No ratings yet
Information Theory
122 pages
Understanding Complexity and Simplicity
No ratings yet
Understanding Complexity and Simplicity
4 pages
Prob Main
No ratings yet
Prob Main
124 pages
Information Theory 5th Unit
No ratings yet
Information Theory 5th Unit
20 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
MIT6 02F12 Chap02
No ratings yet
MIT6 02F12 Chap02
12 pages
Main Characteristics of New Media
No ratings yet
Main Characteristics of New Media
4 pages
Global Hope Scholarship Program Application Form (Undergraduate)
No ratings yet
Global Hope Scholarship Program Application Form (Undergraduate)
11 pages
Unit 1: Introduction To Information and Communication Technology (Ict)
No ratings yet
Unit 1: Introduction To Information and Communication Technology (Ict)
18 pages
Effective Listening and Note-Taking Skills
No ratings yet
Effective Listening and Note-Taking Skills
15 pages
How To Write A Biology Literature Review Paper
100% (2)
How To Write A Biology Literature Review Paper
7 pages
F. Competency-Based Human Resources Management
No ratings yet
F. Competency-Based Human Resources Management
9 pages
Oracle Order Management Workflow Guide
No ratings yet
Oracle Order Management Workflow Guide
9 pages
cw2 Sample 7039mhr
No ratings yet
cw2 Sample 7039mhr
13 pages
How The Internet Travels Across The Ocean: Directions
50% (2)
How The Internet Travels Across The Ocean: Directions
2 pages
Bip 0008-3-2014
No ratings yet
Bip 0008-3-2014
94 pages
Thesis Title Examples in Architecture
No ratings yet
Thesis Title Examples in Architecture
8 pages
Introduction to IT Systems Course Overview
No ratings yet
Introduction to IT Systems Course Overview
10 pages
Assumptions of The Study Thesis Example
100% (3)
Assumptions of The Study Thesis Example
8 pages
Communication Concepts & Contexts
67% (3)
Communication Concepts & Contexts
8 pages
Introduction To Communication Skills
100% (2)
Introduction To Communication Skills
14 pages
Unit Plan: How Does Human Activity Affect The Environment?
No ratings yet
Unit Plan: How Does Human Activity Affect The Environment?
41 pages
Cyber Warfare: Pakistan's Strategic Imperative
0% (1)
Cyber Warfare: Pakistan's Strategic Imperative
63 pages
Michael Taylor CMA3000 - Academic Writing and Research AT Report
No ratings yet
Michael Taylor CMA3000 - Academic Writing and Research AT Report
14 pages
Lab Report Structure - PHYS192
No ratings yet
Lab Report Structure - PHYS192
3 pages
A&M-Sewing Machine Operator-L1
100% (1)
A&M-Sewing Machine Operator-L1
38 pages
Kapiszewski Et Al - Interviews, Oral Histories, and Focus Groups
No ratings yet
Kapiszewski Et Al - Interviews, Oral Histories, and Focus Groups
44 pages
Asset Administration Shell Reading Guide
No ratings yet
Asset Administration Shell Reading Guide
3 pages
2025 GR 11 ATP
100% (1)
2025 GR 11 ATP
7 pages
Strategic Management Insights
No ratings yet
Strategic Management Insights
17 pages
Team Work
No ratings yet
Team Work
11 pages
Faculty Monitoring Chapter 3
No ratings yet
Faculty Monitoring Chapter 3
22 pages
Anh Khaosat 801
No ratings yet
Anh Khaosat 801
11 pages
Learning to Communicate Effectively
No ratings yet
Learning to Communicate Effectively
18 pages
The 6 Types of Information Systems and Their Applications
100% (1)
The 6 Types of Information Systems and Their Applications
46 pages
Business Research Insights by Abhilash Jindal
No ratings yet
Business Research Insights by Abhilash Jindal
51 pages

ch1. Information Coding Theory

Uploaded by

ch1. Information Coding Theory

Uploaded by

Chapter 1

Entropy and Information

Structure is a concept of which we all have an intuitive understanding. However,

1.2 Structure in Randomness

1.3 First Concepts of Probability Theory

DEFINITION 1.1 Probability Distribution A probability distribution on a

DEFINITION 1.2 Events Events are subsets of the sample space.

1.4 Surprise and Entropy

DEFINITION 1.4 Surprise If  is an event in a sample space , we deﬁne the

If the sample space is , with probability distribution , the

  (1.1)

Roughly speaking, a system whose entropy is  is about as unpredictable as a system

1.5 Units of Entropy

PROOF Let the sample space be , with probability distri-

1.6 The Minimum and Maximum Values of Entropy

      

PROOF We want to ﬁnd the maximum value of

To ﬁnd the maximum value we have to solve

1.7 A Useful Inequality

    (1.15)

It is a standard result about the logarithm function that

      (1.19)

     (1.20)

The inequality can also be written in the form

1.8 Joint Probability Distribution Functions

DEFINITION 1.6 Cartesian Product Let and

DEFINITION 1.7 Joint Probability Distribution A joint probability distribution

If we have and as above, then a joint probability distribution function assigns a

  (1.25)

for , . If there are correlations between the and  , then this

DEFINITION 1.8 Marginal Distribution If  is a joint probability distribution

       (1.28)

with equality if and only if the marginal distributions are independent.

1.9 Conditional Probability and Bayes’ Theorem

DEFINITION 1.9 Conditional Probability If  is a sample space with a prob-

Bayes’ Theorem is important because it enables us to derive probabilities of hypothe-

 

 

 

 

Given that Jar B was selected, the corresponding probabilities are:

1.10 Conditional Probability Distributions and Conditional Entropy

DEFINITION 1.10 Conditional Probability of given  For  and

DEFINITION 1.11 Conditional Probability Distribution given For a ﬁxed

DEFINITION 1.12 Conditional Entropy given The conditional entropy

DEFINITION 1.13 Conditional Probability Distribution on given  The

DEFINITION 1.14 Conditional Entropy given  The conditional entropy given

We now prove two simple results about the conditional entropies.

The proof of the other equality is similar.

PROOF From the previous theorem,   

Subtracting from both sides we get   , with equality if and

1.11 Information Sources

1.12 Memoryless Information Sources

Memoryless information sources are relatively simple. More realistic information

1.13 Markov Sources and n-gram Models

DEFINITION 1.15 Markov Source A Markov source consists of an alphabet , 

1. ½ ¾, with label 1 and ;

2. ¾ ¿, with label 0 and  ;

3. ¿ ½, with label 1 and  ;

4. ¿ ¾, with label 0 and  .

The initial probability distribution is ½  , ¾  , ¿ .

The diagrammatic representation of this is shown in Figure 1.2.

It is possible to describe these sources without explicit reference to the states. In an

DEFINITION 1.16 th-order Markov Source A Markov source whose states

where the probability ½ ¾ is given by the -gram model.

   

    

     

     

     

   

    

Figure 1.3 shows this Markov source.

To describe the behaviour of a Markov source mathematically, we use the transition

the transition matrix is the matrix

Then  ¼ is the initial probability distribution and

DEFINITION 1.4 Surprise If is an event in a sample space , we deﬁne the

(1.1)

Roughly speaking, a system whose entropy is is about as unpredictable as a system

(1.15)

(1.19)

(1.20)

(1.25)

for , . If there are correlations between the and , then this

DEFINITION 1.8 Marginal Distribution If is a joint probability distribution

(1.28)

DEFINITION 1.9 Conditional Probability If is a sample space with a prob-

DEFINITION 1.10 Conditional Probability of given For and

DEFINITION 1.13 Conditional Probability Distribution on given The

DEFINITION 1.14 Conditional Entropy given The conditional entropy given

PROOF From the previous theorem,

Subtracting from both sides we get , with equality if and

DEFINITION 1.15 Markov Source A Markov source consists of an alphabet ,

1. ½ ¾, with label 1 and ;

2. ¾ ¿, with label 0 and ;

3. ¿ ½, with label 1 and ;

4. ¿ ¾, with label 0 and .

The initial probability distribution is ½ , ¾ , ¿ .

Then ¼ is the initial probability distribution and

½ ¾ ¿

For this source, any distribution with , and satisﬁes

(1.51)

is given by , and . Its entropy

is given by , and . Its entropy

is given by , and . Its entropy

(1.61)

PROOF If is an th-order source with alphabet , we will

(1.64)

(1.65)

PROOF If the alphabet of is , and the emission probabilities

The alphabet of consists of all sequences ½ ¾ , where .

½ ¾ ½ ¾ (1.69)

DEFINITION 1.23 Extension of an th-order Markov Source Let and

We will use to denote the th extension of .

The second extension of has and , so . It is a ﬁrst-order

If we denote the states of the second extension by , , and

If we denote the states of the second extension by , , and

½ ¾ ½ ½½ ½ (1.75)

´ µ¾ ¾¾ (1.86)

(a) , , , ;

(e) , ½ , , , ,