Informations Channels For Coding Theory
Informations Channels For Coding Theory
Information Channels
Information does not only have to be used or stored, it also has to be transmitted. In
communication systems a transmitter converts, or encodes, a message to a form suit-
able for transmission through a communications medium, be it a fibre optic channel,
satellite link or radio signal through space. The receiver then detects the transmitted
signal and decodes it back to the original message. The encoding and decoding oper-
ations will be discussed in the following chapters. Although we will be dealing with
digital data, physical transmission and storage has to deal with analog signals and
media. Thus digital data has to be modulated in some way prior to transmission and
storage, and detected by the receiver to reproduce the original digital sequence. Dis-
cussion of modern analog and digital communication systems is beyond the scope
of this book and interested readers should refer to the standard textbooks in the area,
for example [7].
As we will see, two forms of coding will be necessary. Source coding is required to
efficiently store and transmit the information generated by a general-purpose source
(e.g., ASCII text, images, audio, etc.) on the storage or transmission medium in use
(e.g., computer disk or digital data networks, both requiring conversion to binary
data). Channel coding is required to mitigate the effects of noise that may corrupt
data during storage and transmission through physical systems. What is noise? Noise
can be defined as any unwanted signal or effect in addition to the desired signal. As
such there can be many causes of noise: interference from other signals, thermal and
spurious effects generated by the electronic device being used to store or transmit the
signal, environmental disturbances, etc.
In Chapter 1 we were concerned with defining the information content of a source.
In this chapter we will be concerned with the equally important problem of defining
or measuring the information carrying capacity of a channel. Intuitively a channel
can carry no more information than is being pushed through the channel itself, that
is, the entropy of the source or message being transmitted through the channel. But
as we will see the presence of noise reduces the information carrying capacity of
the channel and leads us to define another important quantity (alongside entropy)
called mutual information. As we view the channel from an information theoretic
51
52 Fundamentals of Information Theory and Coding Design
point of view not surprisingly the concept of mutual information has uses beyond the
information carrying capacity of a channel (as does entropy beyond measuring the
information content of a source).
The following assumptions will apply in our modelling of an information channel:
Stationary The statistical nature of the channel and noise do not change with time
Memoryless The behaviour of the channel and the effect of the noise at time t will
not depend on the behaviour of the channel or the effect of noise at any previ-
ous time
NOTE The input alphabet represents the symbols transmitted into the channel and
the output alphabet represents the symbols received from the channel. The definition
of the channel implies that the input and output symbols may be different. In real-
ity, one would expect that the received symbols are the same as those transmitted.
However the effect of noise may introduce “new” symbols and thus we use different
input and output alphabets to cater for such cases. For more general applications the
channel models the matching of the input symbols to prescribed output symbols or
classes which are usually different. In statistical applications the input and output
symbols arise from two random variables and the channel models the joint relationship
between the variables.
where is the channel matrix and for notational convenience we may sometimes
rewrite this as:
½½ ½
.. .. .. .. (2.2)
. . . .
where we have defined .
½
½
¾
¾
..
.
..
.
..
.
FIGURE 2.1
Graphical representation of an information channel.
Each row of contains the probabilities of all possible outputs from the same
input to the channel
Each column of contains the probabilities of all possible inputs to a partic-
ular output from the channel
If we transmit the symbol we must receive an output symbol with probabil-
ity 1, that is:
(2.3)
that is, the probability terms in each row must sum to 1.
EXAMPLE 2.1
Consider a binary source and channel with input alphabet and output alphabet
.
54 Fundamentals of Information Theory and Coding Design
Noiseless: If the channel is noiseless there will be no error in transmission,the channel
matrix is given by
and the channel is
1.0
0 0
A B
1 1
1.0
Noisy: Say the channel is noisy and introducesa bit inversion 1% of the time, then
the channel matrix is given by
and the channel is
0.99
0 0
0.01
A B
0.01
1 1
0.99
In digital communication systems the input to the channel will be the binary digits
0,1 and this set will be the input alphabet and, ideally, also be the output alphabet.
Furthermore, the effect of noise will not depend on the transmission pattern, that is,
Information Channels 55
the channel is assumed memoryless. Two possible scenarios on the effect of noise
are possible.
Ideally if there is no noise a transmitted 0 is detected by the receiver as a 0, and a
transmitted 1 is detected by the receiver as a 1. However in the presence of noise the
receiver may produce a different result.
The most common effect of noise is to force the detector to detect the wrong bit (bit
inversion), that is, a 0 is detected as a 1, and a 1 is detected as a 0. In this case the
information channel that arises is called a binary symmetric channel or BSC where
is the probability of error (also called
bit error probability, bit error rate (BER), or “crossover” probability) and the output
alphabet is also the set of binary digits . The parameter fully defines the be-
haviour of the channel. The BSC is an important channel for digital communication
systems as noise present in physical transmission media (fibre optic cable, copper
wire, etc.) typically causes bit inversion errors in the receiver.
A BSC has channel matrix
where and is depicted in Figure
2.2.
p
0 0
q
A B
q
1 1
p
FIGURE 2.2
Binary symmetric channel.
Another effect that noise (or more usually, loss of signal) may have is to prevent
the receiver from deciding whether the symbol was a 0 or a 1. In this case the
output alphabet includes an additional symbol, , called the “erasure” symbol that
denotes a bit that was not able to be detected. Thus for binary input , the output
alphabet consists of the three symbols, . This information channel is called
a binary erasure channel or BEC where
is the probability of error (also called the “erasure” probability). Strictly speaking
a BEC does not model the effect of bit inversion; thus a transmitted bit is either
received correctly (with probability ) or is received as an “erasure” (with
probability ). A BEC is becoming an increasingly important model for wireless
mobile and satellite communication channels, which suffer mainly from dropouts
and loss of signal leading to the receiver failing to detect any signal.
56 Fundamentals of Information Theory and Coding Design
A BEC has channel matrix
and is depicted in Figure 2.3.
p
0 0
q
A ? B
q
1 1
p
FIGURE 2.3
Binary erasure channel.
The probabilities are termed the forward probabilities where forward in-
dicates that the direction of channel use is with input symbol being transmitted
and output symbol being received (i.e., then or given ). We can simi-
larly define the backward probabilities as indicating the channel is running
backwards: output symbol occurs first followed by input symbol (i.e., then
, or given ). The backward probabilities can be calculated by application of
Bayes’ Theorem as follows:
(2.5)
Information Channels 57
EXAMPLE 2.2
Consider the binary information channel fully specified by:
(2.6)
which is usually represented diagrammatically as:
3/4
2/3 0 0
1/8
A B
1/4
1/3 1 1
7/8
and the backward probabilities by:
(2.7)
¾
as the average uncertainty we have about the input before the channel output is ob-
served and the a posteriori entropy of A given :
(2.8)
¾
as the average uncertainty we have about the input after the channel output is
observed.
How does our average uncertainty about the input change after observing the output
of the channel? Intuitively, we expect our uncertainty to be reduced as the channel
output provides us with knowledge and knowledge reduces uncertainty. However, as
we will see in the following example the output can sometimes increase our uncer-
tainty (i.e., be more of a hindrance than a help!).
EXAMPLE 2.3
Consider the binary information channel from Example 2.2:
3/4
2/3 0 0
1/8
A B
1/4
1/3 1 1
7/8
Information Channels 59
What is our uncertainty of the input that is transmitted through the channel before we
observe an output from the channel? This is provided by the entropy based on the
given a priori input probabilities, and , yielding the a
priori entropy of A, .
What is our uncertainty of the input that is transmitted through the channel after we
observe an output from the channel? Say we observe an output of , then the
a posteriori input probabilities, and
,
yield an a posteriori entropy of A, Thus we reduce our
uncertainty once we observe an output of . But what if we observe an output of
? Then the a posteriori input probabilities become and
and the a posteriori entropy of A, . Our
uncertainty is in fact increased! The reason is that our high expectation of an input
of 0, from the a priori probability , is negated by receiving an output
of 1. Thus is closer to equi-probable than and
this increases the a posteriori entropy.
Notwithstanding the fact that even though
we can show that if we average across all possible outputs the channel will
indeed reduce our uncertainty, that is:
average uncertainty (or surprise) of the input to the channel before observing
the channel output;
average uncertainty (or equivocation) of the input to the channel after ob-
serving the channel output;
DEFINITION 2.2 Mutual Information For input alphabet and output alpha-
bet the term
(2.9)
is the mutual information between and
An alternative expression to Equation 2.9 for the mutual information can be derived
as follows:
¾ ¾ ¾
¾ ¾ ¾ ¾
(2.10)
¾ ¾
Using Equation 2.5 the mutual information can be expressed more compactly:
RESULT 2.1 Alternative expressions for Mutual Information
¾ ¾ ¾ ¾
(2.11)
Information Channels 61
The mutual information has been defined in the context of measuring the information
carrying capacity of communication channels. However the concept of mutual infor-
mation has had far-reaching effects on solving difficult estimation and data analysis
problems in biomedical applications [8], image processing and signal processing. In
these applications the key in using mutual information is that it provides a measure
of the independence between two random variables or distributions.
In image processing [12] and speech recognition [9] the use of the maximum mutual
information or MMI between the observed data and available models has yielded
powerful strategies for training the models based on the data in a discriminative
fashion. In signal processing for communication systems the idea of minimising the
mutual information between the vector components for separating mutually interfer-
ing signals [3] has led to the creation of a new area of research for signal separation
based on the idea of independent component analysis or ICA .
From Example 2.3 we saw that for specific values of the output alphabet, , either
or but when we averaged over the output
alphabet then . This implies that for this case. But
what can we say about for other cases? We restate the following result from
Chapter 1:
RESULT 2.2
The mutual information is a non-negative quantity:
(2.14)
(2.15)
or
(2.16)
NOTE The information the channel provides about upon observing is the
same as the information the channel provides about upon noting was sent.
RESULT 2.3
From the fact that it can be stated that, in general and on
average, uncertainty is decreased when we know something, that is:
(2.17)
Other expressions and relationships between the entropies, , the equiv-
ocation, , the joint entropy, and the mutual information,
, can be derived. All these relations can be summarised by the
Venn diagram of Figure 2.4. From Figure 2.4 additional expressions involving the
FIGURE 2.4
Relationship between all entropy and mutual information expressions.
Information Channels 63
(2.18)
EXAMPLE 2.4
From Example 2.3 we calculated:
and from Example 2.2 we can calculate:
¾
The quantities , and completely determine the Venn diagram
and the remaining quantities can be derived by:
64 Fundamentals of Information Theory and Coding Design
and hence:
Then the output probabilities are given by:
A sketch of the BSC showing both input and output probabilities is given in Figure
2.5.
p
0 0
q
A B
q
1 1
p
FIGURE 2.5
Mutual information of a BSC.
(2.19)
EXAMPLE 2.5
What is the mutual information of a BSC for the following cases?
A sketch of the BEC showing both input and output probabilities is given in Figure
2.6.
The expression for is:
(2.21)
FIGURE 2.6
Mutual information of a BEC.
We have already discussed the BSC and BEC structures as important channel models
for describing modern digital communication systems. We have also loosely referred
to channels as being noisefree, noisy and ambiguous. We now formally define noise-
less channels as those channels that are not subject to the effects of noise (i.e., no
uncertainty of the input that was transmitted given the output that was received). We
also define a dual class of channels called deterministic channels for which we can
determine the output that will be received given the input that was transmitted.
A noiseless channel will have either the same or possibly more output symbols than
input symbols and is such that there is no noise, ambiguity, or uncertainty of which
input caused the output.
EXAMPLE 2.6
The following channel with 6 outputs and 3 inputs is noiseless because we know, with
certainty 1, which input symbol, ½ , was transmitted given knowledge of
the received output symbol, .
Information Channels 67
1/3
2/3
1/4
1/2
1/4
1
and we note that there is only one non-zero element in each column.
What can we say about the mutual information through a noiseless channel? Let
be the received output. Then, from the definition of a noiseless channel, we know
which input, say , was transmitted and we know this with certainty 1. That is,
for and hence for all other . The equivocation
becomes:
¾
¾
¾
¾
since:
and hence ¾
½
.
The mutual information is then given by the following result.
68 Fundamentals of Information Theory and Coding Design
RESULT 2.4
The mutual information for noiseless channels is given by:
(2.23)
That is, the amount of information provided by the channel is the same as the
information sent through the channel.
A deterministic channel will have either the same or possibly more input symbols
than output symbols and is such that we can determine which output symbol will be
received when a particular input symbol is transmitted.
EXAMPLE 2.7
The following channel with 3 outputs and 6 inputs is deterministic because we know,
with certainty 1, which output symbol, ½ , will be received given knowledge
of the transmitted input symbol, .
1
1
1
1
1
1
Information Channels 69
RESULT 2.5
The mutual information for deterministic channels is given by:
(2.24)
That is, the amount of information provided by the channel is the same as the
information produced by the channel output.
One example of where this happens is in modern data communication systems where
data can be sent through different physical transmission media links (e.g., copper
wire, optical fibre) and wireless media links (e.g., satellite, radio) from transmitter
to receiver. Each of these links can be modelled as an independent channel and
the complete transmission path as a cascade of such channels. What happens to
the information when passed through a cascade of channels as compared to a single
channel only?
Intuitively we would expect additive loss of information arising from the cumulative
effects of uncertainty (or equivocation) from each channel in the cascade. Only for
the case of a noiseless channel would we expect no additional loss of information in
passing data through that channel. We verify these and other results when, without
loss of generality, we derive the mutual information of a cascade of two channels and
compare this with the mutual information through the first channel.
Consider a pair of channels in cascade as shown in Figure 2.7. In Figure 2.7 the
A B C
Channel AB Channel BC
FIGURE 2.7
Cascade of two channels.
Similarly we can also state that the following will also be true for a cascade of chan-
nels:
(2.26)
RESULT 2.6
For the cascade of channel with channel it is true that:
(2.29)
That is, channels tend to leak information and the amount of information out of a
cascade can be no greater (and is usually less) than the information from the first
channel.
CLAIM 2.1
If channel is noiseless then
The converse of Claim 2.1, however, is not true since, surprisingly, there may exist
particular channel combinations and input distributions which give rise to
even if channel is not noiseless. The following example illustrates this
point.
EXAMPLE 2.8
Consider the cascade of channel :
1/2
1/2 1/3
A B
1/3
1/3
with channel :
3/4
1/4
1
B C
1/4
3/4
by virtue of the fact that the channel matrix for the cascaded channel :
is identical to the channel matrix for channel !
s− symbol
alphabet
B
A
Channel
r− symbol
t− symbol
alphabet
alphabet
FIGURE 2.8
Channel system with two outputs.
the first channel (if there are two physical channels) or first transmission (if there
is one physical channel) and output can be construed as the output of the second
channel or second (repeat) transmission.
To develop an expression for the mutual information we extend our notion of a priori
and a posteriori probabilities and entropies as discussed in Section 2.3 to include the
contribution of a second output as follows:
(2.32)
¾
(2.33)
¾ ¾
Information Channels 75
and the amount of information channels and provide about that is the
mutual information of and , by:
(2.34)
(2.35)
RESULT 2.7
For a channel with input and dual outputs and it is true that:
That is, dual use of a channel provides more information than the single use of a
channel.
EXAMPLE 2.9
Consider a BSC where the input symbol, , is transmitted through the channel twice.
Let represent the output from the original transmission and let be the output
from the repeat transmission. Since the same channel is used:
¾
where we note that since the repeat output does not depend on the
original output. We list the contribution of each term in the expression of Equation
2.38 as shown by Table 2.1.
The Type column assigns a class label to each term. The type X terms represent the
case of no error in both the original or repeat outputs and these positively reinforce
the information provided by the dual use of the channel. Collecting the type X terms
we get:
½ ¾ ¾
¾ ¾ ¾
½ ½
¾ ¾ ¾ ¾
¾ ¾
The type Y terms represent the case of complete error in both the original and repeat
outputs and these negatively reinforce the information provided by the dual use of the
channel. Collecting the type Y terms we get:
½ ¾ ¾
¾ ¾ ¾
½ ½
¾ ¾ ¾ ¾
¾ ¾
Information Channels 77
The type Z terms, however, indicate contradictory original and repeat outputs which
effectively negate any information that the dual use of the channel may provide.
Collecting the type Z terms we see that these make no contribution to the mutual
½
information:
½¾
¾
By plotting and for different values of we see from Figure
2.9 that . It is interesting to note the conditions for equality,
1
I(A;B)
0.9 I(A;BC)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
q (error)
FIGURE 2.9
Comparison of with .
(2.40)
that is, the maximum mutual information over all possible input probability assign-
ments, .
1. since
2. since and
when considering the expression
, and when considering the
expression
1.
2.
Since this represents the maximum possible mutual information we can then state:
(2.42)
How does the channel capacity vary for different error probabilities, ? From Figure
2.11 the following observations can be made:
80 Fundamentals of Information Theory and Coding Design
0.8
0.6
I(A;B)
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
w
FIGURE 2.10
Variation of with different source assignments for a typical BSC.
When or , that is, no error or 100% bit inversion, the BSC channel will
provide its maximum capacity of 1 bit
When the BSC channel is totally ambiguous or useless and has a
capacity of 0 bits
(2.43)
´µ
´µ
(2.44)
since we know that for the binary input alphabet, , that ´µ
½
occurs when ¾ . The following observations
can be made:
Information Channels 81
0.8
0.6
C
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
q (error)
FIGURE 2.11
Variation of channel capacity of a BSC with .
when , that is, no erasures, the BEC channel will provide its maximum
capacity of 1 bit
when , the BEC channel will be producing only erasures and have a
capacity of 0 bits
The calculation of the channel capacity, , is generally quite involved since it rep-
resents a problem in constrained optimisation of a non-linear function. However
there is a special class of channels where the derivation of the channel capacity can
be stated explicitly. Both symmetric and weakly symmetric channels are examples
of this special class. We now provide a formal definition of symmetric and weakly
symmetric channels.
(2.45)
¾
since
. Since is bounded above, then the upper bound of
would be the channel capacity if it can be achieved by an appro-
priate input distribution. Let ½ then
½
. Since the channel is weakly symmetric then the column
sums, , are all equal, , and we have that , that is, the output prob-
abilities are equal. Since we know that maximum entropy is achieved with equal
symbol probabilities then it follows that when ½ . We
have established that:
THEOREM 2.1
For a symmetric or weakly symmetric channel , the channel capacity can be
stated explicitly as:
(2.46)
where ´ µ can be calculated for any row .
½
Channel capacity is achieved when the inputs are uniformly distributed, that is,
½ .
EXAMPLE 2.10
Consider the BSC channel matrix:
The BSC is a symmetric channel and we can use Theorem 2.1 to explicitly derive the
channel capacity as follows:
Information Channels 83
when ½
¾, which is the same expression that was derived in Section 2.7.1.
Obviously the channel is not symmetric, but since the second row is a permutation of
the first row and since the column sums are equal then the channel is weakly symmetric
and the channel capacity can be explicitly stated as:
when ½
¾.
We extend our analysis of information channels to the case of continuous valued in-
put and output alphabets and to the most important class of continuous channel, the
Gaussian channel. In digital communication systems noise analysis at the most basic
level requires consideration of continuous valued random variables rather than dis-
crete quantities. Thus the Gaussian channel represents the most fundamental form
of all types of communication channel systems and is used to provide meaning-
ful insights and theoretical results on the information carrying capacity of channels.
The BSC and BEC models, on the other hand, can be considered as high-level de-
scriptions of the practical implementations and operations observed in most digital
communication systems.
When considering the continuous case our discrete-valued symbols and discrete
probability assignments are replaced by continuous-valued random variables, ,
with associated probability density functions, .
84 Fundamentals of Information Theory and Coding Design
(2.49)
+
FIGURE 2.12
The Gaussian channel.
(2.50)
That is, the channel output is perturbed by additive white Gaussian noise (AWGN).
1. Assume the AWGN has a power spectral density of . Then the noise
¾
variance, or average power, is band-limited and given by
Ê
¾
¾
2. To faithfully reproduce any signals transmitted through the channel the signals
must be transmitted at a rate not exceeding the Nyquist rate of samples
per second
This band-limited, power-limited Gaussian channel just described is not only of the-
oretical importance in the field of information theory but of practical importance to
communication engineers since it provides a fundamental model for many modern
communication channels, including wireless radio, satellite and fibre optic links.
In Section 2.7 we defined the channel capacity as the maximum of the mutual in-
formation over all possible input distributions. Of importance to communication en-
gineers is the channel capacity of a band-limited, power-limited Gaussian channel.
This is given by the following maximisation problem:
¾
¾
(2.52)
CLAIM 2.2
If and is uncorrelated with then:
(2.53)
86 Fundamentals of Information Theory and Coding Design
(2.54)
From Chapter 1 we stated that for a random variable the differential entropy
Gaussian
attains the maximum value of ; so for the Gaussian random variable, ,
we know that: ¾
(2.55)
¾ ¾
¾
¾ ¾ ¾
¾
¾ (2.57)
Since the channel is also band-limited to Hz then there can be no more than
¾ . This provides the final form of the
symbols transmitted per second and
Information Channels 87
where is the average transmitted power and is the signal-to-noise ratio
or SNR.
Equation 2.59 provides the theoretical capacity or upper bound on the bits per second
that can be transmitted for error-free transmission through a channel for a given
transmitted power, , and channel bandwidth, , in the presence of AWGN noise
with power spectral density, . Thus the information capacity theorem defines a
fundamental limit that confronts communication engineers on the rate for error-free
transmission through a power-limited, band-limited Gaussian channel.
EXAMPLE 2.11
What is the minimum signal-to-noise ratio that is needed to support a 56k modem?
A 56k modem requires a channel capacity of 56,000 bits per second. We assume a
telephone bandwidth of Hz. From Equation 2.59 we have:
or:
lossiness in the source coding where the code alphabet and permitted code
words do not allow exact representation of the source symbols and the decod-
ing is subject to errors,
insufficient redundancy in the channel code such that the rate of information
is greater than the channel capacity,
the representation is not perfect and there are unavoidable errors or distortions in the
representation of the source symbol by the representation symbol .
Rate distortion theory, first developed by Shannon [11], deals with the minimum mu-
tual information (equivalent to the information or code rate) that the channel must
possess, for the given source symbol probability distributions, to ensure that the av-
erage distortion is guaranteed not to exceed a specified threshold . To derive this
value we first define what we mean by an information or code rate (this is for the
general case; in Chapter 5 we define the code rate for the specific case that applies
with channel codes).
DEFINITION 2.8 Code Rate (General Case) Assume one of possible source
messages is represented as a code word of length . Let be the average
number of bits transmitted with each source message, then the code rate is defined
as:
(2.60)
For the case of equally likely source messages, . For the general
case, is the entropy of the source message.
Information Channels 89
Consider all possible conditional probability assignments for for the given
source and representation alphabets. An assignment is deemed to be -admissible if
and only if the average distortion, , is less than or equal to some acceptable or spec-
ified threshold, . The set of all -admissible conditional probability assignments
is denoted by:
(2.62)
where are the representation alphabet probabilities.
DEFINITION 2.10 Rate Distortion Function For given and fixed source
probability distribution the minimum information rate we
require for the given average distortion is given by the rate distortion function:
(2.64)
¾
which is derived by finding the -admissible conditional probability assignment
that minimises the mutual information subject to the constraints:
(2.65)
Intuitively, we expect that tolerating a larger distortion permits the use of lower in-
formation rates (as is the case of achieving low bit rate transmission by lossy com-
pression) and conversely that by increasing the information rate lower distortion is
possible.
The calculation of from Equation 2.64 involves a minimisation of a non-linear
function over an unknown but constrained set of probabilities, which is a very diffi-
cult problem analytically.
90 Fundamentals of Information Theory and Coding Design
EXAMPLE 2.12
Consider a binary source with symbols ½ ¾ that are detected at the
receiver as the ternary representation alphabet with symbols ½ ¾ ¿ .
This representation covers the following three cases:
½½ 0 ½
½ 0
½¾
½¿
¾½ ? ¾
¾¾
¾ 1
¾¿ 1 ¿
FIGURE 2.13
Graphical representation of information channel.
0 0
½ 0
1
3
?
3
1
1
0 1
FIGURE 2.14
Graphical representation of distortion measure.
Consider the derivation of , that is, the rate distortion function for .
The conditional probability assignment of Equation 2.66 is -admissible and provides
an information rate of . Is there another probability assignment
with and information rate ? Intuitively one expects the
minimum information rate to occur at the maximum allowable distortion of .
To investigate this we consider the derivation of the corresponding BSC and BEC
equivalents of Figure 2.13 such that . The following conditional probability
assignments:
92 Fundamentals of Information Theory and Coding Design
both yield the same average distortion of . For the BSC equivalent the mutual
information is calculated as and for the BEC equivalent we have
. Thus the BSC equivalent is the channel with lowest information
rate for the same level of average distortion. To find we need to consider
all -admissible conditional probability assignments and find the one providing the
minimum value for the mutual information.
EXAMPLE 2.13
An important example of rate distortion analysis is for the case of analog to digital con-
version when the analog source signal has to be represented as a digital source signal.
An important component of this conversion is the quantisation of the continuous-
valued analog signal to a discrete-valued digital sample.
Consider source symbols generated from a discrete-time, memoryless Gaussian source
with zero mean and variance ¾ . Let denote the value of the source symbol or sam-
ple generated by this source. Although the source symbol is discrete in time it is
continuous-valued and a discrete-valued (quantised) representation of is needed for
storage and transmission through digital media. Let be a symbol from the discrete-
valued representation alphabet that is used to represent the continuous-valued .
For example can be the set of non-negative integers and is the value
of rounded to the nearest integer. It should be noted that, strictly speaking, a finite
representation is needed since digital data are stored in a finite number of bits. This is
usually achieved in practice by assuming a limited dynamic range for and limiting
the representation alphabet to a finite set of non-negative integers. Thus we refer to
as the quantised version of and the mapping of to as the quantisation operation.
The most intuitive distortion measure between and is a measure of the error in
representing by and the most widely used choice is the squared error distortion:
¾ (2.67)
It can be shown with the appropriate derivation (see [4, 5]) that the rate distortion
function for the quantisation of a Gaussian source with variance ¾ with a squared
error distortion is given by:
¾
½
¾ ¾
(2.68)
¾
Information Channels 93
2.10.1 Properties of
By considering the minimum and maximum permissible values for and the
corresponding distortion threshold, , a more intuitive understanding of the be-
haviour of the rate distortion function, , is possible. Obviously the minimum
value of is 0, implying that there is no minimum rate of information, but what
does this mean and what does the corresponding distortion threshold imply? Intu-
itively the maximum value of should be and this should happen when
(perfect reconstruction), but is this really the case? We answer these questions
by stating and proving the following result.
RESULT 2.9
The rate distortion function is a monotonically decreasing function of , limited in
range by:
(2.69)
where indicates that the upper bound occurs for the minimum
possible value of the distortion, , and indicates that there is a
maximum permissible value such that for .
0
FIGURE 2.15
Sketch of typical function .
minimises:
(2.70)
The minimum occurs by considering, for each , the value of that minimises
and setting for these values, with all other conditional prob-
abilities set to zero. For each define as the minimum
distortion for and , where , as the representation
symbol that yields the minimum distortion. We set with all other
conditional probabilities for being set to zero. Hence:
Then since we are looking for such that for , then
is given by the minimum value of Equation 2.72 over all possible probability
assignments for . Define . The mini-
ofEquation
mum 2.72 occurs when for that such that the expression
is smallest, and for all other . This
gives:
(2.73)
Equation 2.73 implies that if we are happy to tolerate an average distortion that is as
much as then there is a choice of representation that is independent of
such that .
Information Channels 95
EXAMPLE 2.14
Consider the channel from Example 2.12. The following conditional probability
assignment:
½
Thus and this occurs with the following conditional probability assign-
ment:
Interestingly this represents the condition where, no matter what input is transmitted
through the channel, the output will always be with an average cost of 1. Thus
if we can tolerate an average distortion of 1 or more we might as well not bother!
EXAMPLE 2.15
Consider the rate distortion function given by Equation 2.68 in Example 2.13 for the
quantisation of a Gaussian source with squared error distortion. Since the source
is continuous-valued and is discrete-valued then for the rate
distortion function must be infinite since no amount of information will ever be able
to reconstruct from with no errors, and the quantisation process, no matter what
discrete-valued representation is used, will always involve a loss of information. That
is, from Equation 2.68:
It should be noted that this does not contradict Result 2.9 since that result implicitly
assumed that both and were discrete-valued.
96 Fundamentals of Information Theory and Coding Design
¾
This result can be confirmed intuitively by observing that if no information is provided
(i.e., the receiver or decoder does not have access to ) the best estimate for is its
mean value from which the average squared error will, of course, be the variance, ¾ .
2.11 Exercises
1. A binary channel correctly transmits a 0 (as a 0) twice as many times as trans-
mitting it incorrectly (as a 1) and correctly transmits a 1 (as a 1) three times
more often then transmitting it incorrectly (as a 0). The input to the channel
can be assumed equiprobable.
(a) Sketch the communication channel that exists between Agent 101 and
her source.
(b) Before hearing the answer to the question what is Agent 101’s average
uncertainty about the answer?
(c) Agent 101 interprets the answer over the phone as no. What is her av-
erage uncertainty about the answer? Is she is more uncertain or less
uncertain about the answer given her interpretation of what she heard?
Explain!
(d) Agent 101 now interprets the answer over the phone as yes. What is her
average uncertainty about the answer? Is she is more uncertain or less
uncertain about the answer given her interpretation of what she heard?
Explain and compare this with the previous case.
Information Channels 97
3/5
4/7 0 0
A B
3/7 1 1
2/3
(a) Prove that if a BSC is shorted (i.e., all the outputs are grounded to 0) then
the channel provides no information about the input.
(b) Consider the statement: “Surely if you know the information of the
source, , and the information provided by the channel, ,
you will know the information of the output,
?” Prove whether
this statement is true or not. If not, under what conditions, if any, would
it be true?
(c) Conceptually state (in plain English) what
means.
9. Sketch a sample channel with inputs and outputs, find the expression for the
mutual information, the channel capacity and the input probability distribution
to achieve capacity, for the following cases:
p
0 0
1−p−q
q
A ? B
q
1−p−q
1 1
p
Find all the values of and for which the above channel is:
Information Channels 99
The ternary input to the channel system has the following statistics:
, , and . The output of the channel is and
the output of channel is .
p
0 0
q B (from original transmission)
A ?
q C (from repeat transmission)
1 1
p
(a)
(b)
(c)
100 Fundamentals of Information Theory and Coding Design
15. Agent 01 contacts two of his sources by email for some straight “yes” and
“no” answers. The first source he contacts is known to be unreliable and to
give wrong answers about 30% of the time. Hence Agent 01 contacts his
second source to ask for the same information. Unfortunately, the second
source insists on using a non-standard email encoding and Agent 01 finds that
only 60% of the answers are intelligible.
(a) What is the average uncertainty Agent 01 has about the input given the
answers he receives from his first source? Hence what is the mutual
information of the first source?
(b) What is the average uncertainty Agent 01 has about the input given the
answers he receives from both the first and second source? Hence what
is the mutual information from both sources?
*16. In order to improve utilisation of a BSC, special input and output electron-
ics are designed so that the input to the BSC is sent twice and the output is
interpreted as follows:
Derive the expression for the mutual information through this augmented BSC
assuming equiprobable inputs and compare this with the mutual information
of a standard BSC. Either analytically or numerically show whether this aug-
mented BSC is superior to the standard BSC. What price is paid for this “im-
proved” performance?
17. The mutual information between two random variables, and , is defined
by Equation 2.47. Explain how
provides a measure of the amount of
independence between and .
*18. A digital transmission channel consists of a terrestrial fibre-optic link with a
measured cross-over (bit error) probability of 0.1 followed by a satellite link
with an erasure probability of 0.2. No prior statistics regarding the source
of information being transmitted through the channel are available. What is
the average amount of information that can be resolved by the channel? In
order to improve the reliability of the channel it is proposed that the same data
being transmitted through the channel also be transmitted through a cheaper
and less reliable terrestrial/marine copper channel with cross-over probability
of 0.3. What is the average amount of information that can now be transmitted
through the combined channel system? Compare with your previous answer.
What is the cost of this improvement?
19. Consider the following errors-and-erasure channel:
Information Channels 101
p
0 0
1−p−q
q
A ? B
q
1−p−q
1 1
p
Under what conditions will the channel be weakly symmetric? What is the
expression for the channel capacity when the channel is weakly symmetric?
0.7
0 0
0.1
0.2
A ? B
0.3
0.1
1 1
0.6
A 1 1 B
1/2
1/2
2
(a)
(b) ,
p
0 0
1−p−q
q
A ? B
q
1−p−q
1 1
p
with input ½ ¾ and output ½ ¾ ¿ . Define the
following single-letter distortion measure:
Assuming equiprobable inputs derive the expression for the average distortion,
. State the constrained minimisation problem that you need to solve to derive
the rate-distortion function,
. Can you solve it?
28. For Qu. 27 what is and what are values of and that yield ?
2.12 References
[1] S. Arimoto, An algorithm for calculating the capacity of an arbitrary discrete
memoryless channel, IEEE Trans. Inform. Theory, IT-18, 14-20, 1972.
[2] R. Blahut, Computation of channel capacity and rate distortion functions, IEEE
Trans. Inform. Theory, IT-18, 460-473, 1972.
[3] J.-F. Cardoso, Blind signal separation: statistical principles, Proceedings of the
IEEE, 86(10), 2009-2025, 1998.
104 Fundamentals of Information Theory and Coding Design
[4] T.M. Cover and J.A. Thomas, Elements of Information Theory, John Wiley &
Sons, New York, 1991.
[5] R.G. Gallager, Information Theory and Reliable Communication, John Wiley
& Sons, New York, 1968.
[6] T.V.L. Hartley, Transmission of information, Bell System Tech. J., 7, 535-563,
1928.
[7] S. Haykin, Communication Systems, John Wiley & Sons, New York, 4th ed.,
2001.
[8] J. Jeong, J.C. Gore, and B.S. Peterson, Mutual information analysis of the
EEG in patients with Alzheimer’s disease, Clinical Neurophysiology, 112(5),
827-835, 2001.
[9] Y. Normandin, R. Cardin, and R. De Mori, High-performance connected digit
recognition using maximum mutual information, IEEE Trans. Speech and Au-
dio Processing, 2(2), 299-311, 1994.
[10] C.E. Shannon, A mathematical theory of communication, Bell System Tech. J.,
vol. 28, pg 379-423, 623-656, 1948.
[11] C.E. Shannon, Coding theorems for a discrete source with a fidelity criterion,
IRE Nat. Conv. Record, Part 4, 142-163, 1959.
[12] P. Viola, and W.M. Wells, Alignment by maximization of mutual information,
Int. J. of Comput. Vision, 24(2), 137-154, 1997.