0% found this document useful (0 votes)
63 views54 pages

Informations Channels For Coding Theory

Chapter 2 discusses information channels, which are essential for transmitting information through various media. It covers the concepts of source coding and channel coding, the impact of noise on data transmission, and introduces the definitions and properties of information channels, including binary symmetric and binary erasure channels. The chapter also explains mutual information and how to calculate output probabilities based on input characteristics and channel matrices.

Uploaded by

zak zak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views54 pages

Informations Channels For Coding Theory

Chapter 2 discusses information channels, which are essential for transmitting information through various media. It covers the concepts of source coding and channel coding, the impact of noise on data transmission, and introduces the definitions and properties of information channels, including binary symmetric and binary erasure channels. The chapter also explains mutual information and how to calculate output probabilities based on input characteristics and channel matrices.

Uploaded by

zak zak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 2

Information Channels

2.1 What Are Information Channels?

Information does not only have to be used or stored, it also has to be transmitted. In
communication systems a transmitter converts, or encodes, a message to a form suit-
able for transmission through a communications medium, be it a fibre optic channel,
satellite link or radio signal through space. The receiver then detects the transmitted
signal and decodes it back to the original message. The encoding and decoding oper-
ations will be discussed in the following chapters. Although we will be dealing with
digital data, physical transmission and storage has to deal with analog signals and
media. Thus digital data has to be modulated in some way prior to transmission and
storage, and detected by the receiver to reproduce the original digital sequence. Dis-
cussion of modern analog and digital communication systems is beyond the scope
of this book and interested readers should refer to the standard textbooks in the area,
for example [7].
As we will see, two forms of coding will be necessary. Source coding is required to
efficiently store and transmit the information generated by a general-purpose source
(e.g., ASCII text, images, audio, etc.) on the storage or transmission medium in use
(e.g., computer disk or digital data networks, both requiring conversion to binary
data). Channel coding is required to mitigate the effects of noise that may corrupt
data during storage and transmission through physical systems. What is noise? Noise
can be defined as any unwanted signal or effect in addition to the desired signal. As
such there can be many causes of noise: interference from other signals, thermal and
spurious effects generated by the electronic device being used to store or transmit the
signal, environmental disturbances, etc.
In Chapter 1 we were concerned with defining the information content of a source.
In this chapter we will be concerned with the equally important problem of defining
or measuring the information carrying capacity of a channel. Intuitively a channel
can carry no more information than is being pushed through the channel itself, that
is, the entropy of the source or message being transmitted through the channel. But
as we will see the presence of noise reduces the information carrying capacity of
the channel and leads us to define another important quantity (alongside entropy)
called mutual information. As we view the channel from an information theoretic

51
52 Fundamentals of Information Theory and Coding Design

point of view not surprisingly the concept of mutual information has uses beyond the
information carrying capacity of a channel (as does entropy beyond measuring the
information content of a source).
The following assumptions will apply in our modelling of an information channel:

Stationary The statistical nature of the channel and noise do not change with time

Memoryless The behaviour of the channel and the effect of the noise at time t will
not depend on the behaviour of the channel or the effect of noise at any previ-
ous time

We now formally define a mathematical structure for an information channel.

DEFINITION 2.1 Information Channel An information channel is a triple


   , where is the input alphabet,  is the output alphabet and  is the
set of channel probabilities.          is a discrete set of   
symbols (where   is the size of the input alphabet) , and         
is a discrete set of   symbols. The transmission behaviour of the channel is
described by the probabilities in                    ,
where     is the probability that the output symbol  will be received if the
input symbol  is transmitted.

NOTE The input alphabet represents the symbols transmitted into the channel and
the output alphabet represents the symbols received from the channel. The definition
of the channel implies that the input and output symbols may be different. In real-
ity, one would expect that the received symbols are the same as those transmitted.
However the effect of noise may introduce “new” symbols and thus we use different
input and output alphabets to cater for such cases. For more general applications the
channel models the matching of the input symbols to prescribed output symbols or
classes which are usually different. In statistical applications the input and output
symbols arise from two random variables and the channel models the joint relationship
between the variables.

The conditional probabilities that describe an information channel can be represented


conveniently using a matrix representation:

 ½ ½   ¾ ½       ½ 
    ½ ¾   ¾ ¾       ¾  

 (2.1)
.. .. .. ..
. . . .
 ½    ¾       
Information Channels 53

where  is the channel matrix and for notational convenience we may sometimes
rewrite this as: 
½½ ½ 

  
 
 .. .. .. ..  (2.2)
. . . .
  
where we have defined    .


A graphical representation of an information channel is given in Figure 2.1.


  

  

½
   

 ¾ 
   
¾

..
.    
..
.
..
.

  



FIGURE 2.1
Graphical representation of an information channel.

The channel matrix exhibits the following properties and structure:

 Each row of  contains the probabilities of all possible outputs from the same
input to the channel
 Each column of  contains the probabilities of all possible inputs to a partic-
ular output from the channel
 If we transmit the symbol  we must receive an output symbol with probabil-
ity 1, that is:

            
 (2.3)
 
that is, the probability terms in each row must sum to 1.

EXAMPLE 2.1
Consider a binary source and channel with input alphabet   and output alphabet
 .
54 Fundamentals of Information Theory and Coding Design


Noiseless: If the channel is noiseless there will be no error in transmission,the channel
matrix is given by 

 and the channel is

1.0
0 0
A B
1 1
1.0

A typical input-output sequence from this channel could be:


input: 0 1 1 0 0 1 0 1 1 0
output: 0 1 1 0 0 1 0 1 1 0

Noisy: Say the channel is noisy and introducesa bit inversion 1% of the time, then
the channel matrix is given by 
   
    and the channel is

0.99
0 0

0.01
A B
0.01

1 1
0.99

A typical input-output sequence from this channel could be:


input: 0 1 1 0 0 1 0 1 1 0
output: 0 1 1 0 0 1 1 1 1 0

2.2 BSC and BEC Channels

In digital communication systems the input to the channel will be the binary digits
0,1 and this set will be the input alphabet and, ideally, also be the output alphabet.
Furthermore, the effect of noise will not depend on the transmission pattern, that is,
Information Channels 55

the channel is assumed memoryless. Two possible scenarios on the effect of noise
are possible.
Ideally if there is no noise a transmitted 0 is detected by the receiver as a 0, and a
transmitted 1 is detected by the receiver as a 1. However in the presence of noise the
receiver may produce a different result.
The most common effect of noise is to force the detector to detect the wrong bit (bit
inversion), that is, a 0 is detected as a 1, and a 1 is detected as a 0. In this case the
information channel that arises is called a binary symmetric channel or BSC where
               is the probability of error (also called
bit error probability, bit error rate (BER), or “crossover” probability) and the output
alphabet is also the set of binary digits  . The parameter  fully defines the be-
haviour of the channel. The BSC is an important channel for digital communication
systems as noise present in physical transmission media (fibre optic cable, copper
wire, etc.) typically causes bit inversion errors in the receiver.

A BSC has channel matrix  
 where      and is depicted in Figure

2.2.

p
0 0

q
A B
q

1 1
p

FIGURE 2.2
Binary symmetric channel.

Another effect that noise (or more usually, loss of signal) may have is to prevent
the receiver from deciding whether the symbol was a 0 or a 1. In this case the
output alphabet includes an additional symbol,  , called the “erasure” symbol that
denotes a bit that was not able to be detected. Thus for binary input  , the output
alphabet consists of the three symbols,   . This information channel is called
a binary erasure channel or BEC where           
is the probability of error (also called the “erasure” probability). Strictly speaking
a BEC does not model the effect of bit inversion; thus a transmitted bit is either
received correctly (with probability      ) or is received as an “erasure” (with
probability  ). A BEC is becoming an increasingly important model for wireless
mobile and satellite communication channels, which suffer mainly from dropouts
and loss of signal leading to the receiver failing to detect any signal.
56 Fundamentals of Information Theory and Coding Design

A BEC has channel matrix 
 
 and is depicted in Figure 2.3.

p
0 0
q
A ? B
q
1 1
p

FIGURE 2.3
Binary erasure channel.

2.3 Mutual Information


To fully specify the behaviour of an information channel it is necessary to specify
the characteristics of the input as well as the channel matrix. We will assume that
the input characteristics are described by a probability distribution over the input
alphabet, with    denoting the probability of symbol  being input to the chan-
nel. Then a channel will be fully specified if the input source probabilities,  
 ½           , and channel probabilities,     
 , are
given.
If a channel is fully specified then the output probabilities,         
      , can be calculated by:

         (2.4)


The probabilities     are termed the forward probabilities where forward in-
dicates that the direction of channel use is with input symbol  being transmitted
and output symbol  being received (i.e.,  then  or  given  ). We can simi-
larly define the backward probabilities as     indicating the channel is running
backwards: output symbol  occurs first followed by input symbol  (i.e.,  then
 , or  given  ). The backward probabilities can be calculated by application of
Bayes’ Theorem as follows:
                
    
            
(2.5)
Information Channels 57

where    is the joint probability of  and  .

EXAMPLE 2.2
Consider the binary information channel fully specified by:

 
  
 

     (2.6)
      
which is usually represented diagrammatically as:

3/4
2/3 0 0

1/8
A B
1/4

1/3 1 1
7/8

The output probabilities are calculated as follows:


               
   
 
    

       


and the backward probabilities by:

  
      
   
   
   
  



             
  
 
       
      
    
   



              

Conceptually we can characterise the probabilities as a priori if they provide the


probability assignment before the channel is used (without any knowledge), and as a
posteriori if the probability assignment is provided after the channel is used (given
knowledge of the channel response). Specifically:
58 Fundamentals of Information Theory and Coding Design

  a priori probability of output symbol  if we do not know which input symbol


was sent

   a posteriori probability of output symbol  if we know that input symbol


 was sent
  a priori probability of input symbol  if we do not know which output symbol
was received

   a posteriori probability of input symbol  if we know that output symbol


 was received

We can similarly refer to the a priori entropy of A:


     (2.7)
 ¾ 

as the average uncertainty we have about the input before the channel output is ob-
served and the a posteriori entropy of A given  :


      (2.8)
¾
 
 

as the average uncertainty we have about the input after the channel output  is
observed.
How does our average uncertainty about the input change after observing the output
of the channel? Intuitively, we expect our uncertainty to be reduced as the channel
output provides us with knowledge and knowledge reduces uncertainty. However, as
we will see in the following example the output can sometimes increase our uncer-
tainty (i.e., be more of a hindrance than a help!).

EXAMPLE 2.3
Consider the binary information channel from Example 2.2:

3/4
2/3 0 0

1/8
A B
1/4

1/3 1 1
7/8
Information Channels 59

What is our uncertainty of the input that is transmitted through the channel before we
observe an output from the channel? This is provided by the entropy based on the
given a priori input probabilities,      and      , yielding the a
priori entropy of A,     .
What is our uncertainty of the input that is transmitted through the channel after we
observe an output from the channel? Say we observe an output of   , then the
a posteriori input probabilities,         and         
,
yield an a posteriori entropy of A,        Thus we reduce our
uncertainty once we observe an output of   . But what if we observe an output of
  ? Then the a posteriori input probabilities become         and
        and the a posteriori entropy of A,        . Our
uncertainty is in fact increased! The reason is that our high expectation of an input
of 0, from the a priori probability      , is negated by receiving an output
of 1. Thus          is closer to equi-probable than      and
this increases the a posteriori entropy.
Notwithstanding the fact that         even though      
  we can show that if we average across all possible outputs the channel will
indeed reduce our uncertainty, that is:

        

              
  

and thus       .

The average of the a posterior entropies of A,    , calculated in Example 2.3 is


sometimes referred to as the equivocation of A with respect to B where equivocation
is used to refer to the fact that     measures the amount of uncertainty or equiv-
ocation we have about the input  when observing the output  . Together with the
a priori entropy of  ,  , we can now establish a measure of how well a channel
transmits information from the input to the output. To derive this quantity consider
the following interpretations:

  average uncertainty (or surprise) of the input to the channel before observing
the channel output;

   average uncertainty (or equivocation) of the input to the channel after ob-
serving the channel output;

     reduction in the average uncertainty of the input to the channel


provided or resolved by the channel.
60 Fundamentals of Information Theory and Coding Design

DEFINITION 2.2 Mutual Information For input alphabet and output alpha-
bet  the term
        (2.9)
is the mutual information between and 

The mutual information,    , indicates the information about ,  , that is


provided by the channel minus the degradation from the equivocation or uncertainty,
  . The    can be construed as a measure of the “noise” in the channel
since the noise directly contributes to the amount of uncertainty we have about the
channel input, , given the channel output  .
Consider the following cases:

noisefree If      this implies        which means the channel


is able to provide all the information there is about the input, i.e.,  . This
is the best the channel will ever be able to do.
noisy If      but       then the channel is noisy and the input
information,  , is reduced by the noise,   , so that the channel is
only able to provide            amount of information about
the input.
ambiguous If       the amount of noise totally masks the contribution
of the channel and the channel provides       information about the
input. In other words the channel is useless and is no better than if the channel
was not there at all and the outputs were produced independently of the inputs!

An alternative expression to Equation 2.9 for the mutual information can be derived
as follows:
       
          

 
¾ ¾ ¾
    
 

    



¾ ¾ ¾ ¾
       (2.10)
¾ ¾
Using Equation 2.5 the mutual information can be expressed more compactly:
RESULT 2.1 Alternative expressions for Mutual Information

       
        
¾ ¾     ¾ ¾
(2.11)
Information Channels 61

2.3.1 Importance of Mutual Information

The mutual information has been defined in the context of measuring the information
carrying capacity of communication channels. However the concept of mutual infor-
mation has had far-reaching effects on solving difficult estimation and data analysis
problems in biomedical applications [8], image processing and signal processing. In
these applications the key in using mutual information is that it provides a measure
of the independence between two random variables or distributions.
In image processing [12] and speech recognition [9] the use of the maximum mutual
information or MMI between the observed data and available models has yielded
powerful strategies for training the models based on the data in a discriminative
fashion. In signal processing for communication systems the idea of minimising the
mutual information between the vector components for separating mutually interfer-
ing signals [3] has led to the creation of a new area of research for signal separation
based on the idea of independent component analysis or ICA .

2.3.2 Properties of the Mutual Information

From Example 2.3 we saw that for specific values of the output alphabet, , either
       or        but when we averaged over the output
alphabet then      . This implies that      for this case. But
what can we say about    for other cases? We restate the following result from
Chapter 1:

    (2.12)


  
for two sources of size with symbol probabilities,  and  , respectively, and
equality only if    for   . Let     and    
and     Then from Equations 2.11 and 2.12 we get:

        


¾ ¾
     (2.13)
¾ ¾ 
That is:

RESULT 2.2
The mutual information is a non-negative quantity:

     (2.14)

with      if and only if       , i.e., the input and


output alphabets are statistically independent.
62 Fundamentals of Information Theory and Coding Design

The expression for    provided by Equation 2.11 is symmetric in the variables


 and . Thus by exchanging  and  we get the following result:

       (2.15)

or
              (2.16)

NOTE The information the channel provides about  upon observing  is the
same as the information the channel provides about  upon noting  was sent.

The term    is sometimes referred to as the equivocation of B with respect to


A and measures the uncertainty or equivocation we have about the output  given
the input .

RESULT 2.3
From the fact that          it can be stated that, in general and on
average, uncertainty is decreased when we know something, that is:

             (2.17)

Other expressions and relationships between the entropies,     , the equiv-
ocation,       , the joint entropy,     and the mutual information,
      , can be derived. All these relations can be summarised by the
Venn diagram of Figure 2.4. From Figure 2.4 additional expressions involving the

 
 
   
 

  
FIGURE 2.4
Relationship between all entropy and mutual information expressions.
Information Channels 63

joint entropy can be derived:

        
    
     (2.18)

The relation          can be stated conceptually as:




“The total uncertainty in both  and      is the sum of the uncertainties in


 and      minus the information provided by the channel   ”
whereas the relation        becomes:
“The total uncertainty in both  and      is the sum of the uncertainty in 
plus the remaining uncertainty in  after we are given .”

REMARK 2.1 Given the input probabilities, 


, and channel matrix, , only
three quantities need to be calculated directly from the individual probabilities to
completely determine the Venn diagram of Figure 2.4 and the remaining three can be
derived from the existing entropy expressions.

EXAMPLE 2.4
From Example 2.3 we calculated:

  
   
and from Example 2.2 we can calculate:

        
¾

The quantities ,  and  completely determine the Venn diagram
and the remaining quantities can be derived by:

        


        


            
64 Fundamentals of Information Theory and Coding Design

The mutual information of a BSC can be expressed algebraically by defining:

         
       
and hence:

         
       
Then the output probabilities are given by:

    
    

A sketch of the BSC showing both input and output probabilities is given in Figure
2.5.

 
p
0 0
q
A B
q
1 1  
p

FIGURE 2.5
Mutual information of a BSC.

The expression for    is:

                      (2.19)

The simplified expression for    is:


       

¾ ¾
 
 
            

 

    
 (2.20)
 
Information Channels 65

and the mutual information is:           .

EXAMPLE 2.5
What is the mutual information of a BSC for the following cases?

1.     : The channel operates in an ambiguous manner since the errors


are as likely as no errors, the output symbols are equally likely,      no
matter what is happening at the input (since     ), the equivocation is
     and the mutual information is     .

2.  : The channel is noisefree,         


½   ½,

          and the mutual information is        


 .

3.   : The source exhibits maximum entropy with    , the output


entropy also exhibits maximum entropy with      (since     ) and
½
the mutual information is given by              . ½

4.   : The source contains no information since    , the output


entropy,   , is the same as the channel uncertainty,   , and the mutual
information is      since there is no information to transmit.

The mutual information of a BEC can be similarly expressed algebraically. Adopting


the same notation used for the BSC the output probabilities of the BEC are given by:

    
    
        

A sketch of the BEC showing both input and output probabilities is given in Figure
2.6.
The expression for    is:

  
          (2.21)
  

The simplified expression for    reduces to the same expression as Equation


2.20:
 
        (2.22)
 

and the mutual information is:           .


66 Fundamentals of Information Theory and Coding Design
p
0 0 
q
A ?  B
q
1 1 
p

FIGURE 2.6
Mutual information of a BEC.

2.4 Noiseless and Deterministic Channels

We have already discussed the BSC and BEC structures as important channel models
for describing modern digital communication systems. We have also loosely referred
to channels as being noisefree, noisy and ambiguous. We now formally define noise-
less channels as those channels that are not subject to the effects of noise (i.e., no
uncertainty of the input that was transmitted given the output that was received). We
also define a dual class of channels called deterministic channels for which we can
determine the output that will be received given the input that was transmitted.

2.4.1 Noiseless Channels

A noiseless channel will have either the same or possibly more output symbols than
input symbols and is such that there is no noise, ambiguity, or uncertainty of which
input caused the output.

DEFINITION 2.3 Noiseless Channel A channel in which there are at least


as many output symbols as input symbols, but in which each of the output symbols
can be produced by the occurrence only of a particular one of the input symbols
is called a noiseless channel. The channel matrix of a noiseless channel has the
property that there is one, and only one, non-zero element in each column.

EXAMPLE 2.6
The following channel with 6 outputs and 3 inputs is noiseless because we know, with
certainty 1, which input symbol, ½    , was transmitted given knowledge of
the received output symbol,            .
Information Channels 67

1/3


2/3 


1/4
 1/2


1/4

 
1

The corresponding channel matrix is given by:



     
       
     

and we note that there is only one non-zero element in each column.

What can we say about the mutual information through a noiseless channel? Let 
be the received output. Then, from the definition of a noiseless channel, we know
which input, say   , was transmitted and we know this with certainty 1. That is,
     for   and hence     for all other  . The equivocation
    becomes:
 
      
¾
   ¾  
  
     
¾
  ¾
 


since:
       
       

and hence ¾
 
   ½
  
.
The mutual information is then given by the following result.
68 Fundamentals of Information Theory and Coding Design

RESULT 2.4
The mutual information for noiseless channels is given by:

      (2.23)

That is, the amount of information provided by the channel is the same as the
information sent through the channel.

2.4.2 Deterministic Channels

A deterministic channel will have either the same or possibly more input symbols
than output symbols and is such that we can determine which output symbol will be
received when a particular input symbol is transmitted.

DEFINITION 2.4 Deterministic Channel A channel in which there are at least


as many input symbols as output symbols, but in which each of the input symbols
is capable of producing only one of the output symbols is called a deterministic
channel. The channel matrix of a deterministic channel has the property that there
is one, and only one, non-zero element in each row, and since the entries along each
row must sum to 1, that non-zero element is equal to 1.

EXAMPLE 2.7

The following channel with 3 outputs and 6 inputs is deterministic because we know,
with certainty 1, which output symbol, ½      , will be received given knowledge
of the transmitted input symbol,            .


1
1
 
1


1

 1
 
1
Information Channels 69

The corresponding channel matrix is given by:





    

    

   
   

and we note that there is only one nonzero element in each row and that the element
is 1.

What is the mutual information of a deterministic channel? Let  be the transmitted


input symbol. Then, from the definition of a deterministic channel, we know that
the received output will be, say,   and we know this with certainty 1. That is,
     for   and hence     for all other  . The equivocation
   then becomes:

        
¾
   ¾
 
     
¾
  ¾
 


since   ¾     ½ 
 
.
The mutual information is given by the following result.

RESULT 2.5
The mutual information for deterministic channels is given by:

       (2.24)

That is, the amount of information provided by the channel is the same as the
information produced by the channel output.

2.5 Cascaded Channels


In most typical cases of information transmission and storage the data are passed
through a cascade of different channels rather than through just the one channel.
70 Fundamentals of Information Theory and Coding Design

One example of where this happens is in modern data communication systems where
data can be sent through different physical transmission media links (e.g., copper
wire, optical fibre) and wireless media links (e.g., satellite, radio) from transmitter
to receiver. Each of these links can be modelled as an independent channel and
the complete transmission path as a cascade of such channels. What happens to
the information when passed through a cascade of channels as compared to a single
channel only?
Intuitively we would expect additive loss of information arising from the cumulative
effects of uncertainty (or equivocation) from each channel in the cascade. Only for
the case of a noiseless channel would we expect no additional loss of information in
passing data through that channel. We verify these and other results when, without
loss of generality, we derive the mutual information of a cascade of two channels and
compare this with the mutual information through the first channel.
Consider a pair of channels in cascade as shown in Figure 2.7. In Figure 2.7 the

A B C
Channel AB Channel BC

r− symbol s− symbol t− symbol


alphabet alphabet alphabet

FIGURE 2.7
Cascade of two channels.

output of channel  is connected to the input of channel  . Thus the symbol


alphabet  of size  is both the output from channel  and the input to channel
 . Say the input symbol  is transmitted through channel  and this produces
 as the output from channel  . Then  forms the input to channel  which,
in turn, produces  as the output from channel  . The output  depends solely
on  , not on  . Thus we can define a cascade of channels as occurring when the
following condition holds:
            (2.25)

Similarly we can also state that the following will also be true for a cascade of chan-
nels:
             (2.26)

The problem of interest is comparing the mutual information through channel 


only,   , with the mutual information through the cascade of channel  with
Information Channels 71

channel  , that is,    . To this end we first show that         


as follows:

            


¾ ¾
  
  
 
¾ ¾
        (2.27)
¾ ¾ ¾

Equation 2.26 gives        , noting that           ,


and given that  ½   with equality when   we can now state:

                   
¾ ¾ ¾
 
  
           
¾ ¾ ¾ ¾

         (2.28)
¾ ¾

     such that     .


with equality when 
Since          and           we have the
following result.

RESULT 2.6
For the cascade of channel  with channel  it is true that:
                       
(2.29)
That is, channels tend to leak information and the amount of information out of a
cascade can be no greater (and is usually less) than the information from the first
channel.

CLAIM 2.1
If channel  is noiseless then       

PROOF For noiseless channel  if      then this implies that    .


For the cascade of channel  with  the condition for         when
72 Fundamentals of Information Theory and Coding Design

    is    From Bayes’ Theorem we can show that:


P      (2.30)
¾
For a cascade we know that     and since    when   
, and    otherwise, for noiseless channel  this gives the required result
that:
   (2.31)
and hence        .

The converse of Claim 2.1, however, is not true since, surprisingly, there may exist
particular channel combinations and input distributions which give rise to     
    even if channel  is not noiseless. The following example illustrates this
point.

EXAMPLE 2.8
Consider the cascade of channel :
1/2

1/2 1/3
A B
1/3

1/3

with channel  :
3/4
1/4
1
B C
1/4
3/4

which produces the cascaded channel :


B 3/4
1/2
1/4
1/3 1
A 1/2 C
1/3
1/4
1/3
3/4
Information Channels 73

The corresponding channel matrices for  and  are:


 
  
   
          
  
Obviously channel  is not noiseless. Nevertheless it is true that:

        
by virtue of the fact that the channel matrix for the cascaded channel :
 
  
      

       
      
  
is identical to the channel matrix for channel  !

2.6 Additivity of Mutual Information


In the previous section it was shown that when information channels are cascaded
there is a tendency for information loss, unless the channels are noiseless. Of par-
ticular interest to communication engineers is the problem of how to reduce infor-
mation loss, especially when confronted with transmission through a noisy channel.
The practical outcome of this is the development of channel codes for reliable trans-
mission (channel codes are discussed in Chapters 5 to 9). From a purely information
theoretic point of view channel coding represents a form of additivity of mutual infor-
mation. Additivity is achieved when we consider the average information provided
by the channel about a single input symbol upon observing a succession of output
symbols. Such a multiplicity of outputs can occur spatially or temporally. Spatial
multiplicity occurs when the same input is transmitted simultaneously through more
than one channel, thereby producing as many outputs as there are channels. Temporal
multiplicity occurs when the same input is repeatedly transmitted through the same
channel, thereby producing as many outputs as repeat transmissions. Both cases
minimise information loss by exploiting redundancy, either extra channel space or
extra time to repeat the same input. We now develop an expression for the mutual
information for the special case of two channel outputs (original plus repeat) and
show that there is a gain of mutual information over using the single original channel
output.
Consider the channel system shown in Figure 2.8 where is the input alphabet of
size  ,  is the original output alphabet of size   and  is the repeat
output alphabet of size   . The output  can be construed as the output from
74 Fundamentals of Information Theory and Coding Design

s− symbol
alphabet

B
A
Channel

r− symbol
t− symbol
alphabet
alphabet

FIGURE 2.8
Channel system with two outputs.

the first channel  (if there are two physical channels) or first transmission (if there
is one physical channel) and output  can be construed as the output of the second
channel  or second (repeat) transmission.
To develop an expression for the mutual information we extend our notion of a priori
and a posteriori probabilities and entropies as discussed in Section 2.3 to include the
contribution of a second output as follows:

  : a priori probability of input symbol  if we do not know which output sym-


bol was received

   : a posteriori probability of input symbol  if we know that output symbol


 was received
     :
a posteriori probability of input symbol  if we know that both output
symbols,  and  , were received.

Thus we can define the following a posteriori entropy:

                   (2.32)
¾


the equivocation of with respect to  and  by:

          (2.33)
¾ ¾
  
Information Channels 75

and the amount of information channels  and  provide about  that is the
mutual information of  and , by:

            (2.34)

What can we say about           , the amount of information


channel  provides about , in comparison to     , the amount of informa-
tion both channels  and  provide about ? Expand Equation 2.34 as follows:

                  
      (2.35)
       

where      is the amount of information channel  additionally provides


about  after using channel  . It can be shown that:
I                (2.36)

Thus we have the following result.

RESULT 2.7
For a channel with input and dual outputs  and  it is true that:

                   (2.37)

That is, dual use of a channel provides more information than the single use of a
channel.

EXAMPLE 2.9
Consider a BSC where the input symbol,  , is transmitted through the channel twice.
Let  represent the output from the original transmission and let  be the output
from the repeat transmission. Since the same channel is used:

   

For simplicity we assume that           From Example 2.5 we


have that for   :
 
      

 

76 Fundamentals of Information Theory and Coding Design

What is the expression for     and how does it compare to   ? A direct


extension of Equation 2.11 yields the following expression for    :
   
          (2.38)
   
¾ ¾ ¾

from which we can state that:

    
       
¾
                 

where we note that         since the repeat output does not depend on the
original output. We list the contribution of each term in the expression of Equation
2.38 as shown by Table 2.1.

Table 2.1 Individual terms of


    expression
         Type
 
½ ¾ X
¾ ½ ¾ ¾
½ ¾ ¾ X
¾
½
 ¾ Z
½
 ¾ Z
½
  ¾ Z
½
 ¾ Z

½ ¾ Y
¾ ½ ¾ ¾

½ ¾ ¾ Y
¾

The Type column assigns a class label to each term. The type X terms represent the
case of no error in both the original or repeat outputs and these positively reinforce
the information provided by the dual use of the channel. Collecting the type X terms
we get:
½ ¾ ¾
¾ ¾ ¾
 ½ ½  
¾ ¾ ¾ ¾
¾ ¾

The type Y terms represent the case of complete error in both the original and repeat
outputs and these negatively reinforce the information provided by the dual use of the
channel. Collecting the type Y terms we get:
½ ¾  ¾
¾ ¾ ¾
 ½ ½  
¾ ¾ ¾ ¾
¾ ¾
Information Channels 77

The type Z terms, however, indicate contradictory original and repeat outputs which
effectively negate any information that the dual use of the channel may provide.
Collecting the type Z terms we see that these make no contribution to the mutual
½ 
information:

  ½¾ 
 ¾ 

Putting it all together we get the final expression for    :


 ¾  ¾
¾
     ¾
 ¾ 
¾ ¾ ¾

By plotting   and    for different values of  we see from Figure
2.9 that      . It is interesting to note the conditions for equality,

1
I(A;B)
0.9 I(A;BC)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
q (error)

FIGURE 2.9
Comparison of   with    .

      . This happens when:

¯   , there is 100% no error, so        


¯   , there is 100% bit reversal, so        
¯    , there is total ambiguity, so        
78 Fundamentals of Information Theory and Coding Design

2.7 Channel Capacity: Maximum Mutual Information


Consider an information channel with input alphabet , output alphabet  and chan-
nel matrix   with conditional channel probabilities    . The mutual infor-
mation:
         (2.39)
¾ ¾
which, if we now assume the logarithm is base 2, indicates the amount of information
the channel is able to carry in bits per symbol transmitted. The maximum amount
of information carrying capacity for the channel is  , the amount of information
that is being transmitted through the channel. But this is reduced by   , which
is an indication of the amount of “noise” present in the channel.
The expression for mutual information depends not only on the channel probabil-
ities,    , which uniquely identify a channel, but also on how the channel is
used, the input or source probability assignment,   . As such     cannot
be used to provide a unique and comparative measure of the information carrying
capacity of a channel since it depends on how the channel is used. One solution is
to ensure that the same probability assignment (or input distribution) is used in cal-
culating the mutual information for different channels. The questions then is: which
probability assignment should be used? Obviously we can’t use an input distribu-
tion with     since that means       for whatever channel we use!
Intuitively an input distribution with maximum information content (i.e., maximum
 ) makes sense. Although this allows us to compare different channels the com-
parison will not be fair since another input distribution may permit certain channels
to exhibit a higher value of mutual information.
The usual solution is to allow the input distribution to vary for each channel and to
determine the input distribution that produces the maximum mutual information for
that channel. That is we attempt to calculate the maximum amount of information a
channel can carry in any single use (or source assignment) of that channel, and we
refer to this measure as the capacity of the channel.

DEFINITION 2.5 Channel Capacity The maximum average mutual informa-


tion,    , in any single use of a channel defines the channel capacity. Mathe-
matically, the channel capacity,  , is defined as:

       (2.40)

that is, the maximum mutual information over all possible input probability assign-
ments,  .

The channel capacity has the following properties:


Information Channels 79

1. since    
2.        since      and     
     when considering the expression       
  , and             when considering the
expression           

The calculation of involves maximisation of     over    indepen-


dent variables (the input probabilities,      ) subject to the two
constraints:

    

1.
2.      

This constrained maximisation of a non-linear function is not a trivial task. Methods


that can be used include:

 standard constrained maximisation techniques like the method of Lagrangian


multipliers
 gradient search algorithms
 derivation for special cases (e.g., weakly symmetric channels)
 the iterative algorithms developed by Arimoto [1] and Blahut [2]

2.7.1 Channel Capacity of a BSC

For a BSC,            where    is given by Equation 2.19


and    is given by Equation 2.20. We want to examine how the mutual in-
formation varies with different uses of the same channel. For the same channel the
channel probabilities, and , remain constant. However, for different uses the input
probability,   , varies from 0 to 1. From Figure 2.10 the maximum
mutual information occurs at     .
    the mutual information expression simplifies to:
For 
 

         (2.41)

Since this represents the maximum possible mutual information we can then state:



     (2.42)

How does the channel capacity vary for different error probabilities, ? From Figure
2.11 the following observations can be made:
80 Fundamentals of Information Theory and Coding Design

0.8

0.6
I(A;B)
0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
w

FIGURE 2.10
Variation of    with different source assignments for a typical BSC.

When    or , that is, no error or 100% bit inversion, the BSC channel will
provide its maximum capacity of 1 bit

When    the BSC channel is totally ambiguous or useless and has a
capacity of 0 bits

2.7.2 Channel Capacity of a BEC

For a BEC,            where    is given by Equation 2.21


and    is given by Equation 2.22. The expression can be further simplified to:

 

               (2.43)
 

from which we can immediately see that:

  
 ´µ
   
 ´µ
         (2.44)

since we know that for the binary input alphabet, , that  ´µ   
½
       occurs when    ¾ . The following observations
can be made:
Information Channels 81

0.8

0.6
C
0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
q (error)

FIGURE 2.11
Variation of channel capacity of a BSC with .

when , that is, no erasures, the BEC channel will provide its maximum
capacity of 1 bit

when , the BEC channel will be producing only erasures and have a
capacity of 0 bits

2.7.3 Channel Capacity of Weakly Symmetric Channels

The calculation of the channel capacity,  , is generally quite involved since it rep-
resents a problem in constrained optimisation of a non-linear function. However
there is a special class of channels where the derivation of the channel capacity can
be stated explicitly. Both symmetric and weakly symmetric channels are examples
of this special class. We now provide a formal definition of symmetric and weakly
symmetric channels.

DEFINITION 2.6 (Weakly) Symmetric Channels A channel is said to be sym-


metric if the rows and columns of the channel matrix are permutations of each other.

A channel is said to be weakly symmetric if the rows of the channel matrix are per-
mutations of each other and the column sums,  ¾    , are equal.
Obviously a symmetric channel will also have equal column sums since columns
are a permutation of each other.
82 Fundamentals of Information Theory and Coding Design

Consider the weakly symmetric channel  with mutual information:


        
     
         
¾ ¾ 

          
  (2.45)
¾

where      ¾       ´½ µ and the summation involves terms in


the th row of the channel matrix. Since each row is a permutation of, say, the first
row, then   ½     ¾               where  .
This means that 
        

          

since 
    . Since     is bounded above, then the upper bound of
      would be the channel capacity if it can be achieved by an appro-
priate input distribution. Let     ½ then     
      
½
 
      . Since the channel is weakly symmetric then the column
sums,  , are all equal,   , and we have that      , that is, the output prob-
abilities are equal. Since we know that maximum entropy is achieved with equal
symbol probabilities then it follows that       when     ½ . We
have established that:

THEOREM 2.1
For a symmetric or weakly symmetric channel  , the channel capacity can be
stated explicitly as:
       (2.46)
where            ´ µ can be calculated for any row .
½

Channel capacity is achieved when the inputs are uniformly distributed, that is,
   ½ .

EXAMPLE 2.10
Consider the BSC channel matrix:
 

The BSC is a symmetric channel and we can use Theorem 2.1 to explicitly derive the
channel capacity as follows:

     


      

Information Channels 83

        

when   ½
¾, which is the same expression that was derived in Section 2.7.1.

Now consider a channel with the following channel matrix:


 
   
  

Obviously the channel is not symmetric, but since the second row is a permutation of
the first row and since the column sums are equal then the channel is weakly symmetric
and the channel capacity can be explicitly stated as:

     
              
      

when   ½
¾.

2.8 Continuous Channels and Gaussian Channels

We extend our analysis of information channels to the case of continuous valued in-
put and output alphabets and to the most important class of continuous channel, the
Gaussian channel. In digital communication systems noise analysis at the most basic
level requires consideration of continuous valued random variables rather than dis-
crete quantities. Thus the Gaussian channel represents the most fundamental form
of all types of communication channel systems and is used to provide meaning-
ful insights and theoretical results on the information carrying capacity of channels.
The BSC and BEC models, on the other hand, can be considered as high-level de-
scriptions of the practical implementations and operations observed in most digital
communication systems.

When considering the continuous case our discrete-valued symbols and discrete
probability assignments are replaced by continuous-valued random variables,  ,
with associated probability density functions,  .
84 Fundamentals of Information Theory and Coding Design

DEFINITION 2.7 Mutual Information of two random variables Let and


 be two random variables with joint density    and marginal densities
  and  . Then the mutual information between and  is defined as:
             (2.47)

   
where    is the differential entropy:

         (2.48)

and   is the conditional differential entropy:

            (2.49)

The mutual information,  


 , provides a measure of the amount of information
that can be carried by the continuous channel 
. Let us now consider the Gaus-
sian Channel shown in Figure 2.12 where      
¾  is a zero-mean Gaussian
¾ . Let
random variable with variance   be the value of the input to the channel at

time , and let  be the value of the output from the channel at time .

 +
FIGURE 2.12
The Gaussian channel.

The output of the channel is given by:

    (2.50)

That is, the channel output is perturbed by additive white Gaussian noise (AWGN).

We assume that the channel is band-limited to Hz. Two immediate implications


of this are that:
Information Channels 85

1. Assume the AWGN has a power spectral density of . Then the noise
¾ 
variance, or average power, is band-limited and given by 
Ê 
 ¾    
 ¾   
2. To faithfully reproduce any signals transmitted through the channel the signals
must be transmitted at a rate not exceeding the Nyquist rate of samples 
per second

We further assume that the channel is power-limited to  . That is:


 ¾  ¾  (2.51)

This band-limited, power-limited Gaussian channel just described is not only of the-
oretical importance in the field of information theory but of practical importance to
communication engineers since it provides a fundamental model for many modern
communication channels, including wireless radio, satellite and fibre optic links.

2.9 Information Capacity Theorem

In Section 2.7 we defined the channel capacity as the maximum of the mutual in-
formation over all possible input distributions. Of importance to communication en-
gineers is the channel capacity of a band-limited, power-limited Gaussian channel.
This is given by the following maximisation problem:

   
   ¾  
 
   ¾  
      (2.52)

We now provide the details of deriving an important and well-known closed-form


expression for for Gaussian channels. The result is the Information Capacity
Theorem which gives the capacity of a Gaussian communication channel in terms
of the two main parameters that confront communication engineers when designing
such systems: the signal-to-noise ratio and the available bandwidth of the system.

CLAIM 2.2
If  and  is uncorrelated with then:

    (2.53)
86 Fundamentals of Information Theory and Coding Design

PROOF We note from Bayes’ Theorem that        . Also


since      and since  and  are uncorrelated we have that   
   . Using these in the expression for     gives:

             

 
         
  
        
  
          (2.54)


From Chapter 1 we stated that for a  random variable the differential entropy
Gaussian
attains the maximum value of   ; so for the Gaussian random variable,  ,
we know that:     ¾ 
          (2.55)


For random variable     where  and  are uncorrelated we have that:


    
  
 ¾  ¾ ¾
       (2.56)
 

with the maximum achieved when  is a Gaussian random variable. Thus:

        
    
    
  


¾ ¾
 


 
¾

¾ ¾ ¾
    
¾ 


 

¾ (2.57)
 

¾   then  will also be a


If  is chosen to be a Gaussian random variable with 
Gaussian random variable and the maximum mutual information or channel capacity
will be achieved:
 
    ¾     (2.58)
 

Since the channel is also band-limited to  Hz then there can be no more than 
¾    . This provides the final form of the
symbols transmitted per second and  
Information Channels 87

channel capacity, stated as Shannon’s most famous Information Capacity Theorem


[10], which is also known as the Shannon-Hartley Law in recognition of the early
work by Hartley [6].

RESULT 2.8 Information Capacity Theorem


The information capacity of a continuous channel of bandwidth Hz, perturbed

by AWGN of power spectral density  and bandlimited also to Hz, is given
by:
 
  
   (2.59)


where is the average transmitted power and  is the signal-to-noise ratio
or SNR.

Equation 2.59 provides the theoretical capacity or upper bound on the bits per second
that can be transmitted for error-free transmission through a channel for a given

transmitted power, , and channel bandwidth, , in the presence of AWGN noise

with power spectral density,  . Thus the information capacity theorem defines a
fundamental limit that confronts communication engineers on the rate for error-free
transmission through a power-limited, band-limited Gaussian channel.

EXAMPLE 2.11

What is the minimum signal-to-noise ratio that is needed to support a 56k modem?

A 56k modem requires a channel capacity of 56,000 bits per second. We assume a
telephone bandwidth of   Hz. From Equation 2.59 we have:


       
   

or:

     


    

Thus a SNR of at least 48 dB is required to support running a 56k modem. In


real telephone channels other factors such as crosstalk, co-channel interference, and
echoes also need to be taken into account.
88 Fundamentals of Information Theory and Coding Design

2.10 Rate Distortion Theory

Consider a discrete, memoryless source with alphabet  


       
and associated probabilities       
     . Source coding is usually re-
quired to efficiently transmit or store messages from the source in the appropriate
representation for the storage or communication medium. For example, with dig-
ital communication channels and storage, source symbols need to be encoded as
a binary representation. Furthermore, source symbols transmitted through a com-
munication channel may be output in a different symbol representation. Thus the
source symbols will appear as symbols from a representation or code word alpha-
bet    
   . If there are no losses or distortions in the coding or

transmission there is perfect representation and  can be recovered fully from its

representation . But in the following situations:

 lossiness in the source coding where the code alphabet and permitted code
words do not allow exact representation of the source symbols and the decod-
ing is subject to errors,

 insufficient redundancy in the channel code such that the rate of information
is greater than the channel capacity,

the representation is not perfect and there are unavoidable errors or distortions in the

representation of the source symbol  by the representation symbol . 
Rate distortion theory, first developed by Shannon [11], deals with the minimum mu-
tual information (equivalent to the information or code rate) that the channel must
possess, for the given source symbol probability distributions, to ensure that the av-
erage distortion is guaranteed not to exceed a specified threshold . To derive this
value we first define what we mean by an information or code rate (this is for the
general case; in Chapter 5 we define the code rate for the specific case that applies
with channel codes).

DEFINITION 2.8 Code Rate (General Case) Assume one of possible source

messages is represented as a code word of length . Let   be the average

number of bits transmitted with each source message, then the code rate is defined

  
as:
(2.60)

For the case of equally likely source messages,     . For the general

case,   is the entropy of the source message.
Information Channels 89

DEFINITION 2.9 Distortion Measures The single-letter distortion measure,


   , is defined as the measure of the cost incurred in representing the source
symbol  by the representation symbol  . The average distortion, , is defined as
the average of     over all possible source symbol and representation symbol
combinations:
 
           (2.61)
  
where the     are the channel or transition probabilities.

Consider all possible conditional probability assignments for     for the given
source and representation alphabets. An assignment is deemed to be -admissible if
and only if the average distortion, , is less than or equal to some acceptable or spec-
ified threshold, . The set of all -admissible conditional probability assignments
is denoted by: 
        (2.62)

For each -admissible conditional probability assignment we have an associated


mutual information, or information rate, given by:
     
 
            (2.63)
     


where            are the representation alphabet probabilities.
DEFINITION 2.10 Rate Distortion Function For given  and fixed source
probability distribution         the minimum information rate we
require for the given average distortion  is given by the rate distortion function:

      (2.64)

   ¾
which is derived by finding the -admissible conditional probability assignment
that minimises the mutual information subject to the constraints:



          (2.65)

Intuitively, we expect that tolerating a larger distortion permits the use of lower in-
formation rates (as is the case of achieving low bit rate transmission by lossy com-
pression) and conversely that by increasing the information rate lower distortion is
possible.
The calculation of  from Equation 2.64 involves a minimisation of a non-linear
function over an unknown but constrained set of probabilities, which is a very diffi-
cult problem analytically.
90 Fundamentals of Information Theory and Coding Design

EXAMPLE 2.12
Consider a binary source with symbols ½  ¾  that are detected at the
receiver as the ternary representation alphabet with symbols ½  ¾  ¿ .
This representation covers the following three cases:

 ½  is detected as ½  and ¾  is detected as ¿ , that is, there is


no error in representation.

 ½  is detected as ¾  and ¾  is detected as ¾ , that is, the


receiver fails to detect the symbol and there is an erasure.

 ½  is detected as ¿  and ¾  is detected as ½ , that is, the


receiver detects the symbol incorrectly and there is a bit inversion.

The corresponding channel is shown in Figure 2.13 where       are the


transition probabilities and we are given that   ½    ¾  .

½½ 0 ½
½ 0
½¾
½¿
¾½ ? ¾

¾¾
¾ 1
¾¿ 1 ¿

FIGURE 2.13
Graphical representation of information channel.

We define a distortion measure for this channel as follows:

  ½  ½   ¾  ¿   to indicate there is no distortion (or distortion of )


when there is no error in the representation.

  ½  ¾   ¾  ¾   to indicate there is a distortion or cost of  when


there is an erasure.

  ½  ¿   ¾  ½   to indicate there is a distortion of  where there is


a bit inversion.
Information Channels 91

0 0 
½ 0
1
3
? 
3
1
1
0 1 

FIGURE 2.14
Graphical representation of distortion measure.

Note that a higher distortion of is attributed to bit inversions compared to a lower


distortion of  when there is an erasure. The set of distortion measures are represented
graphically in Figure 2.14.
Assume the following conditional probability assignment for the channel:

  
   
(2.66)

We calculate the representation alphabet probabilities:     ,     ,


    , and the mutual information (from Equation 2.63 ) is       .
The average distortion is then:


           
 

     

 

Consider the derivation of  , that is, the rate distortion function for   .
The conditional probability assignment of Equation 2.66 is -admissible and provides
an information rate of       . Is there another probability assignment
with    and information rate       ? Intuitively one expects the
minimum information rate to occur at the maximum allowable distortion of    .
To investigate this we consider the derivation of the corresponding BSC and BEC
equivalents of Figure 2.13 such that    . The following conditional probability
assignments:    
     
          
 
92 Fundamentals of Information Theory and Coding Design

both yield the same average distortion of . For the BSC equivalent the mutual
information is calculated as        and for the BEC equivalent we have
      . Thus the BSC equivalent is the channel with lowest information
rate for the same level of average distortion. To find  we need to consider
all -admissible conditional probability assignments and find the one providing the
minimum value for the mutual information.

EXAMPLE 2.13

An important example of rate distortion analysis is for the case of analog to digital con-
version when the analog source signal has to be represented as a digital source signal.
An important component of this conversion is the quantisation of the continuous-
valued analog signal to a discrete-valued digital sample.
Consider source symbols generated from a discrete-time, memoryless Gaussian source
with zero mean and variance  ¾ . Let  denote the value of the source symbol or sam-
ple generated by this source. Although the source symbol is discrete in time it is
continuous-valued and a discrete-valued (quantised) representation of  is needed for
storage and transmission through digital media. Let be a symbol from the discrete-
valued representation alphabet  that is used to represent the continuous-valued .
For example  can be the set of non-negative integers and   is the value
of  rounded to the nearest integer. It should be noted that, strictly speaking, a finite
representation is needed since digital data are stored in a finite number of bits. This is
usually achieved in practice by assuming a limited dynamic range for  and limiting
the representation alphabet  to a finite set of non-negative integers. Thus we refer to
as the quantised version of  and the mapping of  to as the quantisation operation.
The most intuitive distortion measure between  and is a measure of the error in
representing  by and the most widely used choice is the squared error distortion:

   ¾ (2.67)

It can be shown with the appropriate derivation (see [4, 5]) that the rate distortion
function for the quantisation of a Gaussian source with variance  ¾ with a squared
error distortion is given by:
 ¾ 

½
¾     ¾
(2.68)
  ¾
Information Channels 93

2.10.1 Properties of 
By considering the minimum and maximum permissible values for  and the
corresponding distortion threshold, , a more intuitive understanding of the be-
haviour of the rate distortion function, , is possible. Obviously the minimum
value of  is 0, implying that there is no minimum rate of information, but what
does this mean and what does the corresponding distortion threshold imply? Intu-
itively the maximum value of  should be    and this should happen when
   (perfect reconstruction), but is this really the case? We answer these questions
by stating and proving the following result.

RESULT 2.9
The rate distortion function is a monotonically decreasing function of , limited in
range by:
     (2.69)
where        indicates that the upper bound occurs for the minimum
possible value of the distortion,   , and     indicates that there is a
maximum permissible value  such that    for    .

A typical plot of  as a function of  for the case when     is shown in


Figure 2.15.

 

  

0 
FIGURE 2.15
Sketch of typical function  .

We can prove that        by considering   the minimum permissi-


ble value for the distortion . Since   this is equivalent to finding the smallest
possible . That is what we want to find the conditional probability assignment that
94 Fundamentals of Information Theory and Coding Design

minimises:

          (2.70)
  

The minimum occurs by considering, for each  , the value of  that minimises
    and setting      for these values, with all other conditional prob-
abilities set to zero. For each  define        as the minimum
distortion for  and    , where         , as the representation
symbol that yields the minimum distortion. We set        with all other
conditional probabilities     for     being set to zero. Hence:

      (2.71)


 

In typical applications    is the representation alphabet symbol that uniquely


identifies the source symbol  and in such cases the above conditional probabil-
ity assignment implies that     (perfect reconstruction since there is no
uncertainty) and hence       Thus we have that     .
Furthermore we typically assign a distortion of for the pair       and this
means that  ; thus     .
Derivation of the maximum permissible  relies on the observation that this
condition occurs when   . That is,    and thus    
  , the representation symbols are independent of the source symbols and there
is no information conveyed by the channel. The average distortion measure for a
channel with    is given by:
  
          (2.72)
   

Then since we are looking for  such that  for    , then
 is given by the minimum value of Equation 2.72 over all possible probability
assignments for   . Define              . The mini-
 ofEquation
mum 2.72 occurs when      for that   such that the expression

           is smallest, and    for all other    . This
gives:
 

 
 
       (2.73)
 
Equation 2.73 implies that if we are happy to tolerate an average distortion that is as
much as  then there is a choice of representation  that is independent of 
such that   .
Information Channels 95

EXAMPLE 2.14
Consider the channel from Example 2.12. The following conditional probability
assignment: 
½  
   

obviously implies that     . The resulting average distortion is:


  

              
  
However this may not be   since there may be other conditional probability
assignments for which      that provide a lower distortion. Indeed using
Equation 2.73 gives:


 
          
          

Thus    and this occurs with the following conditional probability assign-
ment:  
   
Interestingly this represents the condition where, no matter what input is transmitted
through the channel, the output will always be  with an average cost of 1. Thus
if we can tolerate an average distortion of 1 or more we might as well not bother!

EXAMPLE 2.15

Consider the rate distortion function given by Equation 2.68 in Example 2.13 for the
quantisation of a Gaussian source with squared error distortion. Since the source
 is continuous-valued and  is discrete-valued then for    the rate
distortion function must be infinite since no amount of information will ever be able
to reconstruct  from  with no errors, and the quantisation process, no matter what
discrete-valued representation is used, will always involve a loss of information. That
is, from Equation 2.68:
        

It should be noted that this does not contradict Result 2.9 since that result implicitly
assumed that both  and  were discrete-valued.
96 Fundamentals of Information Theory and Coding Design

For the case of zero rate distortion Equation 2.68 gives:

      ¾
This result can be confirmed intuitively by observing that if no information is provided
(i.e., the receiver or decoder does not have access to  ) the best estimate for  is its
mean value from which the average squared error will, of course, be the variance,  ¾ .

2.11 Exercises
1. A binary channel correctly transmits a 0 (as a 0) twice as many times as trans-
mitting it incorrectly (as a 1) and correctly transmits a 1 (as a 1) three times
more often then transmitting it incorrectly (as a 0). The input to the channel
can be assumed equiprobable.

(a) What is the channel matrix ? Sketch the channel.


(b) Calculate the output probabilities,  .
(c) Calculate the backward channel probabilities,   .
2. Secret agent 101 communicates with her source of information by phone, un-
fortunately in a foreign language over a noisy connection. Agent 101 asks
questions requiring only yes and no answers from the source. Due to the noise
and language barrier, Agent 101 hears and interprets the answer correctly only
75% of the time, she fails to understand the answer 10% of the time and she
misinterprets the answer 15% of the time. Before asking the question, Agent
101 expects the answer yes 80% of the time.

(a) Sketch the communication channel that exists between Agent 101 and
her source.
(b) Before hearing the answer to the question what is Agent 101’s average
uncertainty about the answer?
(c) Agent 101 interprets the answer over the phone as no. What is her av-
erage uncertainty about the answer? Is she is more uncertain or less
uncertain about the answer given her interpretation of what she heard?
Explain!
(d) Agent 101 now interprets the answer over the phone as yes. What is her
average uncertainty about the answer? Is she is more uncertain or less
uncertain about the answer given her interpretation of what she heard?
Explain and compare this with the previous case.
Information Channels 97

3. Calculate the equivocation of A with respect to B, 


, for the com-
munication channel of Qu. 2. Now calculate the mutual information of the
channel as      
 . What can you say about the chan-
nel given your answers for Qu. 2(c) and 2(d)? Now verify that    
  
  by calculating  and 
 . 
*4. A friend of yours has just seen your exam results and has telephoned to tell
you whether you have passed or failed. Alas the telephone connection is so
bad that if your friend says “pass” you mistake that for “fail” 3 out of 10 times
and if your friend says “fail” you mistake that for “pass” 1 out of 10 times.
Before talking to your friend you were 60% confident that you had passed the
exam. How confident are you of having passed the exam if you heard your
friend say you have passed?
5. Which is better “erasure” or “crossover”?

(a) Consider a fibre-optic communication channel with crossover probabil-


ity of  ¾ and a wireless mobile channel with erasure probability of
 ¾ . Calculate the mutual information assuming equiprobable inputs
for both types of communication channels. Which system provides more
information for the same bit error rate?
*(b) Let us examine the problem another way. Consider a fibre-optic com-

munication channel with crossover probability of  and a wireless

mobile channel with erasure probability of  . Assume equiproba-
ble inputs and express the mutual information 
  for the fibre optic

channel as a function of  and for the wireless mobile channel as a
  
function of  . Calculate  and  for the same mutual infor-

mation of  . Which channel can get away with a higher bit error rate
and still provide the same amount of mutual information?

6. Given    ,     and     find  ,


  and   .
7. Consider the following binary channel:

3/5
4/7 0 0

A B

3/7 1 1
2/3

Calculate           as painlessly


as possible.
98 Fundamentals of Information Theory and Coding Design

8. Here are some quick questions:

(a) Prove that if a BSC is shorted (i.e., all the outputs are grounded to 0) then
the channel provides no information about the input.
(b) Consider the statement: “Surely if you know the information of the
source, , and the information provided by the channel,   ,
you will know the information of the output, 
?” Prove whether
this statement is true or not. If not, under what conditions, if any, would
it be true?
(c) Conceptually state (in plain English) what         
means.

 
9. Sketch a sample channel with inputs and outputs, find the expression for the
mutual information, the channel capacity and the input probability distribution
to achieve capacity, for the following cases:

(a) a noiseless, non-deterministic channel


(b) a deterministic, non-noiseless channel
(c) a noiseless, deterministic channel

*10. It was established that the minimum value of    is 0, that is     .


What is the maximum value of  ?
11. Show that the mutual information expression for a BEC can be simplified to:

          
 

12. Consider the following errors-and-erasure channel:

p
0 0
1−p−q
q

A ? B

q
1−p−q
1 1
p

 
Find all the values of and for which the above channel is:
Information Channels 99

(a) totally ambiguous (i.e.,     ),


(b) noiseless,
(c) deterministic.

13. Channel  has channel matrix:



 
 
 ¾  
 
 

and is connected to channel  with matrix:


 
    
  

The ternary input  to the channel system has the following statistics:    
,     , and     . The output of the channel  is  and
the output of channel  is  .

(a) Calculate  ,    and   .


(b) Calculate   ,     and   .
(c) What can you say about channel  ? Explain!

14. The input to a BEC is repeated as shown below:

p
0 0
q B (from original transmission)
A ?
q C (from repeat transmission)
1 1
p

Given equiprobable binary inputs derive the expression for     and


show that        for:

(a)   

(b)  

(c)  
100 Fundamentals of Information Theory and Coding Design

15. Agent 01 contacts two of his sources by email for some straight “yes” and
“no” answers. The first source he contacts is known to be unreliable and to
give wrong answers about 30% of the time. Hence Agent 01 contacts his
second source to ask for the same information. Unfortunately, the second
source insists on using a non-standard email encoding and Agent 01 finds that
only 60% of the answers are intelligible.

(a) What is the average uncertainty Agent 01 has about the input given the
answers he receives from his first source? Hence what is the mutual
information of the first source?
(b) What is the average uncertainty Agent 01 has about the input given the
answers he receives from both the first and second source? Hence what
is the mutual information from both sources?

*16. In order to improve utilisation of a BSC, special input and output electron-
ics are designed so that the input to the BSC is sent twice and the output is
interpreted as follows:

¯ if two 0’s are received the channel outputs a single 0


¯ if two 1’s are received the channel outputs a single 1
¯ if either 01 or 10 is received the channel outputs a single ?

Derive the expression for the mutual information through this augmented BSC
assuming equiprobable inputs and compare this with the mutual information
of a standard BSC. Either analytically or numerically show whether this aug-
mented BSC is superior to the standard BSC. What price is paid for this “im-
proved” performance?
17. The mutual information between two random variables, and , is defined 
by Equation 2.47. Explain how  
  provides a measure of the amount of
independence between and . 
*18. A digital transmission channel consists of a terrestrial fibre-optic link with a
measured cross-over (bit error) probability of 0.1 followed by a satellite link
with an erasure probability of 0.2. No prior statistics regarding the source
of information being transmitted through the channel are available. What is
the average amount of information that can be resolved by the channel? In
order to improve the reliability of the channel it is proposed that the same data
being transmitted through the channel also be transmitted through a cheaper
and less reliable terrestrial/marine copper channel with cross-over probability
of 0.3. What is the average amount of information that can now be transmitted
through the combined channel system? Compare with your previous answer.
What is the cost of this improvement?
19. Consider the following errors-and-erasure channel:
Information Channels 101
p
0 0
1−p−q
q

A ? B

q
1−p−q
1 1
p

Under what conditions will the channel be weakly symmetric? What is the
expression for the channel capacity when the channel is weakly symmetric?

*20. Consider the following errors-and-erasure channel:

0.7
0 0
0.1
0.2

A ? B

0.3
0.1
1 1
0.6

Express    as a function of      . Hence derive the channel


capacity by finding the value of  that maximises   . You can do this
numerically, graphically or analytically.

*21. Consider the following channel:


102 Fundamentals of Information Theory and Coding Design
0 0

A 1 1 B

1/2
1/2
2

Let     ½ and      ; thus        .


Calculate the mutual information for the following cases:

(a)     
(b)    ,   


Now derive an algebraic expression for the mutual information as a function


of  and  and graphically, numerically or otherwise try to find the condition
for channel capacity. Explain your findings!
22. A proposed monochrome television picture standard consists of   pixels
per frame, each occupying one of 16 grayscale levels with equal probability.
Calculate the minimum bandwidth required to support the transmission of 40
frames per second when the signal-to-noise ratio is 25 dB. Given that for effi-
cient spectrum usage the bandwidth should not exceed 10 MHz what do you
think happened to this standard?
23. You are asked to consider the design of a cable modem utilising a broadband
communications network. One of the requirements is the ability to support
full duplex 10 Mbps data communications over a standard television channel
bandwidth of 5.5 MHz. What is the minimum signal-to-noise ratio that is
required to support this facility?
24. Consider a BSC with channel matrix:

 
 

with input       and output     .
  Define the
following single-letter distortion measure:

        
  
Assuming equiprobable inputs derive the expression for the average distortion,
. Hence what is the expression for the rate-distortion function  as a
function of ?
Information Channels 103

25. For Qu. 24, what is   and    ?


26. For Qu. 24, what is  
and what is a value of that yields   ?
*27. Consider the following errors-and-erasure channel:

p
0 0
1−p−q
q

A ? B

q
1−p−q
1 1
p

 
with input ½   ¾   and output ½   ¾   ¿ . Define the
following single-letter distortion measure:

     
           



Assuming equiprobable inputs derive the expression for the average distortion,
. State the constrained minimisation problem that you need to solve to derive
the rate-distortion function, 
. Can you solve it?

28. For Qu. 27 what is  and what are values of and that yield   ?

2.12 References
[1] S. Arimoto, An algorithm for calculating the capacity of an arbitrary discrete
memoryless channel, IEEE Trans. Inform. Theory, IT-18, 14-20, 1972.
[2] R. Blahut, Computation of channel capacity and rate distortion functions, IEEE
Trans. Inform. Theory, IT-18, 460-473, 1972.
[3] J.-F. Cardoso, Blind signal separation: statistical principles, Proceedings of the
IEEE, 86(10), 2009-2025, 1998.
104 Fundamentals of Information Theory and Coding Design

[4] T.M. Cover and J.A. Thomas, Elements of Information Theory, John Wiley &
Sons, New York, 1991.
[5] R.G. Gallager, Information Theory and Reliable Communication, John Wiley
& Sons, New York, 1968.
[6] T.V.L. Hartley, Transmission of information, Bell System Tech. J., 7, 535-563,
1928.
[7] S. Haykin, Communication Systems, John Wiley & Sons, New York, 4th ed.,
2001.
[8] J. Jeong, J.C. Gore, and B.S. Peterson, Mutual information analysis of the
EEG in patients with Alzheimer’s disease, Clinical Neurophysiology, 112(5),
827-835, 2001.
[9] Y. Normandin, R. Cardin, and R. De Mori, High-performance connected digit
recognition using maximum mutual information, IEEE Trans. Speech and Au-
dio Processing, 2(2), 299-311, 1994.
[10] C.E. Shannon, A mathematical theory of communication, Bell System Tech. J.,
vol. 28, pg 379-423, 623-656, 1948.
[11] C.E. Shannon, Coding theorems for a discrete source with a fidelity criterion,
IRE Nat. Conv. Record, Part 4, 142-163, 1959.
[12] P. Viola, and W.M. Wells, Alignment by maximization of mutual information,
Int. J. of Comput. Vision, 24(2), 137-154, 1997.

You might also like