Channel Capacity
In this chapter we find the maximum number of distinguishable
signals for n uses of a communication channel.
This number grows exponentially with n, and the exponent is
known as the channel capacity.
1
Channel Capacity
If two different input sequences may give rise
to the same output sequence, the inputs are confusable.
We need to reconstruct the input sequences at the output with
a negligible probability of error.
By mapping the source into the appropriate “widely spaced”
input sequences to the channel, we can transmit a message
with very low probability of error and reconstruct the source
message at the output.
The maximum rate at which this can be done is called the
capacity of the channel.
2
Channel Capacity
Discrete Channel: We define a discrete channel to be a
system consisting of an input alphabet X and output alphabet
Y and a probability transition matrix p(y|x) that expresses the
probability of observing the output symbol y given that we
send the symbol x.
The channel is said to be memoryless if the probability
distribution of the output depends only on the input at
that time and is conditionally independent of previous
channel inputs or outputs.
3
Channel Capacity
Channel Capacity: We define the “information” channel capacity
of a discrete memoryless channel as
where the maximum is taken over all possible input distributions
p(x).
Operational definition of channel capacity can be given as the
highest rate in bits per channel use at which information can be sent
with arbitrarily low probability of error.
Shannon’s second theorem establishes that the information channel
capacity is equal to the operational channel capacity. Thus, we drop
the word information in most discussions of channel capacity.
4
Channel Capacity
EXAMPLES OF CHANNEL CAPACITY
Noiseless Binary Channel
The binary input is reproduced exactly the same output as shown in
the Figure
In this case, any transmitted bit is received without error. Hence,
one error-free bit can be transmitted per use of the channel.
We can also calculate the information capacity C = max I(X; Y) = 1
bit, which is achieved by using p(x) = (1/2, 1/2).
5
Channel Capacity
EXAMPLES OF CHANNEL CAPACITY
Noisy Channel with Nonoverlapping Outputs
This channel has two possible outputs corresponding to each of the
two inputs as shown in Figure. The channel appears to be noisy, but
really is not.
FIGURE: Noisy channel with nonoverlapping outputs. C = 1 bit.
6
Channel Capacity
EXAMPLES OF CHANNEL CAPACITY
Noisy Typewriter
In this case the channel input is either received unchanged at the
output with probability ½ or is transformed into the next letter with
probability ½ as shown in Figure. If the input has 26 symbols and
we use every alternate input symbol, we can transmit one of 13
symbols without error with each transmission. Hence, the capacity
of this channel is log 13 bits per transmission.
We can also calculate the information capacity C = max I (X; Y) =
max ( H(Y) − H(Y|X) ) = max H(Y) − 1 = log 26 − 1 = log 13,
achieved by using p(x) distributed uniformly over all the inputs.
7
Channel Capacity
EXAMPLES OF CHANNEL CAPACITY
Noisy Typewriter
8
Channel Capacity
EXAMPLES OF CHANNEL CAPACITY
Binary Symmetric Channel
The example of BSC is as shown in following Figure. This is a
binary channel in which the input symbols are complemented
with probability p.
Figure: Binary symmetric channel. C = 1- H(p) bits
9
Channel Capacity
EXAMPLES OF CHANNEL CAPACITY
Binary Symmetric Channel
We can write the mutual information as
10
Channel Capacity
EXAMPLES OF CHANNEL CAPACITY
Binary Symmetric Channel
where the last inequality follows because Y is a binary
random variable. Equality is achieved when the input
distribution is uniform. Hence, the information capacity of a
binary symmetric channel with parameter p is
C = 1 − H(p) bits.
11
Channel Capacity
EXAMPLES OF CHANNEL CAPACITY
Binary Erasure Channel
A BEC in which some bits are lost (rather
than corrupted) is the binary erasure channel
In this channel, a fraction α of the bits are
erased. The receiver knows which bits have
been erased. The binary erasure channel has
two inputs and three outputs as shown in
Figure.
12
Channel Capacity
EXAMPLES OF CHANNEL CAPACITY
Binary Erasure Channel
Capacity of binary erasure channel:
13
Channel Capacity
EXAMPLES OF CHANNEL CAPACITY
Binary Erasure Channel
We may think that H(Y) is log 3, but we cannot achieve this by any
choice of input distribution p(x). Let E be event (Y = e), using the
and letting Pr(X = 1) = , we have
14
Channel Capacity
EXAMPLES OF CHANNEL CAPACITY
Binary Erasure Channel
Hence the capacity is
where capacity is achieved by π = 1/2
15
Channel Capacity
Symmetric Channel
Consider a channel with following transition matrix
All the rows of the probability transition matrix are
permutations of each other and so are the columns.
Such a channel is said to be symmetric. Another
example of a symmetric channel is one of the form
16
Channel Capacity
Symmetric Channel
Another example of a symmetric channel is one of the form
where Z has some distribution on the integers {0, 1, 2, . . . ,
c − 1}, X has the same alphabet as Z, and Z is independent
of X
In both cases we can find the channel capacity as follows.
Letting r be a row of the transition matrix, we have
17
Channel Capacity
Symmetric Channel
with equality if the output distribution is uniform. But p(x) = 1/|X|
achieves a uniform distribution on Y, as seen from
where c is the sum of the entries in one column of the
probability transition matrix.
Thus, the channel with the above transition matrix has the
capacity
and C is achieved by a uniform distribution on the input. 18
Channel Capacity
Symmetric Channel
Definition A channel is said to be symmetric if the rows of the
channel transition matrix p(y|x) are permutations of each other and
the columns are permutations of each other. A channel is said to be
weakly symmetric if every row of the transition matrix p(·|x) is a
permutation of every other row and all the column sums are equal.
For example, the channel with transition matrix
is weakly symmetric but not symmetric
19
Channel Capacity
Symmetric Channel
Theorem: For a weakly symmetric channel
and this is achieved by a uniform distribution on the input alphabet
20
Channel Capacity
Properties of Channel Capacity
4. I(X; Y) is a continuous function of p(x).
5. I(X; Y) is a concave function of p(x) (Theorem 2.7.4).
Since I (X; Y) is a concave function over a closed convex set,
a local maximum is a global maximum.
The maximum can then be found by standard nonlinear
optimization techniques such as gradient search.
21
Channel Capacity
Properties of Channel Capacity
Some of the methods that can be used include the following:
• Constrained maximization using calculus and the Kuhn–Tucker
conditions.
• The Frank–Wolfe gradient search algorithm.
• An iterative algorithm developed by Arimoto [25] and Blahut
[65].
In general, there is no closed-form solution for the capacity. But
for many simple channels it is possible to calculate the capacity
using properties such as symmetry.
22
Channel Capacity
Preview of the Channel Coding Theorem
Now we prove Shannon’s second theorem, which gives an
operational meaning to the definition of capacity as the number of
bits we can transmit reliably over the channel.
Why we can transmit C bits of information over a channel?
The basic idea is that for large block lengths, every channel looks
like the noisy typewriter channel (Figure 7.4) and the channel has a
subset of inputs that produce essentially disjoint sequences at the
output.
23
Channel Capacity
Preview of the Channel Coding Theorem
For each (typical) input n-sequence, there are approximately 2nH(Y |X)
possible Y sequences, all of them equally likely (Figure 7.7). We
should ensure that no two X sequences produce the same Y output
sequence. Otherwise, we will not be able to decide which X
sequence was sent.
The total number of possible (typical) Y
sequences is ≈ 2nH(Y ). This set has to be
divided into sets of size 2nH(Y |X)
corresponding to the different input X
sequences. The total number of disjoint sets
is less than or equal to 2n( H(Y) −H(Y|X)) = 2nI (X;Y).
Hence, we can send at most ≈ 2nI(X;Y)
distinguishable sequences of length n.
24
Channel Capacity
Some Definitions
Suppose a message W, drawn from the index set {1, 2, . . . , M},
results in the signal Xn(W) is sent by a transmitter. which is
received by the receiver as a random sequence
The receiver receives Yn ∼ p(yn|xn). The receiver then guesses
the index W by an appropriate decoding rule . The receiver
makes an error if is not the same as the index W that was
transmitted.
25
Channel Capacity
Some Definitions
Discrete Channel: A discrete channel, denoted by (X, p(y|x), Y),
consists of two finite sets X and Y and a collection of probability
mass functions p(y|x), one for each x ∈ X, such that for every x and
y, p(y|x) ≥ 0, and for every x, with the interpretation that X is the
input and Y is the output of the channel.
Definition The nth extension of the discrete memoryless channel
(DMC) is the channel (Xn p(yn|xn), Yn), where
p(yk|, xk , xk-1) = p(yk|xk) , k = 1, 2, ….
26
Channel Capacity
Some Definitions
If the channel is used without feedback [i.e., if the input
symbols do not depend on the past output symbols, namely,
p(xk|, xk-1 , yk-1) = p(xk|xk-1) ], the channel transition function for
the nth extension of the discrete memoryless channel reduces to
27
Channel Capacity
Some Definitions
Definition An (M, n) code for the channel (X, p(y|x), Y),
consists of the following:
1. An index set {1, 2, . . . , M}.
2. An encoding function Xn : {1, 2, . . .,M} → Xn, yielding
codewords xn(1), xn(2), . . ., xn(M). The set of codewords
is called the codebook.
3. A decoding function
g : Yn → {1, 2, . . . , M},
which is a deterministic rule that assigns a guess to each
possible received vector.
28
Channel Capacity
Some Definitions
Conditional Probability of Error: Let
be the conditional probability of error given that index i was
sent, where I(·) is the indicator function.
Maximal Probability of Error: The maximal probability of
error (n) for an (M, n) code is defined as
29
Channel Capacity
Some Definitions
Average Probability of Error: The (arithmetic) average
probability of error Pe(n) for an (M, n) code is defined as
Note that if the index W is chosen according to a uniform
distribution over the set {1, 2, . . . , M}, and Xn = xn(W), then
(i.e., Pe(n) is the probability of error). Also obviously
Pe(n) (n)
30
Channel Capacity
Some Definitions
Rate:
Achievable rate: A rate R is said to be achievable if there exists
a sequence of codes such that the maximal
probability of error λ(n) tends to 0 as n→∞.
Later, we write codes to mean codes. This
will simplify the notation.
31
Channel Capacity
Some Definitions
Capacity: The capacity of a channel is the supremum of all achievable
rates.
Thus, rates less than capacity yield arbitrarily small probability of error
for sufficiently large block lengths.
32
Channel Capacity
Some Definitions
Jointly Typical Sequences
33
Channel Capacity
Some Definitions
where
34
Theorem 7.6.1 (Joint AEP) Let (Xn, Yn) be sequences of
length n drawn i.i.d. according to .
Then,
35
Proof
1. We show using the law of large number (LLN),
1 as n
36
2. To prove the second part of the theory, we have
and hence
37
3. Now if and are independent but have the same
marginal as Xn and Yn, then
38
and
39
By similar arguments to the upper bound above, we
can also show that for n sufficiently large,
40
Channel Coding Theorem
Theorem 7.7.1
(Channel coding theorem) For a discrete memoryless channel,
all rates below capacity C are achievable. Specifically, for
every rate R < C, there exists a sequence of (2nR, n) codes with
maximum probability of error λ(n) → 0.
Conversely, any sequence of (2nR, n) codes with λ(n) → 0 must
have
R ≤ C.
41
Channel Coding Theorem
Theorem 7.7.1
Proof: We prove that rates R < C
Let a distribution p(x). Generate a (2nR, n) code at random
according to the distribution p(x). Specifically, we generate
2nR codewords independently according to the distribution
We generate the 2nR codewords as the rows of a matrix:
42
Channel Coding Theorem
Theorem 7.7.1
Each entry in this matrix is generated i.i.d. according to p(x).
Thus, the probability that we generate a particular code C is
43
Channel Coding Theorem
44
Channel Coding Theorem
Theorem 7.7.1
6. Now consider jointly typical decoding. In jointly typical
decoding, the receiver declares that the index was sent if the
following conditions are satisfied:
45
Channel Coding Theorem
Theorem 7.7.1
Probability of error: We calculate the average over all codes
generated at random according to the distribution (7.62). By the
symmetry of the code construction, the average probability of
error does not depend on the particular index that was sent.
There are two different sources of error when we use jointly
typical decoding:
Either the output Yn is not jointly typical with the transmitted
codeword or there is some other codeword that is jointly typical
with Yn.
46
Channel Coding Theorem
Theorem 7.7.1
47
Channel Coding Theorem
Theorem 7.7.1
48
Channel Coding Theorem
Theorem 7.7.1
49
Channel Coding Theorem
Theorem 7.7.1
50
Channel Coding Theorem
Theorem 7.7.1
51
Zero-error Codes
We
52
Zero-error Codes
53
Channel Coding Theorem
Theorem 7.7.1
54
Universal Codes and Channel Capacity
We define the redundancy of the code as the difference
between the expected length of the code and the lower limit
for the expected length:
Relative entropy
55
Universal Codes and Channel Capacity
Where q(x) = 2−l(x). Is the distribution that corresponds to the
codeword length l(x) .
We want to find a code without knowing the true distribution
of p , thus we define minimax redundancy as
This minimax redundancy is achieved by a distribution q that
is at the “center” of the information ball containing the
distributions p , that is, the distribution q whose maximum
distance from any of the distributions p , is minimized
(Figure 13.1).
56
Universal Codes and Channel Capacity
Where q(x) = 2−l(x) is the distribution that corresponds to the
codeword length l(x) .
We want to find a code without knowing the true distribution
of p , thus we define minimax redundancy as
This minimax redundancy is achieved by a distribution q that
is at the “center” of the information ball containing the
distributions p , that is, the distribution q whose maximum
distance from any of the distributions p , is minimized as
shown in the following Figure.
57
Universal Codes and Channel Capacity
To find the distribution q that is as close as possible to all the
possible pθ in relative entropy, consider the following channel
58
Universal Codes and Channel Capacity
To find the distribution q that is as close as possible to all the
possible pθ in relative entropy, consider the following channel
This is a channel with the rows of the transition
matrix equal to the pθ ’s, the possible distributions of the source.
59
Universal Codes and Channel Capacity
60
Universal Codes and Channel Capacity
61
Universal Codes and Channel Capacity
Example:
62
Universal Codes and Channel Capacity
We will show that the minimax redundancy R∗ is equal to the
capacity of this channel, and the corresponding optimal
coding distribution is the output distribution of this channel
induced by the capacity-achieving input distribution. The
capacity of this channel is given by
where
Mutual information
63
Universal Codes and Channel Capacity
64
Universal Codes and Channel Capacity
Proof: the Let π(θ) be an input distribution on θ ∈ {1, 2, . . . , m},
and let the induced output distribution be qπ :
where
Then for any distribution q on the output, we have:
65
Universal Codes and Channel Capacity
Proof: Then for any distribution q on the output, we have:
for all q, with equality iff q = qπ
66
Universal Codes and Channel Capacity
Proof: Thus for all q
and therefore
is achieved when q = qπ
67
Universal Codes and Channel Capacity
Proof: Thus for all q
and therefore
68