Unit 2
Uncertainty and Information
1. Introduction
In computer science, mathematics, and information theory, uncertainty
refers to the lack of complete knowledge about a system, event, or
outcome, while information represents the amount of knowledge
gained when uncertainty is reduced.
These concepts are foundational in probability theory, data
compression, cryptography, and decision-making systems.
Importance in Computer Science
1.Data Compression:
1. Files are compressed by removing predictable (low-uncertainty) parts.
2. Example: Huffman coding assigns shorter codes to more probable symbols.
2.Cryptography:
1. Secure keys have high uncertainty (entropy), making them hard to guess.
3.Machine Learning:
1. Models reduce uncertainty about unseen data by learning patterns.
4.Communication Systems:
1. Efficient transmission depends on quantifying uncertainty and minimizing
errors.
2. Uncertainty
Definition
Uncertainty is the degree to which the outcome of an event is
unpredictable.
If we already know the outcome of an event, there is no uncertainty. If
multiple outcomes are possible and we cannot predict with certainty
which will occur, there is uncertainty.
• Mathematically, uncertainty is linked with probability:
• Certain event: Probability = 1 (no uncertainty)
• Impossible event: Probability = 0 (no uncertainty)
• Uncertain event: 0 < Probability < 1
3. Measuring Uncertainty
Shannon Entropy
1. Introduction
• Shannon entropy, introduced by Claude E. Shannon in his landmark
1948 paper “A Mathematical Theory of Communication”, is a measure
of uncertainty in a random variable or the average information
content per outcome.
• It tells us how many bits, on average, are needed to represent the
outcome of a random process.
Definition:
4. Properties of Uncertainty
• Higher entropy More uncertainty, outcomes are more unpredictable.
• Example: Tossing a fair coin (p(H)=0.5, p(T)=0.5) gives maximum uncertainty.
• Lower entropy Less uncertainty, outcomes are more predictable.
• Example: A biased coin (p(H)=0.99, p(T)=0.01) has less uncertainty.
• Zero entropy when the outcome is known in advance
• Example: Tossing a coin with both sides heads.
5. Properties of Shannon Entropy
6. Examples
6.1 Fair Coin Toss
6.2 Biased Coin Toss
INFORMATION
Definition
Information is the reduction of uncertainty.
When we learn the outcome of an event, we gain information.
More uncertainty resolved → more information gained.
Shannon’s View of Information:
Relationship Between Uncertainty and
Information
• Before knowing the outcome: We have uncertainty.
• After knowing the outcome: Uncertainty decreases, and the reduction is information.
• Example:
• Before tossing a fair coin: H=1 bit (uncertainty).
• After knowing "Heads": Uncertainty = 0 → Information gained = 1 bit.
• Example:
Joint Entropy, Conditional Entropy, and
Mutual Information
• Joint Entropy
Definition:
Meaning of p(x,y)
• p(x,y) is the joint probability that two events happen at the same
time:
• Event X=x
• Event Y=y
• It’s the probability of both conditions being true simultaneously.
Formula:
Steps to Calculate p(x,y)
Example: Weather and Umbrella
Example: Two independent Fair coin:
• Example: Perfectly Correlated Dice Rolls
Conditional Entropy
Relation between joint entropy and conditional entropy
Definitions:
PROOF:
Numerical Example proof:
Properties of Conditional Entropy:
Mutual Information (MI)
Definition
• Mutual information measures the amount of information that one random
variable contains about another.
It quantifies the reduction in uncertainty of one variable due to the knowledge of
the other.
Properties of MI:
Interpretation
• High MI: Knowing X tells a lot about Y (strong relationship).
• Low MI: Knowing X tells little about Y.
• Zero MI: X and Y are independent.
Formula from Probabilities
Numerical Example
Where MI>0:
Kullback–Leibler divergence
Proof of Mutual Information is aways non-
negative
Example:
Definition of a Code
• A code is a mapping (assignment) of a sequence of bits (codewords)
to a set of source symbols.
• Purpose: Represent information (symbols, messages) in a binary form
suitable for storage or transmission.
Definition of Prefix Codes
• A prefix code is a variable-length code in which no codeword is a
prefix of another codeword.
• This ensures instantaneous decoding: as soon as a codeword is read,
it can be uniquely identified without looking ahead.
Example:
• Prefix code: {0, 10, 110, 111} (no codeword starts with another).
• Not prefix code: {0, 01, 011} ✘ (because "0" is a prefix of "01" and
"011").
Uniquely Decipherable (UD) Codes
Definition
• A code is uniquely decipherable if every encoded sequence can be
decoded into exactly one sequence of source symbols.
• This ensures that no two different symbol sequences produce the
same encoded string.
• UD codes may or may not be prefix-free.
Instantaneous Codes (Prefix-free Codes)
Definition
• An instantaneous code (or prefix-free code) is one where no
codeword is the prefix of another codeword.
• Guarantees immediate (instantaneous) decoding — as soon as a
codeword is read, you know where it ends.
• All instantaneous codes are uniquely decipherable.
Classes of Codes
• Source coding: Process of representing data (symbols from a source)
with as few bits as possible (compression). Example: Huffman coding,
Shannon-Fano coding.
• Entropy coding: A specific type of source coding where the code
length is based on the probability of symbols, approaching the
entropy limit. Examples: Arithmetic coding, Huffman coding.
Difference: Source coding is the broader concept; entropy coding
is a subset of it that uses symbol probabilities to achieve optimal
compression.
Kraft Inequality I
1. Introduction
• In information theory and coding, Kraft’s inequality gives a
fundamental condition that must be satisfied by the lengths of
codewords in a prefix code (also called instantaneous code).
• It tells us whether it is possible to construct a prefix code with given
codeword lengths.
• It is the mathematical foundation for designing efficient codes like
Huffman codes.
Source Coding Theorem (Shannon’s Noiseless
Coding Theorem)
Optimal Codes
Why Optimal Codes?
• Some symbols occur more frequently → they should get shorter
codewords.
• Rare symbols should get longer codewords.
• This reduces redundancy and saves storage/bandwidth.
The principle is the same as in Morse code, Huffman code,
Arithmetic coding, etc.
Bounds on the Optimal Codelength:
Block Coding
Definition
• Block coding is a source coding technique in which a sequence of k
source symbols is grouped into a block, and then encoded into a
fixed-length or variable-length codeword.
• Instead of encoding one symbol at a time, we encode multiple
symbols together.
• This improves efficiency and allows the average codeword length to
approach entropy more closely.
This means that as we group larger and larger blocks of symbols, the average length per
symbol gets closer and closer to the entropy.
Shannon–Fano Coding
• Shannon-Fano coding, named after Claude Shannon and Robert
Fano, was the first algorithm to construct a set of the best variable-
size codes. We start with a set of n symbols with known probabilities
(or frequencies) of occurrence.
• Idea: Construct a variable-length prefix code based on symbol
probabilities.
Huffman Coding
Idea: Construct an optimal prefix code by building a binary tree based on symbol probabilities.