100% found this document useful (2 votes)

6K views20 pages

Data Compression Techniques

The document discusses data compression techniques including lossless compression where the original files can be perfectly retrieved (e.g. zip files) and lossy compression where files can be approximately retrieved (e.g. mp3 files). It aims to save storage space and bandwidth. Key concepts covered include defining codecs, remarks on compressing relevant vs random data, the Kraft-McMillan inequality for prefix codes, entropy as a measure of uncertainty, and Shannon's theorem relating expected code length to entropy.

Uploaded by

Abhishek kumar singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

6K views20 pages

Data Compression Techniques

Uploaded by

Abhishek kumar singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 20

Data Compression

Techniques

By…

Sukanta behera
Reg. No. 07SBSCA048
Data Compression
Lossless data compression:
Store/Transmit big files using few bytes so
that the original files can be perfectly
retrieved. Example: zip.

Loosely data compression:

Store/Transmit big files using few bytes so
that the original files can be approximately
retrieved. Example: mp3.

Motivation: Save storage space and/or

bandwidth.
Definition of Codec
Let Σ be an alphabet and let S µ Σ*
be a set of possible messages.

 A lossless codec (c,d) consists of

A coder c : S ! {0,1}*
A decoder d: {0,1}* ! Σ*
so that
8 x 2 S: d(c(x))=x
Remarks
 It is necessary for c to be an injective map.

 If we do not worry about efficiency, we don’t have

to specify d if we have specified c.

 Terminology: Some times we just say “code”

rather than “codec”.

Terminology: The set c(S) is called the set of

code words of the codec. In examples to follow,
we often just state the set of code words.
Proposition
 Let S = {0,1}n. Then, for any codec
(c,d) there is some x 2 S, so that |
c(x)| ¸ n.

 “Compression is impossible”
Proposition
 For any message x, there is a codec
(c,d) so that |c(x)|=1.

“The Encyclopedia Britannica can be

compressed to 1 bit”.
Remarks
We cannot compress all data. Thus, we must
concentrate on compressing “relevant” data.

It is trivial to compress data known in advance.

We should concentrate on compressing data about
which there is uncertainty.

We will use probability theory as a tool to model

uncertainty about relevant data.
Can random data be
compressed?
 Suppose Σ = {0,1} and S = {0,1}2.

We know we cannot compress all data, but

can we do well on the average?

Let us assume the uniform distribution on

S and look at the expected length of the
code words.
Definition of prefix codes
 A prefix code c is a code with the property that
for all different messages x and y, c(x) is not a
prefix of c(y).

 Example: Fixed length codes (such as ascii).

 Example: {0,11,10}

 All codes in this course will be prefix codes.

Proposition
 If c is a prefix code for S = Σ1 then cn
is a prefix code for S = Σn where

cn(x1 x2 .. xn) = c(x1)¢ c(x2) ….¢

c(xn)
Prefix codes and trees
 Set of code words of a prefix code:
{0,11,10}.

0 1

0 1
Alternative view of prefix
codes
 A prefix code is an assignment of the
messages of S to the leaves of a
rooted binary tree.

 The codeword of a message x is

found by reading the labels on the
edges on the path from the root of the
tree to the leaf corresponding to x.
Binary trees and the interval
[0,1)
0 1

[0,1/2) 0 1

[1/2,3/4) [3/4,1)

0 1/4 1/2 3/4 1

Alternative view of prefix
codes
 A prefix code is an assignment of the
messages of S to disjoint dyadic
intervals.

 A dyadic interval is a real interval of

the form [ k 2- m, (k+1) 2- m ) with
k+1 · 2m. The corresponding code
word is the m-bit binary
representation of k.
Kraft-McMillan Inequality
 Let m1, m2, … be the lengths of the
code words of a prefix code. Then, ∑ 2-
mi
· 1.

 Let m1, m2, … be integers with ∑ 2- mi

· 1. Then there is prefix code c so that
{mi} are the lengths of the code words
of c.
Probability
A probability distribution p on S is a


map p: S ! [0,1] so that ∑x 2 S p(x) = 1.

A U-valued stochastic variable is a map

Y: S ! U.

If Y: S ! R is a stochastic variable, its

expected value E[Y] is ∑x 2 S p(x) Y(x).
Self-entropy
 Given a probability distribution p on S, the self-entropy of
x 2 S is the defined as
H(x) = – log2 p(x).

 The self-entropy of a message with probability 1 is 0 bits

 The self-entropy of a message with probability 0 is +1.

 The self-entropy of a message with probability ½ is 1 bit

 We often measure entropy is unit “bits”

Entropy
Given a probability distribution p on S, its
entropy H[p] is defined as E[H], i.e.
H[p] = – ∑x 2 S p(x) log2 p(x).

For a stochastic variable X, its entropy H[X]

is the entropy of its underlying distribution:
H[X] = – ∑i Pr[X=i] log2 Pr[X=i]
Facts
The entropy of the uniform distribution on
{0,1}n is n bits. Any other distribution on
{0,1}n has strictly smaller entropy.

If X1 and X2 are independent stochastic

variables, then H(X1, X2) = H(X1) + H(X2).

 For any function f, H(f(X)) · H(X).

Shannon’s theorem
Let S be a set of messages and let X be an S-
valued stochastic variable.

 For all prefix codes c on S,

E[ |c(X)| ] ¸ H[X].

 There is a prefix code c on S so that

E[ |c(X)| ] < H[X] + 1
In fact, for all x in S, |c(x)| < H[x] + 1.

Lecture 6 Learning .
No ratings yet
Lecture 6 Learning .
38 pages
1.1 Sense Organ and Their Function
100% (3)
1.1 Sense Organ and Their Function
20 pages
The Practice of Regulatory Affairs
No ratings yet
The Practice of Regulatory Affairs
42 pages
Project Management
No ratings yet
Project Management
13 pages
SSES Presents-Part I
100% (2)
SSES Presents-Part I
22 pages
Forc Forc of of - : R R Forc Forc
No ratings yet
Forc Forc of of - : R R Forc Forc
14 pages
Divisions of Literature
100% (1)
Divisions of Literature
42 pages
Cogeom
No ratings yet
Cogeom
98 pages
Advertising Layouts and Techniques
No ratings yet
Advertising Layouts and Techniques
51 pages
... To Design, Build and Market: Residential and Commercial Complexes of International Quality
No ratings yet
... To Design, Build and Market: Residential and Commercial Complexes of International Quality
14 pages
Defining The Marketing Research Problem and Developing An Approach
No ratings yet
Defining The Marketing Research Problem and Developing An Approach
25 pages
Data Encoding and Compression Techniques
No ratings yet
Data Encoding and Compression Techniques
9 pages
Lovelock PPT Chapter 04
100% (2)
Lovelock PPT Chapter 04
24 pages
Stress Corrosion Cracking Mechanism
No ratings yet
Stress Corrosion Cracking Mechanism
15 pages
Sut!
100% (2)
Sut!
52 pages
T 01 B Computer Intro Languages
No ratings yet
T 01 B Computer Intro Languages
16 pages
22 - Islamic Economic System
No ratings yet
22 - Islamic Economic System
13 pages
Time Series
No ratings yet
Time Series
44 pages
Otrategic Management Otrategic Management: Concepts and Cases Concepts and Cases
No ratings yet
Otrategic Management Otrategic Management: Concepts and Cases Concepts and Cases
55 pages
Chap 9 - Personality and Values
No ratings yet
Chap 9 - Personality and Values
28 pages
Encrypted Data Analysis
No ratings yet
Encrypted Data Analysis
38 pages
Intel's Leap Ahead Strategy
No ratings yet
Intel's Leap Ahead Strategy
31 pages
Khapusa Overview and Insights
No ratings yet
Khapusa Overview and Insights
13 pages
Lecture 1
No ratings yet
Lecture 1
20 pages
Validation
No ratings yet
Validation
49 pages
Synchronous Motors
50% (2)
Synchronous Motors
27 pages
Advertising, Sales Promotion, & Public Relations
No ratings yet
Advertising, Sales Promotion, & Public Relations
80 pages
Analgesia and Analgesics
No ratings yet
Analgesia and Analgesics
113 pages
The Pencil Parable
100% (6)
The Pencil Parable
14 pages
Head and Neck
100% (1)
Head and Neck
63 pages
Recoltation Overview and Insights
No ratings yet
Recoltation Overview and Insights
26 pages
Leprosy
No ratings yet
Leprosy
38 pages
Oop
100% (1)
Oop
233 pages
Early Morning Activities
No ratings yet
Early Morning Activities
16 pages
COMP: Microsoft Windows Systems
No ratings yet
COMP: Microsoft Windows Systems
28 pages
Unreadable Document with Random Characters
No ratings yet
Unreadable Document with Random Characters
27 pages
Presentation 1
No ratings yet
Presentation 1
10 pages
Encrypted Data Analysis
100% (2)
Encrypted Data Analysis
32 pages
Chapter 14
No ratings yet
Chapter 14
52 pages
Local Self Government
67% (3)
Local Self Government
16 pages
International Marketing
No ratings yet
International Marketing
123 pages
Positive Attitude1
No ratings yet
Positive Attitude1
31 pages
Baby Milestones: Birthmark Development
No ratings yet
Baby Milestones: Birthmark Development
32 pages
Year 10 Genetics
No ratings yet
Year 10 Genetics
16 pages
Eye
No ratings yet
Eye
322 pages
Basics of Contours
No ratings yet
Basics of Contours
28 pages
Encrypted Document Analysis
No ratings yet
Encrypted Document Analysis
74 pages
Subkingdom Metazoa 2 (DR - Nagwa)
No ratings yet
Subkingdom Metazoa 2 (DR - Nagwa)
28 pages
Unreadable Document Content
No ratings yet
Unreadable Document Content
28 pages
Cost of Capital
50% (2)
Cost of Capital
63 pages
Final Ppt-Wind Power
100% (1)
Final Ppt-Wind Power
33 pages
Role of Spices in Processed Foods Mansi Seminar
No ratings yet
Role of Spices in Processed Foods Mansi Seminar
51 pages
(WP, Ppt2) - Session 3
No ratings yet
(WP, Ppt2) - Session 3
14 pages
Job Analysis
No ratings yet
Job Analysis
3 pages
Oracle Database 10g - DBA
100% (1)
Oracle Database 10g - DBA
98 pages
HTML
100% (8)
HTML
75 pages
Sliding Window Protocols
100% (4)
Sliding Window Protocols
9 pages
Ham 3
No ratings yet
Ham 3
12 pages
DNS and Email
100% (2)
DNS and Email
23 pages
Abhishek Kumar Singh 07RWSCA001
100% (1)
Abhishek Kumar Singh 07RWSCA001
9 pages
Construction of Nfa and Dfa From R
100% (2)
Construction of Nfa and Dfa From R
15 pages
Dns
No ratings yet
Dns
43 pages
Oracle9i Database Summary
No ratings yet
Oracle9i Database Summary
35 pages
Wars Hall's and Floyd's Algorithm
100% (2)
Wars Hall's and Floyd's Algorithm
14 pages
DAA Assignment 1
No ratings yet
DAA Assignment 1
7 pages
Fundamentals of Mathematics Overview
No ratings yet
Fundamentals of Mathematics Overview
200 pages
5th Maths
No ratings yet
5th Maths
3 pages
Math Contest Answer Key
No ratings yet
Math Contest Answer Key
13 pages
Math Week 10 Solns
No ratings yet
Math Week 10 Solns
15 pages
MIT School of Engineering 26-07-2021
No ratings yet
MIT School of Engineering 26-07-2021
17 pages
Permutations Combinations Class11 Important Questions
No ratings yet
Permutations Combinations Class11 Important Questions
5 pages
April - May - 2025 End Sem Time Table
No ratings yet
April - May - 2025 End Sem Time Table
30 pages
Conflict Serializable Schedule
No ratings yet
Conflict Serializable Schedule
11 pages
EXPERIMENT 8 Daa
No ratings yet
EXPERIMENT 8 Daa
5 pages
Adsa Unit-2
No ratings yet
Adsa Unit-2
12 pages
Unit-2 MCQ & QB
No ratings yet
Unit-2 MCQ & QB
4 pages
CS2020 Data Structures Exam Overview
No ratings yet
CS2020 Data Structures Exam Overview
27 pages
IOQM - Questions Paper
No ratings yet
IOQM - Questions Paper
4 pages
115 Questions
No ratings yet
115 Questions
21 pages
Even Sigma Function Count
No ratings yet
Even Sigma Function Count
14 pages
Review Problems in Algebra 2
No ratings yet
Review Problems in Algebra 2
28 pages
1 s2.0 0022404977900305 Main
No ratings yet
1 s2.0 0022404977900305 Main
14 pages
Lecture 8
No ratings yet
Lecture 8
19 pages
ECE608 Homework #8 Solution, Fall 2003
No ratings yet
ECE608 Homework #8 Solution, Fall 2003
13 pages
Asi Se Dice Level 3 Student Edition
100% (2)
Asi Se Dice Level 3 Student Edition
35 pages
Swati's Patterns Sheet
No ratings yet
Swati's Patterns Sheet
1 page
Easy Part
No ratings yet
Easy Part
9 pages
Presented by Parkhe Ravindra Ambadas: Parkhe R A Interpolation 1
No ratings yet
Presented by Parkhe Ravindra Ambadas: Parkhe R A Interpolation 1
12 pages
Internal Exam Schedule for Students
No ratings yet
Internal Exam Schedule for Students
1 page
Graph Algorithm 2
No ratings yet
Graph Algorithm 2
5 pages
Linear Algebra Review: Solve Each Equation
No ratings yet
Linear Algebra Review: Solve Each Equation
2 pages
Cyclic Code Encoding for Engineers
No ratings yet
Cyclic Code Encoding for Engineers
6 pages
EM Midterm Revision 2025-2026-1
No ratings yet
EM Midterm Revision 2025-2026-1
3 pages
BCS451 OS Lab Experiment No 7
No ratings yet
BCS451 OS Lab Experiment No 7
3 pages

Data Compression Techniques

Uploaded by

Data Compression Techniques

Uploaded by

Data Compression

Loosely data compression:

Motivation: Save storage space and/or

 A lossless codec (c,d) consists of

 If we do not worry about efficiency, we don’t have

 Terminology: Some times we just say “code”

Terminology: The set c(S) is called the set of

“The Encyclopedia Britannica can be

It is trivial to compress data known in advance.

We will use probability theory as a tool to model

We know we cannot compress all data, but

Let us assume the uniform distribution on

 Example: Fixed length codes (such as ascii).

 All codes in this course will be prefix codes.

cn(x1 x2 .. xn) = c(x1)¢ c(x2) ….¢

 The codeword of a message x is

0 1/4 1/2 3/4 1

 A dyadic interval is a real interval of

 Let m1, m2, … be integers with ∑ 2- mi

map p: S ! [0,1] so that ∑x 2 S p(x) = 1.

A U-valued stochastic variable is a map

If Y: S ! R is a stochastic variable, its

 The self-entropy of a message with probability 1 is 0 bits

 The self-entropy of a message with probability 0 is +1.

 The self-entropy of a message with probability ½ is 1 bit

 We often measure entropy is unit “bits”

For a stochastic variable X, its entropy H[X]

If X1 and X2 are independent stochastic

 For any function f, H(f(X)) · H(X).

 For all prefix codes c on S,

 There is a prefix code c on S so that

You might also like