0% found this document useful (0 votes)

99 views550 pages

Essential Coding Theory Guide

This document is the preface to a textbook on essential coding theory. It provides an overview of the topics that will be covered in the textbook, including definitions of codes, error correction, distance of codes, the Hamming code, bounds on code parameters, linear codes, probability, entropy, Reed-Solomon codes, Shannon's noise model, and list decoding. The preface also acknowledges support from the National Science Foundation and licenses the content as open source under a Creative Commons license. It is dated October 3, 2023 and provides a web address for the latest version online.

Uploaded by

vapabyssboi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views550 pages

Essential Coding Theory Guide

Uploaded by

vapabyssboi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 550

Essential Coding Theory

Venkatesan Guruswami Atri Rudra1 Madhu Sudan

October 3, 2023

1
Department of Computer Science and Engineering, University at Buffalo, SUNY. Work supported by
NSF CAREER grant CCF-0844796.
ii
Foreword

This book is based on lecture notes from coding theory courses taught by Venkatesan Gu-
ruswami at University at Washington and CMU; by Atri Rudra at University at Buffalo, SUNY
and by Madhu Sudan at Harvard and MIT.
This version is dated October 3, 2023. For the latest version, please go to

http://www.cse.buffalo.edu/faculty/atri/courses/coding-theory/book/

The material in this book is supported in part by the National Science Foundation under CA-
REER grant CCF-0844796. Any opinions, findings and conclusions or recomendations expressed
in this material are those of the author(s) and do not necessarily reflect the views of the National
Science Foundation (NSF).

©Venkatesan Guruswami, Atri Rudra, Madhu Sudan, 2019.

This work is licensed under the Creative Commons Attribution-NonCommercial-
NoDerivs 3.0 Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-nd/3.0/ or send a letter to Creative Commons, 444
Castro Street, Suite 900, Mountain View, California, 94041, USA.

iii
iv
Contents

1 The Fundamental Question 3

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Some Definitions and Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Distance of a Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Hamming Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Hamming Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 Generalized Hamming Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.8 Family of codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.10 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

I The Basics 27

2 A Look at Some Nicely Behaved Codes: Linear Codes 31

2.1 Groups and Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Vector Spaces and Linear Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Linear Codes and Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Hamming Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Efficient Decoding of Hamming codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6 Dual of a Linear Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3 Probability as Fancy Counting and the q-ary Entropy Function 53

3.1 A Crash Course on Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 The Probabilistic Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 The q-ary Entropy Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

v
II The Combinatorics 69

4 What Can and Cannot Be Done-I 73

4.1 Asymptotic Version of the Hamming Bound . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Gilbert-Varshamov Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Singleton Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Plotkin Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5 The Greatest Code of Them All: Reed-Solomon Codes 93

5.1 Polynomials and Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Reed-Solomon Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3 Maximum Distance Separable Codes and Properties . . . . . . . . . . . . . . . . . . 100
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6 What Happens When the Noise is Stochastic: Shannon’s Theorem 113

6.1 Overview of Shannon’s Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 Shannon’s Noise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.3 Shannon’s Result for BSCp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.4 Hamming vs. Shannon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7 Bridging the Gap Between Shannon and Hamming: List Decoding 131
7.1 Hamming versus Shannon: part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.2 List Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.3 The Johnson Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.4 List-Decoding Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.5 List Decoding from Random Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

8 What Cannot be Done-II 157

8.1 Elias-Bassalygo bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.2 The MRRW bound: A better upper bound . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.3 A Breather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

vi
III The Codes 163
9 When Polynomials Save the Day: Polynomial Based Codes 165
9.1 The generic construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
9.2 The low degree case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.3 The case of the binary field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.4 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

10 From Large to Small Alphabets: Code Concatenation 179

10.1 Code Concatenation: The basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
10.2 Zyablov Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
10.3 Advanced Concatenation and Strongly Explicit Constructions . . . . . . . . . . . . 184
10.4 Summary of concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

11 When Graphs Come to the Party: Expander Codes 193

11.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
11.2 Expander Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
11.3 Basic Expander Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
11.4 Codes from weaker expanders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11.5 Distance Amplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
11.6 Existence of lossless expanders: Proof of Theorem 11.2.3 . . . . . . . . . . . . . . . . 218
11.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
11.8 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

IV The Algorithms 225

12 Efficient Decoding of Reed-Solomon Codes 227
12.1 Unique decoding of Reed-Solomon codes . . . . . . . . . . . . . . . . . . . . . . . . 227
12.2 List Decoding Reed-Solomon Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
12.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
12.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

13 Decoding Reed-Muller Codes 259

13.1 A natural decoding algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
13.2 Majority Logic Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
13.3 Decoding by reduction to Reed-Solomon decoding . . . . . . . . . . . . . . . . . . . 268
13.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
13.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

vii
14 Decoding Concatenated Codes 279
14.1 A Natural Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
14.2 Decoding From Errors and Erasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
14.3 Generalized Minimum Distance Decoding . . . . . . . . . . . . . . . . . . . . . . . . 283
14.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

15 Efficiently Achieving the Capacity of the BSCp 289

15.1 Achieving capacity of B SC p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
15.2 Decoding Error Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
15.3 The Inner Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
15.4 The Outer Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
15.5 Discussion and Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

16 Information Theory Strikes Back: Polar Codes 297

16.1 Achieving Gap to Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
16.2 Reduction to Linear Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
16.3 The Polarization Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
16.4 Polar codes, Encoder and Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
16.5 Analysis: Speed of Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
16.6 Entropic Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
16.7 Summary and additional information . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
16.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
16.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

17 Efficiently Achieving List Decoding Capacity 331

17.1 Folded Reed-Solomon Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
17.2 List Decoding Folded Reed-Solomon Codes: I . . . . . . . . . . . . . . . . . . . . . . 335
17.3 List Decoding Folded Reed-Solomon Codes: II . . . . . . . . . . . . . . . . . . . . . . 338
17.4 Bibliographic Notes and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
17.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352

18 Fast encoding: linear time encodable codes 357

18.1 Overview of the construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
18.2 Low-density Error-Reduction Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
18.3 The error-correcting code: Recursive construction . . . . . . . . . . . . . . . . . . . 361
18.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
18.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
18.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364

19 Recovering very locally: Locally Recoverable Codes 365

19.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
19.2 Definition of Locally Recoverable Codes . . . . . . . . . . . . . . . . . . . . . . . . . 366
19.3 A simple construction for message symbol LRCs . . . . . . . . . . . . . . . . . . . . . 367
19.4 A Singleton-type bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

viii
19.5 An LRC meeting the Singleton type bound . . . . . . . . . . . . . . . . . . . . . . . . 371
19.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
19.7 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375

V The Applications 377

20 Cutting Data Down to Size: Hashing 379
20.1 Why Should You Care About Hashing? . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
20.2 Avoiding Hash Collisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
20.3 Almost Universal Hash Function Families and Codes . . . . . . . . . . . . . . . . . . 384
20.4 Data Possession Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
20.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

21 Securing Your Fingerprints: Fuzzy Vaults 391

21.1 Some quick background on fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . 391
21.2 The Fuzzy Vault Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
21.3 The Final Fuzzy Vault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
21.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398

22 Finding Defectives: Group Testing 399

22.1 Formalization of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
22.2 Bounds on t a (d , N ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
22.3 Bounds on t (d , N ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
22.4 Coding Theory and Disjunct Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
22.5 An Application in Data Stream Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 409
22.6 Summary of best known bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
22.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
22.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

23 Complexity of Coding Problems 419

23.1 Nearest Codeword Problem (NCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
23.2 Decoding with Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
23.3 Approximate NCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
23.4 Distance bounded decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
23.5 Minimum distance problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
23.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
23.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
23.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437

24 Giving Computational Complexity a helping hand 439

24.1 Communication Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
24.2 Derandomization and Pseudorandomness . . . . . . . . . . . . . . . . . . . . . . . . 444
24.3 Hardcore Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454

ix
24.4 Average case complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
24.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
24.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467

A Notation Table 481

B Some Useful Facts 483

B.1 Some Useful Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
B.2 Some Useful Identities and Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485

C Background on Asymptotic notation, Algorithms and Complexity 487

C.1 Asymptotic Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
C.2 Bounding Algorithm run time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
C.3 Randomized Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
C.4 Efficient Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
C.5 More on intractability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
C.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
C.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505

D Basic Algebraic Algorithms 507

D.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
D.2 Groups, Rings, Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
D.3 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
D.4 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
D.5 Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
D.6 Algorithmic aspects of Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
D.7 Algorithmic aspects of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
D.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526

E Some Information Theory Essentials 529

E.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
E.2 Joint and conditional entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
E.3 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534

x
List of Figures

1.1 Decoding for Akash English, one gets “I need little little (trail)mix." . . . . . . . . . 3
1.2 Coding process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Bad example for unique decoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Illustration for proof of Hamming Bound . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 The q-ary Entropy Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.1 The Hamming and Gilbert-Varshamov (GV) bounds for binary codes . . . . . . . . 74
4.2 An illustration of Gilbert’s greedy algorithm (Algorithm 4.2.1) for the first five iterations. 76
4.3 Construction of a new code in the proof of the Singleton bound. . . . . . . . . . . . 80
4.4 The Hamming, GV and Singleton bound for binary codes. . . . . . . . . . . . . . . . 80
4.5 R vs δ tradeoffs for binary codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1 The communication process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.2 Binary Symmetric Channel BSCp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.3 Binary Erasure Channel BECα . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4 The sets D m partition the ambient space {0, 1}n . . . . . . . . . . . . . . . . . . . . . . 118
6.5 The shell S m of inner radius (1 − γ)pn and outer radius (1 + γ)pn. . . . . . . . . . . 119
6.6 Illustration of Proof of Shannon’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . 121

7.1 Bad example of unique decoding revisited . . . . . . . . . . . . . . . . . . . . . . . . 133

7.2 Comparing the Johnson Bound with Unique decoding and Singleton bounds . . . 139
7.3 An error pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.4 Illustration of notation used in the proof of Theorem 7.5.1 . . . . . . . . . . . . . . . 145
7.5 An error pattern in the middle of the proof . . . . . . . . . . . . . . . . . . . . . . . . 146

8.1 Bounds on R vs δ for binary codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

10.1 Concatenated code C out ◦C in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

10.2 The Zyablov bound for binary codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

11.1 A bipartite graph G H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

11.2 The ‘triangle’ graph G on the left, and its double cover G ′ (see Definition 11.2.13) in the middle and H ,
11.3 The C in -Coded G-Amplication of C out . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

12.1 A received word in 2-D space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

xi
12.2 The closest polynomial to a received word . . . . . . . . . . . . . . . . . . . . . . . . 229
12.3 The tradeoff between rate R and the fraction of errors that can be corrected by Algorithm 12.2.1.238
12.4 A received word in 2-D space for the second Reed-Solomon . . . . . . . . . . . . . . 239
12.5 An interpolating polynomial Q(X , Y ) for the received word in Figure 12.4. . . . . . 240
12.6 The two polynomials that need to be output are shown in blue. . . . . . . . . . . . . 240
12.7 The tradeoff between rate R and the fraction of errors that can be corrected by Algorithm 12.2.1 and A
12.8 Multiplicity of 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
12.9 Multiplicity of 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
12.10Multiplicity of 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
12.11A received word in 2-D space for the third Reed-Solomon . . . . . . . . . . . . . . . 245
12.12An interpolating polynomial Q(X , Y ) for the received word in Figure 12.11. . . . . . 245
12.13The five polynomials that need to be output are shown in blue. . . . . . . . . . . . . 246

14.1 Encoding and Decoding of Concatenated Codes . . . . . . . . . . . . . . . . . . . . . 280

14.2 All values of θ ∈ [q i , q i +1 ) lead to the same outcome . . . . . . . . . . . . . . . . . . . 287

15.1 Efficiently achieving capacity of BSCp . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

15.2 Error Correction cannot decrease during “folding" . . . . . . . . . . . . . . . . . . . 294

16.1 The 2 × 2 Basic Polarizing Transform. Included in red are the conditional
³ entropies of´the variables, co
16.2 The n × n Basic Polarizing Transform defined as P n (Z) = P n (U, V) = P n (U + V), P n (V) . Acknowledgement
2 2
16.3 Block structure of the Basic Polarizing Transform. Circled are a block at the 2nd level and two 2nd leve

17.1 Encoding for Reed-Solomon Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

17.2 Folded Reed-Solomon code for m = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
17.3 Folded Reed-Solomon code for general m ≥ 1. . . . . . . . . . . . . . . . . . . . . . . 332
17.4 Error pattern under unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
17.5 Error pattern under folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
17.6 Performance of Algorithm 17.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
17.7 An agreement in position i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
17.8 More agreement with a sliding window of size 2. . . . . . . . . . . . . . . . . . . . . . 339
17.9 Performance of Algorithm 17.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
17.10An upper triangular system of linear equations . . . . . . . . . . . . . . . . . . . . . 343

18.1 The recursive construction of C k . The final code C̃ k is also shown. . . . . . . . . . . 362

21.1 The minutiae are unordered and form a set, not a vector. . . . . . . . . . . . . . . . 393

22.1 Pick a subset S (not necessarily contiguous). Then pick a column j that is not present in S. There will
22.2 Construction of the final matrix MC ∗ from MC out and MC in from Example 22.4.3. The rows in MC ∗ that

E.1 Relationship between entropy, joint entropy, conditional entropy, and mutual information for two ran

xii
List of Tables

3.1 Uniform distribution over F22×2 along with values of four random variables. . . . . . 54

8.1 High level summary of results seen so far. An asymptotically good code has R > 0 and δ > 0.160

10.1 Strongly explicit binary codes that we have seen so far. . . . . . . . . . . . . . . . . . 180

15.1 An overview of the results seen so far . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

15.2 Summary of properties of C out and C in . . . . . . . . . . . . . . . . . . . . . . . . . . 291

xiii
xiv
List of Algorithms

1.3.1Error Detector for Parity Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4.1Naive Maximum Likelihood Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.1Naive Decoder for Hamming Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2Decoder for Any Linear Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.3Efficient Decoder for Hamming Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1Gilbert’s Greedy Code Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.1 q O(k) time algorithm to compute a code on the GV bound . . . . . . . . . . . . . . . 90
5.1.1Generating Irreducible Polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
12.1.1Welch-Berlekamp Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
12.2.1The Basic List-Decoder for Reed-Solomon Codes . . . . . . . . . . . . . . . . . . . . 236
12.2.2The Second List Decoding Algorithm for Reed-Solomon Codes . . . . . . . . . . . . 241
12.2.3The Third List Decoding Algorithm for Reed-Solomon Codes . . . . . . . . . . . . . 246
13.1.1S IMPLE R EED -M ULLER D ECODER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
13.2.1Majority Logic Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
13.3.1R EED -S OLOMON - BASED D ECODER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
14.1.1Natural Decoder for C out ◦C in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
14.3.1Generalized Minimum Decoder (ver 1) . . . . . . . . . . . . . . . . . . . . . . . . . . 284
14.3.2Generalized Minimum Decoder (ver 2) . . . . . . . . . . . . . . . . . . . . . . . . . . 286
14.3.3Deterministic Generalized Minimum Decoder‘ . . . . . . . . . . . . . . . . . . . . . 287
15.1.1Decoder for efficiently achieving BSCp capacity . . . . . . . . . . . . . . . . . . . . . 291
16.3.1P OLAR C OMPRESSOR(Z, S) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
16.3.2Successive Cancellation Decompressor SCD(W, P, S) . . . . . . . . . . . . . . . . . . 303
16.4.1B ASIC P OLAR E NCODER(Z; n, S) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
16.4.2B ASIC P OLAR D ECODER : BPD(W; n, p) . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
17.1.1Decoding Folded Reed-Solomon Codes by Unfolding . . . . . . . . . . . . . . . . . . 333
17.2.1The First List Decoding Algorithm for Folded Reed-Solomon Codes . . . . . . . . . 336
17.3.1The Second List Decoding Algorithm for Folded Reed-Solomon Codes . . . . . . . 340
17.3.2The Root Finding Algorithm for Algorithm 17.3.1 . . . . . . . . . . . . . . . . . . . . 347
18.2.1G EN -FLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
18.4.1L INEAR-D ECODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
19.4.1Computing disjoint S and T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
20.4.1Pre-Processing for Data Possession Verification . . . . . . . . . . . . . . . . . . . . . 385
20.4.2Verification for Data Possession Verification . . . . . . . . . . . . . . . . . . . . . . . 386

xv
20.4.3Decompression Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
20.4.4Decompression Algorithm Using List Decoding . . . . . . . . . . . . . . . . . . . . . 388
21.2.1UNLOCK2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
21.3.1LOCK3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
21.3.2UNLOCK2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
22.3.1Decoder for Separable Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
22.3.2Naive Decoder for Disjunct Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
22.5.1Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
22.5.2Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
22.5.3Report Heavy Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
24.2.1R ANDOM t -SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
24.2.2De-randomized R ANDOM t -SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
24.3.1INVERT f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
∗
24.5.1B bA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
C.2.1Simple Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
C.3.1Sampling algorithm for G AP H AMMING . . . . . . . . . . . . . . . . . . . . . . . . . . 495
C.3.2An average-case algorithm for G AP H AMMING . . . . . . . . . . . . . . . . . . . . . . 496
C.4.1Exponential time algorithm for M AX L INEAR E Q . . . . . . . . . . . . . . . . . . . . . 497
C.4.2Reduction from M AXC UT to M AX L INEAR E Q . . . . . . . . . . . . . . . . . . . . . . . 500
D.7.1R OOT-F IND (Fq , f ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
D.7.2L INEAR-R OOT-F IND (Fq , g ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524

1
2
Chapter 1

The Fundamental Question

1.1 Overview
Communication is a fundamental need of our modern lives. In fact, communication is some-
thing that humans have been doing for a long time. For simplicity, let us restrict ourselves to
English. It is quite remarkable that different people speaking English can be understood pretty
well: even if e.g. the speaker has an accent. This is because English has some built-in redun-
dancy, which allows for “errors" to be tolerated. We will pick an example from one of the au-
thor’s experiences conversing with his two-year-old son, Akash. When Akash started to speak
his own version of English, which we will dub “Akash English," we got examples such as the one
illustrated below:

Figure 1.1: Decoding for Akash English, one gets “I need little little (trail)mix."

3
With some practice Akash’s parents were able to “decode" what Akash really meant. In fact,
Akash could communicate even if he did not say an entire word properly and gobbled up part(s)
of word(s).
The above example shows that having redundancy in a language allows for communication
even in the presence of (small amounts of) differences and errors. Of course, in our modern
digital world, all kinds of entities communicate (and most of the entities do not communicate
in English, or any natural language for that matter). Errors are also present in the digital world,
so these digital communications also use redundancy.
Error-correcting codes (henceforth, just codes) are clever ways of representing data so that
one can recover the original information even if parts of it are corrupted. The basic idea is to
judiciously introduce redundancy so that the original information can be recovered even when
parts of the (redundant) data have been corrupted.
For example, when packets are transmitted over the Internet, some of the packets get cor-
rupted or dropped. Packet drops are resolved by the TCP layer by a combination of sequence
numbers and ACKs. To deal with data corruption, the TCP/IP protocol use a form of error cor-
rection called CRC Checksum [124]. From a theoretical point of view, the checksum is a terrible
code since it does not have good error correction properties (for that matter so is English). How-
ever, on the Internet, the current dominant mode of operation is to detect errors and if errors
have occurred, then ask for retransmission. This is the reason why the use of checksum has
been hugely successful in the Internet. However, there are other communication applications
where re-transmission is not an option. Codes are used when transmitting data over the tele-
phone line or via cell phones. They are also used in deep space communication and in satellite
broadcast (for example, TV signals are transmitted via satellite). Indeed, asking the Mars Rover
to re-send an image just because it got corrupted during transmission is not an option–this is
the reason that for such applications, the codes used have always been very sophisticated.
Codes also have applications in areas not directly related to communication. In particu-
lar, in the applications above, we want to communicate over space. Codes can also be used to
communicate over time. For example, codes are used heavily in data storage. CDs and DVDs
work fine even in presence of scratches precisely because they use codes. Codes are used in Re-
dundant Array of Inexpensive Disks (RAID) [29] and error correcting memory [28]. Sometimes,
in the Blue Screen of Death displayed by Microsoft Windows family of operating systems, you
might see a line saying something along the lines of “parity check failed"–this happens when
the code used in the error-correcting memory cannot recover from error(s). Also, certain con-
sumers of memory, e.g. banks, do not want to suffer from even one bit flipping (this e.g. could
mean someone’s bank balance either got halved or doubled–neither of which are welcome1 ).
Codes are also deployed in other applications such as paper bar codes; for example, the bar
code used by UPS called MaxiCode [27]. Unlike the Internet example, in all of these applica-
tions, there is no scope for “re-transmission."
In this book, we will mainly think of codes in the communication scenario. In this frame-
work, there is a sender who wants to send (say) k message symbols over a noisy channel. The

1
This is a bit tongue-in-cheek: in real life banks have more mechanisms to prevent one-bit flip from wreaking
havoc.

4
sender first encodes the k message symbols into n symbols (called a codeword) and then sends
it over the channel. The receiver gets a received word consisting of n symbols. The receiver then
tries to decode and recover the original k message symbols. Thus, encoding is the process of
adding redundancy and decoding is the process of removing errors.
Unless mentioned otherwise, in this book we will make the following assumption:

The sender and the receiver only communicate via the channel.a In other words, other than
some setup information about the code, the sender and the receiver do not have any other
information exchange (other than of course what was transmitted over the channel). In
particular, no message is more likely to be transmitted over another.
a
The scenario where the sender and receiver have a “side-channel" is an interesting topic that has been
studied but is outside the scope of this book.

The fundamental question that will occupy our attention for almost the entire book is the
tradeoff between the amount of redundancy used and the number of errors that can be cor-
rected by a code. In particular, we would like to understand:

Question 1.1.1. How much redundancy do we need to correct a given amount of errors? (We
would like to correct as many errors as possible with as little redundancy as possible.)

Note that maximizing error correction and minimizing redundancy are contradictory goals:
a code with higher redundancy should be able to tolerate a greater number of errors. By the end
of this chapter, we will see a formalization of this question.
Once we determine the optimal tradeoff, we will be interested in achieving this optimal
tradeoff with codes that come equipped with efficient encoding and decoding. (A DVD player
that tells its consumer that it will recover from a scratch on a DVD by tomorrow is not exactly
going to be a best-seller.) In this book, we will primarily define efficient algorithms to be ones
that run in polynomial time.2

1.2 Some Definitions and Codes

To formalize Question 1.1.1, we begin with the definition of a code.

Definition 1.2.1 (Code). A code of block length n over an alphabet Σ is a subset of Σn . Typically,
we will use q to denote the alphabet size |Σ|.3

2
Readers unfamiliar with runtime analysis are referred to Appendix C. Coming back to the claim on efficiency–
we are not claiming that this is the correct notion of efficiency in practice. However, we believe that it is a good
definition as the “first cut"– quadratic or cubic time algorithms are definitely more desirable than exponential time
algorithms: see Section C.4 for more on this.
3
Note that q need not be a constant and can depend on n: we’ll see codes in this book where this is true.

5
Remark 1.2.2. We note that the ambient space Σn can be viewed as a set of sequences, vectors or
functions. In other words, we can think of a vector (v 1 , . . . , v n ) ∈ Σn as just the sequence v 1 , . . . , v n
(in order) or a vector tuple (v 1 , . . . , v n ) or as the function f : [n] → Σ such that f (i ) = v i . Sequences
assume least structure on Σ and hence are most generic. Vectors work well when Σ has some struc-
ture (and in particular is what is known as a field, which we will see next chapter). Functional
representation will be convenient when the set of coordinates has structure (e.g., [n] may come
from a finite field of size n). For now, however, the exact representation does not matter and the
reader can work with representation as sequences.

We will also frequently use the following alternate way of looking at a code. Given a code
C ⊆ Σn , with |C | = M , we will think of C as a mapping of the following form:

C : [M ] → Σn . (1.1)

In the above equation (1.1), we have used the notation [M ] for any integer M ≥ 1 to denote the
set {1, 2, . . . , M }.
We will also need the notion of dimension of a code.

Definition 1.2.3 (Dimension of a code). Given a code C ⊆ Σn , its dimension is given by

def
k = logq |C |.

Let us begin by looking at two specific codes. Both codes are defined over Σ = {0, 1} (also
known as binary codes). In both cases |C | = 24 and we will think of each of the 16 messages as a
4 bit vector.
We first look at the so-called parity code, which we will denote by C ⊕ . Given a message
(x 1 , x 2 , x 3 , x 4 ) ∈ {0, 1}4 , its corresponding codeword is given by

C ⊕ (x 1 , x 2 , x 3 , x 4 ) = (x 1 , x 2 , x 3 , x 4 , x 1 ⊕ x 2 ⊕ x 3 ⊕ x 4 ), (1.2)

where the ⊕ denotes the XOR (also known as the EXOR or Exclusive-OR) operator. In other
words, the parity code appends the parity of the message bits (or takes the remainder of the
sum of the message bits when divided by 2) at the end of the message. For example, the message
(1, 0, 0, 1) will have a 0 appended at the end while (1, 0, 0, 0) will have a 1 appended at the end.
Note that such a code uses the minimum amount of non-zero redundancy.
The second code we will look at is the so-called repetition code. This is a very natural code
(and perhaps the first code one might think of). The idea is to repeat every message bit a fixed
number of times. For example, we repeat each of the 4 message bits 3 times and we use C 3,r ep to
denote this code. Given a message (x 1 , x 2 , x 3 , x 4 ) ∈ {0, 1}4 , its corresponding codeword is given
by
C 3,r ep (x 1 , x 2 , x 3 , x 4 ) = (x 1 , x 1 , x 1 , x 2 , x 2 , x 2 , x 3 , x 3 , x 3 , x 4 , x 4 , x 4 ). (1.3)
Let us now try to look at the tradeoff between the amount of redundancy and the number of
errors each of these codes can correct. Even before we begin to answer the question, we need
to define how we are going to measure the amount of redundancy. One natural way to define

6
redundancy for a code with dimension k and block length n is by their difference n − k. By
this definition, the parity code uses the least amount of redundancy. However, one “pitfall" of
such a definition is that it does not distinguish between a code with k = 100 and n = 102 and
another code with dimension and block length 2 and 4, respectively. The first code uses 0.02 bits
of redundancy per message bit while the second code uses 1 bit of redundancy per message
bit. Thus, in the relative sense, the latter code is using more redundancy. This motivates the
following notion of measuring redundancy.

Definition 1.2.4 (Rate of a code). The rate of a code with dimension k and block length n is given
by
def k
R = .
n
Note that the higher the rate, the lesser the amount of redundancy in the code. Thus, when
cosntructing or analyzing codes, we will be interested in lower bounding the rate of a code.
(Ocassionally we will also be sloppy and say that a code “has rate R” when we really mean it
“has rate at least R”.) Also note that as k ≤ n,4

R ≤ 1.

In other words, the rate of a code is the average amount of real information in each of the n
symbols transmitted over the channel. So, in some sense, rate captures the complement of
redundancy. However, for historical reasons, we will deal with the rate R (instead of the more
natural 1−R) as our notion of redundancy. Given the above definition, C ⊕ and C 3,r ep have rates
of 45 and 13 . As expected, the parity code has a higher rate than the repetition code.
We have formalized the notion of redundancy as the rate of a code as well as other param-
eters of a code. However, to formalize Question 1.1.1, we still need to formally define what it
means to correct errors. We do so next.

1.3 Error Correction

Before we formally define error correction, we will first formally define the notion of encoding.

Definition 1.3.1 (Encoding function). Let C ⊆ Σn . An equivalent description of the code C is an

injective mapping E : [|C |] → Σn called the encoding function.

Next we move to error correction. Informally, we can correct a received word if we can re-
cover the transmitted codeword (or equivalently the corresponding message). This “reverse"
process is called decoding.

Definition 1.3.2 (Decoding function). Let C ⊆ Σn be a code. A mapping D : Σn → [|C |] is called a

decoding function for C .
4
Further, in this book, we will always consider the case k > 0 and n < ∞ and hence, we can also assume that
R > 0.

7
The definition of a decoding function by itself does not give anything interesting. What we
really need from a decoding function is for the function to recover the transmitted message. To
understand this notion, we first need to understand the nature of errors that we aim to tackle. In
particular, if a transmitter transmits u ∈ Σn and the receiver receives v ∈ Σn , how do we quantify
the amount of “error” that has happened during this transmission? While multiple notions are
possible, the most central one, and the one we will focus on for most of this book, is based on
“Hamming distance,” a notion of distance that captures how close are two given sequences u
and v.

Definition 1.3.3 (Hamming distance). Given two vectors u, v ∈ Σn the Hamming distance be-
tween u and v, denoted by ∆(u, v), is the number of positions in which u and v differ. We also
define the relative Hamming distance, denoted δ(u, v), to be the quantity δ(u, v) = n1 ∆(u, v).

Note that the relative Hamming distance normalizes the distance so that δ(u, v) always lies
in the interval [0, 1] (for every n, Σ and strings u, v ∈ Σn ). This normalization will be useful when
we study the asymptotic behavior of encoding and decoding functions, i.e., as n → ∞. For now,
though we will focus mostly on the (non-relative) Hamming distance.
The Hamming distance is a distance in a very formal mathematical sense: see Exercise 1.5.
Note that the definition of Hamming distance depends only on the number of differences and
not the nature of the difference. For example, consider the vectors u = 00000 and v = 10001.
One can see that their Hamming distance is ∆(u, v) = 2. Now consider the vector w = 01010.
Note that even though v 6= w, we again have a Hamming distance ∆(u, w) = 2.
To return to the quantification of errors, from now on we will say that if u is transmitted and
v is received then ∆(u, v) errors occurred during transmission. This allows us to quantify the
performance of an encoding/decoding function, or equivalently the underlying code as we do
next.

Definition 1.3.4 (t -Error Channel). An n-symbol t -Error Channel over the alphabet Σ is a func-
tion Ch : Σn → Σn that satisfies ∆(v, Ch(v)) ≤ t for every v ∈ Σn .

Definition 1.3.5 (Error Correcting Code). Let C ⊆ Σn be a code and let t ≥ 1 be an integer. C
is said to be a t -error-correcting code if there exists a decoding function D such that for every
message m ∈ [|C |] every t -error channel Ch we have D (Ch(C (m))) = m.

Thus, a t -error-correcting code is one where there is a decoding function that corrects any
pattern of t errors. For example, consider the case when the codeword (0, 0, 0, 0) is transmitted.
Then a 1-error-correcting code (over the alphabet {0, 1}) should be able to decode from any of
the following received words:

(0, 0, 0, 0), (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1).

Figure 1.2 illustrates how the definitions we have examined so far interact.
We will also very briefly look at a weaker form of error recovery called error detection.

8
Channel Ch
m 7→ C (m) v = Ch (C (m)) 7→ m

Encoding function Decoding function

Figure 1.2: Coding process

Definition 1.3.6 (Error detection code). Let C ⊆ Σn be a code and let t ≥ 1 be an integer. C is said
to be t -error-detecting code if there exists a detecting procedure D such that for every message m
and every received vector v ∈ Σn satisfying ∆(C (m), v) ≤ t , it holds that D outputs a 1 if v = C (m)
and 0 otherwise. In other words
(
1 if v = C (m)
D(v) = .
0 otherwise

Thus, a t -error-detecting code is one where if the transmission has at least one error and at
most t errors, then the decoding function detects the error (by outputting 0). Note that a t -error
correcting code is also a t -error detecting code (but not necessarily the other way round): see
Exercise 1.1. Although error detection might seem like a weak error recovery model, it is useful
in settings where the receiver can ask the sender to re-send the message. For example, error
detection is used quite heavily in the Internet.
Finally, we also consider a more benign model of errors referred to as “erasures,” where a
symbol is merely (and explicitly) omitted from the transmission (as opposed to being replaced
by some other symbol). More specifically, if a symbols is erased, then it is replaced by a special
symbol “?” that is not a member of the alphabet Σ. For example, if (0, 0, 0, 0) was transmitted
and the second symbols was erased by the channel, then the vector (0, ?, 0, 0) will be received.

Definition 1.3.7 (t -Erasure Channel). An n-symbol t -Erasure Channel over the alphabet Σ is a
function Ch : Σn → (Σ ∪ {?})n that satisfies ∆(v, Ch(v)) ≤ t for every v ∈ Σn (where both arguments
to ∆(·, ·) are viewed as elements of (Σ ∪ {?})n ) and for every i ∈ [n] such that vi 6= Ch(v)i we have
Ch(v)i =?.

A coordinate i such that Ch(v)i =? is called an erasure. We may now define erasure correct-
ing codes analogously to error-correcting codes.

Definition 1.3.8 (Erasure Correcting Code). Let C ⊆ Σn be a code and let t ≥ 1 be an integer. C
is said to be a t -erasure-correcting code if there exists a decoding function D such that for every
message m ∈ [|C |] and for every t -erasure channel Ch we have D (Ch(C (m))) = m.

With the above definitions in place, we are now ready to look at the error correcting capa-
bilities of the codes we looked at in the previous section.

9
1.3.1 Error-Correcting Capabilities of Parity and Repetition Codes
In Section 1.2, we looked at examples of parity code and repetition code with the following
properties:

C ⊕ : q = 2, k = 4, n = 5, R = 4/5.
C 3,r ep : q = 2, k = 4, n = 12, R = 1/3.
We will start with the repetition code. To study its error-correcting capabilities, we will con-
sider the following natural decoding function. Given a received word y ∈ {0, 1}12 (where re-
call the transmitted codeword is for the form (x 1 , x 1 , x 1 , x 2 , x 2 , x 2 , x 3 , x 3 , x 3 , x 4 , x 4 , x 4 ) for some
(x 1 , x 2 , x 3 , x 4 ) ∈ {0, 1}4 ), divide it up into four consecutive blocks (y 1 , y 2 , y 3 , y 4 ) where every block
consists of three bits. Then, for every block y i (1 ≤ i ≤ 4), output the majority bit as the message
bit. We claim this decoding function can correct any error pattern with at most 1 error (see Ex-
ercise 1.2.) For example, if a block of 010 is received, since there are two 0’s we know the original
message bit was 0. In other words, we have argued the following error correcting capability of
C 3,r ep :

Proposition 1.3.9. C 3,r ep is a 1-error correcting code.

However, it is not too hard to see that C 3,r ep cannot correct two errors. For example, if both
of the errors happen in the same block and a block in the received word is 010, then the original
block in the codeword could have been either 111 or 000. Therefore in this case, no decoder can
successfully recover the transmitted message.5
Thus, we have pin-pointed the error-correcting capabilities of the C 3,r ep code: it can correct
one error, but not two or more. However, note that the argument assumed that the error posi-
tions can be located arbitrarily. In other words, we are assuming that the channel noise behaves
arbitrarily (subject to a bound on the total number of errors). However, we can model the noise
differently. We now briefly digress to look at this issue in slightly more detail.

Digression: Channel Noise. As was mentioned above, until now we have been assuming the
following noise model, which was first studied by Hamming:

Any error pattern can occur during transmission as long as the total number of er-
rors is bounded. Note that this means that the location as well as the nature6 of the
errors is arbitrary.

We will frequently refer to Hamming’s model as the Adversarial Noise Model. It is important
to note that the atomic unit of error is a symbol from the alphabet. For example, if the error
pattern7 is (1, 0, 1, 0, 0, 0) and we consider the alphabet to be {0, 1}, then the pattern has two
5
Recall we are assuming that the decoder has no side information about the transmitted message.
6
For binary codes, there is only one kind of error: a bit flip. However, for codes over a larger alphabet, say {0, 1, 2},
0 being converted to a 1 and 0 being converted into a 2 are both errors, but are different kinds of errors.
7
If v is transmitted and Ch(v) is received then the ‘difference’ between Ch(v) and v is the error pattern. For
binary alphabet the difference is the XOR operator.

10
errors (since the first and the third locations in the vector have a non-zero value, i.e. value of
1). However, if our alphabet is {0, 1}3 (i.e. we think of the vector above as ((1, 0, 1), (0, 0, 0)), with
(0, 0, 0) corresponding to the zero element in {0, 1}3 ), then the pattern has only one error. Thus,
by increasing the alphabet size we can also change the adversarial noise model. As the book
progresses, we will see how error correction over a larger alphabet is easier than error correction
over a smaller alphabet.
However, the above is not the only way to model noise. For example, we could also have
following error model:
No more than 1 error can happen in any contiguous three-bit block.
First note that, for the error model above, no more than four errors can occur when a codeword
in C 3,r ep is transmitted. (Recall that in C 3,r ep , each of the four bits is repeated three times.) Sec-
ond, note that the decoding function that takes the majority vote of each block can successfully
recover the transmitted codeword for any error pattern, while in the worst-case noise model it
could only correct at most one error. This channel model is admittedly contrived, but it illus-
trates the point that the error-correcting capabilities of a code (and a decoding function) are
crucially dependent on the noise model.
A popular alternate noise model is to model the channel as a stochastic process. As a con-
crete example, let us briefly mention the binary symmetric channel with crossover probability
0 ≤ p ≤ 1, denoted by BSCp , which was first studied by Shannon. In this model, when a (binary)
codeword is transferred through the channel, every bit flips independently with probability p.
Note that the two noise models proposed by Hamming and Shannon are in some sense two
extremes: Hamming’s model assumes no knowledge about the channel (except that a bound
on the total number of errors is known8 while Shannon’s noise model assumes complete knowl-
edge about how noise is produced. In this book, we will consider only these two extreme noise
models. In real life, the situation often is somewhere in between.
For real life applications, modeling the noise model correctly is an extremely important
task, as we can tailor our codes to the noise model at hand. However, in this book we will not
study this aspect of designing codes at all, and will instead mostly consider the worst-case noise
model. Informally, if one can communicate over the worst-case noise model, then one could
use the same code to communicate over nearly every other noise model with the same amount
of noise.

We now return to C ⊕ and examine its error-correcting capabilities in the worst-case noise
model. We claim that C ⊕ cannot correct even one error. Suppose y = 10000 is the received word.
Then we know that an error has occurred, but we do not know which bit was flipped. This is
because the two codewords u = 00000 and v = 10001 differ from the received word y in exactly
one bit. As we are assuming that the receiver has no side information about the transmitted
codeword, no decoder can know what the transmitted codeword was.
Thus, from an error-correction point of view, C ⊕ is a terrible code (as it cannot correct even
1 error). However, we will now see that C ⊕ can detect one error. Consider Algorithm 1.3.1. Note
8
A bound on the total number of errors is necessary; otherwise, error correction would be impossible: see
Exercise 1.3.

11
Algorithm 1.3.1 Error Detector for Parity Code
I NPUT: Received word y = (y 1 , y 2 , y 3 , y 4 , y 5 )
O UTPUT: 1 if y ∈ C ⊕ and 0 otherwise

1: b ← y 1 ⊕ y 2 ⊕ y 3 ⊕ y 4 ⊕ y 5
2: RETURN 1 ⊕ b ⊲ If there is no error, then b = 0 and hence we need to "flip" the bit for the
answer

that when no error has occurred during transmission, y i = x i for 1 ≤ i ≤ 4 and y 5 = x 1 ⊕ x 2 ⊕ x 3 ⊕

x 4 , in which case b = 0 and we output 1 ⊕ 0 = 1 as required. If there is a single error then either
y i = x i ⊕ 1 (for exactly one 1 ≤ i ≤ 4) or y 5 = x 1 ⊕ x 2 ⊕ x 3 ⊕ x 4 ⊕ 1. It can be check that in this case,
b = 1. In fact, one can extend this argument to obtain the following result (see Exercise 1.4).

Proposition 1.3.10. The parity code C ⊕ can detect an odd number of errors.

Let us now revisit the example that showed that one cannot correct one error using C ⊕ .
Recall, we considered two codewords in C ⊕ , u = 00000 and v = 10001 (which are codewords
corresponding to messages 0000 and 1000, respectively). Now consider the scenarios in which
u and v are each transmitted and a single error occurs resulting in the received word r = 10000.
Thus, given the received word r and the fact that at most one error can occur, the decoder has
no way of knowing whether the original transmitted codeword was u or v. Looking back at the
example, it is clear that the decoder is “confused" because the two codewords u and v do not
differ in many positions. This notion is formalized in the next section.

1.4 Distance of a Code

We now turn to a new parameter associated with a code that we call the minimum distance of
a code. As we will see later, minimum distance is connected to the other parameters, including
the error-correction and error-detection capacity of the code. However, due to the cleanliness
of the definition, it will often be the first of the parameters we will explore when studying a new
error-correcting code.

Definition 1.4.1 (Minimum distance). Let C ⊆ Σn . The minimum distance (or just distance) of
C , denoted ∆(C ), is defined to be

∆(C ) = min ∆(c1 , c2 ).

c1 6=c2 ∈C

We also define the relative minimum distance of C to be δ(C ), is defined to be

δ(C ) = min δ(c1 , c2 ).

c1 6=c2 ∈C

In other words, ∆(C ) is the minimum distance between two distinct codewords in C . We
note that the repetition code C 3,r ep has distance 3 (recall (1.3)). Indeed, any two distinct mes-
sages will differ in at least one of the message bits. After encoding, the difference in one message

12
bit will translate into a difference of three bits in the corresponding codewords. For example

C 3,r ep (0, 0, 0, 0) = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) and C 3,r ep (1, 0, 0, 0) = (1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0).

We now claim that the distance of C ⊕ is 2. This is a consequence of the following observa-
tions. If two messages m1 and m2 differ in at least two places then ∆(C ⊕ (m1 ),C ⊕ (m2 )) ≥ 2 (even
if we just ignored the parity bits). If two messages differ in exactly one place then the parity bits
in the corresponding codewords are different, which implies a Hamming distance of 2 between
the codewords. For example,

C ⊕ (1, 0, 0, 0) = (1, 0, 0, 0, 1) and C ⊕ (1, 0, 0, 1) = (1, 0, 0, 1, 0).

Thus, C ⊕ has a smaller distance than C 3,r ep and can correct less number of errors than C 3,r ep .
This suggests that a larger distance implies greater error-correcting capabilities. The next result
formalizes this intuition. As we will see, minimum distance exactly captures both the ability to
recover from errors as also the notion of erasures (Definition 1.3.8).

Proposition 1.4.2. Given a code C , the following are equivalent:

1. C has minimum distance d ≥ 2,

2. If d is odd, C can correct (d − 1)/2 errors.

3. C can detect d − 1 errors.

4. C can correct d − 1 erasures.

Remark 1.4.3. Property (2) above for even d is slightly different. In this case, one can correct up
to d2 − 1 errors but cannot correct d2 errors. (See Exercise 1.6.)

Before we prove Proposition 1.4.2, let us apply it to the codes C ⊕ and C 3,r ep which have
distances of 2 and 3, respectively. Proposition 1.4.2 implies the following facts that we have
already proved:

• C 3,r ep can correct 1 error (Proposition 1.3.9).

• C ⊕ can detect 1 error but cannot correct 1 error (Proposition 1.3.10).

The proof of Proposition 1.4.2 will need the following decoding function. Maximum likeli-
hood decoding (MLD) is a well-studied decoding method for error correcting codes. The MLD
function outputs the codeword c ∈ C , which is as close as possible to the received word in
Hamming distance (with ties broken arbitrarily). More formally, the MLD function denoted
by D M LD : Σn → C is defined as follows. For every y ∈ Σn ,

D M LD (y) = arg min ∆(c, y).

c∈C

Algorithm 1.4.1 is a naive implementation of the MLD.

13
Algorithm 1.4.1 Naive Maximum Likelihood Decoder
I NPUT: Received word y ∈ Σn
O UTPUT: D M LD (y)

1: Pick an arbitrary c ∈ C and assign z ← c

2: FOR every c′ ∈ C such that c 6= c′ DO
3: ∆(c′ , y) < ∆(z, y) THEN
IF
4: z ← c′
5: RETURN z

Proof of Proposition 1.4.2 We will complete the proof in two steps. First, we will show that if
property 1 is satisfied then so are properties 2, 3 and 4 (we prove this via three implications (1)
implies (2), (1) implies (3) and (1) implies (4)). Then we show that if property 1 is not satisfied
then none of the properties 2, 3 or 4 hold (again via the corresponding three implications).

Item 1. implies 2. Assume C has distance d . We first prove 2 (for this case assume that d =
2t + 1). We now need to show that there exists a decoding function such that for all error pat-
terns with at most t errors it always outputs the transmitted message. We claim that the MLD
function has this property. Assume this is not so and let c1 be the transmitted codeword and let
y be the received word. Note that
∆(y, c1 ) ≤ t . (1.4)
As we have assumed that MLD does not work, D M LD (y) = c2 6= c1 . Note that by the definition of
MLD,
∆(y, c2 ) ≤ ∆(y, c1 ). (1.5)
Consider the following set of inequalities:

∆(c1 , c2 ) ≤ ∆(c2 , y) + ∆(c1 , y) (1.6)

≤ 2∆(c1 , y) (1.7)
≤ 2t (1.8)
= d − 1, (1.9)

where (1.6) follows from the triangle inequality (see Exercise 1.5), (1.7) follows from (1.5) and
(1.8) follows from (1.4). (1.9) implies that the distance of C is at most d − 1, which is a contra-
diction.

Item 1. implies 3. We now show that property 3 holds. That is, we need to describe an algo-
rithm that can successfully detect whether errors have occurred during transmission (as long
as the total number of errors is bounded by d − 1). Consider the following error detection algo-
rithm: check if the received word y = c for some c ∈ C (this can be done via an exhaustive check).
If no errors occurred during transmission, y = c1 , where c1 was the transmitted codeword and
the algorithm above will accept (as it should). On the other hand if 1 ≤ ∆(y, c1 ) ≤ d − 1, then by
the fact that the distance of C is d , y 6∈ C and hence the algorithm rejects, as required.

14
Item 1. implies 4. Finally, we prove that property 4 holds. Let y ∈ (Σ∪{?})n be the received word.
First we claim that there is a unique c = (c 1 , . . . , c n ) ∈ C that agrees with y (i.e. y i = c i for every
i such that y i 6= ?). Indeed, for the sake of contradiction, assume that this is not true, i.e. there
exists two distinct codewords c1 , c2 ∈ C such that both c1 and c2 agree with y in the unerased
positions. Note that this implies that c1 and c2 agree in the positions i such that y i 6= ?. Thus,
∆(c1 , c2 ) ≤ |{i |y i = ?}| ≤ d − 1, which contradicts the assumption that C has distance d .
Given the uniqueness of the codeword c ∈ C that agrees with y in the unerased position,
an algorithm to find c is as follows: go through all the codewords in C and output the desired
codeword.

Item ¬1. implies ¬2. For the other direction of the proof, assume that property 1 does not
hold, that is, C has distance d − 1. We now show that property 2 cannot hold: i.e., for ev-
ery decoding function there exists a transmitted codeword c1 and a received word y (where
∆(y, c1 ) ≤ (d − 1)/2) such that the decoding function cannot output c1 . Let c1 6= c2 ∈ C be code-
words such that ∆(c1 , c2 ) = d − 1 (such a pair exists as C has distance d − 1). Now consider a
vector y such that ∆(y, c1 ) = ∆(y, c2 ) = (d − 1)/2. Such a y exists as d is odd and by the choice of
c1 and c2 . Figure 1.3 gives an illustration of such a y (matching color implies that the vectors
agree on those positions).

d −1 d −1
n −d +1 2 2

Figure 1.3: Bad example for unique decoding.

Now, since y could have been generated if either of c1 or c2 were the transmitted codeword,
no decoding function can work in this case.9

Item ¬1. implies ¬3. For the remainder of the proof, assume that the transmitted word is c1
and there exists another codeword c2 such that ∆(c2 , c1 ) = d − 1. To see why property 3 is not
true, let y = c2 . In this case, either the error detecting algorithm detects no error, or it declares
an error when c2 is the transmitted codeword and no error takes place during transmission.
9
Note that this argument is just a generalization of the argument that C ⊕ cannot correct 1 error.

15
Item ¬1. implies ¬4. We finally argue that property 4 does not hold. Let y be the received
word in which the positions that are erased are exactly those where c1 and c2 differ. Thus, given
y both c1 and c2 could have been the transmitted codeword, and no algorithm for correcting (at
most d − 1) erasures can work in this case. ■
Proposition 1.4.2 implies that Question 1.1.1 can be reframed as

Question 1.4.1. What is the largest rate R that a code with distance d can have?

We have seen that the repetition code C 3,r ep has distance 3 and rate 1/3. A natural follow-up
question (which is a special case of Question 1.4.1) is to ask

Question 1.4.2. Can we have a code with distance 3 and rate R > 31 ?

1.5 Hamming Code

With the above question in mind, let us consider the so-called Hamming code, which we will
denote by C H . Given a message (x 1 , x 2 , x 3 , x 4 ) ∈ {0, 1}4 , its corresponding codeword is given by

C H (x 1 , x 2 , x 3 , x 4 ) = (x 1 , x 2 , x 3 , x 4 , x 2 ⊕ x 3 ⊕ x 4 , x 1 ⊕ x 3 ⊕ x 4 , x 1 ⊕ x 2 ⊕ x 4 ).

It can be verified that this code has the following parameters:

C H : q = 2, k = 4, n = 7, R = 4/7.

We will show shortly that C H has a distance of 3. We would like to point out that we could
have picked the three parities differently. The reason we mention the three particular parities
above is due to historical reasons. We leave it as an exercise to define an alternate set of parities
such that the resulting code still has a distance of 3: see Exercise 1.9.
Before we move on to determining the distance of C H , we will need another definition.

Definition 1.5.1 (Hamming Weight). Let q ≥ 2. Given any vector v ∈ {0, 1, 2, . . . , q − 1}n , its Ham-
ming weight, denoted by w t (v) is the number of non-zero symbols in v.

For example, if v = 01203400, then w t (v) = 4.

We now look at the distance of C H .

Proposition 1.5.2. C H has a distance of 3.

16
Proof. We will prove the claimed distance by using two properties of C H :

min w t (c) = 3, (1.10)

c∈C H ,c6=0

and
min w t (c) = min ∆(c1 , c2 ) (1.11)
c∈C H ,c6=0 c1 6=c2 ∈C H

The proof of (1.10) follows from a case analysis on the Hamming weight of the message bits. Let
us use x = (x 1 , x 2 , x 3 , x 4 ) to denote the message vector.

• Case 0: If w t (x) = 0, then C H (x) = 0, which means we do not have to consider this code-
word.

• Case 1: If w t (x) = 1 then at least two parity check bits in (x 2 ⊕x 3 ⊕x 4 , x 1 ⊕x 3 ⊕x 4 , x 1 ⊕x 2 ⊕x 4 )

are 1 (see Exercise 1.10). So in this case, w t (C H (x)) ≥ 3.

• Case 2: If w t (x) = 2 then at least one parity check bit in (x 2 ⊕x 3 ⊕x 4 , x 1 ⊕x 3 ⊕x 4 , x 1 ⊕x 2 ⊕x 4 )

is 1 (see Exercise 1.11). So in this case, w t (C H (x)) ≥ 3.

• Case 3: If w t (x) ≥ 3 then those message bits themselves imply that w t (C H (x)) ≥ 3.

Thus, we can conclude that min w t (c) ≥ 3. Further, note that w t (C H (1, 0, 0, 0)) = 3, which
c∈C H ,c6=0
implies that min w t (c) ≤ 3. This along with the lower bound that we just obtained proves
c∈C H ,c6=0
(1.10).
We now turn to the proof of (1.11). For the rest of the proof, let x = (x 1 , x 2 , x 3 , x 4 ) and y =
(y 1 , y 2 , y 3 , y 4 ) denote the two distinct messages. Using associativity and commutativity of the ⊕
operator, we obtain that
C H (x) +C H (y) = C H (x + y),
where the “+" operator is just the bit-wise ⊕ of the operand vectors10 . Further, it can be verified
that for two vectors u, v ∈ {0, 1}n , we have:

∆(u, v) = w t (u + v)

(see Exercise 1.12). Thus, we have

min ∆(C H (x),C H (y)) = min w t (C H (x + y))

x6=y∈{0,1}4 x6=y∈{0,1}4

= min w t (C H (x)),
x6=0∈{0,1}4

where the second equality follows from the observation that {x+y|x 6= y ∈ {0, 1}n } = {x ∈ {0, 1}n |x 6=
0}. Recall that w t (C H (x)) = 0 if and only if x = 0 and this completes the proof of (1.11). Combin-
ing (1.10) and (1.11), we conclude that C H has a distance of 3.
10
E.g. (0, 1, 1, 0) + (1, 1, 1, 0) = (1, 0, 0, 0).

17
The second part of the proof could also be shown in the following manner. It can be verified
that the Hamming code is the set {x ·G H |x ∈ {0, 1}4 }, where G H is the following matrix (where we
think x as a row vector).11  
1 0 0 0 0 1 1
 
0 1 0 0 1 0 1
GH =  .
0 0 1 0 1 1 0
0 0 0 1 1 1 1
For example, the first column in G H gives the first codeword bit of x 1 and the fifth column of
G H gives the codeword bit x 2 ⊕ x 3 ⊕ x 4 .
In fact, any binary code (of dimension k and block length n) that is generated12 by a k × n
matrix is called a binary linear code. (Both C ⊕ and C 3,r ep are binary linear codes: see Exer-
cise 1.13.) This implies the following simple fact.

Lemma 1.5.3. For any binary linear code C and any two messages x and y, C (x)+C (y) = C (x+y).

Proof. For any binary linear code, we have a generator matrix G. The following sequence of
equalities (which follow from the distributivity and associativity properties of the Boolean EXOR
and AND operators) proves the lemma.

C (x) +C (y) = x · G + y · G
= (x + y) · G
= C (x + y)

We stress that in the lemma above, x and y need not be distinct. Note that due to the fact that
b ⊕b = 0 for every b ∈ {0, 1}, x+x = 0, which along with the lemma above implies that C (0) = 0.13
We can infer the following result from the above lemma and the arguments used to prove (1.11)
in the proof of Proposition 1.5.2.

Proposition 1.5.4. For any binary linear code, its minimum distance is equal to the minimum
Hamming weight of any non-zero codeword.

Thus, we have seen that C H has distance d = 3 and rate R = 47 while C 3,r ep has distance d = 3
and rate R = 31 . Thus, the Hamming code is provably better than the repetition code (in terms
of the tradeoff between rate and distance) and thus, answers Question 1.4.2 in the affirmative.
The next natural question is

11
Indeed (x 1 , x 2 , x 3 , x 4 ) · G H = (x 1 , x 2 , x 3 , x 4 , x 2 ⊕ x 3 ⊕ x 4 , x 1 ⊕ x 3 ⊕ x 4 , x 1 ⊕ x 2 ⊕ x 4 ), as desired.
12
That is, C = {x · G|x ∈ {0, 1}k }, where addition is the ⊕ operation and multiplication is the AND operation.
13
This of course should not be surprising as for any matrix G, we have 0 · G = 0.

18
Question 1.5.1. Can we have a distance 3 code with a rate higher than that of C H ?

We will address this question in the next section.

1.6 Hamming Bound

Now we switch gears to present our first tradeoff between redundancy (in the form of the di-
mension of a code) and its error-correction capability (in the form of its distance). In particular,
we will first prove a special case of the so-called Hamming bound for a distance of 3.
We begin with another definition.
Definition 1.6.1 (Hamming Ball). For any vector x ∈ [q]n ,

B (x, e) = {y ∈ [q]n |∆(x, y) ≤ e}.

In other words, a Hamming ball of radius e, centered at x, contains all vectors within Ham-
ming distance at most e of x.
Next, we prove an upper bound on the dimension of every code with distance 3.
Theorem 1.6.2 (Hamming bound for d = 3). Every binary code with block length n, dimension
k, distance d = 3 satisfies
k ≤ n − log2 (n + 1).
Proof. Given any two codewords, c1 6= c2 ∈ C , the following is true (as C has distance14 3):

B (c1 , 1) ∩ B (c2 , 1) = ;. (1.12)

See Figure 1.4 for an illustration.

Note that for all x ∈ {0, 1}n (see Exercise 1.16),

|B (x, 1)| = n + 1. (1.13)

Now consider the union of all Hamming balls centered around some codeword; their union is
a subset of {0, 1}n . In other words, ¯ ¯
¯[ ¯
¯ B (c, 1)¯ ≤ 2n . (1.14)
¯ ¯
c∈C
As (1.12) holds for every pair of distinct codewords,
¯ ¯
¯[ ¯ X
¯ B (c, 1)¯ = |B (c, 1)|
¯ ¯
c∈C c∈C
X
= (n + 1) (1.15)
c∈C
k
= 2 · (n + 1), (1.16)
14
Assume that y ∈ B (c1 , 1)∩B (c2 , 1), that is ∆(y, c1 ) ≤ 1 and ∆(y, c2 ) ≤ 1. Thus, by the triangle inequality ∆(c1 , c2 ) ≤
2 < 3, which is a contradiction.

19
{0, 1}n

1 1 1

c1 c2

1 1 1

Figure 1.4: Hamming balls of radius 1 are disjoint. The figure is technically not correct: the balls
above are actually balls in the Euclidean space, which is easier to visualize than the Hamming
space.

where (1.15) follows from (1.13) and (1.16) follows from the fact that C has dimension k. Com-
bining (1.16) and (1.14), we get
2k (n + 1) ≤ 2n ,
or equivalently
2n
2k ≤ .
n +1
Taking log2 of both sides we get the desired bound:

k ≤ n − log2 (n + 1).

Thus, Theorem 1.6.2 shows that for n = 7, C H has the largest possible dimension for any
binary code of block length 7 and distance 3 (as for n = 7, n − log2 (n + 1) = 4). In particular,
it also answers Question 1.5.1 for n = 7 in the negative. Next, will present the general form of
Hamming bound.

1.7 Generalized Hamming Bound

We start with a new notation.

Definition 1.7.1. A code C ⊆ Σn with dimension k and distance d will be called a (n, k, d )Σ code.
We will also refer it to as a (n, k, d )|Σ| code.

We now proceed to generalize Theorem 1.6.2 to any distance d (from d = 3).

20
Theorem 1.7.2 (Hamming Bound for any d ). For every (n, k, d )q code
j k 
(d −1) Ã !
2
 X n 
k ≤ n − logq  (q − 1)i  .
i =0 i

Proof. The proof is a straightforward

j k generalization of the proof of Theorem 1.6.2. For nota-
(d −1)
tional convenience, let e = 2 . Given any two codewords, c1 6= c2 ∈ C , the following is true
(as C has distance15 d ):
B (c1 , e) ∩ B (c2 , e) = ;. (1.17)
We claim that for all x ∈ [q]n , Ã !
X e n
|B (x, e)| = (q − 1)i . (1.18)
i =0 i
Indeed
¡n ¢ any vector in B (x, e) must differ from x in exactly 0 ≤ i ≤ e positions. In the summation,
i is the number of ways of choosing the differing i positions and in each such position, a
vector can differ from x in q − 1 ways.
Now consider the union of all Hamming balls centered around a codeword. Obviously, their
union is a subset of [q]n . In other words,
¯ ¯
¯[ ¯
¯ B (c, e)¯ ≤ q n . (1.19)
¯ ¯
c∈C

As (1.17) holds for every pair of distinct codewords,

¯ ¯
¯[ ¯ X
¯ B (c, e)¯ = |B (c, e)|
¯ ¯
c∈C c∈C
Ã !
X e n
= qk (q − 1)i , (1.20)
i =0 i

where (1.20) follows from (1.18) and the fact that C has dimension k. Combining (1.20) and
(1.19) and taking logq of both sides we will get the desired bound:
Ã Ã ! !
X e n
i
k ≤ n − logq (q − 1) .
i =0 i

Note that the Hamming bound gives a partial answer to Question 1.4.1. In particular, any
code of distance d can have rate R at most
¡P ¡ ¢ ¢
logq ei=0 ni (q − 1)i
1− .
n
Further, the Hamming bound also leads to the following definition:
15
Assume that y ∈ B (c1 , e)∩B (c2 , e), that is ∆(y, c1 ) ≤ e and ∆(y, c2 ) ≤ e. Thus, by the triangle inequality, ∆(c1 , c2 ) ≤
2e ≤ d − 1, which is a contradiction.

21
Definition 1.7.3. Codes that meet Hamming bound are called perfect codes.

In other words, a perfect

j codek leads to the following perfect “packing": if one constructs
d −1
Hamming balls of radius 2 around all the codewords, then we would cover the entire am-
bient space, i.e. every possible vector will lie in one of these Hamming balls.
One example of perfect code is the (7, 4, 3)2 Hamming code that we have seen in this chapter
(so is the family of general Hamming codes that we will see in the next chapter). A natural
question to ask is if

Question 1.7.1. Other than the Hamming codes, are there any other perfect (binary) codes?

We will see the answer in Section 2.4.

1.8 Family of codes

Until now, we have mostly studied specific codes with fixed block lengths and dimensions.
However, when we perform an asymptotic study of codes, it makes more sense to talk about
a family of codes and study their asymptotic rate and distance. We define these notions next.

Definition 1.8.1 (Code families, Rate and Distance). Let {n i }i ≥1 be an increasing sequence of
block lengths and suppose there exists sequences {k i }i ≥1 , {d i }i ≥1 and {q i }i ≥1 such that for all i ≥ 1
there exists an (n i , k i , d i )qi code C i . Then the sequence C = {C i }i ≥1 is a family of codes. The rate
of C is defined as ½ ¾
ki
R(C ) = lim ,
i →∞ n i

when the limit exists. The relative distance of C is defined as

½ ¾
di
δ(C ) = lim ,
i →∞ n i

when the limit exists. If for all i ≥ 1, q i = q then C is referred to as a family of q-ary codes.16 17

For instance, we will in Section 2.4 see that Hamming code of Section 1.5 can be extended
to an entire family of codes. Specifically, C H = {C i }i ∈Z+ , with C i being an (n i .k i , d i )-code with
n i = 2i − 1, k i = 2i − i − 1, d i = 3 and thus,
i
R(C H ) = lim 1 − = 1,
i →∞ 2i −1
16
In all codes we will study these limits will exist, but of course it is possible to construct families of codes where
the limits do not exist.
17
While a central goal is to understand q-ary families of codes, families over growing alphabets turn out to be
useful both to illustrate ideas and to get interesting q-ary families.

22
and
3
δ(C H ) = lim = 0.
i →∞ 2i
−1
A significant focus of this text from now on will be on families of codes. This is necessary as
we will be studying the asymptotic behavior of algorithms on codes, which does not make sense
for a fixed code. For example, when we say that a decoding algorithm for a code C takes O(n 2 )
time, we would be implicitly assuming that C is a family of codes and that the algorithm has
an O(n 2 ) running time when the block length is large enough. From now on, unless mentioned
otherwise, whenever we talk about a code, we will be implicitly assuming that we are talking
about a family of codes.
Given that we can only formally talk about asymptotic run time of algorithms, we now also
state our formal notion of efficient algorithms:

We’ll call an algorithm related to a code of block length n to be efficient if it runs in time
polynomial in n.

For all the specific codes that we will study in this book, the corresponding family of codes
will be a “family" in a more natural sense. By this we mean that all the specific codes in a family
of codes will be the “same" code except with different parameters. A bit more formally, we will
consider families {C i }i ≥1 , where given only the ‘index’ i , one can compute a sufficient descrip-
tion of C i efficiently.18
Finally, the definition of a family of codes allows us to present the final version of the big
motivating question for the book. The last formal version of the main question we considered
was Question 1.4.1, where we were interested in the tradeoff of rate R and distance d . The
comparison was somewhat unfair because R was a ratio while d was an integer. A more appro-
priate comparison should be between rate R and the relative distance δ. Further, we would be
interested in tackling the main motivating question for families of codes, which results in the
following final version:

Question 1.8.1. Given q, what is the optimal tradeoff between R(C ) and δ(C ) that can be
achieved by some family C of q-ary codes?

A natural special case of Question 1.8.1 is whether the rate and relative distance of a family
of codes can be simultaneously positive. We formulate this special case as a separate question
below.

18
We stress that this is not always going to be the case. In particular, we will consider “random" codes where this
efficient constructibility will not be true.

23
Question 1.8.2. Does there exist a constant q and a qq-ary family of codes C such that
R(C ) > 0 and δ(C ) > 0 hold simultaneously?

Codes that have the above property are called asymptotically good. For the curious reader,
we will present many asymptotically good codes in the rest of this book, though a priori the
existence of these is not immediate.

1.9 Exercises
Exercise 1.1. Show that every t -error correcting code is also t -error detecting but not necessarily
the other way around.

Exercise 1.2. Prove Proposition 1.3.9.

Exercise 1.3. Show that for every integer n, there is no code with block length n that can handle
arbitrary number of errors.

Exercise 1.4. Prove Proposition 1.3.10.

Exercise 1.5. A distance function on Σn (i.e. d : Σn × Σn → R) is called a metric if the following

conditions are satisfied for every x, y, z ∈ Σn :

1. d (x, y) ≥ 0.

2. d (x, y) = 0 if and only if x = y.

3. d (x, y) = d (y, x).

4. d (x, z) ≤ d (x, y) + d (y, z). (This property is called the triangle inequality.)

Prove that the Hamming distance is a metric.

Exercise 1.6. Let C be a code with distance d for even d . Then argue that C can correct up to
d /2 − 1 many errors but cannot correct d /2 errors. Using this or otherwise, argue that if a code C
is t -error correctable then it either has a distance of 2t + 1 or 2t + 2.

Exercise 1.7. In this exercise, we will see that one can convert arbitrary codes into code with
slightly different parameters:

1. Prove that if there exists an (n, k, d )Σ code then there also exists an (n − 1, k, d − 1)Σ code.
Specifically, show how to convert an (n, k, d )Σ code C into an (n − 1, k, d − 1)Σ code.

2. For odd d , prove that if an (n, k, d )2 code exists, then there also exists an (n + 1, k, d + 1)2
code. Specifically, show how to convert an (n, k, d )2 code C into an (n + 1, k, d + 1)2 code.

24
Note: Your conversion should not assume anything else about the code other than the parameters
of the code C . Also your conversion should work for every n, k, d ≥ 1 and every Σ.

Exercise 1.8. In this problem we will consider a noise model that has both errors and erasures. In
particular, let C be an (n, k, d )Σ code. As usual a codeword c ∈ C is transmitted over the channel
and the received word is a vector y ∈ (Σ ∪ {?})n , where as before a ? denotes an erasure. We will use
s to denote the number of erasures in y and e to denote the number of (non-erasure) errors that
occurred during transmission. To decode such a vector means to output a codeword c ∈ C such
that the number of positions where c disagree with y in the n − s non-erased positions is at most
e. For the rest of the problem assume that

2e + s < d . (1.21)

1. Argue that the output of the decoder for any C under (1.21) is unique.

2. Let C be a binary code (but not necessarily linear). Assume that there exists a decoder D
that can correct from < d /2 many errors in T (n) time. Then under (1.21) one can perform
decoding in time O(T (n)).

Exercise 1.9. Define codes other than C H with k = 4, n = 7 and d = 3.

Hint: Refer to the proof of Proposition 1.5.2 to figure out the properties needed from the three parities.

Exercise 1.10. Argue that if w t (x) = 1 then at least two parity check bits in (x 2 ⊕ x 3 ⊕ x 4 , x 1 ⊕ x 2 ⊕
x 4 , x 1 ⊕ x 3 ⊕ x 4 ) are 1.

Exercise 1.11. Argue that if w t (x) = 2 then at least one parity check bit in (x 2 ⊕ x 3 ⊕ x 4 , x 1 ⊕ x 2 ⊕
x 4 , x 1 ⊕ x 3 ⊕ x 4 ) is 1.

Exercise 1.12. Prove that for any u, v ∈ {0, 1}n , ∆(u, v) = w t (u + v).

Exercise 1.13. Argue that C ⊕ and C 3,r ep are binary linear codes.

Exercise 1.14. Let G be a generator matrix of an (n, k, d )2 binary linear code. Then G has at least
kd ones in it.

Exercise 1.15. Argue that in any binary linear code, either all all codewords begin with a 0 of
exactly half of the codewords begin with a 0.

Exercise 1.16. Prove (1.13).

Exercise 1.17. Show that there is no binary code with block length 4 that achieves the Hamming
bound.

Exercise 1.18. (∗) There are n people in a room, each of whom is given a black/white hat chosen
uniformly at random (and independent of the choices of all other people). Each person can see
the hat color of all other people, but not their own. Each person is asked if they wish to guess
their own hat color. They can either guess, or abstain. Each person makes their choice without

25
knowledge of what the other people are doing. They either win collectively, or lose collectively.
They win if at least one person does not abstain and all the people who don’t abstain guess their
hat color correctly. They lose if all people abstain, or if some person guesses their color incorrectly.
Your goal below is to come up with a strategy that will allow the n people to win with pretty high
probability. We begin with a simple warm-up:

(a) Argue that the n people can win with probability at least 12 .

Next we will see how one can really bump up the probability of success with some careful mod-
eling, and some knowledge of Hamming codes. (Below are assuming knowledge of the general
Hamming code (see Section 2.4). If you do not want to skip ahead, you can assume that n = 7 in
the last part of this problem.

(b) Lets say that a directed graph G is a subgraph of the n-dimensional hypercube if its vertex
set is {0, 1}n and if u → v is an edge in G, then u and v differ in at most one coordinate.
Let K (G) be the number of vertices of G with in-degree at least one, and out-degree zero.
Show that the probability of winning the hat problem equals the maximum, over directed
subgraphs G of the n-dimensional hypercube, of K (G)/2n .

(c) Using the fact that the out-degree of any vertex is at most n, show that K (G)/2n is at most
n
n+1 for any directed subgraph G of the n-dimensional hypercube.

(d) Show that if n = 2r − 1, then there exists a directed subgraph G of the n-dimensional hyper-
n
cube with K (G)/2n = n+1 .
Hint: This is where the Hamming code comes in.

1.10 Bibliographic Notes

Coding theory owes its origin to two remarkable papers: one by Shannon [138] and the other
by Hamming [83] both of which were published within a couple of years of each other. Shan-
non’s paper defined the BSCp channel (among others) and defined codes in terms of its encod-
ing function. Shannon’s paper also explicitly defined the decoding function. Hamming’s work
defined the notion of codes as in Definition 1.2.1 as well as the notion of Hamming distance.
Both the Hamming bound and the Hamming code are (not surprisingly) due to Hamming. The
specific definition of Hamming code that we used in this book was the one proposed by Ham-
ming and is also mentioned in Shannon’s paper (which pre-dates Hamming’s) with attribution
to Hamming. The notion of erasures was defined by Elias [48]. Most exercises of this chapter
are based on [83]. The hat problem in Exercise 1.18 is from Ebert, Merkle and Vollmer [45].

26
Part I

The Basics

27
Overview of the Basics

Chapter 1 introduced the fundamental quests of this book — namely error-correcting codes
over some given alphabet and block length that achieve the best possible tradeoff between
dimension and distance, and the asymptotics of these parameters. We refer to all the theory
pertinent to this quest as coding theory.
To construct codes, it turns out that a little bit of algebra goes a long way. Specifically if
the alphabet is viewed as a finite field and the code is required to be a vector space over this
finite field, then we get the subclass of error-correcting codes called linear codes. Linear codes
tend to be interesting in their own right, but also focussing on linear codes somehow allows
us to come up with creative ways to construct codes. In Chapter 2 we review the basic algebra
concepts needed to define these linear codes and embark on a preliminary study of these codes
to illustrate their power.
A second subfield of mathematics that turns out to be central to coding theory is probability
theory. The probability theory considered in this text is mostly over finite sets, and thus is just
a fancy way to ‘count.’ Yet the basic tools of probability theory turn out to be very useful in
eliciting proofs of basic facts in coding theory. So in Chapter 3 we review some basics of coding
theory and also introduce the entropy function which also plays a prominent role in coding
theory.
Future parts of this book return to our fundamental question and use the tools from Chap-
ters 2 and 3 to make progress on them.

29
30
Chapter 2

A Look at Some Nicely Behaved Codes:

Linear Codes

One motivation for the topic of this chapter is the following question: How we can represent a
code? Or more specifically, how many bits does it take to describe a code C : [q]k −→ [q]n ? In
general, a code C : [q]k −→ [q]n can be stored using nq k symbols from [q] (n symbols for each
of the q k codewords) or nq k log q bits. For constant rate codes, this is exponential space, which
is prohibitive even for modest values of k like k = 100. A natural question is whether we can do
better. To have any hope of doing so, a succinct representation the code must have some extra
structure. It turns out that one broad class of codes that do possess extra structure than general
codes, is what are called linear codes. We have already seen binary linear codes in Section 1.5,
that is: C ⊆ {0, 1}n is a linear code if for all c1 , c2 ∈ C , c1 + c2 ∈ C , where the “+" denotes bit-
wise XOR. In this chapter, we will see more general linear codes. We will see that they not only
offer enough structure to get succinct representations, but they also possess several other nice
properties.
To define general linear codes, we first need to introduce general finite fields and vector
spaces over such fields and we do so first before returning to codes.

2.1 Groups and Finite Fields

To define linear subspaces, we will need to work with (finite) fields. At a high level, we need
finite fields since when we talk about codes, we deal with finite symbols/numbers and we want
to endow these symbols with the same math that makes arithmetic over real numbers work.
Finite fields accomplish this precise task. We begin with a quick overview of fields. We start
with the more elementary notion of a group.
Definition 2.1.1. A group G is given by a pair (S, ◦), where S is the set of elements and ◦ is a
function S × S → S with the following properties:
• C LOSURE : For every a, b ∈ S, we have a ◦ b ∈ S.

• A SSOCIATIVITY: ◦ is associative: that is, for every a, b, c ∈ S, a ◦ (b ◦ c) = (a ◦ b) ◦ c.

31
• I DENTITY: There exists distinct a special elements e ∈ S such that for every a ∈ S we have
a ◦ e = e ◦ a = a.

• I NVERSE : For every a ∈ S, there exists its unique inverse a −1 such that a ◦ a −1 = a −1 ◦ a = e.

If G = (S, ◦) satisfies all the properties except the existence of inverses then G is called a monoid.
We say G is commutative if for every a, b ∈ S, a ◦ b = b ◦ a.

We often use the same letter to denote the group (or other algebraic structures) and the set
of elements.
We now turn to the definition of a field. Informally speaking, a field is a set of elements on
which one can do addition, subtraction, multiplication and division and still stay in the set.

Definition 2.1.2. A field F is given by a triple (S, +, ·), where S is the set of elements and +, · are
functions S × S → S with the following properties:

• Addition: (S, +) form a commutative group with identity element denoted 0 ∈ S.

• Multiplication: (S \ {0}, ·) form a commutative group with identity element 1 ∈ S \ {0}.1

• Distributivity: · distributes over +: that is, for every a, b, c ∈ S, a · (b + c) = a · b + a · c.

Again we typically use the same letter to denote the field and its set of elements. We also
use −a to denote the additive inverse of a ∈ F and a −1 to denote the multiplicative inverse of
a ∈ F \ {0}.
We note that in the above definition we have not explicitly argued that a · 0 = 0 = 0 · a for any
a ∈ S. (Technically this means (S, ·) is a commutative monoid.) This is because this property is
implied by Definition 2.1.2– see Exercise 2.1.
With the usual semantics for + and ·, R (set of real number) is a field, but Z (set of integers)
is not a field as division of two integers results in a rational number that need not be an inte-
ger (the set of rational numbers itself is a field though: see Exercise 2.2). In this course, we will
exclusively deal with finite fields. As the name suggests these are fields with a finite set of ele-
ments. (We will overload notation and denote the size of a field |F| = |S|.) The following is a well
known result.

Theorem 2.1.3 (Size of Finite Fields). Every finite field has size p s for some prime p and integer
s ≥ 1. Conversely for every prime p and integer s ≥ 1 there exists a field F of size p s .

One example of a finite field that we have seen is the field with S = {0, 1}, which we will
denote by F2 (we have seen this field in the context of binary linear codes). For F2 , addition
is the XOR operation, while multiplication is the AND operation. The additive inverse of an
element in F2 is the number itself while the multiplicative inverse of 1 is 1 itself.
Let p be a prime number. Then the integers modulo p form a field, denoted by Fp (and also
by Zp ), where the addition and multiplication are carried out modulo p. For example, consider
F7 , where the elements are {0, 1, 2, 3, 4, 5, 6}. We have (4 + 3) mod 7 = 0 and 4 · 4 mod 7 = 2.
1
Note that we do not include 0 since it does not have a multiplicative inverse.

32
Further, the additive inverse of 4 is 3 as (3 + 4) mod 7 = 0 and the multiplicative inverse of 4 is 2
as 4 · 2 mod 7 = 1.
More formally, we prove the following result.

Lemma 2.1.4. Let p be a prime. Then Fp = ({0, 1, . . . , p − 1}, +p , ·p ) is a field, where +p and ·p are
addition and multiplication modulo p.

Proof. The properties of associativity, commutativity, distributivity and identities hold for in-
tegers and hence, they hold for Fp . The closure property follows since both the “addition" and
“multiplication" are done modulo p, which implies that for any a, b ∈ {0, . . . , p −1}, a +p b, a ·p b ∈
{0, . . . , p −1}. Thus, to complete the proof, we need to prove the existence of unique additive and
multiplicative inverses.
Fix an arbitrary a ∈ {0, . . . , p − 1}. Then we claim that its additive inverse is p − a mod p. It
can be verified that a + p − a = 0 mod p. Next we argue that this is the unique additive inverse.
To see this note that the sequence a, a + 1, a + 2, . . . , a + p − 1 are p consecutive numbers and
thus, exactly one of them is a multiple of p, which happens for b = p − a mod p, as desired.
Now fix an a ∈ {1, . . . , p − 1}. Next we argue for the existence of a unique multiplicative uni-
verse a −1 . Consider the set of numbers T = {a ·p b|b ∈ {1, . . . , p − 1}}. We claim that all these
numbers are unique. To see this, note that if this is not the case, then there exist b 1 6= b 2 ∈
{0, 1, . . . , p − 1} such that a · b 1 = a · b 2 mod p, which in turn implies that a · (b 1 − b 2 ) = 0 mod p.
Since a and b 1 −b 2 are non-zero numbers, this implies that p divides a · (b 1 −b 2 ). Further, since
a and |b 1 −b 2 | are both at most p −1, this implies that multiplying a and (b 1 −b 2 ) mod p results
in p, which is a contradiction since p is prime. Thus, we have argued that |T | = p − 1 and since
each number in T is in [p − 1], we have that T = [p − 1]. Thus, we can conclude that there exists
a unique element b such that a · b = 1 mod p and thus, b is the required a −1 .

One might think that there could be different finite fields with the same number of elements.
However, this is not the case:

Theorem 2.1.5. For every prime power q there is a unique finite field with q elements (up to
isomorphism2 ).

Thus, we are justified in just using Fq to denote a finite field on q elements.

2.2 Vector Spaces and Linear Subspaces

Definition 2.2.1 (Vector Space). A vector space V over a field F is given by a triple (T, +, ·) such
that (T, +) form a commutative group and ·, referred to as the scalar product, is a function F×T →
T such that for every a, b ∈ F and u, v ∈ T we have (a +b)·u = a ·u+b ·u and a ·(u+v) = a ·u+ a ·v.
2
An isomorphism φ : S → S ′ is a bijective map (such that F = (S, +, ·) and F′ = (S ′ , ⊕, ◦) are fields) where for every
a 1 , a 2 ∈ S, we have φ(a 1 + a 2 ) = φ(a 1 ) ⊕ φ(a 2 ) and φ(a 1 · a 2 ) = φ(a 1 ) ◦ φ(a 2 ). In other words, an ismorphisms is a
map between representations that ‘preserves’ the effect of operators on elements.

33
The most common vector space we will focus on is Fn with + representing coordinatewise
addition in F and a · u representing the coordinatewise scaling of u by a.
We are finally ready to define the notion of linear subspaces of Fn .

Definition 2.2.2 (Linear Subspace). A non-empty subset S ⊆ Fn is a linear subspace if the follow-
ing properties hold:

1. For every x, y ∈ S, x + y ∈ S, where the addition is vector addition over F (that is, do addition
componentwise over F).

2. For every a ∈ F and x ∈ S, a · x ∈ S, where the multiplication is done componentwise over F.

Here is a (trivial) example of a linear subspace of F35 :

S 1 = {(0, 0, 0), (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4)}. (2.1)

Note that for example (1, 1, 1) + (3, 3, 3) = (4, 4, 4) ∈ S 1 and 2 · (4, 4, 4) = (3, 3, 3) ∈ S 1 as required
by the definition. Here is another somewhat less trivial example of a linear subspace over F33 :

S 2 = {(0, 0, 0), (1, 0, 1), (2, 0, 2), (0, 1, 1), (0, 2, 2), (1, 1, 2), (1, 2, 0), (2, 1, 0), (2, 2, 1)} . (2.2)

Note that (1, 0, 1) + (0, 2, 2) = (1, 2, 0) ∈ S 2 and 2 · (2, 0, 2) = (1, 0, 1) ∈ S 2 as required.

Remark 2.2.3. Note that the second property implies that 0 is contained in every linear subspace.
Further for any subspace over F2 , the second property is redundant: see Exercise 2.5.

Before we state some properties of linear subspaces, we state some relevant definitions.

Definition 2.2.4 (Span). Given a set B = {v1 , . . . , vℓ }. The span of B is the set of vectors
( )
ℓ
X ¯
a i · vi ¯a i ∈ Fq for every i ∈ [ℓ] .
i =1

Definition 2.2.5 (Linear (in)dependence of vectors). We say that v1 , v2 , . . . vk are linearly inde-
pendent if for every 1 ≤ i ≤ k and for every (k − 1)-tuple (a 1 , a 2 , . . . , a i −1 , a i +1 , . . . , a k ) ∈ Fk−1
q ,

vi 6= a 1 v1 + . . . + a i −1 vi −1 + a i +1 vi +1 + . . . + a k vk .

In other words, vi is not in the span of the set {v1 , . . . , vi −1 , vi +1 , . . . , vn } for every 1 ≤ i ≤ k. We say
that v1 , v2 , . . . vk are linearly dependent if they are not linearly independent.

For example the vectors (1, 0, 1), (1, 1, 1) ∈ S 2 are linearly independent since

• a 1 · (1, 0, 1) = (a 1 , 0, a 1 ) 6= (1, 1, 1) for any a 1 ∈ {0, 1}.

• a 2 · (1, 0, 1) = (a 2 , a 2 , a 2 ) 6= (1, 0, 1) for any a 2 ∈ {0, 1}.

34
Definition 2.2.6 (Rank of a matrix). The rank of matrix in Fk×k
q is the maximum number of lin-
early independent rows (or columns). A matrix in Fk×n
q with rank min(k, n) is said to have full
rank.

One can define the row (column) rank of a matrix as the maximum number of linearly indepen-
dent rows (columns). However, it is a well-known theorem that the row rank of a matrix is the
same as its column rank. For example, the matrix below over F3 has full rank (see Exercise 2.6):
µ ¶
1 0 1
G2 = . (2.3)
0 1 1

Any linear subspace satisfies the following properties (the full proof can be found in any
standard linear algebra textbook).

Theorem 2.2.7. If S ⊆ Fq n is a linear subspace then

1. |S| = q k for some k ≥ 0. The parameter k is called the dimension of S.

2. There exists at least one set of linearly independent vectors v1 , ..., vk ∈ S called basis ele-
ments such that every x ∈ S can be expressed as x = a 1 v1 + a 2 v2 + ... + a k vk where a i ∈ Fq for
1 ≤ i ≤ k. In other words, there exists a full rank k × n matrix G (also known as a generator
matrix) with entries from Fq such that every x ∈ S, x = (a 1 , a 2 , . . . , a k ) · G where
 
←− v1 −→
 ←− v2 −→ 
 
G = .. .
 . 
←− vk −→

3. There exists a full rank (n − k) × n matrix H (called a parity check matrix) such that for
every x ∈ S, H xT = 0.

4. G and H are orthogonal, that is, G · H T = 0.

Proof Sketch.
Property 1. We begin with the proof of the first property. For the sake of contradiction, let
us assume that q k < |S| < q k+1 , for some k ≥ 0. Iteratively, we will construct a set of linearly
independent vectors B ⊆ S such that |B | ≥ k + 1. Note that by the definition of a linear subspace
the span of B should be contained in S. However, this is a contradiction as the size of the span
of B is at least3 q k+1 > |S|.
To complete the proof, we show how to construct the set B in a greedy fashion. In the first
step pick v1 to be any non-zero vector in S and set B ← {v1 } (we can find such a vector as |S| >
q k ≥ 1). Now say after the step t (for some t ≤ k), |B | = t . Now the size of the span of the current
B is q t ≤ q k < |S|. Thus there exists a vector vt +1 ∈ S \ B that is linearly independent of vectors
in B . Set B ← B ∪ {vt +1 }. Thus, we can continue building B until |B | = k + 1, as desired.
3
See Exercise 2.8.

35
Property 2. We first note that we can pick B = {v1 , . . . , vk } to be any set of k linearly indepen-
dent vectors– this just follows from the argument above for Property 1.1. This is because the
span of B is contained in S. However, since |S| = q k and the span of B has q k vectors, the two
have to be the same.

Property 3. Property 3 above follows from another fact that every linear subspace S has a null
space N ⊆ Fnq such that for every x ∈ S and y ∈ N , 〈x, y〉 = 0. Further, it is known that N itself is
a linear subspace of dimension n − k. (The claim that N is also a linear subspace follows from
the following two facts: for every x, y, z ∈ Fnq , (i) 〈x, y + z〉 = 〈x, y〉 + 〈x, z〉 and (ii) for any a ∈ Fq ,
〈x, ay〉 = a ·〈x, y〉.) In other words, there exists a generator matrix H for it. This matrix H is called
the parity check matrix of S.

Property 4. See Exercise 2.9.

As examples, the linear subspace S 1 in (2.1) has as one of its generator matrices
¡ ¢
G1 = 1 1 1

and as one of its parity check matrices

µ ¶
1 2 2
H1 = .
2 2 1

Further, the linear subspace S 2 in (2.2) has G 2 as one of its generator matrices and has the fol-
lowing as one of its parity check matrices
¡ ¢
H2 = 1 1 2 .

Finally, we state another property of linear subspaces that is useful.

Lemma 2.2.8. Given matrix G of dimension k × n that is a generator matrix of subspace S 1 and
matrix H of dimension (n −k)×n that is a parity check matrix of subspace S 2 such that G H T = 0,
then S 1 = S 2 .

Proof. We first prove that S 1 ⊆ S 2 . Given any c ∈ S 1 , there exists x ∈ Fkq such that c = xG. Then,
¡ ¢T
H · cT = H · (xG)T = HG T xT = G H T xT = 0,

which implies that c ∈ S 2 , as desired.

To complete the proof note that as H has full rank, its null space (or S 2 ) has dimension
n − (n − k) = k (this follows from a well known fact from linear algebra called the rank-nullity
theorem). Now as G has full rank, the dimension of S 1 is also k. Thus, as S 1 ⊆ S 2 , it has to be the
case that S 1 = S 2 .4
4
If not, S 1 ⊂ S 2 which implies that that |S 2 | ≥ |S 1 | + 1. The latter is not possible if both S 1 and S 2 have the same
dimension.

36
2.3 Linear Codes and Basic Properties
We now return to the topic of codes and introduce the central concept for this chapter as well
as much of this text.

Definition 2.3.1 (Linear Codes). Let q be a prime power (i.e. q = p s for some prime p and integer
s ≥ 1). C ⊆ Fnq is a linear code if it is a linear subspace of Fnq . If C has dimension k and distance d
then it will be referred to as an [n, k, d ]q or just an [n, k]q code.

Theorem 2.2.7 now gives two alternate characterizations of an [n, k]q linear code C : first, C
is generated by a k × n generator matrix G. Second, C is defined by a (n − k) × n parity check
matrix H . Since these are important concepts for us, we define these formally below before
giving examples and consequences.

Definition 2.3.2 (Generator and Parity Check Matrices). If C is an [n, k]q linear code then there
exists a matrix G ∈ Fk×n
q of rank k satisfying

C = {x · G|x ∈ Fkq }.

G is referred to as a generator matrix of C . In other words, the code C is the set of all possible
linear combinations of rows of G.
If C is an [n, k]q linear code then there exists a matrix H ∈ F(n−k)×n
q of rank n − k satisfying

C = {y ∈ Fnq |H · yT = 0}.

H is referred to as a parity check matrix of C .

Note that we require G and H to have full row rank (i.e., the rows of G are linearly indepen-
dent and the same holds for H ). Sometimes we will consider matrices M ∈ Fm×n q that are not of
full row rank. These can still be used to generate a code C = {{x · G|x ∈ Fmq } though the code C
will not be an [n, m]q code. We will still refer to C as the code generated by M in such a case,
though the phrase “generator matrix” will be reserved for full rank matrices.
Note that neither the generator matrix nor the parity check matrix are unique for a given
code. However, all generator matrices (and parity check matrices) have the same dimensions,
i.e. all are k × n (and (n − k) × n respectively) matrices. We give examples of these matrices for
the case of the [7, 4, 3]2 Hamming code below.

• The [7, 4, 3]2 Hamming code has the following generator matrix:
 
1 0 0 0 0 1 1
 
0 1 0 0 1 0 1
G = 
0 0 1 0 1 1 0
0 0 0 1 1 1 1

37
• The following matrix is a parity check matrix of the [7, 4, 3]2 Hamming code:
 
0 0 0 1 1 1 1
H =  0 1 1 0 0 1 1
1 0 1 0 1 0 1

Indeed, it can be easily verified that G ·H T = 0. Then Lemma 2.2.8 proves that H is a parity
check matrix of the [7, 4, 3]2 Hamming code.

We now look at some consequences of the above characterizations of an [n, k]q linear code
C . We started this chapter with a quest for succinct representation of a code. Note that both
the generator matrix and the parity check matrix can be represented using O(n 2 ) symbols from
Fq . Note that this is much smaller than the exponential representation of a general code. More
precisely we have the following result on succinct representations of a linear code (see also
Exercise 2.11):

Proposition 2.3.3. Any [n, k]q linear code can be represented with min(nk, n(n − k)) symbols
from Fq .

There is an encoding algorithm for C that runs in O(n 2 ) (in particular O(kn)) time– given a
message m ∈ Fkq , the corresponding codeword C (m) = m · G, where G is the generator matrix of
C . (See Exercise 2.12.)

Proposition 2.3.4. For any [n, k]q linear code, given its generator matrix, encoding can be done
with O(nk) operations over Fq .

There is an error-detecting algorithm for C that runs in O(n 2 ). This is a big improvement
over the naive brute force exponential time algorithm (that goes through all possible codewords
c ∈ C and checks if y = c). (See Exercise 2.13.)

Proposition 2.3.5. For any [n, k]q linear code, given its parity check matrix, error detection can
be performed in O(n(n − k)) operations over Fq .

Next, we look at some alternate characterizations of the distance of a linear code.

2.3.1 On the Distance of a Linear Code

Linear codes admit a nice characterization of minimum distance in terms of the Hamming
weight of non-zero codewords, which we have seen for the special case of binary linear codes
(Proposition 1.5.4). Recall that we use wt(x) to denote the Hamming weight of a vector x ∈ Σn ,
i.e., the number of non-zero coordinates in x.

Proposition 2.3.6. For every [n, k, d ]q code C , we have

d = min wt(c).
c∈C ,
c6=0

38
Proof. To show that d is the same as the minimum weight we show that d is no more than the
minimum weight and d is no less than the minimum weight.
First, we show that d is no more than the minimum weight. We can see this by considering
∆(0, c′ ) where c′ is the non-zero codeword in C with minimum weight; its distance from 0 is
equal to its weight. Thus, we have d ≤ w t (c′ ), as desired.
Now, to show that d is no less than the minimum weight, consider c1 6= c2 ∈ C such that
∆(c1 , c2 ) = d . Note that c1 − c2 ∈ C (this is because −c2 = −1 · c2 ∈ C , where −1 is the additive
inverse of 1 in Fq and c1 − c2 = c1 + (−c2 ), which is in C by the definition of linear codes). Now
note that w t (c1 − c2 ) = ∆(c1 , c2 ) = d , since the non-zero symbols in c1 − c2 occur exactly in the
positions where the two codewords differ. Further, since c1 6= c2 , c1 − c2 6= 0, which implies that
the minimum Hamming weight of any non-zero codeword in C is at most d .

Next, we look at another property implied by the parity check matrix of a linear code.

Proposition 2.3.7. For every [n, k, d ]q code C with parity check matrix H , d equals the size of the
smallest subset of columns of H that are linearly dependent.

Proof. By Proposition 2.3.6, we need to show that the minimum weight of a non-zero codeword
in C is the minimum number of linearly dependent columns. Let t be the minimum number of
linearly dependent columns in H . To prove the claim we will show that t ≤ d and t ≥ d .
For the first direction, Let c 6= 0 ∈ C be a codeword with w t (c) = d . Now note that, by the
definition of the parity check matrix, H ·cT = 0. Working through the matrix multiplication, this
P
gives us that ni=1 c i H i = 0, where
 
↑ ↑ ↑ ↑
 1 2
H = H H ··· Hi ··· Hn 
↓ ↓ ↓ ↓

and c = (c 1 , . . . , c n ). Note that we can skip multiplication for those columns for which the cor-
responding bit c i is zero, so for H · cT to be zero, those H i with c i 6= 0 are linearly dependent.
This means that d ≥ t , as the columns corresponding to non-zero entries in c are one instance
of linearly dependent columns.
For the other direction, consider the minimum set of columns from H , H i 1 , H i 2 , . . . , H i t that
are linearly dependent. This implies that there exists non-zero elements c i′ , . . . , c i′ ∈ Fq such
1 t
that c i′ H i 1 + . . . + c i′ H i t = 0. (Note that all the c i′ are non-zero as no set of less than t columns
i t j
are linearly dependent.) Now extend c i′ , . . . , c i′ to the vector c′ such that c ′j = 0 for j 6∈ {i 1 , . . . , i t }.
1 t
¡ ¢T
Note that we have H · c′ = 0 and thus, we have c′ ∈ C . This in turn implies that d ≤ w t (c′ ) = t
(where recall t is the minimum number of linearly independent columns in H ).

2.4 Hamming Codes

We now change gears and look at the general family of linear codes, which were discovered by
Hamming. So far, we have seen the [7, 4, 3]2 Hamming code (in Section 1.5). In fact, for any r ≥ 2

39
there is a [2r − 1, 2r − r − 1, 3]2 Hamming code. Thus in Section 1.5, we have seen this code for
r = 3.

Definition 2.4.1 (Binary Hamming Codes). For any positive integer r , define the matrix Hr ∈
r
F2r ×(2 −1) to be the r × (2r − 1) matrix whose i th column Hir is the binary representation of i , for
1 ≤ i ≤ 2r − 1. (Note that such a representation is a vector in {0, 1}r .)
The [2r − 1, 2r − r − 1]2 Hamming code, denoted by C H ,r , is the code with parity check matrix
Hr .

In other words, the general [2r − 1, 2r − r − 1]2 Hamming code is the code
r
{c ∈ {0, 1}2 −1
|Hr · cT = 0}.

For example, for the case we have seen (r = 3),

 
0 0 0 1 1 1 1
H3 = 0 1 1 0 0 1 1 ,
1 0 1 0 1 0 1

and the resulting code was a [7, 4, 3]2 code.

Next we argue that the above Hamming code has distance 3 (in Proposition 1.5.2, we argued
this for r = 3).

Proposition 2.4.2. The Hamming code [2r − 1, 2r − r − 1, 3]2 has distance 3.

j
Proof. No two columns in Hr are linearly dependent. If they were, we would have Hir + Hr =
0, but this is impossible since they differ in at least one bit (being binary representations of
integers, i 6= j ). Thus, by Proposition 2.3.7, the distance is at least 3. It is at most 3, since (e.g.)
H1r + H2r + H3r = 0.

Now note that under the Hamming bound for d = 3 (Theorem 1.6.2), k ≤ n − log2 (n + 1), so
for n = 2r − 1, k ≤ 2r − r − 1. Hence, the Hamming code is a perfect code. (See Definition 1.7.3.)
In Question 1.7.1, we asked which codes are perfect codes. Interestingly, the only perfect
binary codes are the following:

• The Hamming codes which we just studied.

• The trivial [n, 1, n]2 codes for odd n (which have 0n and 1n as the only codewords): see
Exercise 2.24.

• Two codes due to Golay [63].

40
2.5 Efficient Decoding of Hamming codes
We have shown that the Hamming code has a distance of 3 and thus, by Proposition 1.4.2, can
correct one error. However, this is a combinatorial result and does not give us an efficient al-
gorithm. One obvious candidate for decoding is the MLD function (Algorithm 1.4.1). Unfortu-
nately, the only implementation of MLD that we know is the one in Algorithm 1.4.1, which will
take time 2Θ(n) , where n is the block length of the Hamming code.
However, we can do much better. Consider the following simple algorithm: given the re-
ceived word y, first check if it is indeed a valid codeword. If it is, we are done. Otherwise, flip
each of the n bits and check if the resulting vector is a valid codeword. If so, we have success-
fully decoded from one error. If none of the checks are successful, then we declare a decoding
failure. Algorithm 2.5.1 formally presents this algorithm (where C H ,r is the [2r − 1, 2r − r − 1, 3]2
Hamming code).5

Algorithm 2.5.1 Naive Decoder for Hamming Code

I NPUT: Received word y
O UTPUT: c if ∆(y, c) ≤ 1 else Fail

1: IF y ∈ C H ,r THEN
2: RETURN y
3: FOR i = 1 . . . n DO
4: y′ ← y + ei ⊲ ei is the i th standard basis vector
5: IF y′ ∈ C H ,r THEN
6: RETURN y′
7: RETURN Fail

It can be verified that Algorithm 2.5.1 can correct up to 1 error. If each of the checks y′ ∈ C H ,r
can be done in T (n) time, then the time complexity of the proposed algorithm will be O(nT (n)).
Note that since C H ,r is a linear code (and dimension k = n − O(log n)) by Proposition 2.3.5, we
have T (n) = O(n log n). Thus, the proposed algorithm has running time O(n 2 log n).
Note that Algorithm 2.5.1 can be generalized to work for any linear code C with distance
2t + 1 (and hence, can correct up to t errors): go through all possible error vectors z ∈ [q]n (with
w t (z) ≤ t ) and check if y − z is in the code or not. Algorithm 2.5.2 presents the formal algorithm
(where C is an [n, k, 2t + 1]q code).
P ¡ ¢
The number of error patterns z considered by Algorithm 2.5.2 is6 it =0 ni (q −1)i ≤ O((nq)t ).
Furthermore by Proposition 2.3.5, Step 4 can be performed with O(n 2 ) operations over Fq . Thus,
Algorithm 2.5.2 runs with O(n t +2 q t ) operations over Fq , which for q being a small polynomial in
n, is n O(t ) operations. In other words, the algorithm will have polynomial running time for codes
5
Formally speaking, a decoding algorithm should return the transmitted message x but Algorithm 2.5.1 actually
returns C H ,r (x). However, since C H ,r is a linear code, it is not too hard to see that one can obtain x from C H ,r (x) in
O(n 3 ) time: see Exercise 2.25. Further, for C H ,r one can do this in O(n) time: see Exercise 2.26.
6
Recall (1.18).

41
Algorithm 2.5.2 Decoder for Any Linear Code
I NPUT: Received word y
O UTPUT: c ∈ C if ∆(y, c) ≤ t else Fail

1: FOR i = 0 . . . t DO
2: FOR S ⊆ [n] such that |S| = i DO
3: FOR z ∈ Fnq such that w t (zS ) = w t (z) = i DO
4: IF y − z ∈ C THEN
5: RETURN y − z
6: RETURN Fail

with a constant distance (though the running time would not be practical even for moderate
values of t ).
However, it turns out that for Hamming codes there exists a decoding algorithm with an
O(n 2 ) running time. To see this, first note that if the received word y has no errors, then Hr ·yT =
0. If not, then y = c + ei , where c ∈ C and ei is the unit vector with the only nonzero element at
the i -th position. Thus, if Hir stands for the i -th column of Hr ,

Hr · yT = Hr · cT + Hr · (ei )T = Hr · (ei )T = Hir ,

where the second equality follows as Hr · cT = 0, which in turn follows from the fact that c ∈ C .
In other words, Hr · yT gives the location of the error. This leads to Algorithm 2.5.3.

Algorithm 2.5.3 Efficient Decoder for Hamming Code

I NPUT: Received word y
O UTPUT: c if ∆(y, c) ≤ 1 else Fail

1: b ← Hr · yT .
2: Let i ∈ [n] be the number whose binary representation is b
3: IF y − ei ∈ C H THEN
4: RETURN y − ei
5: RETURN Fail

Note that Hr is an r × n matrix where n = 2r − 1 and thus, r = Θ(log n). This implies Step 1
in Algorithm 2.5.3, which is a matrix vector multiplication can be done in time O(n log n). By
a similar argument and by Proposition 2.3.5 Step 3 can be performed in O(n log n) time, and
therefore Algorithm 2.5.3 overall runs in O(n log n) time. Thus,

Theorem 2.5.1. The [n = 2r −1, 2r −r −1, 3]2 Hamming code is 1-error correctable. Furthermore,
decoding can be performed in time O(n log n).

42
2.6 Dual of a Linear Code
Until now, we have thought of parity check matrix as defining a code via its null space. However,
we are not beholden to think of the parity check matrix in this way. A natural alternative is to use
the parity check matrix as a generator matrix. The following definition addresses this question.
Definition 2.6.1 (Dual of a code). Let H be a parity check matrix of a code C , then the code
generated by H is called the dual of C . The dual of a code C is denoted by C ⊥ .
It is obvious from the definition that if C is an [n, k]q code, then C ⊥ is an [n, n −k]q code. Ap-
plying duality to the Hamming codes and a close relative, we get two families of codes described
below.
Definition 2.6.2 (Simplex and Hadamard Codes). For positive integer r the Simplex Code C Si m,r
⊥
is the code generated by Hr . (Equivalently C Si m,r = C H ,r .) For positive integer r the Hadamard
r r
Code C H ad ,r is the [2 , r ]2 code generated by the r × 2 matrix Hr′ obtained by adding the all zero
column to (say in front of columns in) Hr .
We claim that C Si m,r and C H ad ,r are [2r − 1, r, 2r −1 ]2 and [2r , r, 2r −1 ]2 codes respectively. The
claimed block length and dimension follow from the definition of the codes, while the distance
follows from the following result.
Proposition 2.6.3. C Si m,r and C H ad ,r both have distances of 2r −1 .
Proof. We first show the result for C H ad ,r . In fact, we will show something stronger: every non-
zero codeword in C H ad ,r has weight exactly equal to 2r −1 (the claimed distance follows from
Proposition 2.3.6). Consider a message x 6= 0. Let its i th entry be x i = 1. x is encoded as
r
c = (x 1 , x 2 , . . . , x r )(Hr0 , Hr1 , . . . , Hr2 −1
),
j j
where Hr is the binary representation of 0 ≤ j ≤ 2r − 1 (that is, the set set of vector Hr is exactly
j
the set of all the vectors in {0, 1}r ). Further note that the j th bit of the codeword c is 〈x, Hr 〉.
Group all the columns of the generator matrix into pairs (u, v) such that v = u + ei (i.e. v and u
are the same except in the i th position). For example for r = 3 and i = 2, the paired up columns
are marked with the same color below:
 
0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1
Notice that this partitions all the columns into 2r −1 disjoint pairs. Then,
〈x, v〉 = 〈x, u + ei 〉 = 〈x, u〉 + 〈x, ei 〉 = 〈x, u〉 + x i = 〈x, u〉 + 1.
Thus we have that 〈x, v〉 is the negation of 〈x, u〉, i.e. exactly one of 〈x, v〉 and 〈x, u〉 is 1. As the
choice of the pair (u, v) was arbitrary, we have proved that for any non-zero codeword c such
that c ∈ C H ad ,r , w t (c) = 2r −1 .
For the simplex code, we observe that all codewords of C H ad ,r are obtained by padding a 0 to
the beginning of the codewords in C Si m,r , which implies that all non-zero codewords in C Si m,r
also have a weight of 2r −1 , which completes the proof.

43
We remark that the family of Hamming code has a rate of 1 and a (relative) distance of 0
while the families of Simplex/Hadamard codes have a rate of 0 and a relative distance of 1/2.
Thus neither gives a positive answer to Question 1.8.2 and so the quest for an asymptotically
good code remains ongoing for now (and we will get to these in future chapters).

2.7 Exercises
Exercise 2.1. Let (S, +, ·) be a field (as per Definition 2.1.2). Then argue that a · 0 = 0 · a = 0 for
every a ∈ S.

Exercise 2.2. Prove that the set of rationals (i.e. the set of reals of the form ba , where both a and
b 6= 0 are integers), denoted by Q, is a field.

Exercise 2.3. Let q be a prime power. Let x ∈ Fq such that x 6∈ {0, 1}. Then prove that for any
n ≤ q − 1:
X n x n+1 − 1
xi = .
i =0 x −1

Exercise 2.4. The main aim of this exercise is to prove the following identity that is true for any
α ∈ Fq :
αq = α (2.4)
To make progress towards the above we will prove a sequence of properties of groups. A group G
is a pair (S, ◦) where the operator ◦ : G ×G → G such that ◦ is commutative7 and the elements of S
are closed under ◦. Further, there is a special element ι ∈ S that is the identity element and every
element a ∈ S has an inverse element b ∈ S such that a ◦ b = ι. Note that a finite field Fq consists
of an additive group with the + operator (and 0 as additive identity) and a multiplicative group
on the non-zero elements of Fq (which is also denoted by F∗q ) with the · operator (and 1 as the
multiplicative identity).8
For the rest of the problem let G = (S, ·) be a multiplicative group with |G| = m. Prove the
following statements.

1. For any β ∈ G, let o(β) be the smallest integer o such that βo = 1. Prove that such an o ≤ m
always exists. Further, argue that T = {1, β, . . . , βo−1 } also forms a group. (T, ·) is called a
sub-group of G and o(β) is called the order of β.

2. For any g ∈ G, define the coset (w.r.t. T ) as

g T = {g · β|β ∈ T }.

Prove that if h −1 · g ∈ T then g T = hT and g T ∩hT = ; otherwise. Further argue that these
cosets partition the group G into disjoint sets.
7
Technically, G is an abelian group.
8
Recall Definition 2.1.2.

44
3. Argue that for any g ∈ G, we have |g T | = |T |.

4. Using the above results or otherwise, argue that for any β ∈ G, we have

βm = 1.

5. Prove (2.4).

Exercise 2.5. Prove that for q = 2, the second condition in Definition 2.2.2 is implied by the first
condition.

Exercise 2.6. Prove that G 2 from (2.3) has full rank.

Exercise 2.7. In this problem we will look at the problem of solving a system of linear equations
over Fq . That is, one needs to solve for unknowns x 1 , . . . , x n given the following m linear equations
(where a i , j , b i ∈ Fq for 1 ≤ i ≤ m and 1 ≤ j ≤ n):

a 1,1 x 1 + a 1,2 x 2 + · · · + a 1,n x n = b 1 .

a 2,1 x 1 + a 2,2 x 2 + · · · + a 2,n x n = b 2 .

..
.

a m,1 x 1 + a m,2 x 2 + · · · + a m,n x n = b m .

1. (Warm-up) Convince yourself that the above problem can be stated as A · xT = bT , where A
is an m × n matrix over Fq , x ∈ Fnq and b ∈ Fm
q .

2. (Upper Triangular Matrix) Assume n = m and that A is upper triangular, i.e. all diagonal
elements (a i ,i ) are non-zero and all lower triangular elements (a i , j , i > j ) are 0. Then
present an O(n 2 ) time9 algorithm to compute the unknown vector x.

3. (Gaussian Elimination) Assume that A has full rank (or equivalently a rank of n.)

(a) Prove that the following algorithm due to Gauss converts A into an upper triangular
matrix. By permuting the columns if necessary make sure that a 1,1 6= 0. (Why can one
a
assume w.l.o.g. that this can be done?) Multiply all rows 1 < i ≤ n with a1,1 and then
i ,1
subtract a 1, j from the (i , j )th entry 1 ≤ j ≤ n. Recurse with the same algorithm on the
(n −1)×(n −1) matrix A ′ obtained by removing the first row and column from A. (Stop
when n = 1.)
(b) What happens if A does not have full rank? Show how one can modify the algorithm
above to either upper triangulate a matrix or report that it does not have full rank.
(Convince yourself that your modification works.)
9
For this problem, any basic operation over Fq takes unit time.

45
(c) Call a system of equations A · xT = bT consistent if there exists a solution to x ∈ Fnq .
Show that there exists an O(n 3 ) algorithm that finds the solution if the system of equa-
tions is consistent and A has full rank (and report “fail" otherwise).

4. (m < n case) Assume that A has full rank, i.e. has a rank of m. In this scenario either the
system of equations is inconsistent or there are q n−m solutions to x. Modify the algorithm
from above to design an O(m 2 n) time algorithm to output the solutions (or report that the
system is inconsistent).

• Note that in case the system is consistent there will be q n−m solutions, which might be
much bigger than O(m 2 n). Show that this is not a problem as one can represent the
solutions as system of linear equations. (I.e. one can have n − m “free" variables and
m “bound" variables.)

5. (m > n case) Assume that A has full rank, i.e. a rank of n. In this scenario either the system
of equations is inconsistent or there is a unique solution to x. Modify the algorithm from
above to design an O(m 2 n) time algorithm to output the solution (or report that the system
is inconsistent).

6. (Non-full rank case) Give an O(m 2 n) algorithm for the general case, i.e. the m × n matrix
A need not have full rank. (The algorithm should either report that the system of equations
is inconsistent or output the solution(s) to x.)

Exercise 2.8. Prove that the span of k linearly independent vectors over Fq has size exactly q k .

Exercise 2.9. Let G and H be a generator and parity check matrix of the same linear code of
dimension k and block length n. Then G · H T = 0.

Exercise 2.10. Let C be an [n, k]q linear code with a generator matrix with no all zeros columns.
Then for every position i ∈ [n] and α ∈ Fq , the number of codewords c ∈ C such that c i = α is
exactly q k−1 .

Exercise 2.11. Prove Proposition 2.3.3.

Exercise 2.12. Prove Proposition 2.3.4.

Exercise 2.13. Prove Proposition 2.3.5.

Exercise 2.14. A set of vector S ⊆ Fnq is called t -wise independent if for every set of positions I with
|I | = t , the set S projected to I has each of the vectors in Ftq appear the same number of times. (In
other words, if one picks a vector (s 1 , . . . , s n ) from S at random then any of the t random variables
are uniformly and independently random over Fq ).
Prove that any linear code C whose dual C ⊥ has distance d ⊥ is (d ⊥ − 1)-wise independent.

46
Exercise 2.15. A set of vectors S ⊆ Fk2 is called ε-biased sample space if the following property
holds. Pick a vector X = (x 1 , . . . , x k ) uniformly at random from S. Then X has bias at most ε, that
is, for every I ⊆ [k], ¯ Ã ! Ã !¯
¯ X X ¯
¯ ¯
¯Pr x i = 0 − Pr x i = 1 ¯ ≤ ε.
¯ i ∈I i ∈I
¯

We will look at some connections of such sets to codes.

be an¢ [n,¡k]2 ¢code

1. Let C £¡ ¤ such that all non-zero codewords have Hamming weight in the
range 1−ε2
n, 1+ε
2
n . Then there exists an ε-biased space of size n in F k
2 .

£¡ 1an [n,
2. Let C be ¢ k]
¡ 12 code
¢ such
¤ that all non-zero codewords have Hamming weight in the
range 2 − γ n, 2 + γ n for some constant 0 < γ < 1/2. Then there exists an ε-biased
−1
space in Fk2 of size n O(γ ·log(1/ε))
.

Exercise 2.16. Let C be an [n, k, d ]q code. Let y = (y 1 , . . . , y n ) ∈ (Fq ∪{?})n be a received word10 such
that y i =? for at most d − 1 values of i . Present an O(n 3 ) time algorithm that outputs a codeword
c = (c 1 , . . . , c n ) ∈ C that agrees with y in all un-erased positions (i.e., c i = y i if y i 6=?) or states that
no such c exists. (Recall that if such a c exists then it is unique.)

Exercise 2.17. In the chapter, we did not talk about how to obtain the parity check matrix of a
linear code from its generator matrix. In this problem, we will look at this “conversion" procedure.

(a) Prove that any generator matrix G of an [n, k]q code C (recall that G is a k × n matrix) can
be converted into another equivalent generator matrix of the form G′ = [Ik |A], where Ik is
the k × k identity matrix and A is some k × (n − k) matrix. By “equivalent," we mean that
the code generated by G′ has a linear bijective map to C .
Note that the code generated by G′ has the message symbols as its first k symbols in the cor-
responding codeword. Such codes are called systematic codes. In other words, every linear
code can be converted into a systematic code. Systematic codes are popular in practice as
they allow for immediate access to the message symbols.

(b) Given an k × n generator matrix of the form [Ik |A], give a corresponding (n − k) × n parity
check matrix. Briefly justify why your construction of the parity check matrix is correct.

Hint: Try to think of a parity check matrix that can be decomposed into two submatrices: one will be closely
related to A and the other will be an identity matrix, though the latter might not be a k × k matrix).

Exercise 2.18. So far in this book we have seen that one can modify one code to get another code
with interesting properties (for example, the construction of the Hadamard code from the Simplex
code from Section 2.6 and Exercise 1.7). In this problem you will need to come up with more ways
of constructing new codes from existing ones.
10
A ? denotes an erasure.

47
Prove the following statements (recall that the notation (n, k, d )q code is used for general codes
with q k codewords where k need not be an integer, whereas the notation [n, k, d ]q code stands for
a linear code of dimension k):

1. If there exists an (n, k, d )2m code, then there also exists an (nm, km, d ′ ≥ d )2 code.

2. If there exists an [n, k, d ]2m code, then there also exists an [nm, km, d ′ ≥ d ]2 code.

3. If there exists an [n, k, d ]q code, then there also exists an [n − d , k − 1, d ′ ≥ ⌈d /q⌉]q code.

m
¡ an [n, k,mδn]
4. ¡If there exists ¢ qmcode,
¢ then for every m ≥ 1, there also exists an
n , k/m, 1 − (1 − δ) · n q m code.

5. If
£ there
m
¡ an [n, k,m
exists
1
δn] ¤ then for every odd m ≥ 1, there also exists an
¢ 2 code,
m
n , k, 2 · 1 − (1 − 2δ) · n 2 code.

Note: In all the parts, the only things that you can assume about the original code are only the
parameters given by its definition– nothing else!

Exercise 2.19. Let C 1 be an [n, k 1 , d 1 ]q code and C 2 be an [n, k 2 , d 2 ]q code. Then define a new
code as follows:
C 1 ⊖C 2 = {(c1 , c1 + c2 )|c1 ∈ C 1 , c2 ∈ C 2 }.

Next we will prove interesting properties of this operations on codes:

1. If G i is the generator matrix for C i for i ∈ [2], what is a generator matrix for C 1 ⊖C 2 ?

def
2. Argue that C 1 ⊖C 2 is an [2n, k 1 + k 2 , d = min(2d 1 , d 2 )]q code.

3. Assume there exists algorithms Ai for code C i for i ∈ [2] such that: (i) A1 can decode from
e errors and s erasures such that 2e + s < d 1 and (ii) A2 can decode from ⌊(d 2 − 1)/2⌋ errors.
Then argue that one can correct ⌊(d − 1)/2⌋ errors for C 1 ⊖C 2 .
Hint: Given a received word (y1 , y2 ) ∈ Fnq × Fnq , first apply A2 on y2 − y1 . Then create an intermediate received
word for A1 .

4. We will now consider a recursive construction of a binary linear code that uses the ⊖ oper-
ator. For integers 0 ≤ r ≤ m, we define the code C (r, m) as follows:

• C (r, r ) = Fr2 and C (0, r ) is the code with only two codewords: the all ones and all zeroes
vector in Fr2 .
• For 1 < r < m, C (r, m) = C (r, m − 1) ⊖C (r − 1, m − 1).

Determine the parameters of the code C (r, m).

48
Exercise 2.20. Let C 1 be an [n 1 , k 1 , d 1 ]2 binary linear code, and C 2 an [n 2 , k 2 , d 2 ] binary linear
n ×n
code. Let C ⊆ F2 1 2 be the subset of n 2 ×n 1 matrices whose rows belong to C 1 and whose columns
belong to C 2 . C is called the tensor of C 1 and C 2 and is denoted by C 1 ⊗C 2 .
Prove that C is an [n 1 n 2 , k 1 k 2 , d 1 d 2 ]2 binary linear code.
Further, if G1 and G2 are generator matrices of C 1 and C 2 , construct a genertor matrix of
C 1 ⊗ C 2 from G1 and G2 . In particular, argue that given G1 and G2 , a generator matrix of C 1 ⊗ C 2
can be computed in polynomimal time.
Hint: For the latter problem, it might be useful to think of the codewords and messages as vectors instead of matrices.

Exercise 2.21. In Section 2.4 we considered the binary Hamming code. In this problem we will
consider the more general q-ary Hamming code. In particular, let q be a prime power and r ≥ 1
be an integer. Define the following r × n matrix H q,r , where each column is an non-zero vector
from Frq such that the first non-zero entry is 1. For example,
µ ¶
0 1 1 1
H3,2 =
1 0 1 2

In this problem we will derive the parameters of the code. Define the generalized Hamming code
C H ,r,q to be the linear code whose parity check matrix is H q,r . Argue that
q r −1
1. The block length of C H ,r,q is n = q−1 .

2. C H ,q,r has dimension n − r .

3. C H ,q,r has distance 3.

Exercise 2.22. In Section 2.6, we considered the binary Hadamard code. In this problem we will
consider the more general q-ary Hadamard code. In particular, let q be a prime power and r ≥ 1
be an integer. Define the following r × q r matrix H q,r , where each columns in a vector in Frq . In
this problem we will derive the parameters of the code. Define the generalized Hadamard code
C H ad ,r,q to be the linear code whose parity check matrix is H q,r . Argue that

1. The block length of C H ad ,r,q is n = q r .

2. C H ad ,q,r has dimension r .

³ ´
3. C H ad ,q,r has distance 1 − q1 · n.

Exercise 2.23. Design the best 6-ary code (family) with distance 3 that you can.

Hint: Start with a 7-ary Hamming code.

Exercise 2.24. Prove that the [n, 1, n]2 code for odd n (i.e. the code with the all zeros and all ones
vector as it only two codewords) attains the Hamming bound (Theorem 1.7.2).

Exercise 2.25. Let C be an [n, k]q code with generator matrix G. Then given a codeword c ∈ C
one can compute the corresponding message in time O(kn 2 ).

49
Exercise 2.26. Given a c ∈ C H ,r , one can compute the corresponding message in time O(n).
Exercise 2.27. Let C be an (n, k)q code. Prove that if C can be decoded from e errors in time T (n),
then it can be decoded from n + c errors in time O((nq)c · T (n)).
Exercise 2.28. Show that the bound of kd of the number of ones in the generator matrix of any
binary linear code (see Exercise 1.14) cannot be improved for every code.
¡ ¢⊥
Exercise 2.29. Let C be a linear code. Then prove that C ⊥ = C .
Exercise 2.30. Note that for any linear code C , the codewords 0 is in both C and C ⊥ . Show that
there exists a linear code C such that it shares a non-zero codeword with C ⊥ .
Exercise 2.31. We go into a bit of diversion and look at how finite fields are different from infinite
fields (e.g. R). Most of the properties of linear subspaces that we have used for linear codes (e.g.
notion of dimension, the existence of generator and parity check matrices, notion of duals) also
hold for linear subspaces over R.11 One trivial property that holds for linear subspaces over finite
fields that does not hold over R is that linear subspaces over Fq with dimension k has size q k
(though this is a trivial consequence that Fq is a finite field while R is an infinite field). Next, we
consider a more subtle distinction.
Let S ⊆ Rn be a linear subspace over R and let S ⊥ is the dual of S. Then show that

S ∩ S ⊥ = {0} .

By contrast, linear subspaces over finite fields can have non-trivial intersection with their duals
(see e.g. Exercise 2.30).
Exercise 2.32. A linear code C is called self-orthogonal if C ⊆ C ⊥ . Show that
1. The binary repetition code with even number of repetitions is self-orthogonal.

2. The Hadamard code C H ad ,r is self-orthogonal.

Exercise 2.33. A linear code C is called self dual if C = C ⊥ . Show that for
1. Any self dual code has dimension n/2.

2. Prove that the following code is self-dual

{(x, x)|x ∈ Fk2 }.

Exercise 2.34. Given a code C a puncturing of C is another code C ′ where the same set of positions
are dropped in all codewords of C . More precisely, if C ⊆ Σn and the set of punctured positions is
P ⊆ [n], then the punctured code is {(c i )i 6∈P |(c 1 , . . . , c n ) ∈ C }.
Prove that a linear code with no repetitions (i.e. there are no two positions i 6= j such that for
every codeword c ∈ C , c i = c i ) is a puncturing of the Hadamard code. Hence, Hadamard code is
the “longest" linear code that does not repeat.
11
A linear subspace S ⊆ Rn is the same as in Definition 2.2.2 where all occurrences of the finite field Fq is replaced
by R.

50
Exercise 2.35. In this problem we will consider the long code. For the definition, we will use
the functional way of looking at the ambient space as mentioned in Remark 1.2.2. A long code
of dimension k is a binary code such that the codeword corresponding to x = Fk2 , is the function
k k
f : {0, 1}2 → {0, 1} defined as follows. For any m ∈ {0, 1}F2 , we have f ((m α )α∈Fk ) = m x . Derive the
2
parameters of the long code.
Finally, argue that the long code is the code with the longest block length such that the code-
words do not have a repeated coordinate (i.e. there does not exists i 6= j such that for every code-
word c, c i = c j ). (Contrast this with the property of Hadamard code above.)

Exercise 2.36. Given a linear code C ⊆ Fn2 , define its generating function to be a 2n-variate poly-
P
nomial over ¡Qvariables x =¢(x¡Q 1 , . . . , x n ) and y¢ = (y 1 , . . . , y n ) given by G C (x, y) = w∈C P w (x, y) where
P w (x, y) = {i ∈[n]|w i =0} x i · {i ∈[n]|w i =1} y i . For w ∈ {0, . . . , n}, let ACw denote the number of code-
P
words of weight w and let AC (z) = nw=0 ACw z w be the “weight enumerator” polynomial of C .
P
1. For every w ∈ Fn2 , prove that P w (x + y, x − y) = v∈Fn2 (−1)〈v,w〉 P v (x, y).

2. Prove that GC ⊥ (x, y) = |C1 | GC (x + y, x − y).

3. Prove that AC (z) = GC (1, . . . , 1, z, . . . , z).

n ¡ 1−z ¢
4. Prove that AC ⊥ (z) = (1+z)
|C | A C 1+z .
P ¡P ¡ ¢¡ ¢¢
5. Conclude that ACw⊥ = |C1 | nu=0 ACw ui=0 (−1)i ui n−u w−i . In other words, the distributions
of the weights (AC0 , . . . , ACn ) of the primal code completely determine the distributions of
weights (AC0 ⊥ , . . . , ACn ⊥ ) of the dual code!

2.8 Bibliographic Notes

The background material on algebra is essentially folklore. Readers interested in a more exten-
sive treatment are referred to classical texts such as by Artin [10]. For a perspective focussing
more on finite fields, see the text by Lidl and Niederreiter [106]. Linear codes arose already in the
paper of Hamming [83] and were systematically studied by Slepian [145]. The answer to Ques-
tion 1.7.1 was given by van Lint [158] and Tietavainen [156]. Hadamard codes (Definition 2.6.2)
are named after the work of mathematician Jacques Hadamard and in particular the notion of
Hadamard matrices which are self-orthogonal matrices with +1/-1 entries.
Exercises 2.14 and 2.15 come from the theory of pseudorandomness, which we will cover
more extensively in Chapter 24. The long codes in Exercise 2.35 were introduced by Bellare, Gol-
dreich and Sudan [16]. Exercise 2.36 is based on the MacWilliams Identity proved by MacWilliams [110].

51
52
Chapter 3

Probability as Fancy Counting and the q-ary

Entropy Function

In the chapters to come we will explore questions of the form: “Given n, k, d and q does an
(n, k, d )q code exist?” To answer such questions, we will apply the “probabilistic method” —
the method that demonstrates the existence of an object with a given property by showing that
a randomly chosen object has the property with positive probability. To elaborate on this sen-
tence, we need to introduce the basic language and tools of probability theory which we do in
Section 3.1.
We then introduce the probabilistic method in Section 3.2. We even apply the method to
answer a very simple question:

Question 3.0.1. Does there exist a [2, 2, 1]2 code?

We note that the answer to the above question is trivially yes: just pick the generator matrix
to be the 2 × 2 identity matrix. But our proof will have the advantage of generalizing to broader
settings, though we save the generalizations for later chapters.
Finally in Section 3.3 we introduce the “entropy function” which turns out to be central in
the understanding of limits of codes (both existence and non-existence).

3.1 A Crash Course on Probability

In this section we review basic concepts in probability theory, specialized to the needs of this
book. Specifically„ we introduce distributions, events and random variables, and give some
tools to analyze them.
In this book, we will only consider probability distributions defined over finite spaces. In
particular, given a finite domain D, a probability distribution is defined as a function
X
p : D → [0, 1] such that p(x) = 1,
x∈D

53
G U (G) V00 V01 V10 V11 G U (G) V00 V01 V10 V11
µ ¶ µ ¶
0 0 1 1 0 1
16
0 0 0 0 16
0 0 1 1
0 0 0 0
µ ¶ µ ¶
0 0 1 1 0 1
16 0 1 0 1 16 0 1 1 2
0 1 0 1
µ ¶ µ ¶
0 0 1 1 0 1
16 0 1 0 1 16 0 1 1 0
1 0 1 0
µ ¶ µ ¶
0 0 1 1 0 1
16 0 2 0 2 16 0 2 1 1
1 1 1 1
µ ¶ µ ¶
0 1 1 1 1 1
16 0 0 1 1 16 0 0 2 2
0 0 0 0
µ ¶ µ ¶
0 1 1 1 1 1
16
0 1 1 0 16
0 1 2 1
0 1 0 1
µ ¶ µ ¶
0 1 1 1 1 1
16 0 1 1 2 16 0 1 2 1
1 0 1 0
µ ¶ µ ¶
0 1 1 1 1 1
16 0 2 1 1 16 0 2 2 0
1 1 1 1

Table 3.1: Uniform distribution over F22×2 along with values of four random variables.

where [0, 1] is shorthand for the interval of all real numbers between 0 and 1.
An event E is a predicate over the domain D, i.e. it maps every element of D to “true” or “false”.
Equivalently an event is a subset of the domain D, i.e., those elements that are mapped to true.
We switch between “logical” or ”set-theoretic” notation to denote combinations of events. So
the disjunction of events E1 and E2 may be denoted E1 ∨ E2 or E1 ∪ E2 . Similarly, the conjunction
of E1 and E2 may be denoted E1 ∧ E2 or E1 ∩ E2 ; and the negation of E1 may be denote ¬E1 or E1 .
In this book, we will primarily deal with the following special distribution:

Definition 3.1.1 (Uniform Distribution). The uniform distribution over D, denoted by U D , is

given by
1
Pr (x) = for every x ∈ D.
UD |D|
Typically, we will drop the subscript when the domain D is clear from the context.

For example, consider the domain D = F22×2 , i.e. the set of all 2 × 2 matrices over F2 . (Note
that each such matrix is a generator matrix of some [2, 2]2 code.) The first two columns of Ta-
ble 3.1 list the elements of this D along with the corresponding probabilities for the uniform
distribution.
Typically, we will be interested in a real-valued function defined on D and how it behaves
under a probability distribution defined over D. This is captured by the notion of a random
variable1 :
1
We note that the literature on probability theory allows for more general random variables, but for our purposes
we restrict only to real-valued ones.

54
Definition 3.1.2 (Random Variable). Let D be a finite domain and I ⊂ R be a finite2 subset. Let p
be a probability distribution defined over D. A random variable is a function:

V : D → I.

The expectation of V is defined as

X
E[V ] = p(x) · V (x).
x∈D
¡ ¢
For example, given (i , j ) ∈ {0, 1}2 , let Vi j denote the random variable Vi j (G) = w t (i , j ) · G ,
for any G ∈ F22×2 . The last four columns of Table 3.1 list the values of these four random variables.
Of particular interest in this book will be binary random variables, i.e., with I = {0, 1}. In
particular, given an event E over D, we will define its indicator variable to be a function 1E : D →
{0, 1} such that for any x ∈ D: (
1 if x ∈ E
1E (x) =
0 otherwise.
For example, µµ ¶¶ µµ ¶¶
0 1 0 1
1V01 =0 = 1 and 1V01 =0 = 0.
0 0 1 1
In most cases we will shorten this notation to 1E (x) or simply 1E . Finally, sometimes we will
abuse notation and use E instead of 1E .
As a further use of indicator variables, consider the expectations of the four indicator vari-
ables:
£ ¤ 1
E 1V00 =0 = 16 · = 1.
16
£ ¤ 1 1
E 1V01 =0 = 4 · = . (3.1)
16 4
£ ¤ 1 1
E 1V10 =0 = 4 · = . (3.2)
16 4
£ ¤ 1 1
E 1V11 =0 = 4 · = . (3.3)
16 4

3.1.1 Some Useful Results

Before we proceed, we record a simple property of indicator variables that will be useful. (See
Exercise 3.1.)

Lemma 3.1.3. Let E be any event. Then

E [1E ] = Pr [E is true] .

Next, we state a simple yet useful property of expectation of a sum of random variables:
2
In general, I need not be finite. However, for this book this definition suffices.

55
Proposition 3.1.4 (Linearity of Expectation). Given random variables V1 , . . . ,Vm defined over the
same domain D and with the same probability distribution p, we have
" #
m
X Xm
E Vi = E [Vi ] .
i =1 i =1

Proof. For notational convenience, define V = V1 + · · · + Vm . Thus, we have

X
E[V ] = V (x) · p(x) (3.4)
x∈D
Ã !
X Xm
= Vi (x) · p(x) (3.5)
x∈D i =1
Xm X
= Vi (x) · p(x) (3.6)
i =1 x∈D
Xm
= E[Vi ]. (3.7)
i =1

In the equalities above, (3.4) and (3.7) follow from the definition of expectation of a random
variable. (3.5) follows from the definition of V and (3.6) follows by switching the order of the
two summations.

As an example, we have
£ ¤ 3
E 1V01 =0 + 1V10 =0 + 1V11 =0 = (3.8)
4
Frequently, we will need to deal with the probability of the union of events. We will use the
following result to upper bound such probabilities:
Proposition 3.1.5 (Union Bound). Given m binary random variables A 1 , . . . , A m , we have
"Ã ! #
m
_ m
X
Pr Ai = 1 ≤ Pr [A i = 1] .
i =1 i =1

Proof. For every i ∈ [m], define

S i = {x ∈ D|A i (x) = 1}.
Then we have
"Ã ! #
m
_ X
Pr Ai = 1 = p(x) (3.9)
i =1 x∈∪m S
i =1 i
m X
X
≤ p(x) (3.10)
i =1 x∈S i
Xm
= Pr[A i = 1]. (3.11)
i =1

In the above, (3.9) and (3.11) follow from the definition of S i . (3.10) follows from the fact that
some of the x ∈ ∪i S i get counted more than once.

56
We remark that the union bound is tight when the events are disjoint. (In other words, using
the notation in the proof above, when S i ∩ S j = ; for every i 6= j .)
As an example, let A 1 = 1V01 =0 , A 2 = 1V10 =0 and A 3 = 1V11 =0 . Note that in this case the event
A 1 ∨ A 2 ∨ A 3 is the same as the event that there exists a non-zero m ∈ {0, 1}2 such that w t (m·G) =
0. Thus, the union bound implies (that under the uniform distribution over F22×2 )

£ ¤ 3
Pr There exists an m ∈ {0, 1}2 \ {(0, 0)}, such that w t (mG) = 0 ≤ . (3.12)
4

Finally, we present two bounds on the probability of a random variable deviating signifi-
cantly from its expectation. The first bound holds for any random variable:

Lemma 3.1.6 (Markov Bound). Let V be a non-negative random variable. Then for any t > 0,

E[V ]
Pr[V ≥ t ] ≤ .
t

In particular, for any a ≥ 1,

1
Pr[V ≥ a · E[V ]] ≤ .
a

Proof. The second bound follows from the first bound by substituting t = a · E[V ]. Thus, to
complete the proof, we argue the first bound. Consider the following sequence of relations:
X X
E[V ] = i · Pr[V = i ] + i · Pr[V = i ] (3.13)
i ∈[0,t ) i ∈[t ,∞)
X
≥ i · Pr[V = i ] (3.14)
i ≥t
X
≥t· Pr[V = i ] (3.15)
i ≥t
= t · Pr[V ≥ t ]. (3.16)

In the above relations, (3.13) follows from the definition of expectation of a random variable and
the fact that V is non-negative. (3.14) follows as we have dropped some non-negative terms.
(3.15) follows by noting that in the summands i ≥ t . (3.16) follows from the definition of Pr[V ≥
t ].
The proof is complete by noting that (3.16) implies the claimed bound.

The second bound works only for sums of independent random variables. We begin by
defining independent random variables:

Definition 3.1.7 (Independence). Two random variables A and B are called independent if for
every a and b in the ranges of A and B respectively, we have

Pr[A = a ∧ B = b] = Pr[A = a] · Pr[B = b].

57
For example, for the uniform distribution in Table 3.1, let A denote the bit G 0,0 and B denote
the bit G 0,1 . It can be verified that these two random variables are independent. In fact, it can be
verified all the random variables corresponding to the four bits in G are independent random
variables. (We’ll come to a related comment shortly.)
Another related concept that we will use is that of probability of an event happening condi-
tioned on another event happening:

Definition 3.1.8 (Conditional Probability). Given two events A and B defined over the same do-
main and probability distribution, we define the probability of A conditioned on B as

Pr[A and B ]
Pr[A|B ] = .
Pr[B ]

For example, note that

4/16 1
Pr[1V01 =1 |G 0,0 = 0] = = .
1/2 2
The above definition implies that two events A and B are independent if and only if Pr[A] =
Pr[A|B ]. We will also use the following result later on in the book (see Exercise 3.2):

Lemma 3.1.9. For any two events A and B defined on the same domain and the probability
distribution:
Pr[A] = Pr[A|B ] · Pr[B ] + Pr[A|¬B ] · Pr[¬B ].

Next, we state a deviation bound that asserts that the sum of independent random variables
takes values close to its expectation with high probability. We only state it for sums of binary
random variables, which is the form that will be needed in the book. We refer to this bound as
the “Chernoff bound” though we note that this is part of a larger body of work and the biblio-
graphic notes give more details.

Theorem 3.1.10 (Chernoff Bound). Let X 1 , . . . , X m be independent binary random variables and
P
define X = X i . Then the multiplicative Chernoff bound sates that for 0 < ε ≤ 1,
2
E(X )/3
Pr [|X − E(X )| > εE(X )] < 2e −ε ,

and the additive Chernoff bound states that

2
m/2
Pr [|X − E(X )| > εm] < 2e −ε .

We omit the proof, which can be found in any standard textbook on randomized algorithms.
Finally, we present an alternate view of uniform distribution over product spaces and then
use that view to prove a result that we will use later in the book. Given probability distributions
p 1 and p 2 over domains D1 and D2 respectively, we define the product distribution p 1 × p 2 over
D1 × D2 as follows: every element (x, y) ∈ D1 × D2 under p 1 × p 2 is picked by choosing x from
D1 according to p 1 and y is picked independently from D2 under p 2 . This leads to the following
observation (see Exercise 3.4).

58
Lemma 3.1.11. For any m ≥ 1, the distribution U D1 ×D2 ×···×Dm is identical3 to the distribution
U D1 × U D2 × · · · × U Dm .
For example, the uniform distribution in Table 3.1 can be described equivalently as follows:
pick each of the four bits in G independently and uniformly at random from {0, 1}.
We conclude this section by proving the following result:
Lemma 3.1.12. Given a non-zero vector m ∈ Fkq and a uniformly random k × n matrix G over Fq ,
the vector m · G is uniformly distributed over Fnq .
Proof. Let the ( j , i )th entry in G (1 ≤ j ≤ k, 1 ≤ i ≤ n) be denoted by g j i . Note that as G is a ran-
dom k ×n matrix over Fq , by Lemma 3.1.11, each of the g j i is an independent uniformly random
element from Fq . Now, note that we would be done if we can show that for every 1 ≤ i ≤ n, the
i th entry in m · G (call it b i ) is an independent uniformly random element from Fq . To finish
P
the proof, we prove this latter fact. If we denote m = (m 1 , . . . , m k ), then b i = kj=1 m j g j i . Note
that the disjoint entries of G participate in the sums for b i and b j for i 6= j . Given our choice of
G, this implies that the random variables b i and b j are independent. Hence, to complete the
proof we need to prove that b i is a uniformly independent element of Fq . The rest of the proof
is a generalization of the argument we used in the proof of Proposition 2.6.3.
Note that to show that b i is uniformly distributed over Fq , it is sufficient to prove that b i
takes every value in Fq equally often over all the choices of values that can be assigned to
g 1i , g 2i , . . . , g ki . Now, as m is non-zero, at least one of the its element is non-zero. Without
P
loss of generality assume that m 1 6= 0. Thus, we can write b i = m 1 g 1i + kj=2 m j g j i . Now, for
every fixed assignment of values to g 2i , g 3i , . . . , g ki (note that there are q k−1 such assignments),
b i takes a different value for each of the q distinct possible assignments to g 1i (this is where we
use the assumption that m 1 6= 0). Thus, over all the possible assignments of g 1i , . . . , g ki , b i takes
each of the values in Fq exactly q k−1 times, which proves our claim.

3.2 The Probabilistic Method

The probabilistic method is a very powerful method in combinatorics which can be used to
show the existence of objects that satisfy certain properties. In this course, we will use the prob-
abilistic method to prove existence of a code C with certain property P . Towards that end, we
define a distribution D over all possible codes and prove that when C is chosen according to D:
£ ¤ £ ¤
Pr C has property P > 0 or equivalently Pr C doesn’t have property P < 1.
Note that the above inequality proves the existence of C with property P .
As an example consider Question 3.0.1. To answer this in the affirmative, we note that the
set of all [2, 2]2 linear codes is covered by the set of all 2 × 2 matrices over F2 . Then, we let D be
the uniform distribution over F22×2 . Then by Proposition 2.3.6 and (3.12), we get that
3
Pr [There is no [2, 2, 1]2 code] ≤ < 1,
U F2×2 4
2

3
We say two distributions p 1 and p 2 on D are identical if for every x ∈ D, p 1 (x) = p 2 (x).

59
which by the probabilistic method answers the Question 3.0.1 in the affirmative.
For the more general case, when we apply the probabilistic method, the typical approach
will be to define (sub-)properties P 1 , . . . , P m such that P = P 1 ∧P 2 ∧P 3 . . .∧P m and show that for
every 1 ≤ i ≤ m:
£ ¤ h i 1
Pr C doesn’t have property P i = Pr P i < .
m
4
£ ¤
Finally, by the union bound, the above will prove that Pr C doesn’t have property P < 1, as
desired.
As an example, an alternate way to answer Question 3.0.1 in the affirmative is the following.
Define P 1 = 1V01 ≥1 , P 2 = 1V10 ≥1 and P 3 = 1V11 ≥1 . (Note that we want a [2, 2]2 code that satisfies
P 1 ∧ P 2 ∧ P 3 .) Then, by (3.1), (3.2) and (3.3), we have for i ∈ [3],
£ ¤ h i 1 1
Pr C doesn’t have property P i = Pr P i = < ,
4 3
as desired.
Finally, we mention a special case of the general probabilistic method that we outlined
above. In particular, let P denote the property that the randomly chosen C satisfies f (C ) ≤ b.
Then we claim (see Exercise 3.5) that E[ f (C )] ≤ b implies that Pr[C has property P ] > 0. Note
that this implies that E[ f (C )] ≤ b implies that there exists a code C such that f (C ) ≤ b.

3.3 The q-ary Entropy Function

Finally, in this chapter we introduce a fundamental function — the “entropy” function — that
plays a central role in the analysis of the limits of codes. For example, in Section 4.1 of Chapter 4
we will show how this function captures an upper bound on the rate of codes as a function of
the relative distance. Later in Section 4.2 of Chapter 4 we will see that this function captures a
lower bound on the rate of codes obtained by the probabilistic method.
We begin with the definition of the entropy function.
Definition 3.3.1 (q-ary Entropy Function). Let q be an integer and x be a real number such that
q ≥ 2 and 0 ≤ x ≤ 1. Then the q-ary entropy function is defined as follows:
H q (x) = x logq (q − 1) − x logq (x) − (1 − x) logq (1 − x).
Figure 3.1 presents a pictorial representation of the H q function for the first few values of q.
For the special case of q = 2, we will drop the subscript from the entropy function and denote
H2 (x) by just H (x), that is, H (x) = −x log x − (1 − x) log(1 − x), where log x is defined as log2 (x)
(we are going to follow this convention for the rest of the book).
Under the lens of Shannon’s entropy function, H (x) denotes the entropy of the distribution
over {0, 1} that selects 1 with probability x and 0 with probability 1 − x. However, there is no
similar analogue for the more general H q (x). The reason why this quantity will turn out to be
so central in this book is that it is very closely related to the “volume" of a Hamming ball. We
make this connection precise in the next subsection.
4
Note that P = P 1 ∨ P 2 ∨ · · · ∨ P m .

60
1
q=2
q=3
0.9 q=4

0.8

0.7

Hq(x) ---> 0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
x --->

Figure 3.1: A plot of H q (x) for q = 2, 3 and 4. The maximum value of 1 is achieved at x = 1−1/q.

3.3.1 Volume of Hamming Balls

It turns out that in many of our combinatorial results, we will need good upper and lower
bounds on the volume of a Hamming ball. Next we formalize the notion of the volume of a
Hamming ball:

Definition 3.3.2 (Volume of a Hamming Ball). Let q ≥ 2 and n ≥ r ≥ 1 be integers. Then the
volume of a Hamming ball of radius r is given by
Ã !
X r n
V ol q (r, n) = |B q (0, r )| = (q − 1)i .
i =0 i

The choice of 0 as the center for the Hamming ball above was arbitrary: since the volume
of a Hamming ball is independent of its center (as is evident from the last equality above), we
could have picked any point as the center.
We will prove the following result:

Proposition 3.3.3. Let q ≥ 2 be an integer and 0 ≤ p ≤ 1 − q1 be a real number. Then:

(i) V ol q (pn, n) ≤ q Hq (p)n ; and

(ii) for large enough n, V ol q (pn, n) ≥ q Hq (p)n−o(n) .

Proof. We start with the proof of (i). Consider the following sequence of relations:

1 = (p + (1 − p))n

61
Ã !
X n n
= p i (1 − p)n−i (3.17)
i =0 i
Ã ! Ã !
Xpn Xn
n i n−i n i
= p (1 − p) + p (1 − p)n−i
i =0 i i =pn+1 i
Ã !
Xpn
n i
≥ p (1 − p)n−i (3.18)
i =0 i
Ã ! µ ¶
pn
X n i p i
= (q − 1) (1 − p)n−i
i =0 i q − 1
Ã ! µ ¶i
Xpn
n i n p
= (q − 1) (1 − p)
i =0 i (q − 1)(1 − p)
Ã ! µ ¶pn
Xpn
n i n p
≥ (q − 1) (1 − p) (3.19)
i =0 i (q − 1)(1 − p)
Ã ! µ ¶
pn
X n p pn
= (q − 1)i (1 − p)(1−p)n (3.20)
i =0 i q − 1
≥ V ol q (pn, n)q −Hq (p)n . (3.21)

In the above, (3.17) follows from the binomial expansion. (3.18) follows by dropping the second
p
sum and (3.19) follows from the facts that (q−1)(1−p) ≤ 1 (as5 p ≤ 1−1/q). Rest of the steps except
³ ´
p pn
(3.21) follow from rearranging the terms. (3.21) follows as q −Hq (p)n = q−1 (1 − p)(1−p)n .
(3.21) implies that
1 ≥ V ol q (pn, n)q −Hq (p)n ,

which proves (i).

We now turn to the proof of part (ii). For this part, we will need Stirling’s approximation for
n! (Lemma B.1.2).
By the Stirling’s approximation, we have the following inequality:
Ã !
n n!
=
pn (pn)!((1 − p)n)!
(n/e)n 1
> ·p · e λ1 (n)−λ2 (pn)−λ2 ((1−p)n)
(pn/e)pn ((1 − p)n/e)(1−p)n 2πp(1 − p)n
1
= · ℓ(n), (3.22)
p pn (1 − p)(1−p)n
λ1 (n)−λ2 (pn)−λ2 ((1−p)n)
where ℓ(n) = e p .
2πp(1−p)n

5 p p q−1 q−1
Indeed, note that (q−1)(1−p) ≤ 1 is true if 1−p ≤ 1 , which in turn is true if p ≤ q , where the last step follows
from Lemma B.2.1.

62
Now consider the following sequence of relations that complete the proof:
Ã !
n
V ol q (pn, n) ≥ (q − 1)pn (3.23)
pn
(q − 1)pn
> · ℓ(n) (3.24)
p pn (1 − p)(1−p)n
≥ q Hq (p)n−o(n) . (3.25)

In the above (3.23) follows by only looking at the last term in the sum that defined V ol q (pn, n).
(3.24) follows from (3.22) while (3.25) follows from the definition of H q (·) and the fact that for
large enough n, ℓ(n) is q −o(n) .

Next, we consider how the q-ary entropy function behaves for various ranges of its parame-
ters.

3.3.2 Other Properties of the q-ary Entropy function

This section uses asymptotic analysis in few places. Reader who wish to brush up their knowl-
edge of asymptotic analysis are referred to Appendix C.1.
We begin by recording the behavior of the q-ary entropy function for large q.

Proposition 3.3.4. For small enough ε, 1 − H q (ρ) ≥ 1 − ρ − ε for every 0 < ρ ≤ 1 − 1/q if and only
if q is 2Ω(1/ε) .

Proof. We first note that by definition of H q (ρ) and H (ρ),

H q (ρ) = ρ logq (q − 1) − ρ logq ρ − (1 − ρ) logq (1 − ρ)

= ρ logq (q − 1) + H (ρ)/ log2 q.

Now if q ≥ 21/ε , we get that

H q (ρ) ≤ ρ + ε
as logq (q − 1) ≤ 1 and H (ρ) ≤ 1. Thus, we have argued that for q ≥ 21/ε , we have 1 − H q (ρ) ≥
1 − ρ − ε, as desired.
Next, we consider the case when q = 2o(1/ε) . We begin by claiming that for small enough ε,

if q ≥ 1/ε2 then logq (q − 1) ≥ 1 − ε.

³ ´
1
Indeed, logq (q − 1) = 1 + (1/ ln q) ln(1 − 1/q) = 1 − O q ln q ,6 which is at least 1 − ε for q ≥ 1/ε2
(and small enough ¡ε).¢
1
Finally, if q = 2o ε , then for fixed ρ,

H (ρ)/ log q = ε · ω(1).

6
The last equality follows from the fact that by Lemma B.2.2, for 0 < x < 1, ln(1 − x) = −O(x).

63
¡1¢
Then for q = 2o ε (but q ≥ 1/ε2 ) we have

ρ logq (q − 1) + H (ρ)/ log q ≥ ρ − ε + ε · ω(1) > ρ + ε,

which implies that

1 − H q (ρ) < 1 − ρ − ε,
as desired. For q ≤ 1/ε2 , Lemma 3.3.5 shows that 1 − H q (ρ) ≤ 1 − H1/ε2 (ρ) < 1 − ρ − ε, as desired.

We will also be interested in how H q (x) behaves for fixed x and increasing q:

Lemma 3.3.5. Let q ≥ 2 be an integer and let 0 ≤ ρ ≤ 1 − 1/q, then for any real m ≥ 1 such that
µ ¶q−1
m−1 1
q ≥ 1+ , (3.26)
q −1

we have
H q (ρ) ≥ H q m (ρ).

Proof. Note that H q (0) = H q m (0) = 0. Thus, for the rest of the proof we will assume that ρ ∈
(0, 1 − 1/q].
As observed in the proof of Proposition 3.3.4, we have

log(q − 1) 1
H q (ρ) = ρ · + H (ρ) · .
log q log q

Using this, we obtain

µ ¶ µ ¶
log(q − 1) log(q m − 1) 1 1
H q (ρ) − H q m (ρ) = ρ − + H (ρ) − .
log q m log q log q m log q

The above in turn implies that

1 H (ρ)
· m log q · (H q (ρ) − H q m (ρ)) = log(q − 1)m − log(q m − 1) + (m − 1)
ρ ρ
H (1 − 1/q)
≥ log(q − 1)m − log(q m − 1) + (m − 1) (3.27)
1 − 1/q
µ ¶
m m q log q
= log(q − 1) − log(q − 1) + (m − 1) log +
q −1 q −1
µ µ ¶ ¶
(q − 1)m q m−1 m−1
= log · ·q q−1
qm − 1 q −1
Ã m−1 !
(q − 1) · q m−1 · q q−1
= log
qm − 1
≥0 (3.28)

64
In the above (3.27) follows from the fact that H (ρ)/ρ is decreasing7 in ρ and that ρ ≤ 1 − 1/q.
(3.28) follows from the claim that
m−1
(q − 1) · q q−1 ≥ q.
Indeed the above follows from (3.26).
Finally, note that (3.28) completes the proof.

Since (1 + 1/x)x ≤ e (by Lemma B.2.5), we also have that (3.26) is also satisfied for m ≥ 1 +
1
Further, we note that (3.26) is satisfied for every m ≥ 2 (for any q ≥ 3), which leads to the
ln q .
following (also see Exercise 3.6):

Corollary 3.3.6. Let q ≥ 3 be an integer and let 0 ≤ ρ ≤ 1 − 1/q, then for any m ≥ 2, we have

H q (ρ) ≥ H q m (ρ).

Next, we look at the entropy function when its input is very close to 1.

Proposition 3.3.7. For small enough ε > 0,

µ ¶
1
H q 1 − − ε ≤ 1 − c q ε2 ,
q

where c q is a constant that only depends on q.

Proof. The intuition behind the proof is the following. Since the derivative of H q (x) is zero at
x = 1 − 1/q, in the Taylor expansion of H q (1 − 1/q − ε) the ε term will vanish. We will now make
this intuition more concrete. We will think of q as fixed and 1/ε as growing. In particular, we
will assume that ε < 1/q. Consider the following equalities:
µ ¶ µ ¶ µ ¶ µ ¶
1 1 − 1/q − ε 1 1
H q (1 − 1/q − ε) = − 1 − − ε logq − + ε logq +ε
q q −1 q q
µ µ ¶¶ µ ¶ µ ¶
1 εq 1 1 − (εq)/(q − 1)
= − logq 1− + + ε logq
q q −1 q 1 + εq
· µ ¶ µ ¶ µ ¶¸
1 εq 1 1 − (εq)/(q − 1)
= 1− ln 1 − − + ε ln
ln q q −1 q 1 + εq
· 2 2 µ ¶µ
2 1 εq ε q 1 εq
= 1 + o(ε ) − − − 2
− +ε −
ln q q − 1 2(q − 1) q q −1
2 2 2 2 ¶¸
ε q ε q
− 2
− εq + (3.29)
2(q − 1) 2
·
2 1 εq ε2 q 2
= 1 + o(ε ) − − −
ln q q − 1 2(q − 1)2
7
Indeed, H (ρ)/ρ = log(1/ρ) − (1/ρ − 1) log(1 − ρ). Note that the first term is decreasing in ρ. We claim that the
second term is also decreasing in ρ – this e.g. follows from the observation that −(1/ρ − 1) ln(1 − ρ) = (1 − ρ)(1 +
ρ/2! + ρ 2 /3! + · · · ) = 1 − ρ/2 − ρ 2 (1/2 − 1/3!) − · · · is also decreasing in ρ.

65
µ ¶µ ¶¸
1 εq 2 ε2 q 3 (q − 2)
− +ε − +
q q −1 2(q − 1)2
· ¸
2 1 ε2 q 2 ε2 q 2 ε2 q 2 (q − 2)
= 1 + o(ε ) − − + − (3.30)
ln q 2(q − 1)2 q − 1 2(q − 1)2
ε2 q 2
= 1− + o(ε2 )
2 ln q(q − 1)
ε2 q 2
≤ 1−
4 ln q(q − 1)
(3.31)

(3.29) follows from the fact that for |x| < 1, ln(1 + x) = x − x 2 /2 + x 3 /3 − . . . (Lemma B.2.2) and
by collecting the ε3 and smaller terms in o(ε2 ). (3.30) follows by rearranging the terms and by
absorbing the ε3 terms in o(ε2 ). The last step is true assuming ε is small enough.

Next, we look at the entropy function when its input is very close to 0.

Proposition 3.3.8. For small enough ε > 0,

µ µ ¶¶
1 1
H q (ε) = Θ · ε log .
log q ε

Proof. By definition

H q (ε) = ε logq (q − 1) + ε logq (1/ε) + (1 − ε) logq (1/(1 − ε)).

Since all the terms in the RHS are positive we have

H q (ε) ≥ ε log(1/ε)/ log q. (3.32)

Further, by Lemma B.2.2, (1 − ε) logq (1/(1 − ε)) ≤ 2ε/ ln q for small enough ε. Thus, this implies
that µ ¶
2 + ln(q − 1) 1 1
H q (ε) ≤ ·ε+ · ε ln . (3.33)
ln q ln q ε
(3.32) and (3.33) proves the claimed bound.

We will also work with the inverse of the q-ary entropy function. Note that H q (·) on the
domain [0, 1 − 1/q] is a bijective map into [0, 1]. Thus, we define H q−1 (y) = x such that H q (x) = y
and 0 ≤ x ≤ 1 − 1/q. Finally, we will need the following lower bound:

Lemma 3.3.9. For every 0 < y ≤ 1 − 1/q and for every small enough ε > 0,

H q−1 (y − ε2 /c q′ ) ≥ H q−1 (y) − ε,

where c q′ ≥ 1 is a constant that depends only on q.

66
Proof. It is easy to check that H q−1 (y) is a strictly increasing convex function when y ∈ [0, 1].
This implies that the derivative of H q−1 (y) increases with y. In particular, (H q−1 )′ (1) ≥ (H q−1 )′ (y)
for every 0 ≤ y ≤ 1. In other words, for every 0 < y ≤ 1, and (small enough) δ > 0,

H q−1 (y) − H q−1 (y − δ) H q−1 (1) − H q−1 (1 − δ)

≤ .
δ δ

Proposition 3.3.7 along with the facts that H q−1 (1) = 1−1/q and H q−1 is increasing completes the
proof if one picks c q′ = max(1, 1/c q ) and δ = ε2 /c q′ .

3.4 Exercises
Exercise 3.1. Prove Lemma 3.1.3.

Exercise 3.2. Prove Lemma 3.1.9.

Exercise 3.3. In this exercise, we will see a common use of the Chernoff bound (Theorem 3.1.10).
Say we are trying to determine an (unknown) value x ∈ F to which we have access to via a ran-
domized algorithm A that on input (random) input r ∈ {0, 1}m outputs an estimate A (r) of x
such that
1
Pr [A(r) = x] ≥ + γ,
r 2
³ ´
1
for some 0 < γ < 2 . Then show that for any t ≥ 1 with O γt2 calls to A one can determine x with
probability at least 1 − e −t .
Hint: Call A with independent random bits and take majority of the answer and then use the Chernoff bound.

Exercise 3.4. Prove Lemma 3.1.11.

Exercise 3.5. Let P denote the property that the randomly chosen C satisfies f (C ) ≤ b. Then
E[ f (C )] ≤ b implies that Pr[C has property P ] > 0.

Exercise 3.6. Prove that for any Q ≥ q ≥ 2 and ρ ≤ 1 − 1/q, we have HQ (ρ) ≤ H q (ρ).
¡ ¢
Exercise 3.7. Prove that for p < 21 , we have H2 (p) ≤ O p log p .

3.5 Bibliographic Notes

The Chernoff bounds of this chapter come from a family of bounds on the concentration of
sums of random variables around their expectation. They originate with the work of Cher-
noff [31] though Chernoff himself attributes the bound to personal communication with Ru-
bin [14, Page 340]. These bounds and variations are ubiquitous in information theory and com-
puter science — see for instance [35, 117, 115]. Proofs of various concentration bounds can e.g.
be found in [40].

67
The use of the probabilistic method in combinatorics seems to have originated in the early
40s and became especially well known after works of Erdös, notably [51]. Shannon’s adoption
of the method in [138] is one of the first applications in a broader setting. For more on the
probabilistic method, see the book by Alon and Spencer [6].
The entropy function also dates back to Shannon [138]. Shannon’s definition is more gen-
eral and applies to discrete random variables. Our specialization to a two parameter function
(namely a function of q and p) is a special case derived from applying the original definition to
some special random variables.

68
Part II

The Combinatorics

69
Overview of the Combinatorics

Now that we have described the overaching goal of this book, and reviewed some background
mathematics, we are now ready to explore our first quest, namely to understand the ‘limits of
error-correcting codes.’ More specifically we will ask, given an alphabet parameter q and block
length n, what values of dimension k and distance d are simultaneously achievable. Alternately
we may study the relative versions, namely the rate nk and the relative distance dn that can be
achieved in the asymptotic limit where n → ∞.
Chapter 4 will mostly describe simple ‘negative results,’ i.e., the impossibility of the exis-
tence of codes with certain choices of parameters. The chapter will also include some existen-
tial results, which shows that some choices of parameters are achievable, without necessarily
showing how to ‘construct’ such codes. (As an aside, Chapter 6 includes formal mathematical
definitions of explicit constructions (Definition 6.3.4) and even strongly explicit constructions
(Definition 6.3.5).) The results in Chapter 4 are chosen mostly due to the simplicity and versa-
tility of the proof techniques which could be applied in broad contexts beyond coding theory.
But these are not necessarily tight results, nor are they even the best known results. Some of the
best-known asymptotic results appear without proof in Chapter 8, which also includes some
other results with slightly more sophisticated proofs. One negative result from Chapter 4 with a
pretty simple proof turns out to be tight over large alphabets. The codes showing this tightness
are the famed Reed-Solomon codes whose analysis is also extremely elegant. These codes are
introduced in Chapter 5.
Finally in this part we also introduce two ‘weaker’ models of errors and correction that are
related to the Hamming model. Whereas in the Hamming model our focus was on correcting
all patterns of errors subject to a limit on the total number of errors, Chapter 6 considers errors
that are generated randomly and explores the limits when the goal is to correct these errors in
most cases (i.e., with high probability over the randomness of the errors). This leads us to the
ground breaking work of Shannon and this chapter reviews his work. It turns out the tradeoffs
between the rate and the (typical) number of errors that can be corrected in this setting is tightly
understood; and this model does achieve better tradeoffs for many settings of the parameters.
We note that despite the difference in objectives, error-correcting codes (as studied in the rest
of this book) continue to be the primary tool to correct random errors.
Finally in Chapter 7, we relax the notion of recovery in a different way. Instead of weakening
the model only to consider random error patterns, we return to the notion of worst-case error
patterns. But we relax the notion of recovery to allow the decoding algorithm to output a small
list of codewords and deem the recovery to be successful if this list includes the transmitted

71
codeword. This notion, called list-decoding, turns out to be a very useful bridge between the
Hamming and Shannon models of recovery; and also leads to new insights on the combinatorial
limits of error-correcting codes.

72
Chapter 4

What Can and Cannot Be Done-I

In this chapter, we will try to tackle Question 1.8.1. We will approach this trade-off in the fol-
lowing way:
If we fix the relative distance of the code to be δ, what is the best rate R that we can
achieve?
While we will not be able to pin down the exact optimal relationship between R and δ, we will
start establishing some limits. Note that an upper bound on R is a negative result in that it
establishes that codes with certain parameters do not exist. Similarly, a lower bound on R is a
positive result.
In this chapter, we will consider only one positive result, i.e. a lower bound on R called the
Gilbert-Varshamov bound in Section 4.2. In Section 4.1, we recall a negative result that we have
already seen– Hamming bound and state its asymptotic version to obtain an upper bound on
R. We will consider two other upper bounds: the Singleton bound (Section 4.3), which gives
a tight upper bound for large enough alphabets (but not binary codes) and the Plotkin bound
(Section 4.4), which gives a stronger upper bound than Singleton bound for binary codes.

4.1 Asymptotic Version of the Hamming Bound

We have already seen an upper bound in Section 1.7 due to Hamming. However, we had stated
this as an upper bound on the dimension k in terms of n, q and d . In this section we convert
this into a relation on R versus δ.
Consider any (n, k, d )q code with rate R = k/n and relative distance δ = d /n. Recall that
Theorem 1.7.2 implies the following:
³j k ´
d −1
k log q V ol q 2 ,n
R = ≤ 1−
n n
Recall further that Proposition 3.3.3 states the following lower bound on the volume of a Ham-
ming ball: µ¹ º ¶ ³ ´
d −1 H q δ2 n−o(n)
V ol q ,n ≥ q .
2

73
Taking logarithms to base q of both sides above, and dividing by n yields that the second term
in the right hand side of the inequality above is lower bounded by H q (δ/2) − o(1), where the
o(1) term tends to 0 as n → ∞. Thus Theorem 1.7.2 implies that for a q-ary of code C of rate R,
relative distance δ and block length n, we have:
µ ¶
δ
R ≤ 1 − Hq + o(1), (4.1)
2
where the o(1) term tends to 0 as n → ∞. Thus for an infinite family of q-ary codes C , by taking
limits as n → ∞, we get the following asymptotic Hamming bound (see Exercise 4.1).
Proposition 4.1.1 (Asymptotic Hamming Bound). Let C be an infinite family of q-ary codes
with rate R = R(C ) and relative distance δ = δ(C ). Then we have:
µ ¶
δ
R ≤ 1 − Hq .
2
Figure 4.1 gives a pictorial description of the asymptotic Hamming bound for binary codes.

1
Hamming bound
GV bound
0.9

0.8

0.7

0.6

0.5
R

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
δ

Figure 4.1: The Hamming and GV bounds for binary codes. Note that any point below the GV
bound is achievable by some code while no point above the Hamming bound is achievable by
any code. In this part of the book we would like to push the GV bound as much up as possible
while at the same time try and push down the Hamming bound as much as possible.

4.2 Gilbert-Varshamov Bound

Next, we will switch gears by proving our first non-trivial lower bound on R in terms of δ. (In
fact, this is the only positive result on the R vs δ tradeoff question that we will see in this book.)
In particular, we will prove the following result:

74
Theorem 4.2.1 (Gilbert-Varshamov Bound). Let q ≥ 2. For every 0 ≤ δ < 1− q1 there exists a family
of q-ary codes C with rate R(C ) ≥ 1 − H q (δ) and relative distance δ(C ) ≥ δ. If q is a prime power
then there exists such a q-ary family of linear codes. Furthermore, for every 0 ≤ ε ≤ 1 − H q (δ) and
integer n, if a matrix G is picked uniformly from Fk×nq for k = n(1 − H q (δ) − ε), then G generates
a code of rate 1 − H q (δ) − ε and relative distance at least δ with probability strictly greater than
1 − q −εn .

The bound of the theorem is referred to as the GV bound. For a pictorial description of the
GV bound for binary codes, see Figure 4.1. We will present the proofs for general codes and
linear codes in Sections 4.2.1 and 4.2.2 respectively.
In what follows we first prove the existence of a non-linear code of rate 1−H q (δ) and relative
distance at least δ. Later we show how to get a linear code, and with high probability (when
ε > 0). (Note that the existence of a linear code is implied by the final part using ε = 0.)

4.2.1 Greedy Construction

We will prove Theorem 4.2.1 for general codes by a greedy construction described next: Fix an
integer n and let d = δn. Start with the empty code C ⊆ [q]n and then keep on adding strings
to C that are at Hamming distance at least d from all the existing words in C . Algorithm 4.2.1
presents a formal description of the algorithm and Figure 4.2 illustrates the first few executions
of this algorithm.

Algorithm 4.2.1 Gilbert’s Greedy Code Construction

I NPUT: n, q, d
O UTPUT: A code C ⊆ [q]n of distance d ≥ 1

1: C ← ;
2: WHILE there exists a v ∈ [q]n such that ∆(v, c) ≥ d for every c ∈ C DO
3: Add v to C
4: RETURN C

We claim that Algorithm 4.2.1 terminates and the C that it outputs has distance d . The latter
is true by step 2, which makes sure that in Step 3 we never add a vector c that will make the
distance of C fall below d . For the former claim, note that, if we cannot add v at some point, we
cannot add it later. Indeed, since we only add vectors to C , if a vector v ∈ [q]n is ruled out in a
certain iteration of Step 2 because ∆(c, v) < d , then in all future iterations, we have ∆(v, c) < d
and thus, this v will never be added in Step 3 in any future iteration.
The running time of Algorithm 4.2.1 is q O(n) . To see this, note that Step 2 in the worst-case
could be repeated for every vector in [q]n , that is at most q n times. In a naive implementa-
tion, for each iteration, we cycle through all vectors in [q]n and for each vector v ∈ [q]n , iterate
through all (at most q n ) vectors c ∈ C to check whether ∆(c, v) < d . If no such c exists, then we
add v to C . Otherwise, we move to the next v. However, note that we can do slightly better–
since we know that once a v is “rejected" in an iteration, it’ll keep on being rejected in the future

75
c3
c2

c1
c4
d −1

[q]n

Figure 4.2: An illustration of Gilbert’s greedy algorithm (Algorithm 4.2.1) for the first five itera-
tions.

iterations, we can fix up an ordering of vectors in [q]n and for each vector v in this order, check
whether it can be added to C or not. If so, we add v to C , else we move to the next vector in the
order. This algorithm has time complexity O(nq 2n ), which is still q O(n) .
Further, we claim that after termination of Algorithm 4.2.1
[
B (c, d − 1) = [q]n .
c∈C

This is because if the above is not true, then there exists a vector v ∈ [q]n \C , such that ∆(v, c) ≥ d
and hence v can be added to C . However, this contradicts the fact that Algorithm 4.2.1 has
terminated. Therefore,
¯ ¯
¯[ ¯
¯ B (c, d − 1)¯ = q n . (4.2)
¯ ¯
c∈C

It is not too hard to see that

¯ ¯
X ¯[ ¯
|B (c, d − 1)| ≥ ¯¯ B (c, d − 1)¯¯ ,
c∈C c∈C

which by (4.2) implies that

X
|B (c, d − 1)| ≥ q n
c∈C

or since the volume of a Hamming ball is translation invariant,

X
V ol q (d − 1, n) ≥ q n .
c∈C

76
P
Since c∈C V ol q (d − 1, n) = V ol q (d − 1, n) · |C |, we have

qn
|C | ≥
V ol q (d − 1, n)
qn
≥ (4.3)
q nHq (δ)
= q n(1−Hq (δ)) ,

as desired. In the above, (4.3) follows from the fact that

V ol q (d − 1, n) ≤ V ol q (δn, n)
≤ q nHq (δ) , (4.4)

where the second inequality follows from the upper bound on the volume of a Hamming ball in
Proposition 3.3.3.
We thus conclude that for every q, n and δ there exists a code of rate at least n(1 − H q (δ)).
We state this formally as a lemma below.

Lemma 4.2.2. For every pair of positive integers n, q and real δ ∈ [0, 1] there exists an (n, k, δn)q
qn
code satisfying q k ≥ V ol q (d −1,n) .
In particular, for every positive integer q and real δ ∈ [0, 1−1/q] there exists an infinite family
of q-ary codes C of rate R and distance δ satisfying R ≥ 1 − H q (δ).

It is worth noting that the code from Algorithm 4.2.1 is not guaranteed to have any special
structure. In particular, even storing the code can take exponential space. We have seen in
Proposition 2.3.3 that linear codes have a much more succinct representation. Thus, a natural
question is:

Question 4.2.1. Do linear codes achieve the R ≥ 1 − H q (δ) tradeoff that the greedy construc-
tion achieves?

Next, we will answer the question in the affirmative.

4.2.2 Linear Code Construction

Now we will show that a random linear code, with high probability, lies on the GV bound. The
construction is a use of the probabilistic method (Section 3.2).

Proof of Theorem 4.2.1. By Proposition 2.3.6, we are done if we can show that there exists a k ×n
matrix G of full rank (for k = (1 − H q (δ) − ε)n) such that

For every m ∈ Fkq \ {0}, w t (mG) ≥ d .

77
We will prove the existence of such a G by the probabilistic method. Pick a random linear code
by picking a random k × n matrix G where each of kn entries is chosen uniformly and indepen-
dently at random from Fq . Fix m ∈ Fkq \ {0}. Recall that by Lemma 3.1.12, for a random G, mG is
a uniformly random vector from Fnq . Thus, for every non-zero vector m, we have

V ol q (d − 1, n)
Pr[w t (mG) < d ] =
G qn
q nHq (δ)
≤ (4.5)
qn
≤ q −k · q −εn , (4.6)

where (4.5) follows from the fact that the condition w t (mG) < d is equivalent to the condition
that mG ∈ B (0, d − 1) and the fact that mG is uniformly random in Fnq , (4.5) follows from (4.4)
and (4.6) uses k ≤ n(1 − H q (δ) − ε). There are q k − 1 non-zero vectors m and taking the union
over all such vectors and applying the union bound (Lemma 3.1.5) we have

Pr[There exists a non-zero m s.t. wt(mG) < d ] ≤ (q k − 1) · q −k · q −εn

G
< q −ε·n .

Fix a matrix G ∈ Fk×n

q such that for every non-zero M we have wt(mG) ≥ d . The argument above
has shown that a random matrix has this property with probability strictly greater than 1−q −εn .
By Proposition 2.3.6 this implies that the code generated by G has distance at least d . To con-
clude the theorem we only need to argue that the code has dimension k, i.e., that G has full rank.
But this also follows immediately from the property that for every ε ≥ 0, we have that the prob-
ability that the code generated by a uniformly random matrix has distance less than or equal to
d is strictly less than 1. Thus using the probabilistic method we conclude there exists a matrix
G such that the code it generates in an [n, k, d ]q code. Furthermore if ε > 0 then the probability
that the code does not have distance d is exponentially small, specifically at most q εn .
To conclude we need to verify that the code generated by G has dimension k, i.e., that G
has full rank. But note that an equivalent definition of G not having full rank is that there ex-
ists a non-zero vector M such that mG = 0. But the existence of such a vector m would imply
wt(mG) = 0 < d contradicting the property that for every non-zero M we have wt(mG) ≥ d . We
thus conclude that G generates a code of rate k/n = 1 − H q (δ) − ε and relative distance δ. The
theorem follows.

Discussion. We now digress a bit to stress some aspects of the GV bound and its proof. First,
note that that proof by the probabilistic method shows something stronger than just the exis-
tence of a code, but rather gives a high probability result. Furthermore, as pointed out explicitly
for the non-linear setting in Lemma 4.2.2, the result gives a lower bound not only in the asymp-
totic case but also one for every choice of n and k. The proof of the GV bound in the non-linear
case gives a similar non-asymptotic bound in the linear setting also.

78
Note that we can also pick a random linear code by picking a random (n −k)×n parity check
matrix. This also leads to a alternate proof of the GV bound: see Exercise 4.2.
Finally, we note that Theorem 4.2.1 requires δ < 1 − q1 . An inspection of Gilbert and Var-
shamov’s proofs shows that the only reason the proof required that δ ≤ 1 − q1 is because it is
needed for the volume bound (recall the bound in Proposition 3.3.3)– V ol q (δn, n) ≤ q Hq (δ)n – to
hold. It is natural to wonder if the above is just an artifact of the proof or if better codes exist.
This leads to the following question:

Question 4.2.2. Does there exists a code with R > 0 and δ > 1 − q1 ?

We will return to this question in Section 4.4.

4.3 Singleton Bound

We will now change gears again and prove an upper bound on R (for fixed δ). We start by proving
the Singleton bound.

Theorem 4.3.1 (Singleton Bound). For every (n, k, d )q code,

k ≤ n − d + 1.

Consequently, if C is an infinite family of codes of rate R and relative distance δ then R ≤ 1 − δ.

Note that the asymptotic bound hold for any family of codes, even those where the alphabet
may grow (arbitrarily) with the length of the code.

Proof. We start by proving the non-asymptotic bound first. The asymptotic version follows eas-
ily and is shown at the end.
Let c1 , c2 , . . . , cM be the codewords of an (n, k, d )q code C . Note that we need to show M ≤
n−d +1
q . To this end, we define c′i to be the prefix of the codeword ci of length n − d + 1 for every
i ∈ [M ]. See Figure 4.3 for a pictorial description.
We now claim that for every i 6= j , c′i 6= c′j . For the sake of contradiction, assume that there
exits an i 6= j such that c′i = c′j . Notice this implies that ci and c j agree in all the first n − d +
1 positions, which in turn implies that ∆(ci , c j ) ≤ d − 1. This contradicts the fact that C has
distance d . Thus, M is the number of prefixes of codewords in C of length n − d + 1, which
implies that M ≤ q n−d +1 as desired.
To get the asymptotic bound, assume some infinite family of codes C has rate R = R(C ) =
1 − δ + ε for some ε > 0. Then there must exist an n > 2/ε and a code C n ∈ C that is an (n, k, d )q
code with k ≥ n(1−δ+ε) and d ≥ δn. By our choice of n we thus have k ≥ n −d +2 contradicting
the non-asymptotic bound proved above.

79
c1

ci c′i

cj c′j

n −d +1 d −1

Figure 4.3: Construction of a new code in the proof of the Singleton bound.

1
Hamming bound
GV bound
Singleton bound

0.8

0.6
R

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
δ

Figure 4.4: The Hamming, GV and Singleton bound for binary codes.

80
Figure 4.4 presents a pictorial description of the asymptotic version of the Singleton bound.
It is worth noting that the bound is independent of the alphabet size. As is evident from Fig-
ure 4.4, the Singleton bound is worse than the Hamming bound for binary codes. However, this
bound is better for larger alphabet sizes. In fact, we will look at a family of codes called Reed-
Solomon codes in Chapter 5 that meets the Singleton bound. However, the alphabet size of the
Reed-Solomon codes increases with the block length n. Thus, a natural follow-up question is
the following:

Question 4.3.1. Given a fixed q ≥ 2, does there exist a q-ary code that meets the Singleton
bound?

We’ll see an answer to this question in the next section.

4.4 Plotkin Bound

In this section, we will study the Plotkin bound, which will answer Questions 4.2.2 and 4.3.1.
We start by stating the bound.

Theorem 4.4.1 (Plotkin bound). The following hold for any code C ⊆ [q]n with distance at least
d:
³ ´
1
1. If d = 1 − q n, |C | ≤ 2qn.
³ ´
qd
2. If d > 1 − q1 n, |C | ≤ qd −(q−1)n
.

Note that the Plotkin bound (Theorem 4.4.1) implies that a code with relative distance δ ≥
1 − q1 , must necessarily have R = 0, which answers Question 4.2.2 in the negative.
Before proving Theorem 4.4.1, we make a few remarks. We first note that the upper bound in
the first part of Theorem 4.4.1 can be improved to 2(q−1)n for q ≥ 2. (See Exercise 4.13.) Second,
it can be shown that this bound is tight for q = 2. (See Exercise 4.14.) Third, the statement of
Theorem 4.4.1 gives a trade-off only for relative distance greater than 1 − 1/q. However, as the
following corollary shows, the result can be extended to work for 0 ≤ δ ≤ 1 − 1/q. (See Figure 4.5
for an illustration for binary codes.)

Corollary 4.4.2. Let C be an infinite family of q-ary codes with relative distance 0 ≤ δ ≤ 1− q1 and
rate R. Then µ ¶
q
R ≤ 1− δ.
q −1

´ Assume for contradiction that C is an infinite family of q-ary codes with rate R = 1 −
³Proof.
q
q−1
δ + ε for some ε > 0. Let C ∈ C be a code of block length n ≥ 2/ε with distance d ≤ δn

81
and message length k ≥ Rn. We argue now that an appropriate “shortening” of C yields a code
contradicting Theorem 4.4.1.
Partition the codewords
j of k C so that codewords within a partition agree on the first n −
qd
n ′ symbols, where n ′ = q−1 − 1. (We will see later why this choice of n ′ makes sense.) In
n−n ′
particular, for every x ∈ [q] , define the ‘prefix code’

C x = {(c n−n ′ +1 , . . . c n ) | (c 1 . . . c N ) ∈ C , (c 1 . . . c n−n ′ ) = x} .

In other words C x consists of the n ′ -length suffixes of all codewords of C that start with the
string x.) j k
′ qd
By definition C x is a q-ary code of block length n = q−1 − 1. We claim that it also has
distance at least d for every x: To see this, suppose for some x, c1 6= c2 ∈ C x , ∆(c1 , c2 ) < d . But
this yields two codewords of C , namely (x, c1 ) and (x, c2 ), a Hamming distance is less than d
from each other,
³ contradicting
´ the assumption that ∆(C ) ≥³ d. ´
q
Since n ′ < q−1
d (by definition of n ′ ) and thus, d > 1 − q1 n ′ . Applying Theorem 4.4.1 to
C x we get that
qd
|C x | ≤ ≤ qd ≤ q 2 , (4.7)
qd − (q − 1)n ′
where the second inequality follows from the fact that qd −(q −1)n ′ is a positive integer and the
third is immediate from d ≤ q.
We now use the bound on |C x | for all x to get a bound on |C |. Note that by the definition of
Cx: X
|C | = |C x | ,
′
x∈[q]n−n

which by (4.7) implies that

³ ´
X 2 n−n ′ +2
q
n− q−1 d +2
q
n 1−δ· q−1 +ε
|C | ≤ q =q ≤q ≤q ,
′
x∈[q]n−n

where the first inequality uses the definition

³ ´ of n ′ and the final inequality uses the fact that
q
n ≥ 2/ε. We conclude that R ≤ 1 − q−1 δ + ε. Since this holds for every ε > 0 the corollary
follows.

Note that Corollary 4.4.2 implies that for any q-ary code of rate R and relative distance δ
(where q is a constant independent of the block length of the code), R < 1 − δ. In other words,
this answers Question 4.3.1 in the negative.
Let us pause for a bit at this point and recollect the bounds on R versus δ that we have proved
till now, which are all depicted in Figure 4.5 (for q = 2). The GV bound is the best known lower
bound at the time of writing of this book. Better upper bounds are known and we will see one
such trade-off (called the Elias-Bassalygo bound) in Section 8.1.
Now, we turn to the proof of Theorem 4.4.1, for which we will need two more lemmas. The
first lemma deals with vectors over real spaces. We quickly recap the necessary definitions.

82
1
Hamming bound
GV bound
Singleton bound
Plotkin bound

0.8

R 0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
δ

Figure 4.5: The current bounds on the rate R vs. relative distance δ for binary codes. The GV
bound is a lower bound on R while the other three bounds are upper bounds on R.

n
Consider
q a vector v in R , that is, a tuple of n real numbers. This vector has (Euclidean) norm
kvk = v 12 + v 22 + . . . + v n2 , and is a unit vector if and only if its norm is 1. The inner product of
P
two vectors, u and v, is 〈u, v〉 = i u i · v i . The following lemma gives a bound on the number of
vectors that can exist such that every pair is at an obtuse angle with each other.

Lemma 4.4.3 (Geometric Lemma). Let v1 , v2 , . . . , vm ∈ RN be non-zero vectors.

1. If 〈vi , v j 〉 ≤ 0 for all i 6= j , then m ≤ 2N .

2. Let vi be unit vectors for 1 ≤ i ≤ m. Further, if 〈vi , v j 〉 ≤ −ε < 0 for all i 6= j , then m ≤ 1 + 1ε .1

(Both items 1 and 2 are tight: see Exercises 4.15 and 4.16.) The proof of the Plotkin bound
will need the existence of a map from codewords to real vectors with certain properties, which
the next lemma guarantees.

Lemma 4.4.4 (Mapping Lemma). For every q and n, there exists a function f : [q]n −→ Rnq such
that for every c1 , c2 ∈ [q]n we have
µ ¶µ ¶
q ∆(c1 , c2 )
〈 f (c1 ), f (c2 )〉 = 1 − .
q −1 n

Consequently we get:

1. For every c ∈ [q]n , k f (c)k = 1.

1
Note that since vi and v j are both unit vectors, 〈vi , v j 〉 is the cosine of the angle between them.

83
³ ´³ ´
q d
2. If ∆(c1 , c2 ) ≥ d then we have 〈 f (c1 ), f (c2 )〉 ≤ 1 − q−1 n .

We defer the proofs of the Geometric Lemma and the Mapping Lemma to the end of the
section and turn instead to proving Theorem 4.4.1 using the lemmas.

Proof of Theorem 4.4.1. Let C = {c1 , c2 , . . . , cm } be a q-ary code of block length n and distance d .
Let f : [q]n → Rnq be the function from Lemma 4.4.4. Then for all i we have that f (ci ) is a unit
length vector in R nq . Furthermore for all i 6= j , we have
µ ¶
® q d
f (ci ), f (c j ) ≤ 1 − .
q −1 n

Thus f (c1 ), . . . , f (cm ) give us unit vectors in R nq to which we can apply Lemma 4.4.3 and this
will yield the upper bounds claimed³ on m´= |C | in the theorem statement.
(q−1)n
For part 1 of the theorem, if d = 1 − q1 n = q
, then for all i 6= j , we have
®
f (ci ), f (c j ) ≤ 0.

So by the first part of³ Lemma

´ 4.4.3, m ≤ 2nq, as desired.
q−1
For part 2, if d > q
n then for all i 6= j we have
µ ¶ µ ¶
q d qd − (q − 1)n
〈 f (ci ), f (c j )〉 ≤ 1 − =− .
q −1 n (q − 1)n
³ ´
def qd −(q−1)n
Let ε = (q−1)n > 0. We can apply the second part of Lemma 4.4.3 to f (c1 ), . . . , f (cm ) and ε
(q−1)n qd
to get m ≤ 1 + qd −(q−1)n = qd −(q−1)n , as desired.

4.4.1 Proof of Geometric and Mapping Lemmas

We now prove Lemmas 4.4.3 and 4.4.4. We start with Lemma 4.4.3, namely the Geometric
Lemma.

Proof of Lemma 4.4.3. We prove both parts using linear algebra over the reals.
We start by proving the first part of the lemma. This part is also linear algebraic but involves
a few more steps.
We first focus on a subset of the m vectors that has a positive inner product with some fixed
vector u. Specifically we pick u to be a generic vector in RN so that 〈u, vi 〉 6= 0 for every i . Such
vector exists since the set of vectors satisfying 〈u, vi 〉 = 0 is a dimension N − 1 linear subpace of
RN (since vi 6= 0). And the union of N such linear subspaces (one for each i ∈ [N ]) cannot cover
all of RN .
Assume w.l.o.g. that at least half of the vi ’s have a positive inner product with u (if not we
can work with −u instead) and assume further that these are the first ℓ ≥ m/2 vectors by renum-
bering the vectors. We now show that v1 , . . . , vℓ are linearly independent. This suffices to prove
the first part, since linear independence implies ℓ ≤ N and thus m ≤ 2ℓ ≤ 2N .

84
Assume for contradiction that there is a linear dependency among the vectors v1 , . . . , vℓ , i.e.,
P
there exist α1 , . . . , αℓ with at least one αi 6= 0 such that i ∈[ℓ] αi zi = 0. Note we can assume that
at least one αi is positive since if all are non-negative we can negate all αi ’s to get a positive αi .
Further, by renumbering the indices we can assume that there exists k ≥ 1 such that α1 , . . . , αk >
0 and αk+1 , . . . , αℓ ≤ 0.
P P
Let w = ki=1 αi vi By the definition of αi ’s we have that w = − ℓj =k+1 α j v j . We first argue
that w 6= 0 by using the vector u. Note that we have

k
X k
X
〈u, w〉 = 〈u, αi vi 〉 = αi 〈u, vi 〉 ≥ α1 〈u, v1 〉 > 0.
i =1 i =1

We thus conclude w has a non-zero inner product with some vector and hence can not be the
zero vector.
But now we have the following contradiction:
* +
k
X ℓ
X k,ℓ
X
0 < 〈w, w〉 = αi vi , − αj vj =− αi α j 〈vi , v j 〉 ≤ 0,
i =1 j =k+1 i =1, j =k+1

where the first inequality uses w 6= 0, the first equality uses the two definitions of w namely
P P
w = ki=1 αi vi = − ℓj =k+1 α j v j , and the final inequality holds for every term in the summation.
Specifically for every 0 ≤ i ≤ k and k + 1 ≤ j ≤ ℓ we have αi ≥ 0, α j ≤ 0 and 〈vi , v j 〉 ≤ 0 and so
−αi α j 〈vi , v j 〉 ≤ 0. We conclude that v1 , . . . , vℓ must be linearly independent and this proves the
first part of the lemma.
We now move on to the proof of the second part. Define z = v1 + . . . + vm . Now consider the
following sequence of relationships:
Ã !
m
X X m
kzk2 = kvi k2 + 2 〈vi , v j 〉 ≤ m + 2 · · (−ε) = m(1 − εm + ε).
i =1 i<j 2

The inequality follows from the facts that each vi is a unit vector and the assumption that for
every i 6= j , 〈vi .v j 〉 ≤ −ε. As kzk2 ≥ 0,

m(1 − εm + ε) ≥ 0.

Since m ≥ 1, we have that

1 − εm + ε ≥ 0

or
εm ≤ 1 + ε.

Thus, we have m ≤ 1 + 1ε , as desired.

85
Alternate proof of first part. We now present an alternate proof of the first result, which we
do by induction on n. Note that in the base case of N = 0, we have m = 0, which satisfies the
claimed inequality m ≤ 2N .
In the general case, we have m ≥ 1 non-zero vectors v1 , . . . , vm ∈ RN such that for every i 6= j ,

〈vi , v j 〉 ≤ 0. (4.8)

Since rotating all the vectors by the same amount does not change the sign of the inner
product (nor does scaling any of the vectors), w.l.o.g. we can assume that vm = 〈1, 0, . . . , 0〉. For
1 ≤ i ≤ m − 1, denote the vectors as vi = 〈αi , yi 〉, for some αi ∈ R and yi ∈ RN −1 . Now, for any
P
i 6= 1, 〈v1 , vi 〉 = 1 · αi + m
i =2 0 = αi . However, we know from (4.8) that 〈v1 , vi 〉 ≤ 0, which in turn
implies that
αi ≤ 0. (4.9)
Next, we claim that at most one of y1 , . . . , ym−1 can be the all zeroes vector, 0. If not, assume
w.l.o.g., that y1 = y2 = 0. This in turn implies that

〈v1 , v2 〉 = α1 · α2 + 〈y1 , y2 〉
= α1 · α2 + 0
= α1 · α2
> 0,

where the last inequality follows from the subsequent argument. As v1 = 〈α1 , 0〉 and v2 = 〈α2 , 0〉
are non-zero, we have that α1 , α2 6= 0. (4.9) then implies that α1 , α2 < 0. However, 〈v1 , v2 〉 > 0
contradicts (4.8).
Thus, w.l.o.g., assume that y1 , . . . , ym−2 are all non-zero vectors. Further, note that for every
i 6= j ∈ [m − 2], 〈yi , y j 〉 = 〈vi , v j 〉 − αi · α j ≤ 〈vi , v j 〉 ≤ 0. Thus, we have reduced problem on m
vectors with dimension N to an equivalent problem on m − 2 vectors with dimension N − 1. By
induction we have m − 2 ≤ 2(N − 1) and thus implying m ≤ 2N .

Finally, we prove the Mapping Lemma, i.e., Lemma 4.4.4.

Proof of Lemma 4.4.4. We begin by defining a map φ : [q] → Rq which essentially satisfies the
requirements of the lemma statement for the case n = 1 (up to some normalization constant).
Then, we essentially apply φ separately to each coordinates of a word to get the map f : [q]n →
Rnq that satisfies the claimed properties. We now fill in the details.
Let ei denote the unit vector along the i th direction in Rq , i.e.,
* +
ei = 0, 0, . . . , 1
|{z} ,...,0 .
i th position

P
Let e = q1 i ∈[q] ei = 〈1/q, 1/q, . . . , 1/q〉. Note that we have 〈ei , e j 〉 = 1 if i = j and 0 otherwise.
Note also 〈e, ei 〉 = 〈e, e〉 = 1/q for every i .

86
Now we define φ : [q] → Rq to be φ(i ) = ei − e. For every pair i , j ∈ [q] we have

〈φ(i ), φ( j )〉 = 〈ei − e, e j − e〉 = 〈ei , e j 〉 − 〈ei , e〉 − 〈e, e j 〉 + 〈e, e〉 = 〈ei , e j 〉 − 1/q.

Thus, for every i ∈ [q], we get:

(q − 1)
kφ(i )k2 = 〈ei , ei 〉 − 1/q = . (4.10)
q
Also for every i 6= j ∈ [q], we have:
1
〈φ(i ), φ( j )〉 = − . (4.11)
q
We are now ready to define our final map f : [q]n → Rnq . For every c = (c 1 , . . . , c n ) ∈ [q]n ,
define s
q ¡ ¢
f (c) = · φ(c 1 ), φ(c 2 ), . . . , φ(c n ) .
n(q − 1)
q
q
(The multiplicative factor n(q−1) will be used to ensure below that f (c) for every c ∈ [q]n is a
unit vector.)
To complete the proof, we will show that f satisfies the claimed properties. We begin with
condition 1. Note that
q Xn ° °
k f (c)k2 = · °φ(i )°2 = 1,
(q − 1)n i =1
where the first equality follows from the definition of f and the second equality follows from
(4.10).
We now turn to the second condition. For notational convenience, define c1 = (x 1 , . . . , x n )
and c2 = (y 1 , . . . , y n ). Consider the following sequence of relations:
® Xn ®
f (c1 ), f (c2 ) = f (x ℓ ), f (y ℓ )
ℓ=1
" # µ ¶
X ® X ® q
= φ(x ℓ ), φ(y ℓ ) + φ(x ℓ ), φ(y ℓ ) ·
ℓ:x ℓ 6= y ℓ ℓ:x ℓ =y ℓ n(q − 1)
" µ ¶ ¶# µ
µ ¶
X −1 X q −1 q
= + · (4.12)
ℓ:x ℓ 6= y ℓ q ℓ:x ℓ =y ℓ q n(q − 1)
· µ ¶ µ ¶¸ µ ¶
−1 q −1 q
= ∆(c1 , c2 ) + (n − ∆(c1 , c2 )) · (4.13)
q q n(q − 1)
µ ¶· ¸
q 1 q −1
= 1 − ∆(c1 , c2 ) +
n(q − 1) q q
µ ¶µ ¶
q ∆(c1 , c2 )
= 1− ,
q −1 n
as desired. In the above, (4.12) is obtained using (4.11) and (4.10) while (4.13) follows from the
definition of the Hamming distance.

87
4.5 Exercises
Exercise 4.1. Given an infinite family of q-ary codes C of relative distance , and ε > 0 prove that
there exists an n 0 such that for all n ≥ n 0 , if C n ∈ C is an [n, k]q code, then k/n < 1 − H q (δ/2) + ε.
Use this to conclude Proposition 4.1.1.
Exercise 4.2. Pick a (n − k) × n matrix H over Fq at random. Show that with high probability the
code whose parity check matrix is H achieves the GV bound.
Exercise 4.3. Recall the definition of an ε-biased space from Exercise 2.15. Show that there exists
an ε-biased space of size O(k/ε2 ).

Hint: Recall part 1 of Exercise 2.15.

Exercise 4.4. Argue that a random linear code as well as its dual both lie on the corresponding
GV bound.
Exercise 4.5. In Section 4.2.2, we saw that random linear code meets the GV bound. It is natural
to ask the question for general random codes. (By a random (n, k)q code, we mean the follow-
ing: for each of the q k messages, pick a random vector from [q]n . Further, the choices for each
codeword is independent.) We will do so in this problem.
1. Prove that a random q-ary code with rate R > 0 with high probability has relative distance
δ ≥ H q−1 (1 − 2R − ε). Note that this is worse than the bound for random linear codes in
Theorem 4.2.1.

2. Prove that with high probability the relative distance of a random q-ary code of rate R is at
most H q−1 (1 − 2R) + ε. In other words, general random codes are worse than random linear
codes in terms of their distance.
Hint: Use Chebyshev’s inequality.

Exercise 4.6. We saw that Algorithm 4.2.1 can compute an (n, k)q code on the GV bound in time
q O(n) . Now the construction for linear codes is a randomized construction and it is natural to ask
how quickly can we compute an [n, k]q code that meets the GV bound. In this problem, we will
see that this can also be done in q O(n) deterministic time, though the deterministic algorithm is
not that straight-forward anymore.
1. Argue that Theorem 4.2.1 gives a q O(kn) time algorithm that constructs an [n, k]q code on
the GV bound. (Thus, the goal of this problem is to “shave" off a factor of k from the expo-
nent.)
,n
2. A k × n Toeplitz Matrix A = {A i , j }ki =1, j =1
satisfies the property that A i , j = A i −1, j −1 . In other
words, any diagonal has the same value. For example, the following is a 4 × 6 Toeplitz
matrix:  
1 2 3 4 5 6
 
 7 1 2 3 4 5 
 
 8 7 1 2 3 4 
9 8 7 1 2 3

88
A random k ×n Toeplitz matrix T ∈ Fk×n
q is chosen by picking the entries in the first row and
column uniformly (and independently) at random.
Prove the following claim: For any non-zero m ∈ Fkq , the vector m·T is uniformly distributed
£ ¤
over Fnq , that is for every y ∈ Fnq , Pr m · T = y = q −n .

Hint: Write down the expression for the value at each of the n positions in the vector m · T in terms of the
values in the first row and column of T . Think of the values in the first row and column as variables. Then
divide these variables into two sets (this “division" will depend on m) say S and S. Then argue the following:
for every fixed y ∈ Fnq and for every fixed assignment to variables in S, there is a unique assignment to variables
in S such that mT = y.

3. Briefly argue why the claim in part 2 implies that a random code defined by picking its
generator matrix as a random Toeplitz matrix with high probability lies on the GV bound.

4. Conclude that an [n, k]q code on the GV bound can be constructed in time q O(k+n) .

Exercise 4.7. Show that one can construct the parity check matrix of an [n, k]q code that lies on
the GV bound in time q O(n) .

Exercise 4.8. So far in Exercises 4.6 and 4.7, we have seen two constructions of [n, k]q code on the
GV bound that can be constructed in q O(n) time. For constant rate codes, at the time of writing of
this book, this is fastest known construction of any code that meets the GV bound. For k = o(n),
there is a better construction known, which we explore in this exercise.
We begin with some notation. For the rest of the exercise we will target a distance of d = δn.
Given a message m ∈ Fkq and an [n, k]q code C , define the indicator variable:
½
1 if w t (C (m)) < d
Wm (C ) =
0 otherwise.

Further, define
X
D(C ) = Wm (C ).
m∈Fkq \{0}

We will also use D(G) and Wm (G) to denote the variables above for the code C generated by G.
Given an k × n matrix M , we will use M i to denote the i th column of M and M ≤i to denote
the column submatrix of M that contains the first i columns. Finally below we will use G to
denote a uniformly random k ×n generator matrix and G to denote a specific instantiation of the
generator matrix. We will arrive at the final construction in a sequence of steps. In what follows
define k < (1 − H q (δ))n for large enough n.

1. Argue that C has a distance d if and only if D(C ) < 1.

2. Argue that E [D(G )] < 1.

89
3. Argue that for any 1 ≤ i < n and fixed k × n matrix G,
h i h i
min E D(G )|G ≤i = G ≤i , G i +1 = v ≤ E D(G )|G ≤i = G ≤i .
v∈Fkq

4. We are now ready to define the algorithm to compute the final generator matrix G: see
Algorithm 4.5.1. Prove that Algorithm 4.5.1 outputs a matrix G such that the linear code

Algorithm 4.5.1 q O(k) time algorithm to compute a code on the GV bound

I NPUT: Integer parameters 1 ≤ k 6= n such that k < (1 − H q (δ)n)
O UTPUT: An k × n generator matrix G for a code with distance δn

1: Initialize G to be the all 0s matrix ⊲ This initialization is arbitrary

2: FOR every 1 ≤ i ≤ n DO
£ ¤
3: G i ← arg minv∈Fk E D(G )|G ≤i = G ≤i , G i +1 = v
q

4: RETURN G

generated by G is an [n, k, δn]q code. Conclude that this code lies on the GV bound.

5. Finally, we will analyze

¡ ¢the run time of Algorithm 4.5.1. Argue that Step 2 can be im-
k
plemented
¡ k
¢in poly n, q time. Conclude Algorithm 4.5.1 can be implemented in time
poly n, q .

Hint: It might be useful to maintain a data structure that keeps track of one number for every non-zero m ∈ Fkq
throughout the run of Algorithm 4.5.1.

Exercise 4.9. In this problem we will derive the GV bound using a graph-theoretic proof, which
is actually equivalent to the greedy proof we saw in Section 4.2.1. Let 1 ≤ d ≤ n and q ≥ 1 be
integers. Now consider the graph G n,d ,q = (V, E ), where the vertex set is the set of all vectors in
[q]n . Given two vertices u 6= v ∈ [q]n , we have the edge (u, v) ∈ E if and only if ∆(u, v) < d . An
independent set of a graph G = (V, E ) is a subset I ⊆ V such that for every u 6= v ∈ I , we have that
(u, v) is not an edge. We now consider the following sub-problems:

1. Argue that any independent set C of G n,d ,q is a q-ary code of distance d .

2. The degree of a vertex in a graph G is the number of edges incident on that vertex. Let ∆ be
the maximum degree of any vertex in G = (V, E ).Then argue that G has an independent set
|V |
of size at least ∆+1 .

3. Using parts 1 and 2 argue the GV bound.

Exercise 4.10. In this problem we will improve slightly on the GV bound using a more sophisti-
cated graph-theoretic proof. Let G n,d ,q and N and ∆ be as in the previous exercise (Exercise 4.9).

90
So far we used the fact that G n,d ,q has many vertices and small degree to prove it has a large in-
dependent set, and thus to prove there is a large code of minimum distance d . In this exercise we
will see how a better result can be obtained by counting the number of “triangles” in the graph. A
triangle in a graph G = (V, E ) is a set {u, v, w} ⊂ V of three vertices such that all three vertices are
adjancent, i.e., (u, v), (v, w), (w, u) ∈ E . For simplicity we will focus on the case where q = 2 and
d = n/5, and consider the limit as n → ∞.

1. Prove that a graph on N vertices of maximum degree ∆ has at most O(N ∆2 ) triangles.

2. Prove that the number of triangle in graph G n,d ,2 is at most

Ã !
X n
2n · · 3e .
0≤e≤3d /2 e

Hint: Fix u and let e count the number of coordinates where at least one of v or w disagree
with u. Prove that e is at most 3d /2.

3. Simplify the expression in the case where d = n/5 to show that the number of triangles in
G n,n/5,2 is O(N · ∆2−η ) for some η > 0.

4. A famous result in the “probabilistic method” shows (and you don’t have to prove this), that
if a graph on N vertices of maximum degree ∆ has at most O(N ·∆2−η ) triangles, then it has
an independent set of size Ω( N
∆
log ∆). Use this result
¡ nto¢ conclude that there is a binary code
of block length n and distance n/5 of size Ω(n2n / n/5 ). (Note that this improves over the
GV-bound by an Ω(n) factor.)

Exercise 4.11. Use part 1 from Exercise 1.7 to prove the Singleton bound.

Exercise 4.12. Let C be an (n, k, d )q code. Then prove that fixing any n −d +1 positions uniquely
determines the corresponding codeword.

Exercise 4.13. Our goal in this problem is to improve the bound in part 1 in Theorem 4.4.1.
Towards that end,

1. Prove that the following holds for every k ≥ 1. There exists k +1 vectors vki ∈ Rk for i ∈ [k +1]
° °2 D E
such that (1) °vk ° = 1 for every i ∈ [k + 1] and (2) vk , vk = − 1 for every i 6= j ∈ [k + 1].
i 2 i j k

2. Using the above part, or

³ otherwise,
´ prove the following result. Let C be a q code of block
1
length n and distance 1 − q n. Then |C | ≤ 2(q − 1)n. (Note that this is a factor q/(q − 1)
better than part 1 in Theorem 4.4.1.)

Exercise 4.14. Prove that the bound in Exercise 4.13 is tight for q = 2– i.e. there exists binary
codes C with block length n and distance n/2 such that |C | = 2n.

Exercise 4.15. Prove that part 1 of Lemma 4.4.3 is tight.

Exercise 4.16. Prove that part 2 of Lemma 4.4.3 is tight.

91
Exercise 4.17. In this exercise we will prove the Plotkin bound (at least part 2 of Theorem 4.4.1)
via a purely combinatorial proof. ³ ´
Given an (n, k, d )q code C with d > 1 − q1 n define
X
S= ∆(c1 , c2 ).
c1 6=c2 ∈C

For the rest of the problem think of C has an |C | × n matrix where each row corresponds to a
codeword in C . Now consider the following:
1. Looking at the contribution of each column in the matrix above, argue that
µ ¶
1
S ≤ 1− · n|C |2 .
q

2. Look at the contribution of the rows in the matrix above, argue that

S ≥ |C | (|C | − 1) · d .

3. Conclude part 2 of Theorem 4.4.1.

Exercise 4.18. In this exercise, we will prove the so called Griesmer Bound. For any [n, k, d ]q ,
prove that
k−1 » ¼
X d
n≥ i
.
i =0 q

Hint: Recall Exercise 2.18.

Exercise 4.19. Use Exercise 4.18 to prove part 2 of Theorem 4.4.1 for linear codes.
Exercise 4.20. Use Exercise 4.18 to prove Theorem 4.3.1 for linear codes.

4.6 Bibliographic Notes

Theorem 4.2.1 was proved for general codes by Gilbert ([62]) and for linear codes by Varshamov
([160]). Hence, the bound is called the Gilbert-Varshamov bound. The Singleton bound (The-
orem 4.3.1) is due to Singleton [142], though versions of this result with the same simple proof
seem to have appeared earlier in the work of Joshi [93] who only states the bound for the case
q = 2. For larger (but still constant) values of q, better lower bounds than the GV bound (i.e.,
results on the existence of codes) are known. In particular, for every prime power q ≥ 49, there
exist linear codes, called algebraic geometric (or AG) codes that outperform the corresponding
GV bound2 . AG codes are out of the scope of this book. An introduction to this class of codes
can be found, for instance, in a survey by Høholdt, van Lint, and Pellikaan [88]. Exercise 4.10 is
from the work of Jiang and Vardy [92].
2 1
AG codes are only defined for q being a square or a prime and achieve a rate R ≥ 1−δ− pq−1 . The lower bound
of 49 comes from the fact that it is the smallest square of a prime for which this bound improves on the q-ary GV
bound.

92
Chapter 5

The Greatest Code of Them All:

Reed-Solomon Codes

Reed-Solomon codes have been studied a lot in coding theory, and are ubiquitous in practice.
These codes are basic and based only very elementary algebra. Yet they are optimal in the sense
that they exactly meet the Singleton bound (Theorem 4.3.1). For every choice of n and k satis-
fying k ≤ n there is a Reed-Solomon code of dimension k, block length n and distance n − k + 1.
As if this were not enough, Reed-Solomon codes turn out to be more versatile: they are fully ex-
plicit and they have many applications outside of coding theory. (We will see some applications
later in the book.)
These codes are defined in terms of univariate polynomials (i.e. polynomials in one un-
known/variable) with coefficients from a finite field Fq . It turns out that polynomials over Fp ,
for prime p, also help us describe finite fields Fp s , for s > 1. We start with a quick review of poly-
nomials over finite fields (for a more careful review, please see Appendix D). This will allow us
to define Reed-Solomon codes over every field Fq , which we do in the second part of this chap-
ter. Finally in the third part of this chapter we discuss “Maximum Distance Separable” (MDS)
codes, which are codes that meet the Singleton bound. We discuss their properties (which in
turn are also properties of the Reed-Solomon codes, since they are MDS codes).

5.1 Polynomials and Finite Fields

We start by reviewing the notion of a (univariate) polynomial over a field and define basic
notions such as degree, evaluation and root of a polynomial. We conclude with the “degree
mantra” that relates the degree to the number of roots.
We begin with the formal definition of a (univariate) polynomial.

Definition 5.1.1. A polynomial over a variable X and a finite field Fq is given by a finite sequence
P
( f 0 , f 1 , . . . , f d ) with f i ∈ Fq and is denoted by F (X ) = di=0 f i X i . The degree of F (X ), denoted
deg(F ), is the largest index i such that f i 6= 0.

For example, 2X 3 + X 2 +5X +6 is a polynomial over F7 of degree 3. We ignore leading zeroes

93
in the definition of a polynomial. For example 0X 4 + 2X 3 + X 2 + 5X + 6 is the same polynomial
as 2X 3 + X 2 + 5X + 6.
Next, we define some useful notions related to polynomials. We begin with the notion of
degree of a polynomial.
We let Fq [X ] denote the set of polynomials over Fq , that is, with coefficients from Fq . Let
F (X ),G(X ) ∈ Fq [X ] be polynomials. Then Fq [X ] has the following natural operations defined
on it:

Addition:
max(deg(F
X),deg(G))
F (X ) +G(X ) = ( f i + g i )X i ,
i =0

where the addition on the coefficients is done over Fq . For example, over F2 ,

X + (1 + X ) = X · (1 + 1) + 1 · (0 + 1) = 1

(recall that over F2 , 1 + 1 = 0).1

Multiplication: Ã !
deg(F )+deg(G)
X min(iX
,deg(F ))
F (X ) · G(X ) = f j · gi −j X i ,
i =0 j =0

where all the operations on the coefficients are over Fq . For example, over F2 , X (1 + X ) =
X + X 2 ; (1+ X )2 = 1+2X + X 2 = 1+ X 2 , where the latter equality follows since 2 ≡ 0 mod 2.

Next, we define evaluations of a polynomial.

Definition 5.1.2. Given a polynomial F (X ) ∈ Fq [X ] and α ∈ Fq , the evaluation of F (X ) at α,

Pdeg F
denoted F (α) is i =0 f i αi . Note that F (α) ∈ Fq .2

Finally, polynomials don’t have multiplicative inverses, but one can divide polynomials by
each other and get quotients and residues. The following proposition defines this notion and
states some basic properties.

Proposition 5.1.3 (Polynomial Division). Given polynomial f (X ), g (X ) ∈ Fq [X ] there exist unique

polynomials q(X ), the quotient, and r (X ), the remainder, with deg(r ) < deg(g ) such that f (X ) =
q(X )g (X ) + r (X ). If g (X ) = X − α for α ∈ Fq , then r (X ) is the degree 0 polynomial f (α), i.e., the
evaluation of f at α.

Definition 5.1.4. α ∈ Fq is a root of a polynomial F (X ) if F (α) = 0.

1
This will be a good time to remember that operations over a finite field are much different from operations over
integers/reals. For example, over reals/integers X + (X + 1) = 2X + 1.
2
While this definition requires the coefficients of F and α to come from the same field, it also extends naturally
to the case where one of these is from a field FQ extending Fq . Since Fq ⊆ FQ , if α ∈ Fq and F (X ) ∈ FQ [X ] then the
evaluation is well-defined since α ∈ FQ . If F (X ) ∈ Fq [X ] then we use the fact that Fq [X ] ⊆ FQ [X ] to get a definition
of F (α). In both cases F (α) ∈ FQ .

94
For instance, 1 is a root of 1 + X 2 over F2 .
We now state a basic property of polynomials, the “Degree Mantra”, that will be crucial to our
use of polynomials to build error-correcting codes. We also introduce the notion of irreducible
polynomials whose existence is closely related to the existence of finite fields of prime power
size. Finally, motivated by the need to make fields and field operations fully constructive, we
briefly remark on the construction of irreducible polynomials.

Proposition 5.1.5 (“Degree Mantra"). A nonzero polynomial f (X ) of degree t over a field Fq has
at most t distinct roots in Fq .

Proof. We will prove the theorem by induction on t . If t = 0, we are done. Now, consider f (X ) of
degree t > 0. If f has no roots then we are done, else let α ∈ Fq be a root of f . Let g (X ) = X − α.
By the fundamental rule of division of polynomials (Proposition 5.1.3) we have that f (X ) = (X −
α)q(X ) + f (α) = (X − α)q(X ). It follows that the degree of q(X ) satisfies deg( f ) = 1 + deg(q), and
thus deg(q) = t −1. Note further that if β 6= α is a root of f then we have that q(α) = f (β)·(β−α)−1
and so β is also a root of q. By induction we have that q has at most t − 1 roots, and this f has at
most t distinct roots (the at most t − 1 roots of q plus the root at α).

The codes we will construct in this chapter do not need any more algebra, except to describe
the finite fields that they work over. To understand finite fields beyond those of prime size, we
now describe some more basic properties of polynomials.

5.1.1 Irreducibility and Field Extensions

We will start with a special class of polynomials, called irreducible polynomials, which are anal-
ogous to how prime numbers are special for natural numbers.

Definition 5.1.6. A polynomial F (X ) is irreducible if for every G 1 (X ),G 2 (X ) such that F (X ) =

G 1 (X )G 2 (X ), we have min(deg(G 1 ), deg(G 2 )) = 0

For example, 1 + X 2 is not irreducible over F2 , as

(1 + X )(1 + X ) = 1 + X 2 .

However, 1 + X + X 2 is irreducible, since its non-trivial factors have to be from the linear terms
X or X + 1. However, it can be checked that neither is a factor of 1 + X + X 2 . (In fact, one can
show that 1 + X + X 2 is the only irreducible polynomial of degree 2 over F2 – see Exercise 5.4.)
A word of caution: if a polynomial E (X ) ∈ Fq [X ] has no root in Fq , it does not mean that E (X )
is irreducible. For example consider the polynomial (1 + X + X 2 )2 over F2 – it does not have any
root in F2 but it obviously is not irreducible.
The main reason we consider irreducibility of polynomials in this book is that irreducible
polynomials lead us to non-prime fields. Just as the set of integers modulo a prime is a field,
so is the set of polynomials modulo an irreducible polynomial, and these fields can have non-
prime size. We start by first asserting that they form a field; and then turn to properties such as
size later.

95
Theorem 5.1.7. Let E (X ) be an irreducible polynomial of degree s ≥ 2 over Fp , p prime. Then the
set of polynomials in Fp [X ] modulo E (X ), denoted by Fp [X ]/E (X ), is a field.

The proof of the theorem above is similar to the proof of Lemma 2.1.4, so we only sketch the
proof here. In particular, we will explicitly state the basic tenets of Fp [X ]/E (X ).

• Elements are polynomials in Fp [X ] of degree at most s − 1. Note that there are p s such
polynomials.

• Addition: (F (X ) +G(X )) mod E (X ) = F (X ) mod E (X ) + G(X ) mod E (X ) = F (X ) + G(X ).

(Since F (X ) and G(X ) are of degree at most s − 1, addition modulo E (X ) is just plain poly-
nomial addition.)

• Multiplication: (F (X ) · G(X )) mod E (X ) is the unique polynomial R(X ) with degree at

most s − 1 such that for some A(X ), R(X ) + A(X )E (X ) = F (X ) · G(X )

• The additive identity is the zero polynomial, and the additive inverse of any element F (X )
is −F (X ).

• The multiplicative identity is the constant polynomial 1. It can be shown that for every
element F (X ), there exists a unique multiplicative inverse (F (X ))−1 .

For example, for p = 2 and E (X ) = 1 + X + X 2 , F2 [X ]/(1 + X + X 2 ) has as its elements

{0, 1, X , 1 + X }.

The additive inverse of any element in F2 [X ]/(1 + X + X 2 ) is the element itself while the multi-
plicative inverses of
1, X and 1 + X
in F2 [X ]/(1 + X + X 2 ) are
1, 1 + X and X
respectively.
Next we turn to the size of the field Fq [x]/E (X ) for an irreducible polynomial E .

Lemma 5.1.8. Let E (x) ∈ Fq [x] be an irreducible polynomial of degree s. Then Fq [x]/E (x) is a
field of size q s .

Proof. This follows from the fact that the elements of Fq [x]/E (x) are in one to one correspon-
dence with set of remainders of all polynomials in Fq [X ] when divided by E (X ) which in turn is
simply the set of all polynomials of degree less than s. The number of such polynomials equals
q s (there are q possibilities for the coefficient of X i for every 0 ≤ i < s).

Thus a natural question to ask is if an irreducible polynomials exist for every degree. Indeed,
they do. The following theorem asserts this and the reader may find a proof in Appendix D.

96
Theorem 5.1.9. For all s ≥ 2 and Fp , there exists an irreducible
µ polynomial of degree s over Fp . In
s¶
p
fact, the number of such monic irreducible polynomials is Θ .
s
The result is true even for general finite fields Fq and not just prime fields but we stated the
version over prime fields for simplicity.
Now recall that Theorem 2.1.5 states that for every prime power p s , there is a unique field
Fp s . This along with Theorems 5.1.7, Lemma 5.1.8 and 5.1.9 imply that:

Corollary 5.1.10. The field Fp s is Fp [X ]/E (X ), where E (X ) is an irreducible polynomial of degree

The facts about irreducible polynomials listed above give sufficient information not only
to determine when finite fields exist, but also how to represent them so as to be able to add,
multiply or invert elements, given an irreducible polynomial of degree s over Fp . To make our
ability to work with fields completely algorithmic we need one more ingredient — one that
allows us to find an irreducible polynomial of degree s in Fp fast. We now turn to this question.

5.1.2 Finding Irreducible Polynomials

Given any monic 3 polynomial E (X ) of degree s, it can be verified whether it is an irreducible
s
polynomial by checking if gcd(E (X ), X q −X ) = E (X ), where gcd(F (X ),G(X )) is the greatest com-
mon denominator (or factor) of polynomials F (X ) and G(X ). This is true as e