0% found this document useful (0 votes)

135 views473 pages

Web Coding Book

This document is a textbook on coding theory covering topics such as error correction codes, linear codes, Hamming codes, Reed-Solomon codes, Shannon's noise model and capacity theorem. It includes chapters on the fundamentals of coding theory, linear codes, probability concepts, combinatorial bounds on code parameters, Reed-Solomon codes, Shannon's noise and capacity results, and list decoding.

Uploaded by

Kingston Malone

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

135 views473 pages

Web Coding Book

Uploaded by

Kingston Malone

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 473

Essential Coding Theory

Venkatesan Guruswami Atri Rudra1 Madhu Sudan

January 31, 2022

1
Department of Computer Science and Engineering, University at Buffalo, SUNY. Work supported by
NSF CAREER grant CCF-0844796.
2
Foreword

This book is based on lecture notes from coding theory courses taught by Venkatesan Gu-
ruswami at University at Washington and CMU; by Atri Rudra at University at Buffalo, SUNY
and by Madhu Sudan at Harvard and MIT.
This version is dated January 31, 2022. For the latest version, please go to

http://www.cse.buffalo.edu/faculty/atri/courses/coding-theory/book/

The material in this book is supported in part by the National Science Foundation under CA-
REER grant CCF-0844796. Any opinions, findings and conclusions or recomendations expressed
in this material are those of the author(s) and do not necessarily reflect the views of the National
Science Foundation (NSF).

©Venkatesan Guruswami, Atri Rudra, Madhu Sudan, 2019.

This work is licensed under the Creative Commons Attribution-NonCommercial-
NoDerivs 3.0 Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-nd/3.0/ or send a letter to Creative Commons, 444
Castro Street, Suite 900, Mountain View, California, 94041, USA.

3
4
Contents

I The Basics 17
1 The Fundamental Question 19
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2 Some definitions and codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3 Error correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4 Distance of a code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.5 Hamming Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.6 Hamming Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.7 Generalized Hamming Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.8 Family of codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.10 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2 A Look at Some Nicely Behaved Codes: Linear Codes 43

2.1 Groups and Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2 Vector Spaces and Linear Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.3 Linear Codes and Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.4 Hamming Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.5 Efficient Decoding of Hamming codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.6 Dual of a Linear Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3 Probability as Fancy Counting and the q-ary Entropy Function 63

3.1 A Crash Course on Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 The Probabilistic Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 The q-ary Entropy Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

II The Combinatorics 79
4 What Can and Cannot Be Done-I 81

5
4.1 Asymptotic Version of the Hamming Bound . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 Gilbert-Varshamov Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 Singleton Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4 Plotkin Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5 The Greatest Code of Them All: Reed-Solomon Codes 99

5.1 Polynomials and Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Reed-Solomon Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 A Property of MDS Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6 What Happens When the Noise is Stochastic: Shannon’s Theorem 115

6.1 Overview of Shannon’s Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Shannon’s Noise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3 Shannon’s Result for BSCp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.4 Hamming vs. Shannon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7 Bridging the Gap Between Shannon and Hamming: List Decoding 133
7.1 Hamming versus Shannon: part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2 List Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3 Johnson Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4 List-Decoding Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.5 List Decoding from Random Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

8 What Cannot be Done-II 153

8.1 Elias-Bassalygo bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.2 The MRRW bound: A better upper bound . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.3 A Breather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

III The Codes 157

9 When Polynomials Save the Day: Polynomial Based Codes 159
9.1 The generic construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.2 The low degree case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
9.3 The case of the binary field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

6
9.4 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

10 From Large to Small Alphabets: Code Concatenation 173

10.1 Code Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
10.2 Zyablov Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
10.3 Strongly Explicit Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
10.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

11 When Graphs Come to the Party: Expander Codes 181

11.1 Bipartite Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
11.2 Bipartite Vertex Expanders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
11.3 Expander Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
11.4 Codes from weaker expanders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
11.5 Optimizing the trade-off between rate and error fraction . . . . . . . . . . . . . . . . 195
11.6 Existence of lossless expanders: Proof of Theorem 11.2.6 . . . . . . . . . . . . . . . . 201
11.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
11.8 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

12 Information Theory Strikes Back: Polar Codes 209

12.1 Achieving Gap to Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
12.2 Reduction to Linear Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
12.3 The Polarization Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
12.4 Polar codes, Encoder and Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
12.5 Analysis: Speed of Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
12.6 Entropic Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
12.7 Summary and additional information . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
12.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

IV The Algorithms 243

13 Decoding Concatenated Codes 245
13.1 A Natural Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
13.2 Decoding From Errors and Erasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
13.3 Generalized Minimum Distance Decoding . . . . . . . . . . . . . . . . . . . . . . . . 249
13.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

14 Efficiently Achieving the Capacity of the BSCp 255

14.1 Achieving capacity of B SC p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
14.2 Decoding Error Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
14.3 The Inner Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

7
14.4 The Outer Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
14.5 Discussion and Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

15 Decoding Reed-Muller Codes 263

15.1 A natural decoding algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
15.2 Majority Logic Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
15.3 Decoding by reduction to Reed-Solomon decoding . . . . . . . . . . . . . . . . . . . 272
15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
15.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

16 Fast encoding: linear time encodable codes 283

16.1 Overview of the construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
16.2 Low-density Error-Reduction Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
16.3 The error-correcting code: Recursive construction . . . . . . . . . . . . . . . . . . . 287
16.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
16.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
16.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

17 Efficient Decoding of Reed-Solomon Codes 291

17.1 Unique decoding of Reed-Solomon codes . . . . . . . . . . . . . . . . . . . . . . . . 291
17.2 List Decoding Reed-Solomon Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
17.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
17.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

18 Efficiently Achieving List Decoding Capacity 315

18.1 Folded Reed-Solomon Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
18.2 List Decoding Folded Reed-Solomon Codes: I . . . . . . . . . . . . . . . . . . . . . . 319
18.3 List Decoding Folded Reed-Solomon Codes: II . . . . . . . . . . . . . . . . . . . . . . 322
18.4 Bibliographic Notes and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

19 Recovering very locally: Locally Recoverable Codes 337

19.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
19.2 Definition of Locally Recoverable Codes . . . . . . . . . . . . . . . . . . . . . . . . . 338
19.3 A simple construction for message symbol LRCs . . . . . . . . . . . . . . . . . . . . . 339
19.4 A Singleton-type bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
19.5 An LRC meeting the Singleton type bound . . . . . . . . . . . . . . . . . . . . . . . . 342
19.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
19.7 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

V The Applications 349

20 Cutting Data Down to Size: Hashing 351
20.1 Why Should You Care About Hashing? . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

8
20.2 Avoiding Hash Collisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
20.3 Almost Universal Hash Function Families and Codes . . . . . . . . . . . . . . . . . . 356
20.4 Data Possession Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
20.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361

21 Securing Your Fingerprints: Fuzzy Vaults 363

21.1 Some quick background on fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . 363
21.2 The Fuzzy Vault Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
21.3 The Final Fuzzy Vault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
21.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370

22 Finding Defectives: Group Testing 371

22.1 Formalization of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
22.2 Bounds on t a (d , N ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
22.3 Bounds on t (d , N ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
22.4 Coding Theory and Disjunct Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
22.5 An Application in Data Stream Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 381
22.6 Summary of best known bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
22.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
22.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

23 Complexity of Coding Problems 391

23.1 Nearest Codeword Problem (NCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
23.2 Decoding with Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
23.3 Approximate NCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
23.4 Distance bounded decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
23.5 Minimum distance problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
23.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
23.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
23.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409

A Notation Table 421

B Some Useful Facts 423

B.1 Some Useful Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
B.2 Some Useful Identities and Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

C Background on Asymptotic notation, Algorithms and Complexity 427

C.1 Asymptotic Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
C.2 Bounding Algorithm run time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
C.3 Randomized Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
C.4 Efficient Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
C.5 More on intractability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
C.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442

9
C.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445

D Basic Algebraic Algorithms 447

D.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
D.2 Groups, Rings, Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
D.3 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
D.4 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
D.5 Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
D.6 Algorithmic aspects of Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
D.7 Algorithmic aspects of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
D.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465

E Some Information Theory Essentials 467

E.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
E.2 Joint and conditional entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
E.3 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472

10
List of Figures

1.1 Decoding for Akash English, one gets “I need little little (trail)mix." . . . . . . . . . 19
1.2 Coding process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3 Bad example for unique decoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.4 Illustration for proof of Hamming Bound . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1 The q-ary Entropy Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.1 The Hamming and GV bounds for binary codes . . . . . . . . . . . . . . . . . . . . . 82

4.2 An illustration of Gilbert’s greedy algorithm (Algorithm 6) for the first five iterations. 84
4.3 Construction of a new code in the proof of the Singleton bound. . . . . . . . . . . . 87
4.4 The Hamming, GV and Singleton bound for binary codes. . . . . . . . . . . . . . . . 88
4.5 R vs δ tradeoffs for binary codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.1 The communication process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.2 Binary Symmetric Channel BSCp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3 Binary Erasure Channel BECα . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.4 The sets D m partition the ambient space {0, 1}n . . . . . . . . . . . . . . . . . . . . . . 120
6.5 The shell S m of inner radius (1 − γ)pn and outer radius (1 + γ)pn. . . . . . . . . . . 121
6.6 Illustration of Proof of Shannon’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . 123

7.1 Bad example of unique decoding revisited . . . . . . . . . . . . . . . . . . . . . . . . 134

7.2 Comparing the Johnson Bound with Unique decoding and Singleton bounds . . . 140
7.3 An error pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.4 Illustration of notation used in the proof of Theorem 7.5.1 . . . . . . . . . . . . . . . 146
7.5 An error pattern in the middle of the proof . . . . . . . . . . . . . . . . . . . . . . . . 147

8.1 Bounds on R vs δ for binary codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

10.1 Concatenated code C out ◦C in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

10.2 The Zyablov bound for binary codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

11.1 A bipartite graph G H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

11.2 A bipartite expander graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
11.3 The ‘triangle’ graph G on the left, its edge vertex incidence graph (see Definition 11.4.4) in the middle
11.4 Code construction in proof of Theoremm 11.5.7. . . . . . . . . . . . . . . . . . . . . 200

11
12.1 The 2 × 2 Basic Polarizing Transform. Included in red are the conditional
³ entropies of´the variables, co
12.2 The n × n Basic Polarizing Transform defined as P n (Z) = P n (U, V) = P n (U + V), P n (V) . Acknowledgement
2 2
12.3 Block structure of the Basic Polarizing Transform. Circled are a block at the 2nd level and two 2nd leve

13.1 Encoding and Decoding of Concatenated Codes . . . . . . . . . . . . . . . . . . . . . 246

13.2 All values of θ ∈ [q i , q i +1 ) lead to the same outcome . . . . . . . . . . . . . . . . . . . 253

14.1 Efficiently achieving capacity of BSCp . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

14.2 Error Correction cannot decrease during “folding" . . . . . . . . . . . . . . . . . . . 260

16.1 The recursive construction of C k . The final code C̃ k is also shown. . . . . . . . . . . 288

17.1 A received word in 2-D space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

17.2 The closest polynomial to a received word . . . . . . . . . . . . . . . . . . . . . . . . 293
17.3 Error locator polynomial for a received word . . . . . . . . . . . . . . . . . . . . . . . 294
17.4 The tradeoff between rate R and the fraction of errors that can be corrected by Algorithm 24.300
17.5 A received word in 2-D space for the second Reed-Solomon . . . . . . . . . . . . . . 301
17.6 An interpolating polynomial Q(X , Y ) for the received word in Figure 17.5. . . . . . 302
17.7 The two polynomials that need to be output are shown in blue. . . . . . . . . . . . . 302
17.8 The tradeoff between rate R and the fraction of errors that can be corrected by Algorithm 24 and Algor
17.9 Multiplicity of 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
17.10Multiplicity of 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
17.11Multiplicity of 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
17.12A received word in 2-D space for the third Reed-Solomon . . . . . . . . . . . . . . . 307
17.13An interpolating polynomial Q(X , Y ) for the received word in Figure 17.12. . . . . . 307
17.14The five polynomials that need to be output are shown in blue. . . . . . . . . . . . . 308

18.1 Encoding for Reed-Solomon Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316

18.2 Folded Reed-Solomon code for m = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
18.3 Folded Reed-Solomon code for general m ≥ 1. . . . . . . . . . . . . . . . . . . . . . . 316
18.4 Error pattern under unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
18.5 Error pattern under folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
18.6 Performance of Algorithm 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
18.7 An agreement in position i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
18.8 More agreement with a sliding window of size 2. . . . . . . . . . . . . . . . . . . . . . 323
18.9 Performance of Algorithm 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
18.10An upper triangular system of linear equations . . . . . . . . . . . . . . . . . . . . . 327

21.1 The minutiae are unordered and form a set, not a vector. . . . . . . . . . . . . . . . 365

22.1 Pick a subset S (not necessarily contiguous). Then pick a column j that is not present in S. There will
22.2 Construction of the final matrix MC ∗ from MC out and MC in from Example 22.4.3. The rows in MC ∗ that

E.1 Relationship between entropy, joint entropy, conditional entropy, and mutual information for two ran

12
List of Tables

3.1 Uniform distribution over F22×2 along with values of four random variables. . . . . . 64

8.1 High level summary of results seen so far. . . . . . . . . . . . . . . . . . . . . . . . . . 155

10.1 Strongly explicit binary codes that we have seen so far. . . . . . . . . . . . . . . . . . 173

14.1 An overview of the results seen so far . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

14.2 Summary of properties of C out and C in . . . . . . . . . . . . . . . . . . . . . . . . . . 257

13
14
List of Algorithms

1 Error Detector for Parity Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2 Naive Maximum Likelihood Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Naive Decoder for Hamming Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Decoder for Any Linear Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Efficient Decoder for Hamming Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6 Gilbert’s Greedy Code Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7 q O(k) time algorithm to compute a code on the GV bound . . . . . . . . . . . . . . . 96
8 Generating Irreducible Polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9 P OLAR C OMPRESSOR(Z, S) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
10 Successive Cancellation Decompressor SCD(W, P, S) . . . . . . . . . . . . . . . . . . 215
11 B ASIC P OLAR E NCODER(Z; n, S) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
12 B ASIC P OLAR D ECODER : BPD(W; n, p) . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
13 Natural Decoder for C out ◦C in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
14 Generalized Minimum Decoder (ver 1) . . . . . . . . . . . . . . . . . . . . . . . . . . 250
15 Generalized Minimum Decoder (ver 2) . . . . . . . . . . . . . . . . . . . . . . . . . . 252
16 Deterministic Generalized Minimum Decoder‘ . . . . . . . . . . . . . . . . . . . . . 253
17 Decoder for efficiently achieving BSCp capacity . . . . . . . . . . . . . . . . . . . . . 257
18 S IMPLE R EED -M ULLER D ECODER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
19 Majority Logic Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
20 R EED -S OLOMON - BASED D ECODER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
21 G EN -FLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
22 L INEAR-D ECODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
23 Welch-Berlekamp Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
24 The First List Decoding Algorithm for Reed-Solomon Codes . . . . . . . . . . . . . . 299
25 The Second List Decoding Algorithm for Reed-Solomon Codes . . . . . . . . . . . . 303
26 The Third List Decoding Algorithm for Reed-Solomon Codes . . . . . . . . . . . . . 308
27 Decoding Folded Reed-Solomon Codes by Unfolding . . . . . . . . . . . . . . . . . . 317
28 The First List Decoding Algorithm for Folded Reed-Solomon Codes . . . . . . . . . 320
29 The Second List Decoding Algorithm for Folded Reed-Solomon Codes . . . . . . . 324
30 The Root Finding Algorithm for Algorithm 29 . . . . . . . . . . . . . . . . . . . . . . 331
31 Computing disjoint S and T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
32 Pre-Processing for Data Possession Verification . . . . . . . . . . . . . . . . . . . . . 357
33 Verification for Data Possession Verification . . . . . . . . . . . . . . . . . . . . . . . 358

15
34 Decompression Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
35 Decompression Algorithm Using List Decoding . . . . . . . . . . . . . . . . . . . . . 360
36 UNLOCK 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
37 LOCK 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
38 UNLOCK 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
39 Decoder for Separable Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
40 Naive Decoder for Disjunct Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
41 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
42 Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
43 Report Heavy Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
44 Simple Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
45 Sampling algorithm for G AP H AMMING . . . . . . . . . . . . . . . . . . . . . . . . . . 435
46 An average-case algorithm for G AP H AMMING . . . . . . . . . . . . . . . . . . . . . . 436
47 Exponential time algorithm for M AX L INEAR E Q . . . . . . . . . . . . . . . . . . . . . 437
48 Reduction from M AXC UT to M AX L INEAR E Q . . . . . . . . . . . . . . . . . . . . . . . 440
49 R OOT-F IND (Fq , f ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
50 L INEAR-R OOT-F IND (Fq , g ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464

16
Part I

The Basics

17
Chapter 1

The Fundamental Question

1.1 Overview
Communication is a fundamental need of our modern lives. In fact, communication is some-
thing that humans have been doing for a long time. For simplicity, let us restrict ourselves to
English. It is quite remarkable that different people speaking English can be understood pretty
well: even if e.g. the speaker has an accent. This is because English has some built-in redun-
dancy, which allows for “errors" to be tolerated. We will pick an example from one of the au-
thor’s experiences conversing with his two year old son, Akash. When Akash started to speak
his own version of English, which we will dub “Akash English," we got examples such as the one
illustrated below:

Figure 1.1: Decoding for Akash English, one gets “I need little little (trail)mix."

19
With some practice Akash’s parents were able to “decode" what Akash really meant. In fact,
Akash could communicate even if he did not say an entire word properly and gobbled up part(s)
of word(s).
The above example shows that having redundancy in a language allows for communication
even in the presence of (small amounts of) differences and errors. Of course in our modern
digital world, all kinds of entities communicate (and most of the entities do not communicate
in English or any natural language for that matter). Errors are also present in the digital world,
so these digital communications also use redundancy.
Error-correcting codes (henceforth, just codes) are clever ways of representing data so that
one can recover the original information even if parts of it are corrupted. The basic idea is to
judiciously introduce redundancy so that the original information can be recovered even when
parts of the (redundant) data have been corrupted.
For example, when packets are transmitted over the Internet, some of the packets get cor-
rupted or dropped. Packet drops are resolved by the TCP layer by a combination of sequence
numbers and ACKs. To deal with data corruption, multiple layers of the TCP/IP stack use a form
of error correction called CRC Checksum [102]. From a theoretical point of view, the checksum
is a terrible code (for that matter so is English). However, on the Internet, the current dominant
mode of operation is to detect errors and if errors have occurred, then ask for retransmission.
This is the reason why the use of checksum has been hugely successful in the Internet. However,
there are other communication applications where re-transmission is not an option. Codes are
used when transmitting data over the telephone line or via cell phones. They are also used in
deep space communication and in satellite broadcast (for example, TV signals are transmitted
via satellite). Indeed, asking the Mars Rover to re-send an image just because it got corrupted
during transmission is not an option–this is the reason that for such applications, the codes
used have always been very sophisticated.
Codes also have applications in areas not directly related to communication. In particu-
lar, in the applications above, we want to communicate over space. Codes can also be used to
communicate over time. For example, codes are used heavily in data storage. CDs and DVDs
work fine even in presence of scratches precisely because they use codes. Codes are used in Re-
dundant Array of Inexpensive Disks (RAID) [21] and error correcting memory [20]. Sometimes,
in the Blue Screen of Death displayed by Microsoft Windows family of operating systems, you
might see a line saying something along the lines of “parity check failed"–this happens when
the code used in the error-correcting memory cannot recover from error(s). Also, certain con-
sumers of memory, e.g. banks, do not want to suffer from even one bit flipping (this e.g. could
mean someone’s bank balance either got halved or doubled–neither of which are welcome1 ).
Codes are also deployed in other applications such as paper bar codes; for example, the bar
code used by UPS called MaxiCode [19]. Unlike the Internet example, in all of these applica-
tions, there is no scope for “re-transmission."
In this book, we will mainly think of codes in the communication scenario. In this frame-
work, there is a sender who wants to send (say) k message symbols over a noisy channel. The

1
This is a bit tongue-in-cheek: in real life banks have more mechanisms to prevent one bit flip from wreaking
havoc.

20
sender first encodes the k message symbols into n symbols (called a codeword) and then sends
it over the channel. The receiver gets a received word consisting of n symbols. The receiver then
tries to decode and recover the original k message symbols. Thus, encoding is the process of
adding redundancy and decoding is the process of removing errors.
Unless mentioned otherwise, in this book we will make the following assumption:

The sender and the receiver only communicate via the channel.a In other words, other than
some setup information about the code, the sender and the receiver do not have any other
information exchange (other than of course what was transmitted over the channel). In
particular, no message is more likely to be transmitted over another.
a
The scenario where the sender and receiver have a “side-channel" is an interesting topic that has been
studied but is outside the scope of this book.

The fundamental question that will occupy our attention for almost the entire book is the
tradeoff between the amount of redundancy used and the number of errors that can be cor-
rected by a code. In particular, we would like to understand:

Question 1.1.1. How much redundancy do we need to correct a given amount of errors? (We
would like to correct as many errors as possible with as little redundancy as possible.)

Intuitively, maximizing error correction and minimizing redundancy are contradictory goals:
a code with higher redundancy should be able to tolerate more number of errors. By the end of
this chapter, we will see a formalization of this question.
Once we determine the optimal tradeoff, we will be interested in achieving this optimal
tradeoff with codes that come equipped with efficient encoding and decoding. (A DVD player
that tells its consumer that it will recover from a scratch on a DVD by tomorrow is not exactly
going to be a best-seller.) In this book, we will primarily define efficient algorithms to be ones
that run in polynomial time.2

1.2 Some definitions and codes

To formalize Question 1.1.1, we begin with the definition of a code.

Definition 1.2.1 (Code). A code of block length n over an alphabet Σ is a subset of Σn . Typically,
we will use q to denote |Σ|.3

Remark 1.2.2. We note that the ambient space Σn can be viewed as a set of sequences, vectors or
functions. In other words, we can think of a vector (v 1 , . . . , v n ) ∈ Σn as just the sequence v 1 , . . . , v n
2
We are not claiming that this is the correct notion of efficiency in practice. However, we believe that it is a good
definition as the “first cut"– quadratic or cubic time algorithms are definitely more desirable than exponential time
algorithms: see Section C.4 for more on this.
3
Note that q need not be a constant and can depend on n: we’ll see codes in this book where this is true.

21
(in order) or a vector tuple (v 1 , . . . , v n ) or as the function f : [n] → Σ such that f (i ) = v i . Sequences
assume least structure on Σ and hence are most generic. Vectors work well when Σ has some struc-
ture (and in particular is what is known as a field, which we will see next chapter). Functional
representation will be convenient when the set of coordinates has structure (e.g., [n] may come
from a finite field of size n). For now, however, the exact representation does not matter and the
reader can work with representation as sequences.

We will also frequently use the following alternate way of looking at a code. Given a code
C ⊆ Σn , with |C | = M , we will think of C as a mapping of the following form:

C : [M ] → Σn .

In the above, we have used to notation [M ] for any integer M ≥ 1 to denote the set {1, 2, . . . , M }.
We will also need the notion of dimension of a code.

Definition 1.2.3 (Dimension of a code). Given a code C ⊆ Σn , its dimension is given by

def
k = logq |C |.

Let us begin by looking at two specific codes. Both codes are defined over Σ = {0, 1} (also
known as binary codes). In both cases |C | = 24 and we will think of each of the 16 messages as a
4 bit vector.
We first look at the so-called parity code, which we will denote by C ⊕ . Given a message
(x 1 , x 2 , x 3 , x 4 ) ∈ {0, 1}4 , its corresponding codeword is given by

C ⊕ (x 1 , x 2 , x 3 , x 4 ) = (x 1 , x 2 , x 3 , x 4 , x 1 ⊕ x 2 ⊕ x 3 ⊕ x 4 ),

where the ⊕ denotes the EXOR (also known as the XOR or Exclusive-OR) operator. In other
words, the parity code appends the parity of the message bits (or takes the remainder of the
sum of the message bits when divided by 2) at the end of the message. Note that such a code
uses the minimum amount of non-zero redundancy.
The second code we will look at is the so-called repetition code. This is a very natural code
(and perhaps the first code one might think of). The idea is to repeat every message bit a fixed
number of times. For example, we repeat each of the 4 message bits 3 times and we use C 3,r ep
to denote this code.
Let us now try to look at the tradeoff between the amount of redundancy and the number of
errors each of these codes can correct. Even before we begin to answer the question, we need
to define how we are going to measure the amount of redundancy. One natural way to define
redundancy for a code with dimension k and block length n is by their difference n − k. By this
definition, the parity code uses the least amount of redundancy. However, one “pitfall" of such
a definition is that it does not distinguish between a code with k = 100 and n = 102 and another
code with dimension and block length 2 and 4, respectively. Intuitively, the latter code is using
more redundancy. This motivates the following notion of measuring redundancy.

22
Definition 1.2.4 (Rate of a code). The rate of a code with dimension k and block length n is given
by
def k
R = .
n

Note that the higher the rate, the lesser the amount of redundancy in the code. Also note that
as k ≤ n, R ≤ 1.4 Intuitively, the rate of a code is the average amount of real information in each
of the n symbols transmitted over the channel. So, in some sense, rate captures the complement
of redundancy. However, for historical reasons, we will deal with the rate R (instead of the more
obvious 1−R) as our notion of redundancy. Given the above definition, C ⊕ and C 3,r ep have rates
of 54 and 13 . As was to be expected, the parity code has a higher rate than the repetition code.
We have formalized the notion of redundancy as the rate of a code as well as other param-
eters of a code. However, to formalize Question 1.1.1, we still need to formally define what it
means to correct errors. We do so next.

1.3 Error correction

Before we formally define error correction, we will first formally define the notion of encoding.

Definition 1.3.1 (Encoding function). Let C ⊆ Σn . An equivalent description of the code C is an

injective mapping E : [|C |] → Σn called the encoding function.

Next we move to error correction. Intuitively, we can correct a received word if we can re-
cover the transmitted codeword (or equivalently the corresponding message). This “reverse"
process is called decoding.

Definition 1.3.2 (Decoding function). Let C ⊆ Σn be a code. A mapping D : Σn → [|C |] is called a

decoding function for C .

The definition of a decoding function by itself does not give anything interesting. What we
really need from a decoding function is that it recovers the transmitted message. To understand
this notion, we first need to understand what is the nature of errors that we aim to tackle. In
particular, if a transmitter transmits u ∈ Σn and the receiver receives v ∈ Σn , how do we quantify
the amount of “error” that has happened during this transmission? While multiple notions are
possible, the most central one, and the one we will focus on for most of this book is based on
“Hamming distance”, a notion of distance that captures how close are two given sequences u
and v.

Definition 1.3.3 (Hamming distance). Given two vectors u, v ∈ Σn the Hamming distance be-
tween u and v, denoted by ∆(u, v), is the number of positions in which u and v differ.
4
Further, in this book, we will always consider the case k > 0 and n < ∞ and hence, we can also assume that
R > 0.

23
The Hamming distance is a distance in a very formal mathematical sense: see Exercise 1.5.
Note that the definition of Hamming distance just depends on the number of differences and
not the nature of the difference. For example, consider the vectors u = 00000 and v = 10001.
One can see that their Hamming distance is ∆(u, v) = 2. Now consider the vector w = 01010.
Note that even though v 6= w, we have that the Hamming distance ∆(u, w) = 2.
To return to the quantification of errors, from now we will say that if u is transmitted and
v is received then ∆(u, v) errors occured during transmission. This allows us to quantify the
performance of an encoding/function, or equivalently the underlying code as we do next.

Definition 1.3.4 (t -Error Channel). An n-symbol t -Error Channel over the alphabet Σ is a func-
tion Ch : Σn → Σn that satisfies ∆(v, Ch(v)) ≤ t for every v ∈ Σn .

Definition 1.3.5 (Error Correcting Code). Let C ⊆ Σn be a code and let t ≥ 1 be an integer. C
is said to be a t -error-correcting code if there exists a decoding function D such that for every
message m ∈ [|C |] every t -error channel Ch we have D (Ch(C (m))) = m.

Thus a t -error-correcting code is one where there is a decoding function that corrects any
pattern of t errors.
Figure 1.3 illustrates how the definitions we have examined so far interact.

Channel Ch
m 7→ C (m) v = Ch (C (m)) 7→ m

Encoding function Decoding function

Figure 1.2: Coding process

We will also very briefly look at a weaker form of error recovery called error detection.

Definition 1.3.6 (Error detection code). Let C ⊆ Σn be a code and let t ≥ 1 be an integer. C is said
to be t -error-detecting code if there exists a detecting procedure D such that for every message m
and every received vector v ∈ Σn satisfying ∆(C (m), v) ≤ t , it hold that D outputs a 1 if v = C (m)
and 0 otherwise.

Thus a t -error-detecting code is one where if the transmission has at least one error and at
most t errors, then the decoding function detects the error (by outputting 0). Note that a t -error
correcting code is also a t -error detecting code (but not necessarily the other way round): see
Exercise 1.1. Although error detection might seem like a weak error recovery model, it is useful
in settings where the receiver can ask the sender to re-send the message. For example, error
detection is used quite heavily in the Internet.
Finally we also consider a more benign model of errors referred to as “erasures” where a
symbol is merely (and explicitly) omitted from the transmission (as opposed being replaced by
some other symbol). To define this model we use a special symbol “?” that is not a member of
the alphabet Σ.

24
Definition 1.3.7 (t -Erasure Channel). An n-symbol t -Erasure Channel over the alphabet Σ is a
function Ch : Σn → (Σ ∪ {?})n that satisfies ∆(v, Ch(v)) ≤ t for every v ∈ Σn (where both arguments
to ∆(·, ·) are viewed as elements of (Σ ∪ {?})n ) and for every i ∈ [n] such that vi 6= Ch(v)i we have
Ch(v)i =?.

A coordinate i such that Ch(v)i =? is called an erasure. We may now define erasure correct-
ing codes analogously to error-correcting codes.

Definition 1.3.8 (Erasure Correcting Code). Let C ⊆ Σn be a code and let t ≥ 1 be an integer. C
is said to be a t -erasure-correcting code if there exists a decoding function D such that for every
message m ∈ [|C |] every t -erasure channel Ch we have D (Ch(C (m))) = m.

With the above definitions in place, we are now ready to look at the error correcting capa-
bilities of the codes we looked at in the previous section.

1.3.1 Error-Correcting Capabilities of Parity and Repetition codes

In Section 1.2, we looked at examples of parity code and repetition code with the following
properties:

C ⊕ : q = 2, k = 4, n = 5, R = 4/5.

C 3,r ep : q = 2, k = 4, n = 12, R = 1/3.

We will start with the repetition code. To study its error-correcting capabilities, we will con-
sider the following natural decoding function. Given a received word y ∈ {0, 1}12 , divide it up
into four consecutive blocks (y 1 , y 2 , y 3 , y 4 ) where every block consists of three bits. Then, for
every block y i (1 ≤ i ≤ 4), output the majority bit as the message bit. We claim this decoding
function can correct any error pattern with at most 1 error. (See Exercise 1.2.) For example, if a
block of 010 is received, since there are two 0’s we know the original message bit was 0. In other
words, we have argued that

Proposition 1.3.9. C 3,r ep is a 1-error correcting code.

However, it is not too hard to see that C 3,r ep cannot correct two errors. For example, if both
of the errors happen in the same block and a block in the received word is 010, then the original
block in the codeword could have been either 111 or 000. Therefore in this case, no decoder can
successfully recover the transmitted message.5
Thus, we have pin-pointed the error-correcting capabilities of the C 3,r ep code: it can cor-
rect one error, but not two or more. However, note that the argument assumed that the error
positions can be located arbitrarily. In other words, we are assuming that the channel noise
behaves arbitrarily (subject to a bound on the total number of errors). Obviously, we can model
the noise differently. We now briefly digress to look at this issue in slightly more detail.
5
Recall we are assuming that the decoder has no side information about the transmitted message.

25
Digression: Channel Noise. As was mentioned above, until now we have been assuming the
following noise model, which was first studied by Hamming:

Any error pattern can occur during transmission as long as the total number of er-
rors is bounded. Note that this means that the location as well as the nature6 of the
errors is arbitrary.

We will frequently refer to Hamming’s model as the Adversarial Noise Model. It is important
to note that the atomic unit of error is a symbol from the alphabet. So for example, if the error
pattern is (1, 0, 1, 0, 0, 0) and we consider the alphabet to be {0, 1}, then the pattern has two errors.
However, if our alphabet is {0, 1}3 (i.e. we think of the vector above as ((1, 0, 1), (0, 0, 0)), with
(0, 0, 0) corresponding to the zero element in {0, 1}3 ), then the pattern has only one error. Thus,
by increasing the alphabet size we can also change the adversarial noise model. As the book
progresses, we will see how error correction over a larger alphabet is easier than error correction
over a smaller alphabet.
However, the above is not the only way to model noise. For example, we could also have
following error model:

No more than 1 error can happen in any contiguous three-bit block.

First note that, for the channel model above, no more than four errors can occur when a code-
word in C 3,r ep is transmitted. (Recall that in C 3,r ep , each of the four bits is repeated three times.)
Second, note that the decoding function that takes the majority vote of each block can suc-
cessfully recover the transmitted codeword for any error pattern, while in the worst-case noise
model it could only correct at most one error. This channel model is admittedly contrived, but
it illustrates the point that the error-correcting capabilities of a code (and a decoding function)
are crucially dependent on the noise model.
A popular alternate noise model is to model the channel as a stochastic process. As a con-
crete example, let us briefly mention the binary symmetric channel with crossover probability
0 ≤ p ≤ 1, denoted by BSCp , which was first studied by Shannon. In this model, when a (binary)
codeword is transferred through the channel, every bit flips independently with probability p.
Note that the two noise models proposed by Hamming and Shannon are in some sense two
extremes: Hamming’s model assumes no knowledge about the channel (except that a bound
on the total number of errors is known7 while Shannon’s noise model assumes complete knowl-
edge about how noise is produced. In this book, we will consider only these two extreme noise
models. In real life, the situation often is somewhere in between.
For real life applications, modeling the noise model correctly is an extremely important
task, as we can tailor our codes to the noise model at hand. However, in this book we will
not study this aspect of designing codes at all, and will instead mostly consider the worst-case
noise model. Intuitively, if one can communicate over the worst-case noise model, then one
6
For binary codes, there is only one kind of error: a bit flip. However, for codes over a larger alphabet, say {0, 1, 2},
0 being converted to a 1 and 0 being converted into a 2 are both errors, but are different kinds of errors.
7
A bound on the total number of errors is necessary; otherwise, error correction would be impossible: see
Exercise 1.3.

26
could use the same code to communicate over nearly every other noise model with the same
amount of noise.

We now return to C ⊕ and examine its error-correcting capabilities in the worst-case noise
model. We claim that C ⊕ cannot correct even one error. Suppose y = 10000 is the received word.
Then we know that an error has occurred, but we do not know which bit was flipped. This is
because the two codewords u = 00000 and v = 10001 differ from the received word y in exactly
one bit. As we are assuming that the receiver has no side information about the transmitted
codeword, no decoder can know what the transmitted codeword was.
Thus, from an error-correction point of view, C ⊕ is a terrible code (as it cannot correct even
1 error). However, we will now see that C ⊕ can detect one error. Consider Algorithm 1. Note that

Algorithm 1 Error Detector for Parity Code

I NPUT: Received word y = (y 1 , y 2 , y 3 , y 4 , y 5 )
O UTPUT: 1 if y ∈ C ⊕ and 0 otherwise

1: b ← y 1 ⊕ y 2 ⊕ y 3 ⊕ y 4 ⊕ y 5
2: RETURN 1 ⊕ b ⊲ If there is no error, then b = 0 and hence we need to "flip" the bit for the
answer

when no error has occurred during transmission, y i = x i for 1 ≤ i ≤ 4 and y 5 = x 1 ⊕ x 2 ⊕ x 3 ⊕ x 4 ,

in which case b = 0 and we output 1 ⊕ 0 = 1 as required. If there is a single error then either
y i = x i ⊕ 1 (for exactly one 1 ≤ i ≤ 4) or y 5 = x 1 ⊕ x 2 ⊕ x 3 ⊕ x 4 ⊕ 1. It is easy to check that in this
case, b = 1. In fact, one can extend this argument to obtain the following result (see Exercise 1.4).

Proposition 1.3.10. The parity code C ⊕ can detect an odd number of errors.

Let us now revisit the example that showed that one cannot correct one error using C ⊕ .
Recall, we considered two codewords in C ⊕ , u = 00000 and v = 10001 (which are codewords
corresponding to messages 0000 and 1000, respectively). Now consider the scenarios in which
u and v are each transmitted and a single error occurs resulting in the received word r = 10000.
Thus, given the received word r and the fact that at most one error can occur, the decoder has
no way of knowing whether the original transmitted codeword was u or v. Looking back at the
example, it is clear that the decoder is “confused" because the two codewords u and v do not
differ in many positions. This notion is formalized in the next section.

1.4 Distance of a code

We now turn to a new parameter associated with a code that we call the minimum distance of
a code. As we will see later, minimum distance is not completely new and easily connected to
the other parameters including the error-correction capacity of the code and error-detection
capacity of the code. But due to the cleanliness of the definition it will often be the first of the
parameters we will explore when studying a new error-correcting code.

27
Definition 1.4.1 (Minimum distance). Let C ⊆ Σn . The minimum distance (or just distance) of
C , denoted ∆(C ), is defined to be

∆(C ) = min ∆(c1 , c2 ).

c1 6=c2 ∈C

In other words, ∆(C ) is the minimum distance between two distinct codewords in C . It is
easy to check that the repetition code C 3,r ep has distance 3. Indeed, any two distinct messages
will differ in at least one of the message bits. After encoding, the difference in one message bit
will translate into a difference of three bits in the corresponding codewords. We now claim that
the distance of C ⊕ is 2. This is a consequence of the following observations. If two messages
m1 and m2 differ in at least two places then ∆(C ⊕ (m1 ),C ⊕ (m2 )) ≥ 2 (even if we just ignored the
parity bits). If two messages differ in exactly one place then the parity bits in the correspond-
ing codewords are different which implies a Hamming distance of 2 between the codewords.
Thus, C ⊕ has smaller distance than C 3,r ep and can correct less number of errors than C 3,r ep .
This suggests that a larger distance implies greater error-correcting capabilities. The next result
formalizes this intuition. Before we get to the result, we first introduce a milder notion of cor-
ruption called “erasures”. As we will see minimum distance exactly captures both the ability to
recover from errors as also this notion of erasures.

Proposition 1.4.2. Given a code C , the following are equivalent:

1. C has minimum distance d ≥ 2,

2. If d is odd, C can correct (d − 1)/2 errors.

3. C can detect d − 1 errors.

4. C can correct d − 1 erasures.

Remark 1.4.3. Property (2) above for even d is slightly different. In this case, one can correct up
to d2 − 1 errors but cannot correct d2 errors. (See Exercise 1.6.)

Before we prove Proposition 1.4.2, let us apply it to the codes C ⊕ and C 3,r ep which have
distances of 2 and 3 respectively. Proposition 1.4.2 implies the following facts that we have
already proved:

• C 3,r ep can correct 1 error (Proposition 1.3.9).

• C ⊕ can detect 1 error but cannot correct 1 error (Proposition 1.3.10).

The proof of Proposition 1.4.2 will need the following decoding function. Maximum like-
lihood decoding (MLD) is a well-studied decoding method for error correcting codes, which
outputs the codeword closest to the received word in Hamming distance (with ties broken arbi-
trarily). More formally, the MLD function denoted by D M LD : Σn → C is defined as follows. For
every y ∈ Σn ,
D M LD (y) = arg min ∆(c, y).
c∈C
Algorithm 2 is a naive implementation of the MLD.

28
Algorithm 2 Naive Maximum Likelihood Decoder
I NPUT: Received word y ∈ Σn
O UTPUT: D M LD (y)

1: Pick an arbitrary c ∈ C and assign z ← c

2: FOR every c′ ∈ C such that c 6= c′ DO
3: ∆(c′ , y) < ∆(z, y) THEN
IF
4: z ← c′
5: RETURN z

Proof of Proposition 1.4.2 We will complete the proof in two steps. First, we will show that if
property 1 is satisfied then so are properties 2,3 and 4. Then we show that if property 1 is not
satisfied then none of properties 2,3 or 4 hold.

1. implies 2. Assume C has distance d . We first prove 2 (for this case assume that d = 2t + 1).
We now need to show that there exists a decoding function such that for all error patterns with
at most t errors it always outputs the transmitted message. We claim that the MLD function
has this property. Assume this is not so and let c1 be the transmitted codeword and let y be the
received word. Note that
∆(y, c1 ) ≤ t . (1.1)
As we have assumed that MLD does not work, D M LD (y) = c2 6= c1 . Note that by the definition of
MLD,
∆(y, c2 ) ≤ ∆(y, c1 ). (1.2)
Consider the following set of inequalities:

∆(c1 , c2 ) ≤ ∆(c2 , y) + ∆(c1 , y) (1.3)

≤ 2∆(c1 , y) (1.4)
≤ 2t (1.5)
= d − 1, (1.6)

where (1.3) follows from the triangle inequality (see Exercise 1.5), (1.4) follows from (1.2) and
(1.5) follows from (1.1). (1.6) implies that the distance of C is at most d − 1, which is a contra-
diction.

1. implies 3. We now show that property 3 holds, that is, we need to describe an algorithm
that can successfully detect whether errors have occurred during transmission (as long as the
total number of errors is bounded by d − 1). Consider the following error detection algorithm:
check if the received word y = c for some c ∈ C (this can be done via an exhaustive check). If
no errors occurred during transmission, y = c1 , where c1 was the transmitted codeword and the
algorithm above will accept (as it should). On the other hand if 1 ≤ ∆(y, c1 ) ≤ d − 1, then by the
fact that the distance of C is d , y 6∈ C and hence the algorithm rejects, as required.

29
1. implies 4. Finally, we prove that property 4 holds. Let y ∈ (Σ ∪ {?})n be the received word.
First we claim that there is a unique c = (c 1 , . . . , c n ) ∈ C that agrees with y (i.e. y i = c i for ev-
ery i such that y i 6= ?). (For the sake of contradiction, assume that this is not true, i.e. there
exists two distinct codewords c1 , c2 ∈ C such that both c1 and c2 agree with y in the unerased
positions. Note that this implies that c1 and c2 agree in the positions i such that y i 6= ?. Thus,
∆(c1 , c2 ) ≤ |{i |y i = ?}| ≤ d − 1, which contradicts the assumption that C has distance d .) Given
the uniqueness of the codeword c ∈ C that agrees with y in the unerased position, an algorithm
to find c is as follows: go through all the codewords in C and output the desired codeword.

¬1. implies ¬2. For the other direction of the proof, assume that property 1 does not hold,
that is, C has distance d − 1. We now show that property 2 cannot hold: i.e., for every decoding
function there exists a transmitted codeword c1 and a received word y (where ∆(y, c1 ) ≤ (d −1)/2)
such that the decoding function cannot output c1 . Let c1 6= c2 ∈ C be codewords such that
∆(c1 , c2 ) = d − 1 (such a pair exists as C has distance d − 1). Now consider a vector y such that
∆(y, c1 ) = ∆(y, c2 ) = (d − 1)/2. Such a y exists as d is odd and by the choice of c1 and c2 . Below is
an illustration of such a y (matching color implies that the vectors agree on those positions):

d −1 d −1
n −d +1 2 2

Figure 1.3: Bad example for unique decoding.

Now, since y could have been generated if either of c1 or c2 were the transmitted codeword,
no decoding function can work in this case.8

¬1. implies ¬3. For the remainder of the proof, assume that the transmitted word is c1 and
there exists another codeword c2 such that ∆(c2 , c1 ) = d − 1. To see why property 3 is not true,
let y = c2 . In this case, either the error detecting algorithm detects no error, or it declares an
error when c2 is the transmitted codeword and no error takes place during transmission.

8
Note that this argument is just a generalization of the argument that C ⊕ cannot correct 1 error.

30
¬1. implies ¬4. We finally argue that property 4 does not hold. Let y be the received word in
which the positions that are erased are exactly those where c1 and c2 differ. Thus, given y both
c1 and c2 could have been the transmitted codeword, and no algorithm for correcting (at most
d − 1) erasures can work in this case. ■
Proposition 1.4.2 implies that Question 1.1.1 can be reframed as

Question 1.4.1. What is the largest rate R that a code with distance d can have?

We have seen that the repetition code C 3,r ep has distance 3 and rate 1/3. A natural follow-up
question (which is a special case of Question 1.4.1) is to ask

Question 1.4.2. Can we have a code with distance 3 and rate R > 31 ?

1.5 Hamming Code

With the above question in mind, let us consider the so-called Hamming code, which we will
denote by C H . Given a message (x 1 , x 2 , x 3 , x 4 ) ∈ {0, 1}4 , its corresponding codeword is given by

C H (x 1 , x 2 , x 3 , x 4 ) = (x 1 , x 2 , x 3 , x 4 , x 2 ⊕ x 3 ⊕ x 4 , x 1 ⊕ x 3 ⊕ x 4 , x 1 ⊕ x 2 ⊕ x 4 ).

It is easy to check that this code has the following parameters:

C H : q = 2, k = 4, n = 7, R = 4/7.

We will show shortly that C H has a distance of 3. We would like to point out that we could
have picked the three parities differently. The reason we mention the three particular parities
above is due to historical reasons. We leave it as an exercise to define an alternate set of parities
such that the resulting code still has a distance of 3: see Exercise 1.9.
Before we move on to determining the distance of C H , we will need another definition.

Definition 1.5.1 (Hamming Weight). Let q ≥ 2. Given any vector v ∈ {0, 1, 2, . . . , q − 1}n , its Ham-
ming weight, denoted by w t (v) is the number of non-zero symbols in v.

For example, if v = 01203400, then w t (v) = 4.

We now look at the distance of C H .

Proposition 1.5.2. C H has a distance of 3.

31
Proof. We will prove the claimed property by using two properties of C H :

min w t (c) = 3, (1.7)

c∈C H ,c6=0

and
min w t (c) = min ∆(c1 , c2 ) (1.8)
c∈C H ,c6=0 c1 6=c2 ∈C H

The proof of (1.7) follows from a case analysis on the Hamming weight of the message bits. Let
us use x = (x 1 , x 2 , x 3 , x 4 ) to denote the message vector.

• Case 0: If w t (x) = 0, then C H (x) = 0, which means we do not have to consider this code-
word.

• Case 1: If w t (x) = 1 then at least two parity check bits in (x 2 ⊕x 3 ⊕x 4 , x 1 ⊕x 2 ⊕x 4 , x 1 ⊕x 3 ⊕x 4 )

are 1 (see Exercise 1.10). So in this case, w t (C H (x)) ≥ 3.

• Case 2: If w t (x) = 2 then at least one parity check bit in (x 2 ⊕x 3 ⊕x 4 , x 1 ⊕x 2 ⊕x 4 , x 1 ⊕x 3 ⊕x 4 )

is 1 (see Exercise 1.11). So in this case, w t (C H (x)) ≥ 3.

• Case 3: If w t (x) ≥ 3 then obviously w t (C H (x)) ≥ 3.

Thus, we can conclude that min w t (c) ≥ 3. Further, note that w t (C H (1, 0, 0, 0)) = 3, which
c∈C H ,c6=0
along with the lower bound that we just obtained proves (1.7).
We now turn to the proof of (1.8). For the rest of the proof, let x = (x 1 , x 2 , x 3 , x 4 ) and y =
(y 1 , y 2 , y 3 , y 4 ) denote the two distinct messages. Using associativity and commutativity of the ⊕
operator, we obtain that
C H (x) +C H (y) = C H (x + y),

where the “+" operator is just the bit-wise ⊕ of the operand vectors9 . Further, it is easy to verify
that for two vectors u, v ∈ {0, 1}n , ∆(u, v) = w t (u + v) (see Exercise 1.12). Thus, we have

min ∆(C H (x),C H (y)) = min w t (C H (x + y))

x6=y∈{0,1}4 x6=y∈{0,1}4

= min w t (C H (x)),
x6=0∈{0,1}4

where the second equality follows from the observation that {x+y|x 6= y ∈ {0, 1}n } = {x ∈ {0, 1}n |x 6=
0}. Recall that w t (C H (x)) = 0 if and only if x = 0 and this completes the proof of (1.8). Combining
(1.7) and (1.8), we conclude that C H has a distance of 3.

The second part of the proof could also be shown in the following manner. It can be verified
easily that the Hamming code is the set {x · G H |x ∈ {0, 1}4 }, where G H is the following matrix
9
E.g. (0, 1, 1, 0) + (1, 1, 1, 0) = (1, 0, 0, 0).

32
(where we think x as a row vector).10
 
1 0 0 0 0 1 1
 
0 1 0 0 1 0 1
GH =  
0 0 1 0 1 1 0
0 0 0 1 1 1 1

In fact, any binary code (of dimension k and block length n) that is generated11 by a k × n
matrix is called a binary linear code. (Both C ⊕ and C 3,r ep are binary linear codes: see Exer-
cise 1.13.) This implies the following simple fact.

Lemma 1.5.3. For any binary linear code C and any two messages x and y, C (x)+C (y) = C (x+y).

Proof. For any binary linear code, we have a generator matrix G. The following sequence of
equalities (which follow from the distributivity and associativity properties of the Boolean EXOR
and AND operators) proves the lemma.

C (x) +C (y) = x · G + y · G
= (x + y) · G
= C (x + y)

We stress that in the lemma above, x and y need not be distinct. Note that due to the fact that
b ⊕b = 0 for every b ∈ {0, 1}, x+x = 0, which along with the lemma above implies that C (0) = 0.12
We can infer the following result from the above lemma and the arguments used to prove (1.8)
in the proof of Proposition 1.5.2.

Proposition 1.5.4. For any binary linear code, its minimum distance is equal to minimum Ham-
ming weight of any non-zero codeword.

Thus, we have seen that C H has distance d = 3 and rate R = 74 while C 3,r ep has distance d = 3
and rate R = 31 . Thus, the Hamming code is provably better than the repetition code (in terms
of the tradeoff between rate and distance) and thus, answers Question 1.4.2 in the affirmative.
The next natural question is

Question 1.5.1. Can we have a distance 3 code with a rate higher than that of C H ?

We will address this question in the next section.

10
Indeed (x 1 , x 2 , x 3 , x 4 ) · G H = (x 1 , x 2 , x 3 , x 4 , x 2 ⊕ x 3 ⊕ x 4 , x 1 ⊕ x 3 ⊕ x 4 , x 1 ⊕ x 2 ⊕ x 4 ), as desired.
11
That is, C = {x · G|x ∈ {0, 1}k }, where addition is the ⊕ operation and multiplication is the AND operation.
12
This of course should not be surprising as for any matrix G, we have 0 · G = 0.

33
1.6 Hamming Bound
Now we switch gears to present our first tradeoff between redundancy (in the form of the di-
mension of a code) and its error-correction capability (in the form of its distance). In particular,
we will first prove a special case of the so-called Hamming bound for a distance of 3.
We begin with another definition.

Definition 1.6.1 (Hamming Ball). For any vector x ∈ [q]n ,

B (x, e) = {y ∈ [q]n |∆(x, y) ≤ e}.

Next, we prove an upper bound on the dimension of every code with distance 3.

Theorem 1.6.2 (Hamming bound for d = 3). Every binary code with block length n, dimension
k, distance d = 3 satisfies
k ≤ n − log2 (n + 1).

Proof. Given any two codewords, c1 6= c2 ∈ C , the following is true (as C has distance13 3):

B (c1 , 1) ∩ B (c2 , 1) = ;. (1.9)

See Figure 1.4 for an illustration.

{0, 1}n

1 1 1

c1 c2

1 1 1

Figure 1.4: Hamming balls of radius 1 are disjoint. The figure is technically not correct: the balls
above are actually balls in the Euclidean space, which is easier to visualize than the Hamming
space.

Note that for all x ∈ {0, 1}n (see Exercise 1.16),

|B (x, 1)| = n + 1. (1.10)

13
Assume that y ∈ B (c1 , 1)∩B (c2 , 1), that is ∆(y, c1 ) ≤ 1 and ∆(y, c2 ) ≤ 1. Thus, by the triangle inequality ∆(c1 , c2 ) ≤
2 < 3, which is a contradiction.

34
Now consider the union of all Hamming balls centered around some codeword. Obviously, their
union is a subset of {0, 1}n . In other words,
¯ ¯
¯[ ¯
¯ B (c, 1)¯ ≤ 2n . (1.11)
¯ ¯
c∈C

As (1.9) holds for every pair of distinct codewords,

¯ ¯
¯[ ¯ X
¯ B (c, 1)¯ = |B (c, 1)|
¯ ¯
c∈C c∈C
X
= (n + 1) (1.12)
c∈C
k
= 2 · (n + 1), (1.13)

where (1.12) follows from (1.10) and (1.13)) the fact that C has dimension k. Combining (1.13)
and (1.11), we get
2k (n + 1) ≤ 2n ,
or equivalently
2n
2k ≤ .
n +1
Taking log2 of both sides we get the desired bound:

k ≤ n − log2 (n + 1).

Thus, Theorem 1.6.2 shows that for n = 7, C H has the largest possible dimension for any
binary code of block length 7 and distance 3 (as for n = 7, n − log2 (n + 1) = 4). In particular,
it also answers Question 1.5.1 for n = 7 in the negative. Next, will present the general form of
Hamming bound.

1.7 Generalized Hamming Bound

We start with a new notation.

Definition 1.7.1. A code C ⊆ Σn with dimension k and distance d will be called a (n, k, d )Σ code.
We will also refer it to as a (n, k, d )|Σ| code.

We now proceed to generalize Theorem 1.6.2 to any distance d .

Theorem 1.7.2 (Hamming Bound for any d ). For every (n, k, d )q code
j k
Ã !
(d −1)
2
 X n 
k ≤ n − logq  (q − 1)i  .
i =0 i

35
Proof. The proof is a straightforward
j k generalization of the proof of Theorem 1.6.2. For nota-
(d −1)
tional convenience, let e = 2 . Given any two codewords, c1 6= c2 ∈ C , the following is true
(as C has distance14 d ):
B (c1 , e) ∩ B (c2 , e) = ;. (1.14)
We claim that for all x ∈ [q]n , Ã !
X e n
|B (x, e)| = (q − 1)i . (1.15)
i =0 i
Indeed
¡n ¢ any vector in B (x, e) must differ from x in exactly 0 ≤ i ≤ e positions. In the summation
i is the number of ways of choosing the differing i positions and in each such position, a
vector can differ from x in q − 1 ways.
Now consider the union of all Hamming balls centered around some codeword. Obviously,
their union is a subset of [q]n . In other words,
¯ ¯
¯[ ¯
¯ B (c, e)¯ ≤ q n . (1.16)
¯ ¯
c∈C

As (1.14) holds for every pair of distinct codewords,

¯ ¯
¯[ ¯ X
¯ B (c, e)¯ = |B (c, e)|
¯ ¯
c∈C c∈C
Ã !
X e n
= qk (q − 1)i , (1.17)
i =0 i

where (1.17) follows from (1.15) and the fact that C has dimension k. Combining (1.17) and
(1.16) and taking logq of both sides we will get the desired bound:
Ã Ã ! !
X e n
k ≤ n − logq (q − 1)i .
i =0 i

Note that the Hamming bound gives a partial answer to Question 1.4.1. In particular, any
code of distance d can have rate R at most
¡P ¡ ¢ ¢
logq ei=0 ni (q − 1)i
1− .
n

Further, the Hamming bound also leads to the following definition:

Definition 1.7.3. Codes that meet Hamming bound are called perfect codes.
14
Assume that y ∈ B (c1 , e)∩B (c2 , e), that is ∆(y, c1 ) ≤ e and ∆(y, c2 ) ≤ e. Thus, by the triangle inequality, ∆(c1 , c2 ) ≤
2e ≤ d − 1, which is a contradiction.

36
Intuitively, a perfect
j code
k leads to the following perfect “packing": if one constructs Ham-
ming balls of radius d −12 around all the codewords, then we would cover the entire ambient
space, i.e. every possible vector will lie in one of these Hamming balls.
One example of perfect code is the (7, 4, 3)2 Hamming code that we have seen in this chapter
(so is the family of general Hamming codes that we will see in the next chapter). A natural
question to ask is if

Question 1.7.1. Other than the Hamming codes, are there any other perfect (binary) codes?

We will see the answer in Section 2.4.

1.8 Family of codes

Until now, we have mostly studied specific codes, that is, codes with fixed block lengths and
dimension. However, when we perform an asymptotic study of codes, it makes more sense
to talk about a family of codes and study their asymptotic rate and distance. We define these
notions next.

Definition 1.8.1 (Code families, Rate and Distance). Let q ≥ 2. Let {n i }i ≥1 be an increasing se-
quence of block lengths and suppose there exists sequences {k i }i ≥1 and {d i }i ≥1 such that for all
i ≥ 1 there exists an (n i , k i , d i )q code C i . Then the sequence C = {C i }i ≥1 is a family of codes. The
rate of C is defined as ½ ¾
ki
R(C ) = lim ,
i →∞ n i

when the limit exists. The relative distance of C is defined as

½ ¾
di
δ(C ) = lim ,
i →∞ n i

when the limit exists.15

For instance, we will shortly see that Hamming code of Section 1.5 can be extended to an
entire family of codes C H = {C i }i ∈Z+ , with C i being an (n i .k i , d i )-code with n i = 2i − 1, k i = 2i −
i − 1, d i = 3 and thus,
i
R(C H ) = lim 1 − i = 1,
i →∞ 2 −1
and
3
δ(C H ) = lim i = 0.
i →∞ 2 − 1
15
In all codes we will study these limits will exist, but of course it is possible to construct families of codes where
the limits do not exist.

37
A significant focus of this text from now on will be on families of codes. This is necessary
as we will study the asymptotic behavior of algorithms on codes, which does not make sense
for a fixed code. For example, when we say that a decoding algorithm for a code C takes O(n 2 )
time, we would be implicitly assuming that C is a family of codes and that the algorithm has
an O(n 2 ) running time when the block length is large enough. From now on, unless mentioned
otherwise, whenever we talk about a code, we will be implicitly assuming that we are talking
about a family of codes.
Given that we can only formally talk about asymptotic run time of algorithms, we now also
state our formal notion of efficient algorithms:

We’ll call an algorithm related to a code of block length n to be efficient, if it runs in time
polynomial in n.

For all the specific codes that we will study in this book, the corresponding family of codes
will be a “family" in a more natural sense. By this we mean that all the specific codes in a family
of codes will be the “same" code except with different parameters. A bit more formally, we will
consider families {C i }i ≥1 , where given only the ‘index’ i , one can compute a sufficient descrip-
tion of C i efficiently.16
Finally, the definition of a family of codes allows us to present the final version of the big
motivating question for the book. The last formal version of the main question we considered
was Question 1.4.1, where we were interested in the tradeoff of rate R and distance d . The
comparison was somewhat unfair because R was a ratio while d was an integer. A more appro-
priate comparison should be between rate R and the relative distance δ. Further, we would be
interested in tackling the main motivating question for families of codes, which results in the
following final version:

Question 1.8.1. What is the optimal tradeoff between R(C ) and δ(C ) that can be achieved by
some code family C ?

A natural special case of Question 1.8.1 is whether the rate and relative distance of a family
of codes can be simultaneously positive. We formulate this special case as a separate question
below.

Question 1.8.2. Does there exist a family of codes C such that R(C ) > 0 and δ(C ) > 0 hold
simultaneously?

16
We stress that this is not always going to be the case. In particular, we will consider “random" codes where this
efficient constructibility will not be true.

38
Codes that have the above property are called asymptotically good. For the curious reader,
we will present many asymptotically good codes in the rest of this book, though a priori the
existence of these is not immediate.

1.9 Exercises
Exercise 1.1. Show that any t -error correcting code is also t -error detecting but not necessarily
the other way around.

Exercise 1.2. Prove Proposition 1.3.9.

Exercise 1.3. Show that for every integer n, there is no code with block length n that can handle
arbitrary number of errors.

Exercise 1.4. Prove Proposition 1.3.10.

Exercise 1.5. A distance function on Σn (i.e. d : Σn × Σn → R) is called a metric if the following

conditions are satisfied for every x, y, z ∈ Σn :

1. d (x, y) ≥ 0.

2. d (x, y) = 0 if and only if x = y.

3. d (x, y) = d (y, x).

4. d (x, z) ≤ d (x, y) + d (y, z). (This property is called the triangle inequality.)

Prove that the Hamming distance is a metric.

Exercise 1.6. Let C be a code with distance d for even d . Then argue that C can correct up to
d /2 − 1 many errors but cannot correct d /2 errors. Using this or otherwise, argue that if a code C
is t -error correctable then it either has a distance of 2t + 1 or 2t + 2.

Exercise 1.7. In this exercise, we will see that one can convert arbitrary codes into code with
slightly different parameters:

1. Let C be an (n, k, d )2 code with d odd. Then it can be converted into an (n + 1, k, d + 1)2
code.

2. Let C be an (n, k, d )Σ code. Then it can be converted into an (n − 1, k, d − 1)Σ code.

Note: Other than the parameters of the code C , you should not assume anything else about the
code. Also your conversion should work for every n, k, d ≥ 1.

39
Exercise 1.8. In this problem we will consider a noise model that has both errors and erasures. In
particular, let C be an (n, k, d )Σ code. As usual a codeword c ∈ C is transmitted over the channel
and the received word is a vector y ∈ (Σ ∪ {?})n , where as before a ? denotes an erasure. We will use
s to denote the number of erasures in y and e to denote the number of (non-erasure) errors that
occurred during transmission. To decode such a vector means to output a codeword c ∈ C such
that the number of positions where c disagree with y in the n − s non-erased positions is at most
e. For the rest of the problem assume that

2e + s < d . (1.18)

1. Argue that the output of the decoder for any C under (1.18) is unique.

2. Let C be a binary code (but not necessarily linear). Assume that there exists a decoder D
that can correct from < d /2 many errors in T (n) time. Then under (1.18) one can perform
decoding in time O(T (n)).
Exercise 1.9. Define codes other than C H with k = 4, n = 7 and d = 3.
Hint: Refer to the proof of Proposition 1.5.2 to figure out the properties needed from the three parities.

Exercise 1.10. Argue that if w t (x) = 1 then at least two parity check bits in (x 2 ⊕ x 3 ⊕ x 4 , x 1 ⊕ x 2 ⊕
x 4 , x 1 ⊕ x 3 ⊕ x 4 ) are 1.
Exercise 1.11. Argue that if w t (x) = 2 then at least one parity check bit in (x 2 ⊕ x 3 ⊕ x 4 , x 1 ⊕ x 2 ⊕
x 4 , x 1 ⊕ x 3 ⊕ x 4 ) is 1.
Exercise 1.12. Prove that for any u, v ∈ {0, 1}n , ∆(u, v) = w t (u + v).
Exercise 1.13. Argue that C ⊕ and C 3,r ep are binary linear codes.
Exercise 1.14. Let G be a generator matrix of an (n, k, d )2 binary linear code. Then G has at least
kd ones in it.
Exercise 1.15. Argue that in any binary linear code, either all all codewords begin with a 0 of
exactly half of the codewords begin with a 0.
Exercise 1.16. Prove (1.10).
Exercise 1.17. Show that there is no binary code with block length 4 that achieves the Hamming
bound.
Exercise 1.18. (∗) There are n people in a room, each of whom is given a black/white hat chosen
uniformly at random (and independent of the choices of all other people). Each person can see
the hat color of all other people, but not their own. Each person is asked if (s)he wishes to guess
their own hat color. They can either guess, or abstain. Each person makes their choice without
knowledge of what the other people are doing. They either win collectively, or lose collectively.
They win if all the people who don’t abstain guess their hat color correctly and at least one person
does not abstain. They lose if all people abstain, or if some person guesses their color incorrectly.
Your goal below is to come up with a strategy that will allow the n people to win with pretty high
probability. We begin with a simple warmup:

40
(a) Argue that the n people can win with probability at least 12 .

Next we will see how one can really bump up the probability of success with some careful mod-
eling, and some knowledge of Hamming codes. (Below are assuming knowledge of the general
Hamming code (see Section 2.4). If you do not want to skip ahead, you can assume that n = 7 in
the last part of this problem.

(b) Lets say that a directed graph G is a subgraph of the n-dimensional hypercube if its vertex
set is {0, 1}n and if u → v is an edge in G, then u and v differ in at most one coordinate.
Let K (G) be the number of vertices of G with in-degree at least one, and out-degree zero.
Show that the probability of winning the hat problem equals the maximum, over directed
subgraphs G of the n-dimensional hypercube, of K (G)/2n .

(c) Using the fact that the out-degree of any vertex is at most n, show that K (G)/2n is at most
n
n+1
for any directed subgraph G of the n-dimensional hypercube.

(d) Show that if n = 2r − 1, then there exists a directed subgraph G of the n-dimensional hyper-
n
cube with K (G)/2n = n+1 .
Hint: This is where the Hamming code comes in.

1.10 Bibliographic Notes

Coding theory owes its origin to two remarkable papers: one by Shannon [115] and the other
by Hamming [70] both of which were published within a couple of years of each other. Shan-
non’s paper defined the BSCp channel (among others) and defined codes in terms of its encod-
ing function. Shannon’s paper also explicitly defined the decoding function. Hamming’s work
defined the notion of codes as in Definition 1.2.1 as well as the notion of Hamming distance.
Both the Hamming bound and the Hamming code are (not surprisingly) due to Hamming. The
specific definition of Hamming code that we used in this book was the one proposed by Ham-
ming and is also mentioned in Shannon’s paper (even though Shannon’s paper pre-dates Ham-
ming’s).
The notion of erasures was defined by Elias.
One hybrid model to account for the fact that in real life the noise channel is somewhere
in between the extremes of the channels proposed by Hamming and Shannon is the Arbitrary
Varying Channel (the reader is referred to the survey by Lapidoth and Narayan [85]).

41
42
Chapter 2

A Look at Some Nicely Behaved Codes:

Linear Codes

One motivation for the topic of this chapter is the following question: How we can represent
a code? Or more specifically, how many bits does it take to describe a code C : [q]k −→ [q]n ?
In general, a code C : [q]k −→ [q]n can be stored using nq k symbols from [q] (n symbols for
each of the q k codewords) or nq k log q bits. For constant rate codes, this is exponential space,
which is prohibitive even for modest values of k like k = 100. A natural question is whether we
can do better. Intuitively to facilitate a succinct representation the code must have some extra
structure. It turns out that one broad class of codes that do possess extra structure than general
codes, is what are called linear codes. We have already seen binary linear codes in Section 1.5,
that is: C ⊆ {0, 1}n is a linear code if for all c1 , c2 ∈ C , c1 + c2 ∈ C , where the “+" denotes bit-
wise XOR. In this chapter, we will see more general linear codes. We will see that they not only
offer enough structure to get succinct representations, but they also possess several other nice
properties.
To define general linear codes, we first need to introduce general finite fields and vector
spaces over such fields and we do so first before returning to codes.

2.1 Groups and Finite Fields

To define linear subspaces, we will need to work with (finite) fields. At a high level, we need
finite fields since when we talk about codes, we deal with finite symbols/numbers and we want
to endow these symbols with the same math that makes arithmetic over real numbers work.
Finite fields accomplish this precise task. We begin with a quick overview of fields. We start
with the more elementary notion of a group.
Definition 2.1.1. A group G is given by a pair (S, ◦), where S is the set of elements and ◦ is a
function S × S → S with the following properties:
• C LOSURE : For every a, b ∈ S, we have a ◦ b ∈ S.

• A SSOCIATIVITY: ◦ is associative: that is, for every a, b, c ∈ S, a ◦ (b ◦ c) = (a ◦ b) ◦ c.

43
• I DENTITY: There exists distinct a special elements e ∈ S such that for every a ∈ S we have
a ◦ e = e ◦ a = a.

• I NVERSE : For every a ∈ S, there exists its unique inverse a −1 such that a ◦ a −1 = a −1 ◦ a = e.

If G = (S, ◦) satisfies all the properties except the existence of inverses then G is called a monoid.
We say G is commutative if for every a, b ∈ S, a ◦ b = b ◦ a.

We often use the same letter to denote the group (or other algebraic structures) and the set
of elements.
We now turn to the definition of a field. Informally speaking, a field is a set of elements on
which one can do addition, subtraction, multiplication and division and still stay in the set.

Definition 2.1.2. A field F is given by a triple (S, +, ·), where S is the set of elements and +, · are
functions S × S → S with the following properties:

• Addition: (S, +) form a commutative group with identity element denoted 0 ∈ S.

• Multiplication: (S \ {0}, ·) form a commutative group with identity element 1 ∈ S \ {0}.

• Distributivity: · distributes over +: that is, for every a, b, c ∈ S, a · (b + c) = a · b + a · c.

Again we typically use the same letter to denote the field and its set of elements. We also
use −a to denote the additive inverse of a ∈ F and a −1 to denote the multiplicative inverse of
a ∈ F \ {0}.
With the usual semantics for + and ·, R (set of real number) is a field, but Z (set of integers)
is not a field as division of two integers results in a rational number that need not be an inte-
ger (the set of rational numbers itself is a field though: see Exercise 2.1). In this course, we will
exclusively deal with finite fields. As the name suggests these are fields with a finite set of ele-
ments. (We will overload notation and denote the size of a field |F| = |S|.) The following is a well
known result.

Theorem 2.1.3 (Size of Finite Fields). Every finite field has size p s for some prime p and integer
s ≥ 1. Conversely for every prime p and integer s ≥ 1 there exists a field F of size p s .

One example of a finite field that we have seen is the field with S = {0, 1}, which we will
denote by F2 (we have seen this field in the context of binary linear codes). For F2 , addition
is the XOR operation, while multiplication is the AND operation. The additive inverse of an
element in F2 is the number itself while the multiplicative inverse of 1 is 1 itself.
Let p be a prime number. Then the integers modulo p form a field, denoted by Fp (and also
by Zp ), where the addition and multiplication are carried out modulo p. For example, consider
F7 , where the elements are {0, 1, 2, 3, 4, 5, 6}. We have (4 + 3) mod 7 = 0 and 4 · 4 mod 7 = 2.
Further, the additive inverse of 4 is 3 as (3 + 4) mod 7 = 0 and the multiplicative inverse of 4 is 2
as 4 · 2 mod 7 = 1.
More formally, we prove the following result.

44
Lemma 2.1.4. Let p be a prime. Then Fp = ({0, 1, . . . , p − 1}, +p , ·p ) is a field, where +p and ·p are
addition and multiplication modulo p.

Proof. The properties of associativity, commutativity, distributivity and identities hold for in-
tegers and hence, they hold for Fp . The closure property follows since both the “addition" and
“multiplication" are done modulo p, which implies that for any a, b ∈ {0, . . . , p −1}, a +p b, a ·p b ∈
{0, . . . , p −1}. Thus, to complete the proof, we need to prove the existence of unique additive and
multiplicative inverses.
Fix an arbitrary a ∈ {0, . . . , p − 1}. Then we claim that its additive inverse is p − a mod p. It is
easy to check that a + p − a = 0 mod p. Next we argue that this is the unique additive inverse.
To see this note that the sequence a, a + 1, a + 2, . . . , a + p − 1 are p consecutive numbers and
thus, exactly one of them is a multiple of p, which happens for b = p − a mod p, as desired.
Now fix an a ∈ {1, . . . , p − 1}. Next we argue for the existence of a unique multiplicative uni-
verse a −1 . Consider the set of numbers T = {a ·p b|b ∈ {1, . . . , p − 1}}. We claim that all these
numbers are unique. To see this, note that if this is not the case, then there exist b 1 6= b 2 ∈
{0, 1, . . . , p − 1} such that a · b 1 = a · b 2 mod p, which in turn implies that a · (b 1 − b 2 ) = 0 mod p.
Since a and b 1 −b 2 are non-zero numbers, this implies that p divides a · (b 1 −b 2 ). Further, since
a and |b 1 −b 2 | are both at most p −1, this implies that multiplying a and (b 1 −b 2 ) mod p results
in p, which is a contradiction since p is prime. Thus, we have argued that |T | = p − 1 and since
each number in T is in [p − 1], we have that T = [p − 1]. Thus, we can conclude that there exists
a unique element b such that a · b = 1 mod p and thus, b is the required a −1 .

One might think that there could be different finite fields with the same number of elements.
However, this is not the case:

Theorem 2.1.5. For every prime power q there is a unique finite field with q elements (up to
isomorphism1 ).

Thus, we are justified in just using Fq to denote a finite field on q elements.

2.2 Vector Spaces and Linear Subspaces

Definition 2.2.1 (Vector Space). A vector space V over a field F is given by a triple (T, +, ·) such
that (T, +) form a commutative group and ·, referred to as the scalar product, is a function F×T →
T such that for every a, b ∈ F and u, v ∈ T we have (a +b)·u = a ·u+b ·u and a ·(u+v) = a ·u+ a ·v.

The most common vector space we will focus on is Fn with + representing coordinatewise
addition in F and a · u representing the coordinatewise scaling of u by a.
We are finally ready to define the notion of linear subspaces of Fn .

Definition 2.2.2 (Linear Subspace). A non-empty subset S ⊆ Fn is a linear subspace if the follow-
ing properties hold:
1
An isomorphism φ : S → S ′ is a bijective map (such that F = (S, +, ·) and F′ = (S ′ , ⊕, ◦) are fields) where for every
a 1 , a 2 ∈ S, we have φ(a 1 + a 2 ) = φ(a 1 ) ⊕ φ(a 2 ) and φ(a 1 · a 2 ) = φ(a 1 ) ◦ φ(a 2 ).

45
1. For every x, y ∈ S, x + y ∈ S, where the addition is vector addition over F (that is, do addition
component wise over F).

2. For every a ∈ F and x ∈ S, a · x ∈ S, where the multiplication is done component-wise over F.

Here is a (trivial) example of a linear subspace of F35 :

S 1 = {(0, 0, 0), (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4)}. (2.1)

Note that for example (1, 1, 1) + (3, 3, 3) = (4, 4, 4) ∈ S 1 and 2 · (4, 4, 4) = (3, 3, 3) ∈ S 1 as required
by the definition. Here is another somewhat less trivial example of a linear subspace over F33 :

S 2 = {(0, 0, 0), (1, 0, 1), (2, 0, 2), (0, 1, 1), (0, 2, 2), (1, 1, 2), (1, 2, 0), (2, 1, 0), (2, 2, 1). (2.2)
Note that (1, 0, 1) + (0, 2, 2) = (1, 2, 0) ∈ S 2 and 2 · (2, 0, 2) = (1, 0, 1) ∈ S 2 as required.

Remark 2.2.3. Note that the second property implies that 0 is contained in every linear subspace.
Further for any subspace over F2 , the second property is redundant: see Exercise 2.4.

Before we state some properties of linear subspaces, we state some relevant definitions.

Definition 2.2.4 (Span). Given a set B = {v1 , . . . , vℓ }. The span of B is the set of vectors
( )
Xℓ
a i · vi |a i ∈ Fq for every i ∈ [ℓ] .
i =1

Definition 2.2.5 (Linear (in)dependence of vectors). We say that v1 , v2 , . . . vk are linearly inde-
pendent if for every 1 ≤ i ≤ k and for every (k − 1)-tuple (a 1 , a 2 , . . . , a i −1 , a i +1 , . . . , a k ) ∈ Fk−1
q ,

vi 6= a 1 v1 + . . . + a i −1 vi −1 + a i +1 vi +1 + . . . + a k vk .

In other words, vi is not in the span of the set {v1 , . . . , vi −1 , vi +1 , . . . , vn } for every 1 ≤ i ≤ k. We say
that v1 , v2 , . . . vk are linearly dependent if they are not linearly independent.

For example the vectors (1, 0, 1), (1, 1, 1) ∈ S 2 are linearly independent.

Definition 2.2.6 (Rank of a matrix). The rank of matrix in Fk×k

q is the maximum number of lin-
k×n
early independent rows (or columns). A matrix in Fq with rank min(k, n) is said to have full
rank.

One can define the row (column) rank of a matrix as the maximum number of linearly in-
dependent rows (columns). However, it is a well-known theorem that the row rank of a matrix
is the same as its column rank. For example, the matrix below over F3 has full rank (see Exer-
cise 2.5): µ ¶
1 0 1
G2 = . (2.3)
0 1 1
Any linear subspace satisfies the following properties (the full proof can be found in any
standard linear algebra textbook).

46
Theorem 2.2.7. If S ⊆ Fq n is a linear subspace then

1. |S| = q k for some k ≥ 0. The parameter k is called the dimension of S.

2. There exists at least one set of vectors v1 , ..., vk ∈ S called basis elements such that every x ∈ S
can be expressed as x = a 1 v1 + a 2 v2 + ... + a n vn where a i ∈ Fq for 1 ≤ i ≤ k. In other words,
there exists a full rank k × n matrix G (also known as a generator matrix) with entries from
Fq such that every x ∈ S, x = (a 1 , a 2 , ...a k ) · G where
 
←− v1 −→
 ←− v2 −→ 
 
G = .. .
 . 
←− vk −→

3. There exists a full rank (n − k) × n matrix H (called a parity check matrix) such that for
every x ∈ S, H xT = 0.

4. G and H are orthogonal, that is, G · H T = 0.

Proof Sketch.
Property 1. We begin with the proof of the first property. For the sake of contradiction, let
us assume that q k < |S| < q k+1 , for some k ≥ 0. Iteratively, we will construct a set of linearly
independent vectors B ⊆ S such that |B | ≥ k + 1. Note that by the definition of a linear subspace
the span of B should be contained in S. However, this is a contradiction as the size of the span
of B is at least2 q k+1 > |S|.
To complete the proof, we show how to construct the set B in a greedy fashion. In the first
step pick v1 to be any non-zero vector in S and set B ← {v1 } (we can find such a vector as |S| >
q k ≥ 1). Now say after the step t (for some t ≤ k), |B | = t . Now the size of the span of the current
B is q t ≤ q k < |S|. Thus there exists a vector vt +1 ∈ S \ B that is linearly independent of vectors
in B . Set B ← B ∪ {vt +1 }. Thus, we can continue building B until |B | = k + 1, as desired.

Property 2. We first note that we can pick B = {v1 , . . . , vk } to be any set of k linearly indepen-
dent vectors– this just follows from the argument above for Property 1.1. This is because the
span of B is contained in S. However, since |S| = q k and the span of B has q k vectors, the two
have to be the same.

Property 3. Property 3 above follows from another fact that every linear subspace S has a null
space N ⊆ Fnq such that for every x ∈ S and y ∈ N , 〈x, y〉 = 0. Further, it is known that N itself is
a linear subspace of dimension n − k. (The claim that N is also a linear subspace follows from
the following two facts: for every x, y, z ∈ Fnq , (i) 〈x, y + z〉 = 〈x, y〉 + 〈x, z〉 and (ii) for any a ∈ Fq ,
〈x, ay〉 = a ·〈x, y〉.) In other words, there exists a generator matrix H for it. This matrix H is called
the parity check matrix of S.
2
See Exercise 2.7.

47
Property 4. See Exercise 2.8.

As examples, the linear subspace S 1 in (2.1) has as one of its generator matrices
¡ ¢
G1 = 1 1 1

and as one of its parity check matrices

µ ¶
1 2 2
H1 = .
2 2 1

Further, the linear subspace S 2 in (2.2) has G 2 as one of its generator matrices and has the fol-
lowing as one of its parity check matrices
¡ ¢
H2 = 1 1 2 .

Finally, we state another property of linear subspaces that is useful.

Lemma 2.2.8. Given matrix G of dimension k × n that is a generator matrix of subspace S 1 and
matrix H of dimension (n −k)×n that is a parity check matrix of subspace S 2 such that G H T = 0,
then S 1 = S 2 .

Proof. We first prove that S 1 ⊆ S 2 . Given any c ∈ S 1 , there exists x ∈ Fkq such that c = xG. Then,
¡ ¢T
H · cT = H · (xG)T = HG T xT = G H T xT = 0,

which implies that c ∈ S 2 , as desired.

To complete the proof note that as H has full rank, its null space (or S 2 ) has dimension
n − (n − k) = k (this follows from a well known fact from linear algebra called the rank-nullity
theorem). Now as G has full rank, the dimension of S 1 is also k. Thus, as S 1 ⊆ S 2 , it has to be the
case that S 1 = S 2 .3

2.3 Linear Codes and Basic Properties

We now return to the topic of codes and introduce the central concept for this chapter as well
as much of this text.

Definition 2.3.1 (Linear Codes). Let q be a prime power (i.e. q = p s for some prime p and integer
s ≥ 1). C ⊆ Fq is a linear code if it is a linear subspace of Fnq . If C has dimension k and distance d
then it will be referred to as an [n, k, d ]q or just an [n, k]q code.

Theorem 2.2.7 now gives two alternate characterizations of an [n, k]q linear code C : C is
generated by a k × n generator matrix G. Alternately C is characterized by a (n − k) × n parity
check matrix H . Since these are important concepts for us, we define these formally below
before giving examples and consequences.
3
If not, S 1 ⊂ S 2 which implies that that |S 2 | ≥ |S 1 | + 1. The latter is not possible if both S 1 and S 2 have the same
dimension.

48
Definition 2.3.2 (Generator and Parity Check Matrices). If C is an [n, k]q linear code then there
exists a matrix G ∈ Fk×n
q of rank k satisfying

C = {x · G|x ∈ Fkq }.

G is referred to as a generator matrix of C .

If C is an [n, k]q linear code then there exists a matrix H ∈ F(n−k)×n
q of rank n − k satisfying

C = {y ∈ Fnq |H · yT ∈ Fkq }.

H is referred to as a parity check matrix of C .

Note that we require G and H to have full row rank (i.e., the rows of G are linearly indepen-
dent and the same holds for H ). Sometimes we will consider matrices M ∈ Fm×n q that are not of
full row rank. These can still be used to generate a code C = {{x · G|x ∈ Fmq } though the code C
will not be an [n, m]q code. We will still refer to C as the code generated by M in such a case,
though the phrase “generator matrix” will be reserved for full rank matrices.
Note that neither the generator matrix nor the parity check matrix are unique for a given
code. However their dimensions are. We give examples of these matrices for the case of the
[7, 4, 3]2 Hamming code below.

• The [7, 4, 3]2 Hamming code has the following generator matrix:
 
1 0 0 0 0 1 1
 
 0 1 0 0 1 0 1
G = 
 0 0 1 0 1 1 0
0 0 0 1 1 1 1

• The following matrix is a parity check matrix of the [7, 4, 3]2 Hamming code:
 
0 0 0 1 1 1 1
H =  0 1 1 0 0 1 1
1 0 1 0 1 0 1

Indeed, it can be easily verified that G · H T = 0. Then Lemma 2.2.8 proves that H is indeed
a parity check matrix of the [7, 4, 3]2 Hamming code.

We now look at some consequences of the above characterizations of an [n, k]q linear code
C . We started this chapter with a quest for succinct representation of a code. Note that both the
generator matrix and the parity check matrix can be represented using O(n 2 ) symbols from Fq
(which is much smaller than the exponential representation of a general code). More precisely
we have the following (see also Exercise 2.10):

Proposition 2.3.3. Any [n, k]q linear code can be represented with min(nk, n(n − k)) symbols
from Fq .

49
There is an encoding algorithm for C that runs in O(n 2 ) (in particular O(kn)) time– given a
message m ∈ Fkq , the corresponding codeword C (m) = m · G, where G is the generator matrix of
C . (See Exercise 2.11.)

Proposition 2.3.4. For any [n, k]q linear code, given its generator matrix, encoding can be done
with O(nk) operations over Fq .

There is an error-detecting algorithm for C that runs in O(n 2 ). This is a big improvement
over the naive brute force exponential time algorithm (that goes through all possible codewords
c ∈ C and checks if y = c). (See Exercise 2.12.)

Proposition 2.3.5. For any [n, k]q linear code, given its parity check matrix, error detection can
be performed in O(n(n − k)) operations over Fq .

Next, we look at some alternate characterizations of the distance of a linear code.

2.3.1 On the Distance of a Linear Code

Linear codes admit a nice characterization of minimum distance in terms of the Hamming
weight of non-zero codewords, which we have seen for the special case of binary linear codes
(Proposition 1.5.4). Recall that we use wt(x) to denote the Hamming weight of a vector x ∈ Σn ,
i.e., the number of non-zero coordinates in x.

Proposition 2.3.6. For every [n, k, d ]q code C , we have

d = min wt(c).
c∈C ,
c6=0

Proof. To show that d is the same as the minimum weight we show that d is no more than the
minimum weight and d is no less than the minimum weight.
First, we show that d is no more than the minimum weight. We can see this by considering
∆(0, c′ ) where c′ is the non-zero codeword in C with minimum weight; its distance from 0 is
equal to its weight. Thus, we have d ≤ w t (c′ ), as desired.
Now, to show that d is no less than the minimum weight, consider c1 6= c2 ∈ C such that
∆(c1 , c2 ) = d . Note that c1 − c2 ∈ C (this is because −c2 = −1 · c2 ∈ C , where −1 is the additive
inverse of 1 in Fq and c1 − c2 = c1 + (−c2 ), which in in C by the definition of linear codes). Now
note that w t (c1 − c2 ) = ∆(c1 , c2 ) = d , since the non-zero symbols in c1 − c2 occur exactly in the
positions where the two codewords differ. Further, since c1 6= c2 , c1 − c2 6= 0, which implies that
the minimum Hamming weight of any non-zero codeword in C is at most d .

Next, we look at another property implied by the parity check matrix of a linear code.

Proposition 2.3.7. For every [n, k, d ]q code C with parity check matrix H , d equals the size of the
smallest subset of columns of H that are linearly dependent.

50
Proof. By Proposition 2.3.6, we need to show that the minimum weight of a non-zero codeword
in C is the minimum number of linearly dependent columns. Let t be the minimum number of
linearly dependent columns in H . To prove the claim we will show that t ≤ d and t ≥ d .
For the first direction, Let c 6= 0 ∈ C be a codeword with w t (c) = d . Now note that, by the
definition of the parity check matrix, H ·cT = 0. Working through the matrix multiplication, this
P
gives us that ni=1 c i H i , where
 
↑ ↑ ↑ ↑
 1 2
H = H H ··· Hi ··· Hn 
↓ ↓ ↓ ↓

and c = (c 1 , . . . , c n ). Note that we can skip multiplication for those columns for which the corre-
sponding bit c i is zero, so for this to be zero, those H i with c i 6= 0 are linearly dependent. This
means that d ≥ t , as the columns corresponding to non-zero entries in c are one instance of
linearly dependent columns.
For the other direction, consider the minimum set of columns from H , H i 1 , H i 2 , . . . , H i t that
are linearly dependent. This implies that there exists non-zero elements c i′ , . . . , c i′ ∈ Fq such
1 t
that c i′ H i 1 + . . . + c i′ H i t = 0. (Note that all the c i′ are non-zero as no set of less than t columns
i t j
are linearly dependent.) Now extend c i′ , . . . , c i′ to the vector c′ such that c ′j = 0 for j 6∈ {i 1 , . . . , i t }.
1 t
¡ ′ ¢T
Note that we have H · c = 0 and thus, we have c′ ∈ C . This in turn implies that d ≤ w t (c′ ) = t
(where recall t is the minimum number of linearly independent columns in H ).

2.4 Hamming Codes

We now change gears and look at the general family of linear codes, which were discovered by
Hamming. So far we have seen the [7, 4, 3]2 Hamming code (in Section 1.5). In fact, for any r ≥ 2
there is a [2r − 1, 2r − r − 1, 3]2 Hamming code. Thus in Section 1.5, we have seen this code for
r = 3.
r
Definition 2.4.1 (Binary Hamming Codes). For positive integer r , define the matrix Hr ∈ F2r ×(2 −1)
to be the r ×(2r −1) matrix whose i th column Hir is the binary representation of i , for 1 ≤ i ≤ 2r −1.
(Note that such a representation is a vector in {0, 1}r .)
The [2r − 1, 2r − r − 1]2 Hamming code, denoted by C H ,r , is the code with parity check matrix
Hr .

In other words, the general [2r − 1, 2r − r − 1]2 Hamming code is the code
r
{c ∈ {0, 1}2 −1
|Hr · cT = 0}.

For example, for the case we have seen (r = 3),

 
0 0 0 1 1 1 1

H3 = 0 1 1 0 0 1 1 ,
1 0 1 0 1 0 1

51
and the resulting code was a [7, 4, 3]2 code.
Next we argue that the above Hamming code has distance 3 (in Proposition 1.5.2, we argued
this for r = 3).

Proposition 2.4.2. The Hamming code [2r − 1, 2r − r − 1, 3]2 has distance 3.

j
Proof. No two columns in Hr are linearly dependent. If they were, we would have Hir + Hr =
0, but this is impossible since they differ in at least one bit (being binary representations of
integers, i 6= j ). Thus, by Proposition 2.3.7, the distance is at least 3. It is at most 3, since (e.g.)
H1r + H2r + H3r = 0.

Now note that under the Hamming bound for d = 3 (Theorem 1.6.2), k ≤ n − log2 (n + 1), so
for n = 2r − 1, k ≤ 2r − r − 1. Hence, the Hamming code is a perfect code. (See Definition 1.7.3.)
In Question 1.7.1, we asked which codes are perfect codes. Interestingly, the only perfect
binary codes are the following:

• The Hamming codes which we just studied.

• The trivial [n, 1, n]2 codes for odd n (which have 0n and 1n as the only codewords): see
Exercise 2.22.

• Two codes due to Golay [50].

2.5 Efficient Decoding of Hamming codes

We have shown that the Hamming code has a distance of 3 and thus, by Proposition 1.4.2, can
correct one error. However, this is a combinatorial result and does not give us an efficient al-
gorithm. One obvious candidate for decoding is the MLD function. Unfortunately, the only
implementation of MLD that we know is the one in Algorithm 2, which will take time 2Θ(n) ,
where n is the block length of the Hamming code. However, we can do much better. Consider
the following simple algorithm: given the received word y, first check if it is indeed a valid code-
word. If it is, we are done. Otherwise, flip each of the n bits and check if the resulting vector is
a valid codeword. If so, we have successfully decoded from one error. (If none of the checks are
successful, then we declare a decoding failure.) Algorithm 3 formally presents this algorithm
(where C H ,r is the [2r − 1, 2r − r − 1, 3]2 Hamming code).4
It is easy to check that Algorithm 3 can correct up to 1 error. If each of the checks y′ ∈ C H ,r
can be done in T (n) time, then the time complexity of the proposed algorithm will be O(nT (n)).
Note that since C H ,r is a linear code (and dimension k = n − O(log n)) by Proposition 2.3.5, we
have T (n) = O(n log n). Thus, the proposed algorithm has running time O(n 2 log n).
Note that Algorithm 3 can be generalized to work for any linear code C with distance 2t +
1 (and hence, can correct up to t errors): go through all possible error vectors z ∈ [q]n (with
4
Formally speaking, a decoding algorithm should return the transmitted message x but Algorithm 3 actually
returns C H ,r (x). However, since C H ,r is a linear code, it is not too hard to see that one can obtain x from C H ,r (x) in
O(n 3 ) time: see Exercise 2.23. Further, for C H ,r one can do this in O(n) time: see Exercise 2.24.

52
Algorithm 3 Naive Decoder for Hamming Code
I NPUT: Received word y
O UTPUT: c if ∆(y, c) ≤ 1 else Fail

1: IF y ∈ C H ,r THEN
2: RETURN y
3: FOR i = 1 . . . n DO
4: y′ ← y + ei ⊲ ei is the i th standard basis vector
5: IF y′ ∈ C H ,r THEN
6: RETURN y′
7: RETURN Fail

w t (z) ≤ t ) and check if y − z is in the code or not. Algorithm 4 presents the formal algorithm
(where C is an [n, k, 2t + 1]q code).

Algorithm 4 Decoder for Any Linear Code

I NPUT: Received word y
O UTPUT: c ∈ C if ∆(y, c) ≤ t else Fail

1: FOR i = 0 . . . t DO
2: FOR S ⊆ [n] such that |S| = i DO
3: FOR z ∈ Fnq such that w t (zS ) = w t (z) = i DO
4: IF y − z ∈ C THEN
5: RETURN y − z
6: RETURN Fail
P ¡ ¢
The number of error patterns z considered by Algorithm 4 is5 it =0 ni (q − 1)i ≤ O((nq)t ).
Further by Proposition 2.3.5, Step 4 can be performed with O(n 2 ) operations over Fq . Thus, Al-
gorithm 4 runs with O(n t +2 q t ) operations over Fq , which for q being a small polynomial in n,
is n O(t ) operations. In other words, the algorithm will have polynomial running time for codes
with constant distance (though the running time would not be practical even for moderate val-
ues of t ).
However, it turns out that for Hamming codes there exists a decoding algorithm with an
O(n 2 ) running time. To see this, first note that if the received word y has no errors, then Hr ·yT =
0. If not, then y = c + ei , where c ∈ C and ei is the unit vector with the only nonzero element at
the i -th position. Thus, if Hir stands for the i -th column of Hr ,

Hr · yT = Hr · cT + Hr · (ei )T = Hr · (ei )T = Hir ,

where the second equality follows as Hr · cT = 0, which in turn follows from the fact that c ∈ C .
In other words, Hr · yT gives the location of the error. This leads to Algorithm 5.
5
Recall (1.15).

53
Algorithm 5 Efficient Decoder for Hamming Code
I NPUT: Received word y
O UTPUT: c if ∆(y, c) ≤ 1 else Fail

1: b ← Hr · yT .
2: Let i ∈ [n] be the number whose binary representation is b
3: IF y − ei ∈ C H THEN
4: RETURN y − ei
5: RETURN Fail

Note that Hr is an r × n matrix where n = 2r − 1 and thus, r = Θ(log n). This implies Step 1 in
Algorithm 5, which is a matrix vector multiplication can be done in time O(n log n). By a similar
argument and by Proposition 2.3.5 Step 3 can be performed in O(n log n) time, and therefore
Algorithm 5 overall runs in O(n log n) time. Thus,

Theorem 2.5.1. The [n = 2r −1, 2r −r −1, 3]2 Hamming code is 1-error correctable. Furthermore,
decoding can be performed in time O(n log n).

2.6 Dual of a Linear Code

Until now, we have thought of parity check matrix as defining a code via its null space. However,
we are not beholden to think of the parity check matrix in this way. A natural alternative is to use
the parity check matrix as a generator matrix. The following definition addresses this question.

Definition 2.6.1 (Dual of a code). Let H be a parity check matrix of a code C , then the code
generated by H is called the dual of C . The dual of a code C is denoted by C ⊥ .

It is obvious from the definition that if C is an [n, k]q code, then C ⊥ is an [n, n −k]q code. Ap-
plying duality to the Hamming codes and a close relative, we get two families of codes described
below.

Definition 2.6.2 (Simplex and Hadamard Codes). For positive integer r the Simplex Code C Si m,r
⊥
is the code generated by Hr . (Equivalently C Si m,r = C H ,r .) For positive integer r the Hadamard
Code C H ad ,r is the [2r , r ]2 code generated by the r × 2r matrix Hr′ obtained by adding the all zero
column to Hr .

We claim that C Si m,r and C H ad ,r are [2r − 1, r, 2r −1 ]2 and [2r , r, 2r −1 ]2 codes respectively. The
claimed block length and dimension follow from the definition of the codes, while the distance
follows from the following result.

Proposition 2.6.3. C Si m,r and C H ad ,r both have distances of 2r −1 .

54
Proof. We first show the result for C H ad ,r . In fact, we will show something stronger: every non-
zero codeword in C H ad ,r has weight exactly equal to 2r −1 (the claimed distance follows from
Proposition 2.3.6). Consider a message x 6= 0. Let its i th entry be x i = 1. x is encoded as
r
c = (x 1 , x 2 , . . . , x r )(Hr0 , Hr1 , . . . , Hr2 −1
),
j
where Hr is the binary representation of 0 ≤ j ≤ 2r − 1 (that is, it contains all the vectors in
j
{0, 1}r ). Further note that the j th bit of the codeword c is 〈x, Hr 〉. Group all the columns of the
generator matrix into pairs (u, v) such that v = u + ei (i.e. v and u are the same except in the i th
position). Notice that this partitions all the columns into 2r −1 disjoint pairs. Then,
〈x, v〉 = 〈x, u + ei 〉 = 〈x, u〉 + 〈x, ei 〉 = 〈x, u〉 + x i = 〈x, u〉 + 1.
Thus we have that exactly one of 〈x, v〉 and 〈x, u〉 is 1. As the choice of the pair (u, v) was arbitrary,
we have proved that for any non-zero codeword c such that c ∈ C H ad , w t (c) = 2r −1 .
For the simplex code, we observe that all codewords of C H ad ,3 are obtained by padding a 0 to
the beginning of the codewords in C Si m,r , which implies that all non-zero codewords in C Si m,r
also have a weight of 2r −1 , which completes the proof.
We remark that the family of Hamming code has a rate of 1 and a (relative) distance of 0
while the families of Simplex/Hadamard codes have a rate of 0 and a relative distance of 1/2.
Thus neither gives a positive answer to Question 1.8.2 and so the quest for an asymptotically
good code remains ongoing for now (and we will get to these in future chapters).

2.7 Exercises
Exercise 2.1. Prove that the set of rationals (i.e. the set of reals of the form ba , where both a and
b 6= 0 are integers), denoted by Q, is a field.
Exercise 2.2. Let q be a prime power. Let x ∈ Fq such that x 6∈ {0, 1}. Then prove that for any
n ≤ q − 1:
X n x n+1 − 1
xi = .
i =0 x −1
Exercise 2.3. The main aim of this exercise is to prove the following identity that is true for any
α ∈ Fq :
αq = α (2.4)
To make progress towards the above we will prove a sequence of properties of groups. A group G
is a pair (S, ◦) where the operator ◦ : G ×G → G such that ◦ is commutative6 and the elements of S
are closed under ◦. Further, there is a special element ι ∈ S that is the identity element and every
element a ∈ S has an inverse element b ∈ S such that a ◦ b = ι. Note that a finite field Fq consists
of an additive group with the + operator (and 0 as additive identity) and a multiplicative group
on the non-zero elements of Fq (which is also denoted by F∗q ) with the · operator (and 1 as the
multiplicative identity).7
6
Technically, G is an abelian group.
7
Recall Definition 2.1.2.

55
For the rest of the problem let G = (S, ·) be a multiplicative group with |G| = m. Prove the
following statements.

1. For any β ∈ G, let o(β) be the smallest integer o such that βo = 1. Prove that such an o ≤ m
always exists. Further, argue that T = {1, β, . . . , βo−1 } also forms a group. (T, ·) is called a
sub-group of G and o(β) is called the order of β.

2. For any g ∈ G, define the coset (w.r.t. T ) as

g T = {g · β|β ∈ T }.

Prove that if h −1 · g ∈ T then g T = hT and g T ∩hT = ; otherwise. Further argue that these
cosets partition the group G into disjoint sets.

3. Argue that for any g ∈ G, we have |g T | = |T |.

4. Using the above results or otherwise, argue that for any β ∈ G, we have

βm = 1.

5. Prove (2.4).

Exercise 2.4. Prove that for q = 2, the second condition in Definition 2.2.2 is implied by the first
condition.

Exercise 2.5. Prove that G 2 from (2.3) has full rank.

Exercise 2.6. In this problem we will look at the problem of solving a system of linear equations
over Fq . That is, one needs to solve for unknowns x 1 , . . . , x n given the following m linear equations
(where a i , j , b i ∈ Fq for 1 ≤ i ≤ m and 1 ≤ j ≤ n):

a 1,1 x 1 + a 1,2 x 2 + · · · + a 1,n x n = b 1 .

a 2,1 x 1 + a 2,2 x 2 + · · · + a 2,n x n = b 2 .

..
.
a m,1 x 1 + a m,2 x 2 + · · · + a m,n x n = b m .

1. (Warm-up) Convince yourself that the above problem can be stated as A · xT = bT , where A
is an m × n matrix over Fq , x ∈ Fnq and b ∈ Fm
q .

2. (Upper Triangular Matrix) Assume n = m and that A is upper triangular, i.e. all diagonal
elements (a i ,i ) are non-zero and all lower triangular elements (a i , j , i > j ) are 0. Then
present an O(n 2 ) time8 algorithm to compute the unknown vector x.
8
For this problem, any basic operation over Fq takes unit time.

56
3. (Gaussian Elimination) Assume that A has full rank (or equivalently a rank of n.)

(a) Prove that the following algorithm due to Gauss converts A into an upper triangular
matrix. By permuting the columns if necessary make sure that a 1,1 6= 0. (Why can one
a
assume w.l.o.g. that this can be done?) Multiply all rows 1 < i ≤ n with a1,1 and then
i ,1
subtract a 1, j from the (i , j )th entry 1 ≤ j ≤ n. Recurse with the same algorithm on the
(n −1)×(n −1) matrix A ′ obtained by removing the first row and column from A. (Stop
when n = 1.)
(b) What happens if A does not have full rank? Show how one can modify the algorithm
above to either upper triangulate a matrix or report that it does not have full rank.
(Convince yourself that your modification works.)
(c) Call a system of equations A · xT = bT consistent if there exists a solution to x ∈ Fnq .
Show that there exists an O(n 3 ) algorithm that finds the solution if the system of equa-
tions is consistent and A has full rank (and report “fail" otherwise).

4. (m < n case) Assume that A has full rank, i.e. has a rank of m. In this scenario either the
system of equations is inconsistent or there are q n−m solutions to x. Modify the algorithm
from above to design an O(m 2 n) time algorithm to output the solutions (or report that the
system is inconsistent).

• Note that in case the system is consistent there will be q n−m solutions, which might be
much bigger than O(m 2 n). Show that this is not a problem as one can represent the
solutions as system of linear equations. (I.e. one can have n − m “free" variables and
m “bound" variables.)

5. (m > n case) Assume that A has full rank, i.e. a rank of n. In this scenario either the system
of equations is inconsistent or there is a unique solution to x. Modify the algorithm from
above to design an O(m 2 n) time algorithm to output the solution (or report that the system
is inconsistent).

6. (Non-full rank case) Give an O(m 2 n) algorithm for the general case, i.e. the m × n matrix
A need not have full rank. (The algorithm should either report that the system of equations
is inconsistent or output the solution(s) to x.)

Exercise 2.7. Prove that the span of k linearly independent vectors over Fq has size exactly q k .

Exercise 2.8. Let G and H be a generator and parity check matrix of the same linear code of
dimension k and block length n. Then G · H T = 0.

Exercise 2.9. Let C be an [n, k]q linear code with a generator matrix with no all zeros columns.
Then for every position i ∈ [n] and α ∈ Fq , the number of codewords c ∈ C such that c i = α is
exactly q k−1 .

Exercise 2.10. Prove Proposition 2.3.3.

57
Exercise 2.11. Prove Proposition 2.3.4.

Exercise 2.12. Prove Proposition 2.3.5.

Exercise 2.13. A set of vector S ⊆ Fnq is called t -wise independent if for every set of positions I with
|I | = t , the set S projected to I has each of the vectors in Ftq appear the same number of times. (In
other words, if one picks a vector (s 1 , . . . , s n ) from S at random then any of the t random variables
are uniformly and independently random over Fq ).
Prove that any linear code C whose dual C ⊥ has distance d ⊥ is (d ⊥ − 1)-wise independent.

Exercise 2.14. A set of vectors S ⊆ Fk2 is called ε-biased sample space if the following property
holds. Pick a vector X = (x 1 , . . . , x k ) uniformly at random from S. Then X has bias at most ε, that
is, for every I ⊆ [k],
¯ Ã ! Ã !¯
¯ X X ¯
¯ ¯
¯Pr x i = 0 − Pr x i = 1 ¯ ≤ ε.
¯ i ∈I i ∈I
¯

We will look at some connections of such sets to codes.

be an¢ [n,¡k]2 ¢code

1. Let C £¡ ¤ such that all non-zero codewords have Hamming weight in the
range 1−ε2
n, 1+ε
2
n . Then there exists an ε-biased space of size n.

£¡ 1an [n,
2. Let C be ¢ k]
¡ 12 code
¢ such
¤ that all non-zero codewords have Hamming weight in the
range 2 − γ n, 2 + γ n for some constant 0 < γ < 1/2. Then there exists an ε-biased
−1
space of size n O(γ ·log(1/ε))
.

Exercise 2.15. Let C be an [n, k, d ]q code. Let y = (y 1 , . . . , y n ) ∈ (Fq ∪{?})n be a received word9 such
that y i =? for at most d − 1 values of i . Present an O(n 3 ) time algorithm that outputs a codeword
c = (c 1 , . . . , c n ) ∈ C that agrees with y in all un-erased positions (i.e., c i = y i if y i 6=?) or states that
no such c exists. (Recall that if such a c exists then it is unique.)

Exercise 2.16. In the chapter, we did not talk about how to obtain the parity check matrix of a
linear code from its generator matrix. In this problem, we will look at this “conversion" procedure.

(a) Prove that any generator matrix G of an [n, k]q code C (recall that G is a k × n matrix) can
be converted into another equivalent generator matrix of the form G′ = [Ik |A], where Ik is
the k × k identity matrix and A is some k × (n − k) matrix. By “equivalent," we mean that
the code generated by G′ has a linear bijective map to C .
Note that the code generated by G′ has the message symbols as its first k symbols in the cor-
responding codeword. Such codes are called systematic codes. In other words, every linear
code can be converted into a systematic code. Systematic codes are popular in practice as
they allow for immediate access to the message symbols.
9
A ? denotes an erasure.

58
(b) Given an k × n generator matrix of the form [Ik |A], give a corresponding (n − k) × n parity
check matrix. Briefly justify why your construction of the parity check matrix is correct.

Hint: Try to think of a parity check matrix that can be decomposed into two submatrices: one will be closely
related to A and the other will be an identity matrix, though the latter might not be a k × k matrix).

Exercise 2.17. So far in this book we have seen that one can modify one code to get another code
with interesting properties (for example, the construction of the Hadamard code from the Simplex
code from Section 2.6 and Exercise 1.7). In this problem you will need to come up with more ways
of constructing new codes from existing ones.
Prove the following statements (recall that the notation (n, k, d )q code is used for general codes
with q k codewords where k need not be an integer, whereas the notation [n, k, d ]q code stands for
a linear code of dimension k):

1. If there exists an (n, k, d )2m code, then there also exists an (nm, km, d ′ ≥ d )2 code.

2. If there exists an [n, k, d ]2m code, then there also exists an [nm, km, d ′ ≥ d ]2 code.

3. If there exists an [n, k, d ]q code, then there also exists an [n − d , k − 1, d ′ ≥ ⌈d /q⌉]q code.

m
¡ an [n, k,mδn]
4. ¡If there exists ¢ qmcode,
¢ then for every m ≥ 1, there also exists an
n , k/m, 1 − (1 − δ) · n q m code.

5. If
£ there
m
¡ an [n, k,m
exists
1
δn] ¤ then for every odd m ≥ 1, there also exists an
¢ 2 code,
m
n , k, 2 · 1 − (1 − 2δ) · n 2 code.

Note: In all the parts, the only things that you can assume about the original code are only the
parameters given by its definition– nothing else!

Exercise 2.18. Let C 1 be an [n, k 1 , d 1 ]q code and C 2 be an [n, k 2 , d 2 ]q code. Then define a new
code as follows:
C 1 ⊖C 2 = {(c1 , c1 + c2 )|c1 ∈ C 1 , c2 ∈ C 2 }.
Next we will prove interesting properties of this operations on codes:

1. If G i is the generator matrix for C i for i ∈ [2], what is a generator matrix for C 1 ⊖C 2 ?
def
2. Argue that C 1 ⊖C 2 is an [2n, k 1 + k 2 , d = min(2d 1 , d 2 )]q code.

3. Assume there exists algorithms Ai for code C i for i ∈ [2] such that: (i) A1 can decode from
e errors and s erasures such that 2e + s < d 1 and (ii) A2 can decode from ⌊(d 2 − 1)/2⌋ errors.
Then argue that one can correct ⌊(d − 1)/2⌋ errors for C 1 ⊖C 2 .
Hint: Given a received word (y1 , y2 ) ∈ Fnq × Fnq , first apply A2 on y2 − y1 . Then create an intermediate received
word for A1 .

59
4. We will now consider a recursive construction of a binary linear code that uses the ⊖ oper-
ator. For integers 0 ≤ r ≤ m, we define the code C (r, m) as follows:

• C (r, r ) = Fr2 and C (0, r ) is the code with only two codewords: the all ones and all zeroes
vector in Fr2 .
• For 1 < r < m, C (r, m) = C (r, m − 1) ⊖C (r − 1, m − 1).

Determine the parameters of the code C (r, m).

Exercise 2.19. Let C 1 be an [n 1 , k 1 , d 1 ]2 binary linear code, and C 2 an [n 2 , k 2 , d 2 ] binary linear

n ×n
code. Let C ⊆ F2 1 2 be the subset of n 2 ×n 1 matrices whose rows belong to C 1 and whose columns
belong to C 2 . C is called the tensor of C 1 and C 2 and is denoted by C 1 ⊗C 2 .
Prove that C is an [n 1 n 2 , k 1 k 2 , d 1 d 2 ]2 binary linear code.
Further, if G1 and G2 are generator matrices of C 1 and C 2 , construct a genertor matrix of
C 1 ⊗ C 2 from G1 and G2 . In particular, argue that given G1 and G2 , a generator matrix of C 1 ⊗ C 2
can be computed in polynomimal time.
Hint: For the latter problem, it might be useful to think of the codewords and messages as vectors instead of matrices.

Exercise 2.20. In Section 2.4 we considered the binary Hamming code. In this problem we will
consider the more general q-ary Hamming code. In particular, let q be a prime power and r ≥ 1
be an integer. Define the following r × n matrix H q,r , where each column is an non-zero vector
from Frq such that the first non-zero entry is 1. For example,
µ ¶
0 1 1 1
H3,2 =
1 0 1 2

In this problem we will derive the parameters of the code. Define the generalized Hamming code
C H ,r,q to be the linear code whose parity check matrix is H q,r . Argue that
q r −1
1. The block length of C H ,r,q is n = q−1 .

2. C H ,q,r has dimension n − r .

3. C H ,q,r has distance 3.

Exercise 2.21. Design the best 6-ary code (family) with distance 3 that you can.

Hint: Start with a 7-ary Hamming code.

Exercise 2.22. Prove that the [n, 1, n]2 code for odd n (i.e. the code with the all zeros and all ones
vector as it only two codewords) attains the Hamming bound (Theorem 1.7.2).

Exercise 2.23. Let C be an [n, k]q code with generator matrix G. Then given a codeword c ∈ C
one can compute the corresponding message in time O(kn 2 ).

Exercise 2.24. Given a c ∈ C H ,r , one can compute the corresponding message in time O(n).

60
Exercise 2.25. Let C be an (n, k)q code. Prove that if C can be decoded from e errors in time T (n),
then it can be decoded from n + c errors in time O((nq)c · T (n)).

Exercise 2.26. Show that the bound of kd of the number of ones in the generator matrix of any
binary linear code (see Exercise 1.14) cannot be improved for every code.
¡ ¢⊥
Exercise 2.27. Let C be a linear code. Then prove that C ⊥ = C .

Exercise 2.28. Note that for any linear code C , the codewords 0 is in both C and C ⊥ . Show that
there exists a linear code C such that it shares a non-zero codeword with C ⊥ .

Exercise 2.29. We go into a bit of diversion and look at how finite fields are different from infinite
fields (e.g. R). Most of the properties of linear subspaces that we have used for linear codes (e.g.
notion of dimension, the existence of generator and parity check matrices, notion of duals) also
hold for linear subspaces over R.10 One trivial property that holds for linear subspaces over finite
fields that does not hold over R is that linear subspaces over Fq with dimension k has size q k
(though this is a trivial consequence that F q are finite field while R is an infinite field). Next, we
consider a more subtle distinction.
Let S ⊆ Rn be a linear subspace over R and let S ⊥ is the dual of S. Then show that

S ∩ S ⊥ = {0} .

By contrast, linear subspaces over finite fields can have non-trivial intersection with their duals
(see e.g. Exercise 2.28).

Exercise 2.30. A linear code C is called self-orthogonal if C ⊆ C ⊥ . Show that

1. The binary repetition code with even number of repetitions is self-orthogonal.

2. The Hadamard code C H ad ,r is self-orthogonal.

Exercise 2.31. A linear code C is called self dual if C = C ⊥ . Show that for

1. Any self dual code has dimension n/2.

2. Prove that the following code is self-dual

{(x, x)|x ∈ Fk2 }.

Exercise 2.32. Given a code C a puncturing of C is another code C ′ where the same set of positions
are dropped in all codewords of C . More precisely, if C ⊆ Σn and the set of punctured positions is
P ⊆ [n], then the punctured code is {(c i )i 6∈P |(c 1 , . . . , c n ) ∈ C }.
Prove that a linear code with no repetitions (i.e. there are no two positions i 6= j such that for
every codeword c ∈ C , c i = c i ) is a puncturing of the Hadamard code. Hence, Hadamard code is
the “longest" linear code that does not repeat.
10
A linear subspace S ⊆ Rn is the same as in Definition 2.2.2 where all occurrences of the finite field Fq is replaced
by R.

61
Exercise 2.33. In this problem we will consider the long code. For the definition, we will use
the functional way of looking at the ambient space as mentioned in Remark 1.2.2. A long code
of dimension k is a binary code such that the codeword corresponding to x = Fk2 , is the function
k k
f : {0, 1}2 → {0, 1} defined as follows. For any m ∈ {0, 1}F2 , we have f ((m α )α∈Fk ) = m x . Derive the
2
parameters of the long code.
Finally, argue that the long code is the code with the longest block length such that the code-
words do not have a repeated coordinate (i.e. there does not exists i 6= j such that for every code-
word c, c i = c j ). (Contrast this with the property of Hadamard code above.)

2.8 Bibliographic Notes

Finite fields are also called Galois fields (another common notation for Fq is GF (q)), named
after Évariste Galois, whose worked laid the foundations of their theory. (Galois led an extremely
short and interesting life, which ended in death from a duel.) For a more thorough treatment
refer to any standard text on algebra or the book on finite fields by Lidl and Niederreiter [88].
The answer to Question 1.7.1 was proved by van Lint [131] and Tietavainen [130].

62
Chapter 3

Probability as Fancy Counting and the q-ary

Entropy Function

In the chapters to come we will explore questions of the form: “Given n, k, d and q does an
(n, k, d )q code exist?” To answer such questions we will apply the “probabilistic method” — the
method that demonstrates the existence of an object with a given property by showing that a
randomly chosen object has the property with positive probability. To elaborate on this sen-
tence we need to introduce the basic language and tools of probability theory which we do in
Section 3.1.
We then introduce the probabilistic method in Section 3.2. We even apply the method to
answer a very simple question:

Question 3.0.1. Does there exist a [2, 2, 1]2 code?

We note that the answer to the above question is trivially yes: just pick the generator matrix
to be the 2 × 2 identity matrix. But our proof will have the advantage of generalizing to broader
settings, though we save the generalizations for later chapters.
Finally in Section 3.3 we introduce the “entropy function” which turns out to be central in
the understanding of limits of codes (both existence and non-existence).

3.1 A Crash Course on Probability

In this section we review basic concepts in probability theory, specialized to the needs of this
book. Specifically we introduce distributions, events and random variables, and give some tools
to analyze them.
In this book, we will only consider probability distributions defined over finite spaces. In
particular, given a finite domain D, a probability distribution is defined as a function
X
p : D → [0, 1] such that p(x) = 1,
x∈D

63
G U (G) V00 V01 V10 V11 G U (G) V00 V01 V10 V11
µ ¶ µ ¶
0 0 1 1 0 1
16
0 0 0 0 16
0 0 1 1
0 0 0 0
µ ¶ µ ¶
0 0 1 1 0 1
16 0 1 0 1 16 0 1 1 2
0 1 0 1
µ ¶ µ ¶
0 0 1 1 0 1
16 0 1 0 1 16 0 1 1 0
1 0 1 0
µ ¶ µ ¶
0 0 1 1 0 1
16 0 2 0 2 16 0 2 1 1
1 1 1 1
µ ¶ µ ¶
0 1 1 1 1 1
16 0 0 1 1 16 0 0 2 2
0 0 0 0
µ ¶ µ ¶
0 1 1 1 1 1
16
0 1 1 0 16
0 1 2 1
0 1 0 1
µ ¶ µ ¶
0 1 1 1 1 1
16 0 1 1 2 16 0 1 2 1
1 0 1 0
µ ¶ µ ¶
0 1 1 1 1 1
16 0 2 1 1 16 0 2 2 0
1 1 1 1

Table 3.1: Uniform distribution over F22×2 along with values of four random variables.

where [0, 1] is shorthand for the interval of all real numbers between 0 and 1.
An event E is a predicate over the domain D, i.e. it maps every element of D to “true” or “false”.
Equivalently an event is a subset of the domain D, i.e., those elements that are mapped to true.
We switch between “logical” or ”set-theoretic” notation to denote combinations of events. So
the disjunction of events E1 and E2 may be denoted E1 ∨ E2 or E1 ∪ E2 . Similarly the conjunction
of E1 and E2 may be denoted E1 ∧ E2 or E1 ∩ E2 ; and the negation of E1 may be denote ¬E1 or E1 .
In this book, we will primarily deal with the following special distribution:

Definition 3.1.1 (Uniform Distribution). The uniform distribution over D, denoted by U D , is

given by
1
U D (x) = for every x ∈ D.
|D|
Typically we will drop the subscript when the domain D is clear from the context.

For example, consider the domain D = F22×2 , i.e. the set of all 2 × 2 matrices over F2 . (Note
that each such matrix is a generator matrix of some [2, 2]2 code.) The first two columns of Ta-
ble 3.1 list the elements of this D along with the corresponding probabilities for the uniform
distribution.
Typically, we will be interested in a real-valued function defined on D and how it behaves
under a probability distribution defined over D. This is captured by the notion of a random
variable1 :
1
We note that the literature on probability theory allows for more general random variables, but for our purposes
we restrict only to real-valued ones.

64
Definition 3.1.2 (Random Variable). Let D be a finite domain and I ⊂ R be a finite2 subset. Let p
be a probability distribution defined over D. A random variable is a function:

V : D → I.

The expectation of V is defined as

X
E[V ] = p(x) · V (x).
x∈D
¡ ¢
For example, given (i , j ) ∈ {0, 1}2 , let Vi j denote the random variable Vi j (G) = w t (i , j ) · G ,
for any G ∈ F22×2 . The last four columns of Table 3.1 list the values of these four random variables.
Of particular interest in this book will be binary random variables, i.e., with I = {0, 1}. In
particular, given an event E over D, we will define its indicator variable to be a function 1E : D →
{0, 1} such that for any x ∈ D: (
1 if x ∈ E
1E (x) =
0 otherwise.
For example, µµ ¶¶ µµ ¶¶
0 1 0 1
1V01 =0 = 1 and 1V01 =0 = 0.
0 0 1 1
In most cases we will shorten this notation to 1E (x) or simply 1E . Finally, sometimes we will
abuse notation and use E instead of 1E .
As a further use of indicator variables, consider the expectations of the four indicator vari-
ables:
£ ¤ 1
E 1V00 =0 = 16 · = 1.
16
£ ¤ 1 1
E 1V01 =0 = 4 · = . (3.1)
16 4
£ ¤ 1 1
E 1V10 =0 = 4 · = . (3.2)
16 4
£ ¤ 1 1
E 1V11 =0 = 4 · = . (3.3)
16 4

3.1.1 Some Useful Results

Before we proceed, we record a simple property of indicator variables that will be useful. (See
Exercise 3.1.)

Lemma 3.1.3. Let E be any event. Then

E [1E ] = Pr [E is true] .

Next, we state a simple yet useful property of expectation of a sum of random variables:
2
In general, I need not be finite. However, for this book this definition suffices.

65
Proposition 3.1.4 (Linearity of Expectation). Given random variables V1 , . . . ,Vm defined over the
same domain D and with the same probability distribution p, we have
" #
m
X Xm
E Vi = E [Vi ] .
i =1 i =1

Proof. For notational convenience, define V = V1 + · · · + Vm . Thus, we have

X
E[V ] = V (x) · p(x) (3.4)
x∈D
Ã !
X Xm
= Vi (x) · p(x) (3.5)
x∈D i =1
Xm X
= Vi (x) · p(x) (3.6)
i =1 x∈D
Xm
= E[Vi ]. (3.7)
i =1

In the equalities above, (3.4) and (3.7) follow from the definition of expectation of a random
variable. (3.5) follows from the definition of V and (3.6) follows by switching the order of the
two summations.

As an example, we have
£ ¤ 3
E 1V01 =0 + 1V10 =0 + 1V11 =0 = (3.8)
4
Frequently, we will need to deal with the probability of the union of events. We will use the
following result to upper bound such probabilities:
Proposition 3.1.5 (Union Bound). Given m binary random variables A 1 , . . . , A m , we have
"Ã ! #
m
_ m
X
Pr Ai = 1 ≤ Pr [A i = 1] .
i =1 i =1

Proof. For every i ∈ [m], define

S i = {x ∈ D|A i (x) = 1}.
Then we have
"Ã ! #
m
_ X
Pr Ai = 1 = p(x) (3.9)
i =1 x∈∪m S
i =1 i
m X
X
≤ p(x) (3.10)
i =1 x∈S i
Xm
= Pr[A i = 1]. (3.11)
i =1

In the above, (3.9) and (3.11) follow from the definition of S i . (3.10) follows from the fact that
some of the x ∈ ∪i S i get counted more than once.

66
We remark that the union bound is tight when the events are disjoint. (In other words, using
the notation in the proof above, when S i ∩ S j = ; for every i 6= j .)
As an example, let A 1 = 1V01 =0 , A 2 = 1V10 =0 and A 3 = 1V11 =0 . Note that in this case the event
A 1 ∨ A 2 ∨ A 3 is the same as the event that there exists a non-zero m ∈ {0, 1}2 such that w t (m·G) =
0. Thus, the union bound implies (that under the uniform distribution over F22×2 )
£ ¤ 3
Pr There exists an m ∈ {0, 1}2 \ {(0, 0)}, such that w t (mG) = 0 ≤ . (3.12)
4
Finally, we present two bounds on the probability of a random variable deviating signifi-
cantly from its expectation. The first bound holds for any random variable:
Lemma 3.1.6 (Markov Bound). Let V be a non-zero random variable. Then for any t > 0,
E[V ]
Pr[V ≥ t ] ≤ .
t
In particular, for any a ≥ 1,
1
Pr[V ≥ a · E[V ]] ≤
.
a
Proof. The second bound follows from the first bound by substituting t = a · E[V ]. Thus, to
complete the proof, we argue the first bound. Consider the following sequence of relations:
X X
E[V ] = i · Pr[V = i ] + i · Pr[V = i ] (3.13)
i ∈[0,t ) i ∈[t ,∞)
X
≥ i · Pr[V = i ] (3.14)
i ≥t
X
≥t· Pr[V = i ] (3.15)
i ≥t
= t · Pr[V ≥ t ]. (3.16)
In the above relations, (3.13) follows from the definition of expectation of a random variable and
the fact that V is positive. (3.14) follows as we have dropped some non-negative terms. (3.15)
follows by noting that in the summands i ≥ t . (3.16) follows from the definition of Pr[V ≥ t ].
The proof is complete by noting that (3.16) implies the claimed bound.

The second bound works only for sums of independent random variables. We begin by
defining independent random variables:
Definition 3.1.7 (Independence). Two random variables A and B are called independent if for
every a and b in the ranges of A and B respectively, we have
Pr[A = a ∧ B = b] = Pr[A = a] · Pr[B = b].
For example, for the uniform distribution in Table 3.1, let A denote the bit G 0,0 and B denote
the bit G 0,1 . It can be verified that these two random variables are independent. In fact, it can be
verified all the random variables corresponding to the four bits in G are independent random
variables. (We’ll come to a related comment shortly.)
Another related concept that we will use is that of probability of an event happening condi-
tioned on another event happening:

67
Definition 3.1.8 (Conditional Probability). Given two events A and B defined over the same do-
main and probability distribution, we define the probability of A conditioned on B as

Pr[A and B ]
Pr[A|B ] = .
Pr[B ]
For example, note that
4/16 1
= .
Pr[1V01 =1 |G 0,0 = 0] =
1/2 2
The above definition implies that two events A and B are independent if and only if Pr[A] =
Pr[A|B ]. We will also use the following result later on in the book (see Exercise 3.2):

Lemma 3.1.9. For any two events A and B defined on the same domain and the probability
distribution:
Pr[A] = Pr[A|B ] · Pr[B ] + Pr[A|¬B ] · Pr[¬B ].

Next, we state a deviation bound that asserts that the sum of independent random variables
takes values close to its expectation with high probability. We only state it for sums of binary
random variables, which is the form that will be needed in the book. We refer to this bound as
the “Chernoff bound” though we note that this is part of a larger body of work and the biblio-
graphic notes give more details.

Theorem 3.1.10 (Chernoff Bound). Let X 1 , . . . , X m be independent binary random variables and
P
define X = X i . Then the multiplicative Chernoff bound sates that for 0 < ε ≤ 1,
2
E(X )/3
Pr [|X − E(X )| > εE(X )] < 2e −ε ,

and the additive Chernoff bound states that

2
m/2
Pr [|X − E(X )| > εm] < 2e −ε .

We omit the proof, which can be found in any standard textbook on randomized algorithms.
Finally, we present an alternate view of uniform distribution over product spaces and then
use that view to prove a result that we will use later in the book. Given probability distributions
p 1 and p 2 over domains D1 and D2 respectively, we define the product distribution p 1 × p 2 over
D1 × D2 as follows: every element (x, y) ∈ D1 × D2 under p 1 × p 2 is picked by choosing x from
D1 according to p 1 and y is picked independently from D2 under p 2 . This leads to the following
observation (see Exercise 3.4).

Lemma 3.1.11. For any m ≥ 1, the distribution U D1 ×D2 ×···×Dm is identical3 to the distribution
U D1 × U D2 × · · · × U Dm .

For example, the uniform distribution in Table 3.1 can be described equivalently as follows:
pick each of the four bits in G independently and uniformly at random from {0, 1}.
We conclude this section by proving the following result:
3
We say two distributions p 1 and p 2 on D are identical if for every x ∈ D, p 1 (x) = p 2 (x).

68
Lemma 3.1.12. Given a non-zero vector m ∈ Fkq and a uniformly random k × n matrix G over Fq ,
the vector m · G is uniformly distributed over Fnq .

Proof. Let the ( j , i )th entry in G (1 ≤ j ≤ k, 1 ≤ i ≤ n) be denoted by g j i . Note that as G is a ran-

dom k ×n matrix over Fq , by Lemma 3.1.11, each of the g j i is an independent uniformly random
element from Fq . Now, note that we would be done if we can show that for every 1 ≤ i ≤ n, the
i th entry in m · G (call it b i ) is an independent uniformly random element from Fq . To finish
P
the proof, we prove this latter fact. If we denote m = (m 1 , . . . , m k ), then b i = kj=1 m j g j i . Note
that the disjoint entries of G participate in the sums for b i and b j for i 6= j . Given our choice of
G, this implies that the random variables b i and b j are independent. Hence, to complete the
proof we need to prove that b i is a uniformly independent element of Fq . The rest of the proof
is a generalization of the argument we used in the proof of Proposition 2.6.3.
Note that to show that b i is uniformly distributed over Fq , it is sufficient to prove that b i
takes every value in Fq equally often over all the choices of values that can be assigned to
g 1i , g 2i , . . . , g ki . Now, as m is non-zero, at least one of the its element is non-zero. Without
P
loss of generality assume that m 1 6= 0. Thus, we can write b i = m 1 g 1i + kj=2 m j g j i . Now, for
every fixed assignment of values to g 2i , g 3i , . . . , g ki (note that there are q k−1 such assignments),
b i takes a different value for each of the q distinct possible assignments to g 1i (this is where we
use the assumption that m 1 6= 0). Thus, over all the possible assignments of g 1i , . . . , g ki , b i takes
each of the values in Fq exactly q k−1 times, which proves our claim.

3.2 The Probabilistic Method

The probabilistic method is a very powerful method in combinatorics which can be used to
show the existence of objects that satisfy certain properties. In this course, we will use the prob-
abilistic method to prove existence of a code C with certain property P . Towards that end, we
define a distribution D over all possible codes and prove that when C is chosen according to D:
£ ¤ £ ¤
Pr C has property P > 0 or equivalently Pr C doesn’t have property P < 1.

Note that the above inequality proves the existence of C with property P .
As an example consider Question 3.0.1. To answer this in the affirmative, we note that the
set of all [2, 2]2 linear codes is covered by the set of all 2 × 2 matrices over F2 . Then, we let D be
the uniform distribution over F22×2 . Then by Proposition 2.3.6 and (3.12), we get that

3
Pr [There is no [2, 2, 1]2 code] ≤ < 1,
U F2×2 4
2

which by the probabilistic method answers the Question 3.0.1 in the affirmative.
For the more general case, when we apply the probabilistic method, the typical approach
will be to define (sub-)properties P 1 , . . . , P m such that P = P 1 ∧P 2 ∧P 3 . . .∧P m and show that for
every 1 ≤ i ≤ m:
£ ¤ h i 1
Pr C doesn’t have property P i = Pr P i < .
m

69
£ ¤
Finally, by the union bound, the above will prove that4 Pr C doesn’t have property P < 1, as
desired.
As an example, an alternate way to answer Question 3.0.1 in the affirmative is the following.
Define P 1 = 1V01 ≥1 , P 2 = 1V10 ≥1 and P 3 = 1V11 ≥1 . (Note that we want a [2, 2]2 code that satisfies
P 1 ∧ P 2 ∧ P 3 .) Then, by (3.1), (3.2) and (3.3), we have for i ∈ [3],
£ ¤ h i 1 1
Pr C doesn’t have property P i = Pr P i = < ,
4 3
as desired.
Finally, we mention a special case of the general probabilistic method that we outlined
above. In particular, let P denote the property that the randomly chosen C satisfies f (C ) ≤ b.
Then we claim (see Exercise 3.5) that E[ f (C )] ≤ b implies that Pr[C has property P ] > 0. Note
that this implies that E[ f (C )] ≤ b implies that there exists a code C such that f (C ) ≤ b.

3.3 The q-ary Entropy Function

Finally, in this chapter we introduce a fundamental function — the “entropy” function — that
plays a central role in the analysis of the limits of codes. For example, in Section 4.1 of Chapter 4
we will show how this function captures an upper bound on the rate of codes as a function of
the relative distance. Later in Section 4.2 of Chapter 4 we will see that this function captures the
a lower bound on the rate of codes obtained by the probabilistic method.
We begin with the definition of the entropy function.
Definition 3.3.1 (q-ary Entropy Function). Let q be an integer and x be a real number such that
q ≥ 2 and 0 ≤ x ≤ 1. Then the q-ary entropy function is defined as follows:
H q (x) = x logq (q − 1) − x logq (x) − (1 − x) logq (1 − x).
Figure 3.1 presents a pictorial representation of the H q function for the first few values of q.
For the special case of q = 2, we will drop the subscript from the entropy function and denote
H2 (x) by just H (x), that is, H (x) = −x log x − (1 − x) log(1 − x), where log x is defined as log2 (x)
(we are going to follow this convention for the rest of the book).
Under the lens of Shannon’s entropy function, H (x) denotes the entropy of the distribution
over {0, 1} that selects 1 with probability x and 0 with probability 1 − x. However, there is no
similar analogue for the more general H q (x). The reason why this quantity will turn out to be
so central in this book is that it is very closely related to the “volume" of a Hamming ball. We
make this connection precise in the next subsection.

3.3.1 Volume of Hamming Balls

It turns out that in many of our combinatorial results, we will need good upper and lower
bounds on the volume of a Hamming ball. Next we formalize the notion of the volume of a
Hamming ball:
4
Note that P = P 1 ∨ P 2 ∨ · · · ∨ P m .

70
1
q=2
q=3
0.9 q=4

0.8

0.7

Hq(x) ---> 0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
x --->

Figure 3.1: A plot of H q (x) for q = 2, 3 and 4. The maximum value of 1 is achieved at x = 1−1/q.

Definition 3.3.2 (Volume of a Hamming Ball). Let q ≥ 2 and n ≥ r ≥ 1 be integers. Then the
volume of a Hamming ball of radius r is given by
Ã !
X r n
V ol q (r, n) = |B q (0, r )| = (q − 1)i .
i =0 i

The choice of 0 as the center for the Hamming ball above was arbitrary: since the volume
of a Hamming ball is independent of its center (as is evident from the last equality above), we
could have picked any point as the center.
We will prove the following result:

Proposition 3.3.3. Let q ≥ 2 be an integer and 0 ≤ p ≤ 1 − q1 be a real number. Then:

(i) V ol q (pn, n) ≤ q Hq (p)n ; and

(ii) for large enough n, V ol q (pn, n) ≥ q Hq (p)n−o(n) .

Proof. We start with the proof of (i). Consider the following sequence of relations:

1 = (p + (1 − p))n
Ã !
X n n
= p i (1 − p)n−i (3.17)
i =0 i
Ã ! Ã !
Xpn Xn
n i n i
= p (1 − p)n−i + p (1 − p)n−i
i =0 i i =pn+1 i

71
Ã !
Xpn
n i
≥ p (1 − p)n−i (3.18)
i =0 i
Ã ! µ ¶
pn
X n i p i
= (q − 1) (1 − p)n−i
i =0 i q −1
Ã ! µ ¶i
Xpn
n i n p
= (q − 1) (1 − p)
i =0 i (q − 1)(1 − p)
Ã ! µ ¶pn
Xpn
n i n p
≥ (q − 1) (1 − p) (3.19)
i =0 i (q − 1)(1 − p)
Ã ! µ ¶
pn
X n p pn
= (q − 1)i (1 − p)(1−p)n (3.20)
i =0 i q − 1
≥ V ol q (pn, n)q −Hq (p)n . (3.21)
In the above, (3.17) follows from the binomial expansion. (3.18) follows by dropping the second
p
sum and (3.19) follows from the facts that (q−1)(1−p) ≤ 1 (as5 p ≤ 1−1/q). Rest of the steps except
³ ´
−H q (p)n p pn
(3.21) follow from rearranging the terms. (3.21) follows as q = q−1 (1 − p)(1−p)n .
(3.21) implies that
1 ≥ V ol q (pn, n)q −Hq (p)n ,
which proves (i).
We now turn to the proof of part (ii). For this part, we will need Stirling’s approximation for
n! (Lemma B.1.2).
By the Stirling’s approximation, we have the following inequality:
Ã !
n n!
=
pn (pn)!((1 − p)n)!
(n/e)n 1
> ·p · e λ1 (n)−λ2 (pn)−λ2 ((1−p)n)
(pn/e)pn ((1 − p)n/e)(1−p)n 2πp(1 − p)n
1
= · ℓ(n), (3.22)
p pn (1 − p)(1−p)n
λ1 (n)−λ2 (pn)−λ2 ((1−p)n)
where ℓ(n) = e p .
2πp(1−p)n
Now consider the following sequence of relations that complete the proof:
Ã !
n
V ol q (pn, n) ≥ (q − 1)pn (3.23)
pn
(q − 1)pn
> · ℓ(n) (3.24)
p pn (1 − p)(1−p)n
≥ q Hq (p)n−o(n) . (3.25)
5 p p q−1 q−1
Indeed, note that (q−1)(1−p) ≤ 1 is true if 1−p ≤ 1 , which in turn is true if p ≤ q , where the last step follows
from Lemma B.2.1.

72
In the above (3.23) follows by only looking at one term. (3.24) follows from (3.22) while (3.25)
follows from the definition of H q (·) and the fact that for large enough n, ℓ(n) is q −o(n) .

Next, we consider how the q-ary entropy function behaves for various ranges of its parame-
ters.

3.3.2 Other Properties of the q-ary Entropy function

We begin by recording the behavior of q-ary entropy function for large q.

Proposition 3.3.4. For small enough ε, 1 − H q (ρ) ≥ 1 − ρ − ε for every 0 < ρ ≤ 1 − 1/q if and only
if q is 2Ω(1/ε) .

Proof. We first note that by definition of H q (ρ) and H (ρ),

H q (ρ) = ρ logq (q − 1) − ρ logq ρ − (1 − ρ) logq (1 − ρ)

= ρ logq (q − 1) + H (ρ)/ log2 q.

Now if q ≥ 21/ε , we get that

H q (ρ) ≤ ρ + ε
as logq (q − 1) ≤ 1 and H (ρ) ≤ 1. Thus, we have argued that for q ≥ 21/ε , we have 1 − H q (ρ) ≥
1 − ρ − ε, as desired.
Next, we consider the case when q = 2o(1/ε) . We begin by claiming that for small enough ε,

if q ≥ 1/ε2 then logq (q − 1) ≥ 1 − ε.

³ ´
1
Indeed, logq (q − 1) = 1 + (1/ ln q) ln(1 − 1/q) = 1 − O q ln q
,6 which is at least 1 − ε for q ≥ 1/ε2
(and small enough ¡ε).¢
1
Finally, if q = 2o ε , then for fixed ρ,

H (ρ)/ log q = ε · ω(1).

¡1¢
Then for q = 2o ε (but q ≥ 1/ε2 ) we have

ρ logq (q − 1) + H (ρ)/ log q ≥ ρ − ε + ε · ω(1) > ρ + ε,

which implies that

1 − H q (ρ) < 1 − ρ − ε,
as desired. For q ≤ 1/ε2 , Lemma 3.3.5 shows that 1 − H q (ρ) ≤ 1 − H1/ε2 (ρ) < 1 − ρ − ε, as desired.

We will also be interested in how H q (x) behaves for fixed x and increasing q:
6
The last equality follows from the fact that by Lemma B.2.2, for 0 < x < 1, ln(1 − x) = −O(x).

73
Lemma 3.3.5. Let q ≥ 2 be an integer and let 0 ≤ ρ ≤ 1 − 1/q, then for any real m ≥ 1 such that
µ ¶q−1
m−1 1
q ≥ 1+ , (3.26)
q −1

we have
H q (ρ) ≥ H q m (ρ).

Proof. Note that H q (0) = H q m (0) = 0. Thus, for the rest of the proof we will assume that ρ ∈
(0, 1 − 1/q].
As observed in the proof of Proposition 3.3.4, we have

log(q − 1) 1
H q (ρ) = ρ · + H (ρ) · .
log q log q

Using this, we obtain

µ ¶ µ ¶
log(q − 1) log(q m − 1) 1 1
H q (ρ) − H q m (ρ) = ρ − + H (ρ) − .
log q m log q log q m log q

The above in turn implies that

1 H (ρ)
· m log q · (H q (ρ) − H q m (ρ)) = log(q − 1)m − log(q m − 1) + (m − 1)
ρ ρ
H (1 − 1/q)
≥ log(q − 1)m − log(q m − 1) + (m − 1) (3.27)
1 − 1/q
µ ¶
m m q log q
= log(q − 1) − log(q − 1) + (m − 1) log +
q −1 q −1
µ µ ¶ ¶
(q − 1)m q m−1 m−1
= log · ·q q−1
qm − 1 q −1
Ã m−1 !
(q − 1) · q m−1 · q q−1
= log
qm − 1
≥0 (3.28)

In the above (3.27) follows from the fact that H (ρ)/ρ is decreasing7 in ρ and that ρ ≤ 1 − 1/q.
(3.28) follows from the claim that
m−1
(q − 1) · q q−1 ≥ q.

Indeed the above follows from (3.26).

Finally, note that (3.28) completes the proof.
7
Indeed, H (ρ)/ρ = log(1/ρ) − (1/ρ − 1) log(1 − ρ). Note that the first term is deceasing in ρ. We claim that the
second term is also decreasing in ρ– this e.g. follows from the observation that −(1/ρ − 1) ln(1 − ρ) = (1 − ρ)(1 +
ρ/2! + ρ 2 /3! + · · · ) = 1 − ρ/2 − ρ 2 (1/2 − 1/3!) − · · · is also decreasing in ρ.

74
Since (1 + 1/x)x ≤ e (by Lemma B.2.5), we also have that (3.26) is also satisfied for m ≥ 1 +
1
ln q
. Further, we note that (3.26) is satisfied for every m ≥ 2 (for any q ≥ 3), which leads to the
following (also see Exercise 3.6):

Corollary 3.3.6. Let q ≥ 3 be an integer and let 0 ≤ ρ ≤ 1 − 1/q, then for any m ≥ 2, we have

H q (ρ) ≥ H q m (ρ).

Next, we look at the entropy function when its input is very close to 1.

Proposition 3.3.7. For small enough ε > 0,

µ ¶
1
H q 1 − − ε ≤ 1 − c q ε2 ,
q

where c q is a constant that only depends on q.

Proof. The intuition behind the proof is the following. Since the derivative of H q (x) is zero at
x = 1 − 1/q, in the Taylor expansion of H q (1 − 1/q − ε) the ε term will vanish. We will now make
this intuition more concrete. We will think of q as fixed and 1/ε as growing. In particular, we
will assume that ε < 1/q. Consider the following equalities:
µ ¶ µ ¶ µ ¶ µ ¶
1 1 − 1/q − ε 1 1
H q (1 − 1/q − ε) = − 1 − − ε logq − + ε logq +ε
q q −1 q q
µ µ ¶¶ µ ¶ µ ¶
1 εq 1 1 − (εq)/(q − 1)
= − logq 1− + + ε logq
q q −1 q 1 + εq
· µ ¶ µ ¶ µ ¶¸
1 εq 1 1 − (εq)/(q − 1)
= 1− ln 1 − − + ε ln
ln q q −1 q 1 + εq
· 2 2 µ ¶µ
2 1 εq ε q 1 εq
= 1 + o(ε ) − − − − + ε −
ln q q − 1 2(q − 1)2 q q −1
2 2 2 2 ¶¸
ε q ε q
− 2
− εq + (3.29)
2(q − 1) 2
·
2 1 εq ε2 q 2
= 1 + o(ε ) − − −
ln q q − 1 2(q − 1)2
µ ¶µ ¶¸
1 εq 2 ε2 q 3 (q − 2)
− +ε − +
q q −1 2(q − 1)2
· ¸
2 1 ε2 q 2 ε2 q 2 ε2 q 2 (q − 2)
= 1 + o(ε ) − − + − (3.30)
ln q 2(q − 1)2 q − 1 2(q − 1)2
ε2 q 2
= 1− + o(ε2 )
2 ln q(q − 1)
ε2 q 2
≤ 1−
4 ln q(q − 1)
(3.31)

75
(3.29) follows from the fact that for |x| < 1, ln(1 + x) = x − x 2 /2 + x 3 /3 − . . . (Lemma B.2.2) and
by collecting the ε3 and smaller terms in o(ε2 ). (3.30) follows by rearranging the terms and by
absorbing the ε3 terms in o(ε2 ). The last step is true assuming ε is small enough.

Next, we look at the entropy function when its input is very close to 0.

Proposition 3.3.8. For small enough ε > 0,

µ µ ¶¶
1 1
H q (ε) = Θ · ε log .
log q ε

Proof. By definition

H q (ε) = ε logq (q − 1) + ε logq (1/ε) + (1 − ε) logq (1/(1 − ε)).

Since all the terms in the RHS are positive we have

H q (ε) ≥ ε log(1/ε)/ log q. (3.32)

Further, by Lemma B.2.2, (1 − ε) logq (1/(1 − ε)) ≤ 2ε/ ln q for small enough ε. Thus, this implies
that µ ¶
2 + ln(q − 1) 1 1
H q (ε) ≤ ·ε+ · ε ln . (3.33)
ln q ln q ε
(3.32) and (3.33) proves the claimed bound.

We will also work with the inverse of the q-ary entropy function. Note that H q (·) on the
domain [0, 1 − 1/q] is a bijective map into [0, 1]. Thus, we define H q−1 (y) = x such that H q (x) = y
and 0 ≤ x ≤ 1 − 1/q. Finally, we will need the following lower bound:

Lemma 3.3.9. For every 0 ≤ y ≤ 1 − 1/q and for every small enough ε > 0,

H q−1 (y − ε2 /c q′ ) ≥ H q−1 (y) − ε,

where c q′ ≥ 1 is a constant that depends only on q.

Proof. It is easy to check that H q−1 (y) is a strictly increasing convex function when y ∈ [0, 1]. This
implies that the derivative of H q−1 (y) increases with y. In particular, (H q−1 )′ (1) ≥ (H q−1 )′ (y) for
H q−1 (y)−H q−1 (y−δ)
every 0 ≤ y ≤ 1. In other words, for every 0 < y ≤ 1, and (small enough) δ > 0, δ ≤
H q−1 (1)−H q−1 (1−δ)
δ
Proposition 3.3.7 along with the facts that H q−1 (1) = 1−1/q
. and H q−1 is increasing
completes the proof if one picks c q′ = max(1, 1/c q ) and δ = ε2 /c q′ .

76
3.4 Exercises
Exercise 3.1. Prove Lemma 3.1.3.

Exercise 3.2. Prove Lemma 3.1.9.

Exercise 3.3. In this exercise, we will see a common use of the Chernoff bound (Theorem 3.1.10).
Say we are trying to determine an (unknown) value x ∈ F to which we have access to via a ran-
domized algorithm A that on input (random) input r ∈ {0, 1}m outputs an estimate A (r) of x
such that
1
Pr [A(r) = x] ≥ + γ,
r 2
³ ´
for some 0 < γ < 21 . Then show that for any t ≥ 1 with O γt2 calls to A one can determine x with
probability at least 1 − e −t .
Hint: Call A with independent random bits and take majority of the answer and then use the Chernoff bound.

Exercise 3.4. Prove Lemma 3.1.11.

Exercise 3.5. Let P denote the property that the randomly chosen C satisfies f (C ) ≤ b. Then
E[ f (C )] ≤ b implies that Pr[C has property P ] > 0.

Exercise 3.6. Show that for any Q ≥ q ≥ 2 and ρ ≤ 1 − 1/q, we have HQ (ρ) ≤ H q (ρ).

3.5 Bibliographic Notes

The Chernoff bounds of this chapter come from a family of bounds on the concentration of
sums of random variables around their expectation. They originate with the work of Cher-
noff [23] though Chernoff himself attributes the bound to personal communication with Ru-
bin [9, Page 340]. These bounds and variations are ubiquitous in information theory and com-
puter science — see for instance [28, 96, 94]. Proofs of various concentration bounds can e.g.
be found in [31].
The use of the probabilistic method in combinatorics seems to have originated in the early
40s and became especially well known after works of Erdös, notably [39]. Shannon’s adoption
of the method in [115] is one of the first applications in a broader setting. For more on the
probabilistic method, see the book by Alon and Spencer [3].
The entropy function also dates back to Shannon [115]. Shannon’s definition is more gen-
eral and applies to discrete random variables. Our specialization to a two parameter function
(namely a function of q and p) is a special case derived from applying the original definition to
some special random variables.

77
78
Part II

The Combinatorics

79
Chapter 4

What Can and Cannot Be Done-I

In this chapter, we will try to tackle Question 1.8.1. We will approach this trade-off in the fol-
lowing way:
If we fix the relative distance of the code to be δ, what is the best rate R that we can
achieve?
While we will not be able to pin down the exact optimal relationship between R and δ, we will
start establishing some limits. Note that an upper bound on R is a negative result in that it
establishes that codes with certain parameters do not exist. Similarly, a lower bound on R is a
positive result.
In this chapter, we will consider only one positive result, i.e. a lower bound on R called the
Gilbert-Varshamov bound in Section 4.2. In Section 4.1, we recall a negative result that we have
already seen– Hamming bound and state its asymptotic version to obtain an upper bound on
R. We will consider two other upper bounds: the Singleton bound (Section 4.3), which gives
a tight upper bound for large enough alphabets (but not binary codes) and the Plotkin bound
(Section 4.4).

4.1 Asymptotic Version of the Hamming Bound

We have already seen an upper bound in Section 1.7 due to Hamming. However, we had stated
this as an upper bound on the dimension k in terms of n, q and d . In this section we convert
this into a relation on R versus δ.
Consider any (n, k, d )q code with rate R = k/n and relative distance δ = d /n. Recall that
Theorem 1.7.2 implies the following:
³j k ´
d −1
k log q V ol q 2 ,n
R = ≤ 1−
n n
Recall further that Proposition 3.3.3 states the following lower bound on the volume of a Ham-
ming ball: µ¹ º ¶ ³ ´
d −1 H q δ2 n−o(n)
V ol q ,n ≥ q .
2

81
Taking logarithms to base q of both sides above, and dividing by n yields that the second term
in the right had side of the inequality above is lower bounded by H q (δ/2) − o(1), and thus we
can get an asymptotic implication from Theorem 1.7.2. For a q-ary code C of rate R, relative
distance δ and block length n, we have:
µ ¶
δ
R ≤ 1 − Hq + o(1),
2
where the o(1) term tends to 0 as n → ∞. Thus for an infinite family of codes C , taking limits as
n → ∞, we get the following asymptotic Hamming bound.
Proposition 4.1.1 (Asymptotic Hamming Bound). Let C be an infinite family of q-ary codes with
rate R and relative distance δ. Then we have:
µ ¶
δ
R ≤ 1 − Hq .
2
Figure 4.1 gives a pictorial description of the asymptotic Hamming bound for binary codes.

1
Hamming bound
GV bound
0.9

0.8

0.7

0.6

0.5
R

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
δ

Figure 4.1: The Hamming and GV bounds for binary codes. Note that any point below the GV
bound is achievable by some code while no point above the Hamming bound is achievable by
any code. In this part of the book we would like to push the GV bound as much up as possible
while at the same time try and push down the Hamming bound as much as possible.

4.2 Gilbert-Varshamov Bound

Next, we will switch gears by proving our first non-trivial lower bound on R in terms of δ. (In
fact, this is the only positive result on the R vs δ tradeoff question that we will see in this book.)
In particular, we will prove the following result:

82
Theorem 4.2.1 (Gilbert-Varshamov Bound). Let q ≥ 2. For every 0 ≤ δ < 1 − q1 there exists a
(linear) code with rate R ≥ 1 − H q (δ) and relative distance δ. Furthermore, for every 0 ≤ ε ≤
1− H q (δ) and integer n, if the generator matrix of a code of block length n and rate 1− H q (δ)−ε is
picked uniformly at random, then the code has relative distance at least δ with probability strictly
greater than 1 − q −εn .

The bound is generally referred to as the GV bound. For a pictorial description of the GV
bound for binary codes, see Figure 4.1. We will present the proofs for general codes and linear
codes in Sections 4.2.1 and 4.2.2 respectively.
Note that the first part of the theorem follows from the second part by setting ε = 0. In what
follows we first prove the existence of a non-linear code of rate 1 − H q (δ) and relative distance
at least δ. Later we show how to get a linear code, and with high probability (when ε > 0).

4.2.1 Greedy Construction

We will prove Theorem 4.2.1 for general codes by the following greedy construction (where
d = δn): start with the empty code C and then keep on adding vectors not in C that are at Ham-
ming distance at least d from all the existing codewords in C . Algorithm 6 presents a formal
description of the algorithm and Figure 4.2 illustrates the first few executions of this algorithm.

Algorithm 6 Gilbert’s Greedy Code Construction

I NPUT: n, q, d
O UTPUT: A code C ⊆ [q]n of distance d ≥≥ 11

1: C ← ;
2: WHILE there exists a v ∈ [q]n such that ∆(v, c) ≥ d for every c ∈ C DO
3: Add v to C
4: RETURN C

We claim that Algorithm 6 terminates and the C that it outputs has distance d . The latter
is true by step 2, which makes sure that in Step 3 we never add a vector c that will make the
distance of C fall below d . For the former claim, note that, if we cannot add v at some point, we
cannot add it later. Indeed, since we only add vectors to C , if a vector v ∈ [q]n is ruled out in a
certain iteration of Step 2 because ∆(c, v) < d , then in all future iterations, we have ∆(v, c) < d
and thus, this v will never be added in Step 3 in any future iteration.
The running time of Algorithm 6 is q O(n) . To see this, note that Step 2 in the worst-case could
be repeated for every vector in [q]n , that is at most q n times. In a naive implementation, for each
iteration, we cycle through all vectors in [q]n and for each vector v ∈ [q]n , iterate through all (at
most q n ) vectors c ∈ C to check whether ∆(c, v) < d . If no such c exists, then we add v to C .
Otherwise, we move to the next v. However, note that we can do slightly better– since we know
that once a v is “rejected" in an iteration, it’ll keep on being rejected in the future iterations, we
can fix up an ordering of vectors in [q]n and for each vector v in this order, check whether it can

83
c3
c2

c1
c4
d −1

[q]n

Figure 4.2: An illustration of Gilbert’s greedy algorithm (Algorithm 6) for the first five iterations.

be added to C or not. If so, we add v to C , else we move to the next vector in the order. This
algorithm has time complexity O(nq 2n ), which is still q O(n) .
Further, we claim that after termination of Algorithm 6
[
B (c, d − 1) = [q]n .
c∈C

This is because if the above is not true, then there exists a vector v ∈ [q]n \C , such that ∆(v, c) ≥ d
and hence v can be added to C . However, this contradicts the fact that Algorithm 6 has termi-
nated. Therefore, ¯ ¯
¯[ ¯
¯ B (c, d − 1)¯ = q n . (4.1)
¯ ¯
c∈C
It is not too hard to see that
¯ ¯
X ¯[ ¯
|B (c, d − 1)| ≥ ¯¯ B (c, d − 1)¯¯ ,
c∈C c∈C

which by (4.1) implies that X

|B (c, d − 1)| ≥ q n
c∈C
or since the volume of a Hamming ball is translation invariant,
X
V ol q (d − 1, n) ≥ q n .
c∈C
P
Since c∈C V ol q (d − 1, n) = V ol q (d − 1, n) · |C |, we have
qn
|C | ≥
V ol q (d − 1, n)
qn
≥ (4.2)
q nHq (δ)
= q n(1−Hq (δ)) ,

84
as desired. In the above, (4.2) follows from the fact that

V ol q (d − 1, n) ≤ V ol q (δn, n)
≤ q nHq (δ) , (4.3)

where the second inequality follows from the upper bound on the volume of a Hamming ball in
Proposition 3.3.3.
We thus conclude that for every q, n and δ there exists a code of rate at least n(1 − H q (δ)).
We state this formally as a lemma below.

Lemma 4.2.2. For every pair of positive integers n, q and real δ ∈ [0, 1] there exists a code (n, k, δn)q
qn
code satisfying q k ≥ V ol q (d −1,n) .
In particular, for every positive integer q and real δ ∈ [0, 1] there exists an infinite family of
q-ary codes C of rate R and distance δ satisfying R ≥ 1 − H q (δ).

It is worth noting that the code from Algorithm 6 is not guaranteed to have any special struc-
ture. In particular, even storing the code can take exponential space. We have seen in Proposi-
tion 2.3.3 that linear codes have a much more succinct representation. Thus, a natural question
is:

Question 4.2.1. Do linear codes achieve the R ≥ 1 − H q (δ) tradeoff that the greedy construc-
tion achieves?

Next, we will answer the question in the affirmative.

4.2.2 Linear Code Construction

Now we will show that a random linear code, with high probability, lies on the GV bound. The
construction is a use of the probabilistic method (Section 3.2).

Proof of Theorem 4.2.1. By Proposition 2.3.6, we are done if we can show that there exists a k ×n
matrix G of full rank (for k = (1 − H q (δ) − ε)n) such that

For every m ∈ Fkq \ {0}, w t (mG) ≥ d .

We will prove the existence of such a G by the probabilistic method. Pick a random linear code
by picking a random k × n matrix G where each of kn entries is chosen uniformly and indepen-
dently at random from Fq . Fix m ∈ Fkq \ {0}. Recall that by Lemma 3.1.12, for a random G, mG is

85
a uniformly random vector from Fnq . Thus, for every non-zero vector m, we have

V ol q (d − 1, n)
Pr[w t (mG) < d ] =
G qn
q nHq (δ)
≤ (4.4)
qn
≤ q −k · q −εn , (4.5)

where (4.4) follows from (4.3) and (4.5) uses k ≤ n(1 − H q (δ) − ε). There are q k − 1 non-zero vec-
tors m and taking the union over all such vectors and applying the union bound (Lemma 3.1.5)
we have

Pr[There exists a non-zero m s.t. wt(mG) < d ] ≤ (q k − 1) · q k · q −εn

G
< q −ε·n .

Fix a matrix G ∈ Fk×n

q such that for every non-zero M we have wt(mG) ≥ d . The argument above
has shown that a random matrix has this property with probability strictly greater than 1−q −εn .
By Proposition ?? this implies that the code generated by G has distance at least d . To conclude
the theorem we only need to argue that the code has dimension k, i.e., that G has full rank. But
this also follows immediately from the property that for every Thus for every ε ≥ 0 we have that
the probability that the code generated by a uniformly random matrix has distance less than or
equal to d is strictly less than 1. Thus using the probabilistic method we conclude there exists
a matrix G such that the code it generates in an [n, k, d ]q code. Furthermore if ε > 0 then the
probability that the code does not have distance d is exponentially small, specifically at most
q εn .
To conclude we need to verify that the code generated by G has dimension k, i.e., that G
has full rank. But note that an equivalent definition of G not having full rank is that there ex-
ists a non-zero vector M such that mG = 0. But the existence of such a vector m would imply
wt(mG) = 0 < d contradicting the property that for every non-zero M we have wt(mG) ≥ d . We
thus conclude that G generates a code of rate k/n = 1 − H q (δ) − ε and relative distance δ. The
theorem follows.

Discussion. We now digress a bit to stress some aspects of the GV bound and its proof. First,
note that that proof by the probabilistic method shows something stronger than just the exis-
tence of a code, but rather gives a high probability result. Furthermore, as pointed out explicitly
for the non-linear setting in Lemma 4.2.2, the result gives a lower bound not only in the asymp-
totic case but also one for every choice of n and k. The proof of the GV bound in the non-linear
case gives a similar non-asymptotic bound in the linear setting also.
Note that we can also pick a random linear code by picking a random (n −k)×n parity check
matrix. This also leads to a alternate proof of the GV bound: see Exercise 4.1.
Finally, we note that Theorem 4.2.1 requires δ < 1 − q1 . An inspection of Gilbert and Var-
shamov’s proofs shows that the only reason the proof required that δ ≤ 1 − q1 is because it is

86
c1

ci c′i

cj c′i

n −d +1 d −1

Figure 4.3: Construction of a new code in the proof of the Singleton bound.

needed for the volume bound (recall the bound in Proposition 3.3.3)– V ol q (δn, n) ≤ q Hq (δ)n – to
hold. It is natural to wonder if the above is just an artifact of the proof or if better codes exist.
This leads to the following question:

Question 4.2.2. Does there exists a code with R > 0 and δ > 1 − q1 ?

We will return to this question in Section 4.4.

4.3 Singleton Bound

We will now change gears again and prove an upper bound on R (for fixed δ). We start by proving
the Singleton bound.
Theorem 4.3.1 (Singleton Bound). For every (n, k, d )q code,

k ≤ n − d + 1.

Proof. Let c1 , c2 , . . . , cM be the codewords of an (n, k, d )q code C . Note that we need to show
M ≤ q n−d +1 . To this end, we define c′i to be the prefix of the codeword ci of length n − d + 1 for
every i ∈ [M ]. See Figure 4.3 for a pictorial description.
We now claim that for every i 6= j , c′i 6= c′j . For the sake of contradiction, assume that there
exits an i 6= j such that c′i = c′j . Notice this implies that ci and c j agree in all the first n − d +

87
1
Hamming bound
GV bound
Singleton bound

0.8

R 0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
δ

Figure 4.4: The Hamming, GV and Singleton bound for binary codes.

1 positions, which in turn implies that ∆(ci , c j ) ≤ d − 1. This contradicts the fact that C has
distance d . Thus, M is the number of prefixes of codewords in C of length n − d + 1, which
implies that M ≤ q n−d +1 as desired.

Note that the asymptotic version of the Singleton bound states that k/n ≤ 1 − d /n + 1/n. In
other words,
R ≤ 1 − δ + o(1).

Figure 4.4 presents a pictorial description of the asymptotic version of the Singleton bound.
It is worth noting that the bound is independent of the alphabet size. As is evident from Fig-
ure 4.4, the Singleton bound is worse than the Hamming bound for binary codes. However, this
bound is better for larger alphabet sizes. In fact, we will look at a family of codes called Reed-
Solomon codes in Chapter 5 that meets the Singleton bound. However, the alphabet size of the
Reed-Solomon codes increases with the block length n. Thus, a natural follow-up question is
the following:

Question 4.3.1. Given a fixed q ≥ 2, does there exist a q-ary code that meets the Singleton
bound?

We’ll see an answer to this question in the next section.

88
4.4 Plotkin Bound
In this section, we will study the Plotkin bound, which will answer Questions 4.2.2 and 4.3.1.
We start by stating the bound.

Theorem 4.4.1 (Plotkin bound). The following holds for any code C ⊆ [q]n with distance d :
³ ´
1. If d = 1 − q1 n, |C | ≤ 2qn.
³ ´
qd
2. If d > 1 − q1 n, |C | ≤ qd −(q−1)n .

Note that the Plotkin bound (Theorem 4.4.1) implies that a code with relative distance δ ≥
1 − q1 , must necessarily have R = 0, which answers Question 4.2.2 in the negative.
Before we prove Theorem 4.4.1, we make a few remarks. We first note that the upper bound
in the first part of Theorem 4.4.1 can be improved to 2n for q = 2. (See Exercise 4.12.) Second, it
can be shown that this bound is tight. (See Exercise 4.13.) Third, the statement of Theorem 4.4.1
gives a trade-off only for relative distance greater than 1 − 1/q. However, as the following corol-
lary shows, the result can be extended to work for 0 ≤ δ ≤ 1 − 1/q. (See Figure 4.5 for an illustra-
tion for binary codes.)

Corollary 4.4.2. For any q-ary code with relative distance 0 ≤ δ ≤ 1 − q1 ,

¶
µ
q
R ≤ 1− δ + o(1).
q −1

Proof. Define d = δn. The proof proceeds by shortening the codewords.

j k We group the code-
qd
words so that they agree on the first n − n ′ symbols, where n ′ = q−1 − 1. (We will see later why
′ n−n ′
this choice of n makes sense.) In particular, for any x ∈ [q] , define the ‘prefix code’

C x = {(c n−n ′ +1 , . . . c n ) | (c 1 . . . c N ) ∈ C , (c 1 . . . c n−n ′ ) = x} .

³ ´
q
For all x, C x has distance d as C has distance d .1 Additionally, it has block length n ′ < q−1 d
³ ´
and thus, d > 1 − q1 n ′ . By Theorem 4.4.1, this implies that

qd
|C x | ≤ ≤ qd , (4.6)
qd − (q − 1)n ′

where the second inequality follows from the fact that qd − (q − 1)n ′ is an integer.
Note that by the definition of C x :
X
|C | = |C x | ,
′
x∈[q]n−n

1
If for some x, c1 6= c2 ∈ C x , ∆(c1 , c2 ) < d , then ∆((x, c1 ), (x, c2 )) < d , which implies that the distance of C is less
than d (as by definition of C x , both (x, c1 ), (x, c2 ) ∈ C ), which in turn is a contradiction.

89
1
Hamming bound
GV bound
Singleton bound
Plotkin bound

0.8

R 0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
δ

Figure 4.5: The current bounds on the rate R vs. relative distance δ for binary codes. The GV
bound is a lower bound on R while the other three bounds are upper bounds on R.

which by (4.6) implies that

³ ´
X n−n ′
q
n− q−1 d +o(n)
q
n 1−δ· q−1 +o(1)
|C | ≤ qd = q · qd ≤ q =q .
′
x∈[q]n−n

³ ´
q
In other words, R ≤ 1 − q−1 δ + o(1) as desired.

Note that Corollary 4.4.2 implies that for any q-ary code of rate R and relative distance δ
(where q is a constant independent of the block length of the code), R < 1 − δ. In other words,
this answers Question 4.3.1 in the negative.
Let us pause for a bit at this point and recollect the bounds on R versus δ that we have proved
till now. Figure 4.5 depicts all the bounds we have seen till now (for q = 2). The GV bound is
the best known lower bound at the time of writing of this book. Better upper bounds are known
and we will see one such trade-off (called the Elias-Bassalygo bound) in Section 8.1.
Now, we turn to the proof of Theorem 4.4.1, for which we will need two more lemmas. The
first lemma deals with vectors over real spaces. We quickly recap the necessary definitions.
n
Considerq a vector v in R , that is, a tuple of n real numbers. This vector has (Euclidean) norm
kvk = v 12 + v 22 + . . . + v n2 , and is a unit vector if and only if its norm is 1. The inner product of
P
two vectors, u and v, is 〈u, v〉 = i u i · v i . The following lemma gives a bound on the number of
vectors that can exist such that every pair is at an obtuse angle with each other.

Lemma 4.4.3 (Geometric Lemma). Let v1 , v2 , . . . , vm ∈ RN be non-zero vectors.

90
1. If 〈vi , v j 〉 ≤ 0 for all i 6= j , then m ≤ 2N .

2. Let vi be unit vectors for 1 ≤ i ≤ m. Further, if 〈vi , v j 〉 ≤ −ε < 0 for all i 6= j , then m ≤ 1 + 1ε .2

(Item 1 is tight: see Exercise 4.14.) The proof of the Plotkin bound will need the existence of a
map from codewords to real vectors with certain properties, which the next lemma guarantees.

Lemma 4.4.4 (Mapping Lemma). Let C ⊆ [q]n . Then there exists a function f : C −→ Rnq such
that

1. For every c ∈ C , k f (c)k = 1.

³ ´³ ´
q ∆(c1 ,c2 )
2. For every c1 6= c2 such that c1 , c2 ∈ C , 〈 f (c1 ), f (c2 )〉 = 1 − q−1 n .

We defer the proofs of the lemmas above to the end of the section. We are now in a position
to prove Theorem 4.4.1.
Proof of Theorem 4.4.1 Let C = {c1 , c2 , . . . , cm }. For all i 6= j ,
µ ¶ µ ¶
® q ∆(ci , c j ) q d
f (ci ), f (c j ) ≤ 1 − ≤ 1− .
q −1 n q −1 n

The first inequality holds

³ by
´ Lemma 4.4.4, and the second holds as C has distance d .
(q−1)n
For part 1, if d = 1 − q1 n = q , then for all i 6= j ,

®
f (ci ), f (c j ) ≤ 0

and so by the first part

³ ´of Lemma 4.4.3, m ≤ 2nq, as desired.
q−1
For part 2, d > q n and so for all i 6= j ,

µ ¶ µ ¶
q d qd − (q − 1)n
〈 f (ci ), f (c j )〉 ≤ 1 − =−
q −1 n (q − 1)n
³ ´
def qd −(q−1)n
and, since ε = (q−1)n > 0, we can apply the second part of Lemma 4.4.3. Thus, m ≤ 1 +
(q−1)n qd
qd −(q−1)n = qd −(q−1)n , as desired ✷

4.4.1 Proof of Geometric and Mapping Lemmas

Next, we prove Lemma 4.4.3.

2
Note that since vi and v j are both unit vectors, 〈vi , v j 〉 is the cosine of the angle between them.

91
Proof of Lemma 4.4.3. We begin with a proof of the first result. The proof is by induction on
n. Note that in the base case of N = 0, we have m = 0, which satisfies the claimed inequality
m ≤ 2N .
In the general case, we have m ≥ 1 non-zero vectors v1 , . . . , vm ∈ RN such that for every i 6= j ,

〈vi , v j 〉 ≤ 0. (4.7)

Since rotating all the vectors by the same amount does not change the sign of the inner
product (nor does scaling any of the vectors), w.l.o.g. we can assume that vm = 〈1, 0, . . . , 0〉. For
1 ≤ i ≤ m − 1, denote the vectors as vi = 〈αi , yi 〉, for some αi ∈ R and yi ∈ RN −1 . Now, for any
P
i 6= 1, 〈v1 , vi 〉 = 1·αi + m
i =2 0 = αi . However, note that (4.7) implies that 〈v1 , vi 〉 ≤ 0, which in turn
implies that
αi ≤ 0. (4.8)
Next, we claim that at most one of y1 , . . . , ym−1 can be the all zeroes vector, 0. If not, assume
w.l.o.g., that y1 = y2 = 0. This in turn implies that

〈v1 , v2 〉 = α1 · α2 + 〈y1 , y2 〉
= α1 · α2 + 0
= α1 · α2
> 0,

where the last inequality follows from the subsequent argument. As v1 = 〈α1 , 0〉 and v2 = 〈α2 , 0〉
are non-zero, this implies that α1 , α2 6= 0. (4.8) then implies that α1 , α2 < 0. However, 〈v1 , v2 〉 > 0
contradicts (4.7).
Thus, w.l.o.g., assume that v1 , . . . , vm−2 are all non-zero vectors. Further, note that for every
i 6= j ∈ [m − 2], 〈yi , y j 〉 = 〈vi , v j 〉 − αi · α j ≤ 〈vi , v j 〉 ≤ 0. Thus, we have reduced problem on m
vectors with dimension N to an equivalent problem on m−2 vectors with dimension dimension
N − 1. If we continue this process, we can conclude that every loss in dimension of the vector
results in twice in loss in the numbers of the vectors in the set. Induction then implies that
m ≤ 2N , as desired.
We now move on to the proof of the second part. Define z = v1 + . . . + vm . Now consider the
following sequence of relationships:
Ã !
Xm X m
2 2
kzk = kvi k + 2 〈vi , v j 〉 ≤ m + 2 · · (−ε) = m(1 − εm + ε).
i =1 i<j 2

The inequality follows from the facts that each vi is a unit vector and the assumption that for
every i 6= j , 〈vi .v j 〉 ≤ −ε. As kzk2 ≥ 0,

m(1 − εm + ε) ≥ 0.

Since m ≥ 1, we have that

1 − εm + ε ≥ 0

92
or
εm ≤ 1 + ε.
Thus, we have m ≤ 1 + 1ε , as desired. ✷
Finally, we prove Lemma 4.4.4.

Proof of Lemma 4.4.4. We begin by defining a map φ : [q] → Rq with certain properties. Then
we apply φ to all the coordinates of a codeword to define the map f : Rq → Rnq that satisfies the
claimed properties. We now fill in the details.
Define φ : [q] → Rq as follows. For every i ∈ [q], we define
* +
1 1 −(q − 1) 1
φ(i ) = , ,..., ,... .
q q q q
| {z }
i th position

That is, all but the i ’th position in φ(i ) ∈ Rq has a value of 1/q and the i th position has value
−(q − 1)/q.
Next, we record two properties of φ that follow immediately from its definition. For every
i ∈ [q],
(q − 1) (q − 1)2 (q − 1)
φ(i )2 = + = . (4.9)
q2 q2 q
Also for every i 6= j ∈ [q],

(q − 2) 2(q − 1) 1
〈φ(i ), φ( j )〉 = 2
− 2
=− . (4.10)
q q q

We are now ready to define our final map f : C → Rnq . For every c = (c 1 , . . . , c n ) ∈ C , define
s
q ¡ ¢
f (c) = · φ(c 1 ), φ(c 2 ), . . . , φ(c n ) .
n(q − 1)
q
q
(The multiplicative factor n(q−1) is to ensure that f (c) for any c ∈ C is a unit vector.)
To complete the proof, we will show that f satisfies the claimed properties. We begin with
condition 1. Note that
q Xn ¯ ¯
k f (c)k2 = · ¯φ(i )¯2 = 1,
(q − 1)n i =1
where the first equality follows from the definition of f and the second equality follows from
(4.9).
We now turn to the second condition. For notational convenience, define c1 = (x 1 , . . . , x n )
and c2 = (y 1 , . . . , y n ). Consider the following sequence of relations:

® Xn ®
f (c1 ), f (c2 ) = f (x ℓ ), f (y ℓ )
ℓ=1

93
" # µ ¶
X ® X ® q
= φ(x ℓ ), φ(y ℓ ) + φ(x ℓ ), φ(y ℓ ) ·
ℓ:x ℓ 6= y ℓ ℓ:x ℓ =y ℓ n(q − 1)
" µ ¶ µ¶# µ ¶
X −1 X q −1 q
= + · (4.11)
ℓ:x ℓ 6= y ℓ q ℓ:x ℓ =y ℓ q n(q − 1)
· µ ¶ µ ¶¸ µ ¶
−1 q −1 q
= ∆(c1 , c2 ) + (n − ∆(c1 , c2 )) · (4.12)
q q n(q − 1)
µ ¶· ¸
q 1 q −1
= 1 − ∆(c1 , c2 ) +
n(q − 1) q q
µ ¶µ ¶
q ∆(c1 , c2 )
= 1− ,
q −1 n

as desired. In the above, (4.11) is obtained using (4.10) and (4.9) while (4.12) follows from the
definition of the Hamming distance. ✷

4.5 Exercises
Exercise 4.1. Pick a (n − k) × n matrix H over Fq at random. Show that with high probability the
code whose parity check matrix is H achieves the GV bound.

Exercise 4.2. Recall the definition of an ε-biased space from Exercise 2.14. Show that there exists
an ε-biased space of size O(k/ε2 ).

Hint: Recall part 1 of Exercise 2.14.

Exercise 4.3. Argue that a random linear code as well as its dual both lie on the corresponding
GV bound.

Exercise 4.4. In Section 4.2.2, we saw that random linear code meets the GV bound. It is natural
to ask the question for general random codes. (By a random (n, k)q code, we mean the follow-
ing: for each of the q k messages, pick a random vector from [q]n . Further, the choices for each
codeword is independent.) We will do so in this problem.

1. Prove that a random q-ary code with rate R > 0 with high probability has relative distance
δ ≥ H q−1 (1 − 2R − ε). Note that this is worse than the bound for random linear codes in
Theorem 4.2.1.

2. Prove that with high probability the relative distance of a random q-ary code of rate R is at
most H q−1 (1 − 2R) + ε. In other words, general random codes are worse than random linear
codes in terms of their distance.
Hint: Use Chebyshev’s inequality.

Exercise 4.5. We saw that Algorithm 6 can compute an (n, k)q code on the GV bound in time
q O(n) . Now the construction for linear codes is a randomized construction and it is natural to ask

94
how quickly can we compute an [n, k]q code that meets the GV bound. In this problem, we will
see that this can also be done in q O(n) deterministic time, though the deterministic algorithm is
not that straight-forward anymore.

1. Argue that Theorem 4.2.1 gives a q O(kn) time algorithm that constructs an [n, k]q code on
the GV bound. (Thus, the goal of this problem is to “shave" off a factor of k from the expo-
nent.)
,n
2. A k × n Toeplitz Matrix A = {A i , j }ki =1, j =1
satisfies the property that A i , j = A i −1, j −1 . In other
words, any diagonal has the same value. For example, the following is a 4 × 6 Toeplitz
matrix:  
1 2 3 4 5 6
 
 7 1 2 3 4 5 
 
 8 7 1 2 3 4 
9 8 7 1 2 3

A random k ×n Toeplitz matrix T ∈ Fk×n

q is chosen by picking the entries in the first row and
column uniformly (and independently) at random.
Prove the following claim: For any non-zero m ∈ Fkq , the vector m·T is uniformly distributed
£ ¤
over Fnq , that is for every y ∈ Fnq , Pr m · T = y = q −n .

Hint: Write down the expression for the value at each of the n positions in the vector m · T in terms of the
values in the first row and column of T . Think of the values in the first row and column as variables. Then
divide these variables into two sets (this “division" will depend on m) say S and S. Then argue the following:
for every fixed y ∈ Fnq and for every fixed assignment to variables in S, there is a unique assignment to variables
in S such that mT = y.

3. Briefly argue why the claim in part 2 implies that a random code defined by picking its
generator matrix as a random Toeplitz matrix with high probability lies on the GV bound.

4. Conclude that an [n, k]q code on the GV bound can be constructed in time q O(k+n) .

Exercise 4.6. Show that one can construct the parity check matrix of an [n, k]q code that lies on
the GV bound in time q O(n) .

Exercise 4.7. So far in Exercises 4.5 and 4.6, we have seen two constructions of [n, k]q code on the
GV bound that can be constructed in q O(n) time. For constant rate codes, at the time of writing of
this book, this is fastest known construction of any code that meets the GV bound. For k = o(n),
there is a better construction known, which we explore in this exercise.
We begin with some notation. For the rest of the exercise we will target a distance of d = δn.
Given a message m ∈ Fkq and an [n, k]q code C , define the indicator variable:
½
1 if w t (C (m)) < d
Wm (C ) =
0 otherwise.

95
Further, define X
D(C ) = Wm (C ).
m∈Fkq \{0}

We will also use D(G) and Wm (G) to denote the variables above for the code C generated by G.
Given an k × n matrix M , we will use M i to denote the i th column of M and M ≤i to denote
the column submatrix of M that contains the first i columns. Finally below we will use G to
denote a uniformly random k ×n generator matrix and G to denote a specific instantiation of the
generator matrix. We will arrive at the final construction in a sequence of steps. In what follows
define k < (1 − H q (δ))n for large enough n.
1. Argue that C has a distance d if and only if D(C ) < 1.

2. Argue that E [D(G )] < 1.

3. Argue that for any 1 ≤ i < n and fixed k × n matrix G,

h i h i
min E D(G )|G ≤i = G ≤i , G i +1 = v ≤ E D(G )|G ≤i = G ≤i .
v∈Fkq

4. We are now ready to define the algorithm to compute the final generator matrix G: see
Algorithm 7. Prove that Algorithm 7 outputs a matrix G such that the linear code generated

Algorithm 7 q O(k) time algorithm to compute a code on the GV bound

I NPUT: Integer parameters 1 ≤ k 6= n such that k < (1 − H q (δ)n)
O UTPUT: An k × n generator matrix G for a code with distance δn

1: Initialize G to be the all 0s matrix ⊲ This initialization is arbitrary

2: FOR every 1 ≤ i ≤ n DO
£ ¤
3: G i ← arg minv∈Fk E D(G )|G ≤i = G ≤i , G i +1 = v
q

4: RETURN G

by G is an [n, k, δn]q code. Conclude that this code lies on the GV bound.

5. Finally,¡we will
¢ analyze the run time of Algorithm 7. Argue that Step 2 can be¡implemented
¢
k
in poly n, q time. Conclude Algorithm 7 can be implemented in time poly n, q k .

Hint: It might be useful to maintain a data structure that keeps track of one number for every non-zero m ∈ Fkq
throughout the run of Algorithm 7.

Exercise 4.8. In this problem we will derive the GV bound using a graph-theoretic proof, which
is actually equivalent to the greedy proof we saw in Section 4.2.1. Let 1 ≤ d ≤ n and q ≥ 1 be
integers. Now consider the graph G n,d ,q = (V, E ), where the vertex set is the set of all vectors in
[q]n . Given two vertices u 6= v ∈ [q]n , we have the edge (u, v) ∈ E if and only if ∆(u, v) < d . An
independent set of a graph G = (V, E ) is a subset I ⊆ V such that for every u 6= v ∈ I , we have that
(u, v) is not an edge. We now consider the following sub-problems:

96
1. Argue that any independent set C of G n,d ,q is a q-ary code of distance d .

2. The degree of a vertex in a graph G is the number of edges incident on that vertex. Let ∆ be
the maximum degree of any vertex in G = (V, E ).Then argue that G has an independent set
|V |
of size at least ∆+1 .

3. Using parts 1 and 2 argue the GV bound.

Exercise 4.9. In this problem we will improve slightly on the GV bound using a more sophisti-
cated graph-theoretic proof. Let G n,d ,q and N and ∆ be as in the previous exercise (Exercise 4.8).
So far we used the fact that G n,d ,q has many vertices and small degree to prove it has a large in-
dependent set, and thus to prove there is a large code of minimum distance d . In this exercise we
will see how a better result can be obtained by counting the number of “triangles” in the graph. A
triangle in a graph G = (V, E ) is a set {u, v, w} ⊂ V of three vertices such that all three vertices are
adjancent, i.e., (u, v), (v, w), (w, u) ∈ E . For simplicity we will focus on the case where q = 2 and
d = n/5, and consider the limit as n → ∞.

1. Prove that a graph on N vertices of maximum degree ∆ has at most O(N ∆2 ) triangles.

2. Prove that the number of triangle in graph G n,d ,2 is at most

Ã !
X n
2n · · 3e .
0≤e≤3d /2 e

Hint: Fix u and let e count the number of coordinates where at least one of v or w disagree
with u. Prove that e is at most 3d /2.

3. Simplify the expression in the case where d = n/5 to show that the number of triangles in
G n,n/5,2 is O(N · ∆2−η ) for some η > 0.

4. A famous result in the “probabilistic method” shows (and you don’t have to prove this), that
if a graph on N vertices of maximum degree ∆ has at most O(N ·∆2−η ) triangles, then it has
an independent set of size Ω( N
∆ log ∆). Use this result
¡ nto¢ conclude that there is a binary code
n
of block length n and distance n/5 of size Ω(n2 / n/5 ). (Note that this improves over the
GV-bound by an Ω(n) factor.)

Exercise 4.10. Use part 2 from Exercise 1.7 to prove the Singleton bound.

Exercise 4.11. Let C be an (n, k, d )q code. Then prove that fixing any n −d +1 positions uniquely
determines the corresponding codeword.

Exercise 4.12. Let C be a binary code of block length n and distance n/2. Then |C | ≤ 2n. (Note
that this is a factor 2 better than part 1 in Theorem 4.4.1.)

Exercise 4.13. Prove that the bound in Exercise 4.12 is tight– i.e. there exists binary codes C with
block length n and distance n/2 such that |C | = 2n.

97
Exercise 4.14. Prove that part 1 of Lemma 4.4.3 is tight.
Exercise 4.15. In this exercise we will prove the Plotkin bound (at least part 2 of Theorem 4.4.1)
via a purely combinatorial proof. ³ ´
Given an (n, k, d )q code C with d > 1 − q1 n define
X
S= ∆(c1 , c2 ).
c1 6=c2 ∈C

For the rest of the problem think of C has an |C | × n matrix where each row corresponds to a
codeword in C . Now consider the following:
1. Looking at the contribution of each column in the matrix above, argue that
µ ¶
1
S ≤ 1− · n|C |2 .
q

2. Look at the contribution of the rows in the matrix above, argue that

S ≥ |C | (|C | − 1) · d .

3. Conclude part 2 of Theorem 4.4.1.

Exercise 4.16. In this exercise, we will prove the so called Griesmer Bound. For any [n, k, d ]q ,
prove that
k−1 » ¼
X d
n≥ i
.
i =0 q

Hint: Recall Exercise 2.17.

Exercise 4.17. Use Exercise 4.16 to prove part 2 of Theorem 4.4.1 for linear codes.
Exercise 4.18. Use Exercise 4.16 to prove Theorem 4.3.1 for linear code.

4.6 Bibliographic Notes

Theorem 4.2.1 was proved for general codes by Edgar Gilbert ([49]) and for linear codes by Rom
Varshamov ([133]). Hence, the bound is called the Gilbert-Varshamov bound. The Singleton
bound (Theorem 4.3.1) is due to Richard C. Singleton [118]. For larger (but still constant) values
of q, better lower bounds than the GV bound are known. In particular, for any prime power
q ≥ 49, there exist linear codes, called algebraic geometric (or AG) codes that outperform the
corresponding GV bound3 . AG codes out of the scope of this book. One starting point could be
the following [74].
The proof method illustrated in Exercise 4.15has a name– double counting: in this specific
case this follows since we count S in two different ways.
3
The lower bound of 49 comes about as AG codes are only defined for q being a square (i.e. q = (q ′ )2 ) and it
turns out that q ′ = 7 is the smallest value where AG bound beats the GV bound.

98
Chapter 5

The Greatest Code of Them All:

Reed-Solomon Codes

In this chapter, we will study the Reed-Solomon codes. Reed-Solomon codes have been studied
a lot in coding theory. These codes are optimal in the sense that they meet the Singleton bound
(Theorem 4.3.1). We would like to emphasize that these codes meet the Singleton bound not
just asymptotically in terms of rate and relative distance but also in terms of the dimension,
block length and distance. As if this were not enough, Reed-Solomon codes turn out to be more
versatile: they have many applications outside of coding theory. (We will see some applications
later in the book.)
These codes are defined in terms of univariate polynomials (i.e. polynomials in one un-
known/variable) with coefficients from a finite field Fq . It turns out that polynomials over Fp ,
for prime p, also help us define finite fields Fp s , for s > 1. To kill two birds with one stone1 , we
first do a quick review of polynomials over finite fields. Then we will define and study some
properties of Reed-Solomon codes.

5.1 Polynomials and Finite Fields

We begin with the formal definition of a (univariate) polynomial.
P∞ i
Definition 5.1.1. Let Fq be a finite field with q elements. Then a function F (X ) = i =0 f i X , fi ∈
Fq is called a polynomial.
P
For our purposes, we will only consider the finite case; that is, F (X ) = di=0 f i X i for some
integer d > 0, with coefficients f i ∈ Fq , and f d 6= 0. For example, 2X 3 +X 2 +5X +6 is a polynomial
over F7 .
Next, we define some useful notions related to polynomials. We begin with the notion of
degree of a polynomial.
1
No birds will be harmed in this exercise.

99
P
Definition 5.1.2. For F (X ) = di=0 f i X i with f d 6= 0, we call d the degree of F (X ). We denote the
degree of the polynomial F (X ) by deg(F ).

For example, 2X 3 + X 2 + 5X + 6 has degree 3.

Let Fq [X ] be the set of polynomials over Fq , that is, with coefficients from Fq . Let F (X ),G(X ) ∈
Fq [X ] be polynomials. Then Fq [X ] has the following natural operations defined on it:

Addition:
max(deg(F
X),deg(G))
F (X ) +G(X ) = ( f i + g i )X i ,
i =0

where the addition on the coefficients is done over Fq . For example, over F2 , X + (1 + X ) =
X · (1 + 1) + 1 · (0 + 1) = 1 (recall that over F2 , 1 + 1 = 0).2

Multiplication: Ã !
deg(F )+deg(G)
X min(iX
,deg(F ))
F (X ) · G(X ) = f j · gi −j X i ,
i =0 j =0

where all the operations on the coefficients are over Fq . For example, over F2 , X (1 + X ) =
X + X 2 ; (1+ X )2 = 1+2X + X 2 = 1+ X 2 , where the latter equality follows since 2 ≡ 0 mod 2.

Next, we define a root of a polynomial.

Definition 5.1.3. α ∈ Fq is a root of a polynomial F (X ), if F (α) = 0.

For instance, 1 is a root of 1 + X 2 over F2 .

We will also need the notion of a special class of polynomials, which are analogous to how
prime numbers are special for natural numbers.

Definition 5.1.4. A polynomial F (X ) is irreducible if for every G 1 (X ),G 2 (X ) such that F (X ) =

G 1 (X )G 2 (X ), we have min(deg(G 1 ), deg(G 2 )) = 0

For example, 1 + X 2 is not irreducible over F2 , as (1 + X )(1 + X ) = 1 + X 2 . However, 1 + X + X 2

is irreducible, since its non-trivial factors have to be from the linear terms X or X + 1. However,
it is easy to check that neither is a factor of 1 + X + X 2 . (In fact, one can show that 1 + X + X 2
is the only irreducible polynomial of degree 2 over F2 – see Exercise 5.1.) A word of caution: if
a polynomial E (X ) ∈ Fq [X ] has no root in Fq , it does not mean that E (X ) is irreducible. For
example consider the polynomial (1 + X + X 2 )2 over F2 – it does not have any root in F2 but it
obviously is not irreducible.
Just as the set of integers modulo a prime is a field, so is the set of polynomials modulo an
irreducible polynomial:

Theorem 5.1.5. Let E (X ) be an irreducible polynomial with degree at least 2 over Fp , p prime.
Then the set of polynomials in Fp [X ] modulo E (X ), denoted by Fp [X ]/E (X ), is a field.
2
This will be a good time to remember that operations over a finite field are much different from operations over
integers/reals. For example, over reals/integers X + (X + 1) = 2X + 1.

100
The proof of the theorem above is similar to the proof of Lemma 2.1.4, so we only sketch the
proof here. In particular, we will explicitly state the basic tenets of Fp [X ]/E (X ).

• Elements are polynomials in Fp [X ] of degree at most s − 1. Note that there are p s such
polynomials.

• Addition: (F (X ) +G(X )) mod E (X ) = F (X ) mod E (X ) + G(X ) mod E (X ) = F (X ) + G(X ).

(Since F (X ) and G(X ) are of degree at most s − 1, addition modulo E (X ) is just plain poly-
nomial addition.)

• Multiplication: (F (X ) · G(X )) mod E (X ) is the unique polynomial R(X ) with degree at

most s − 1 such that for some A(X ), R(X ) + A(X )E (X ) = F (X ) · G(X )

• The additive identity is the zero polynomial, and the additive inverse of any element F (X )
is −F (X ).

• The multiplicative identity is the constant polynomial 1. It can be shown that for every
element F (X ), there exists a unique multiplicative inverse (F (X ))−1 .

For example, for p = 2 and E (X ) = 1+X +X 2 , F2 [X ]/(1+X +X 2 ) has as its elements {0, 1, X , 1+
X }. The additive inverse of any element in F2 [X ]/(1 + X + X 2 ) is the element itself while the
multiplicative inverses of 1, X and 1 + X in F2 [X ]/(1 + X + X 2 ) are 1, 1 + X and X respectively.
A natural question to ask is if irreducible polynomials exist for every degree. Indeed, they
do:

Theorem 5.1.6. For all s ≥ 2 and Fp , there exists an irreducible

µ s¶ polynomial of degree s over Fp . In
p
fact, the number of such irreducible polynomials is Θ .
s
The result is true even for general finite fields Fq and not just prime fields but we stated the
version over prime fields for simplicity. Given any monic 3 polynomial E (X ) of degree s, it can
s
be verified whether it is an irreducible polynomial by checking if gcd(E (X ), X q − X ) = E (X ).
This is true as every irreducible polynomial in Fq [X ] of degree exactly s divides the polynomial
s
X q − X (see Proposition D.5.14). Since Euclid’s algorithm for computing the gcd(F (X ),G(X ))
can be implemented in time polynomial in the minimum of deg(F ) and deg(G) and log q (see
Section D.7.2), this implies that checking whether a given polynomial of degree s over Fq [X ] is
irreducible can be done in time poly(s, log q).
This implies an efficient Las Vegas algorithm4 to generate an irreducible polynomial of de-
gree s over Fq . Note that the algorithm is to keep on generating random polynomials until it
comes
¡ ¢ across an irreducible polynomial (Theorem 5.1.6 implies that the algorithm will check
O p s polynomials in expectation). Algorithm 8 presents the formal algorithm.
The above discussion implies the following:
3
I.e. the coefficient of the highest degree term is 1. It is easy to check that if E (X ) = e s X s + e s−1 X s−1 + · · · + 1 is
irreducible, then e s−1 · E (X ) is also an irreducible polynomial.
4
A Las Vegas algorithm is a randomized algorithm which always succeeds and we consider its time complexity
to be its expected worst-case run time.

101
Algorithm 8 Generating Irreducible Polynomial
I NPUT: Prime power q and an integer s > 1
O UTPUT: A monic irreducible polynomial of degree s over Fq

1: b ← 0
2: WHILE b = 0 DO
P
3: F (X ) ← X s + is−1 i
=0 f i X , where each f i is chosen uniformly at random from Fq .
s
4: IF gcd(F (X ), X q − X ) = F (X ) THEN
5: b ← 1.
6: RETURN F (X )

Corollary 5.1.7. There is a Las Vegas algorithm to generate an irreducible polynomial of degree s
over any Fq in expected time poly(s, log q).
Now recall that Theorem 2.1.5 states that for every prime power p s , there a unique field Fp s .
This along with Theorems 5.1.5 and 5.1.6 imply that:
Corollary 5.1.8. The field Fp s is Fp [X ]/E (X ), where E (X ) is an irreducible polynomial of degree
s.

5.2 Reed-Solomon Codes

Recall that the Singleton bound (Theorem 4.3.1) states that for any (n, k, d )q code, k ≤ n − d + 1.
Next, we will study Reed-Solomon codes, which meet the Singleton bound, i.e. satisfy k = n −
d + 1 (but have the unfortunate property that q ≥ n). Note that this implies that the Singleton
bound is tight, at least for q ≥ n.
We begin with the definition of Reed-Solomon codes.
Definition 5.2.1 (Reed-Solomon code). Let Fq be a finite field. Let α1 , α2 , ...αn be distinct el-
ements (also called evaluation points) from Fq and choose n and k such that k ≤ n ≤ q. We
define an encoding function for Reed-Solomon code as RS : Fkq → Fnq as follows. A message m =
(m 0 , m 1 , ..., m k−1 ) with m i ∈ Fq is mapped to a degree k − 1 polynomial.
m 7→ f m (X ),
where
k−1
X
f m (X ) = mi X i . (5.1)
i =0
Note that f m (X ) ∈ Fq [X ] is a polynomial of degree at most k − 1. The encoding of m is the evalua-
tion of f m (X ) at all the αi ’s :
¡ ¢
RS(m) = f m (α1 ), f m (α2 ), ..., f m (αn ) .
We call this image Reed-Solomon code or RS code. A common special case is n = q − 1 with the
def
set of evaluation points being F∗ = F \ {0}.

102
For example, the first row below are all the codewords in the [3, 2]3 Reed-Solomon codes
where the evaluation points are F3 (and the codewords are ordered by the corresponding mes-
sages from F23 in lexicographic order where for clarity the second row shows the polynomial
f m (X ) for the corresponding m ∈ F23 in gray):
(0,0,0), (1,1,1), (2,2,2), (0,1,2), (1,2,0), (2,0,1), (0,2,1), (1,0,2), (2,1,0)
0, 1, 2, X, X+1, X+2, 2X, 2X+1, 2X+2
Notice that by definition, the entries in {α1 , ..., αn } are distinct and thus, must have n ≤ q.
We now turn to some properties of Reed-Solomon codes.

Claim 5.2.2. RS codes are linear codes.

Proof. The proof follows from the fact that if a ∈ Fq and f (X ), g (X ) ∈ Fq [X ] are polynomials of
degree ≤ k −1, then a f (X ) and f (X )+ g (X ) are also polynomials of degree ≤ k −1. In particular,
let messages m1 and m2 be mapped to f m1 (X ) and f m2 (X ) where f m1 (X ), f m2 (X ) ∈ Fq [X ] are
polynomials of degree at most k − 1 and because of the mapping defined in (5.1), it is easy to
verify that:
f m1 (X ) + f m2 (X ) = f m1 +m2 (X ),
and
a f m1 (X ) = f am1 (X ).
In other words,
RS(m1 ) + RS(m2 ) = RS(m1 + m2 )

aRS(m1 ) = RS(am1 ).
Therefore RS is a [n, k]q linear code.

The second and more interesting claim is the following:

Claim 5.2.3. RS is a [n, k, n − k + 1]q code. That is, it matches the Singleton bound.

The claim on the distance follows from the fact that every non-zero polynomial of degree
k − 1 over Fq [X ] has at most k − 1 (not necessarily distinct) roots, which we prove first (see
Proposition 5.2.4 below). This implies that if two polynomials agree on more than k − 1 places
then they must be the same polynomial– note that this implies two polynomials when evaluated
at the same n points must differ in at least n−(k −1) = n−k +1 positions, which is what we want.

Proposition 5.2.4 (“Degree Mantra"). A nonzero polynomial f (X ) of degree t over a field Fq has
at most t roots in Fq

Proof. We will prove the theorem by induction on t . If t = 0, we are done. Now, consider f (X )
of degree t > 0. Let α ∈ Fq be a root such that f (α) = 0. If no such root α exists, we are done. If
there is a root α, then we can write

f (X ) = (X − α)g (X )

103
where deg(g ) = deg( f ) − 1 (i.e. X − α divides f (X )). Note that g (X ) is non-zero since f (X ) is
non-zero. This is because by the fundamental rule of division of polynomials:

f (X ) = (X − α)g (X ) + R(X )

where deg(R) ≤ 0 (as the degree cannot be negative this in turn implies that deg(R) = 0) and
since f (α) = 0,
f (α) = 0 + R(α),
which implies that R(α) = 0. Since R(X ) has degree zero (i.e. it is a constant polynomial), this
implies that R(X ) ≡ 0.
Finally, as g (X ) is non-zero and has degree t − 1, by induction, g (X ) has at most t − 1 roots,
which implies that f (X ) has at most t roots.

We are now ready to prove Claim 5.2.3

Proof of Claim 5.2.3. We start by proving the claim on the distance. Fix arbitrary m1 6= m2 ∈
Fkq . Note that f m1 (X ), f m2 (X ) ∈ Fq [X ] are distinct polynomials of degree at most k − 1 since
m1 6= m2 ∈ Fkq . Then f m1 (X ) − f m2 (X ) 6= 0 also has degree at most k − 1. Note that w t (RS(m2 ) −
RS(m1 )) = ∆(RS(m1 ), RS(m2 )). The weight of RS(m2 )−RS(m1 ) is n minus the number of zeroes
in RS(m2 ) − RS(m1 ), which is equal to n minus the number of roots that f m1 (X ) − f m2 (X ) has
among {α1 , ..., αn }. That is,

∆(RS(m1 ), RS(m2 )) = n − |{αi | f m1 (αi ) = f m2 (αi )}|.

By Proposition 5.2.4, f m1 (X ) − f m2 (X ) has at most k − 1 roots. Thus, the weight of RS(m2 ) −

RS(m1 ) is at least n − (k − 1) = n − k + 1. Therefore d ≥ n − k + 1, and since the Singleton bound
(Theorem 4.3.1) implies that d ≤ n−k+1, we have d = n−k+1.5 The argument above also shows
that distinct polynomials f m1 (X ), f m2 (X ) ∈ Fq [X ] are mapped to distinct codewords. (This is
because the Hamming distance between any two codewords is at least n − k + 1 ≥ 1, where the
last inequality follows as k ≤ n.) Therefore, the code contains q k codewords and has dimension
k. The claim on linearity of the code follows from Claim 5.2.2. ✷
Recall that the Plotkin bound (Corollary 4.4.2) implies that to achieve the Singleton bound,
the alphabet size cannot be a constant. Thus, some dependence of q on n in Reed-Solomon
codes is unavoidable.
Let us now find a generator matrix for RS codes (which exists by Claim 5.2.2). By Defi-
nition 5.2.1, any basis f m1 , ..., f mk of polynomial of degree at most k − 1 gives rise to a basis
RS(m1 ), ..., RS(mk ) of the code. A particularly nice polynomial basis is the set of monomials
1, X , ..., X i , ..., X k−1 . The corresponding generator matrix, whose i th row (numbering rows from
0 to k − 1 ) is
(αi1 , αi2 , ..., αij , ..., αin )
and this generator matrix is called the Vandermonde matrix of size k × n:
5
See Exercise 5.3 for an alternate direct argument.

104
 
1 1 1 1 1 1
 α α2 · · · αj · · · αn 
 1 
 2 
 α1 α22 · · · α2j · · · α2n 
 
 .. .. .. .. .. .. 
 . . . . . . 
 i 
 α α i
· · · αij · · · αin 
 1 2 
 . . .. .. .. .. 
 .. .. . . . . 


αk−1
1 αk−1
2
k−1
· · · αj · · · αk−1
n

The class of codes that match the Singleton bound have their own name, which we define
and study next.

5.3 A Property of MDS Codes

Definition 5.3.1 (MDS codes). An (n, k, d )q code is called Maximum Distance Separable (MDS)
if d = n − k + 1.

Thus, Reed-Solomon codes are MDS codes.

Next, we prove an interesting property of an MDS code C ⊆ Σn with integral dimension k.
We begin with the following notation.

Definition 5.3.2. For any subset of indices S ⊆ [n] of size exactly k and a code C ⊆ Σn , C S is the
set of all codewords in C projected onto the indices in S.

MDS codes have the following nice property that we shall prove for the special case of Reed-
Solomon codes first and subsequently for the general case as well.

Proposition 5.3.3. Let C ⊆ Σn of integral dimension k be an MDS code, then for all S ⊆ [n] such
that |S| = k, we have |C S | = Σk .

Before proving Proposition 5.3.3 in its full generality, we present its proof for the special case of
Reed-Solomon codes.
Consider any S ⊆ [n] of size k and fix an arbitrary v = (v 1 , . . . , v k ) ∈ Fkq , we need to show that
there exists a codeword c ∈ RS (assume that the RS code evaluates polynomials of degree at
most k − 1 over α1 , . . . , αn ⊆ Fq ) such that cS = v. Consider a generic degree k − 1 polynomial
P
F (X ) = k−1 i
i =0 f i X . Thus, we need to show that there exists F (X ) such that F (αi ) = v i for all i ∈
S, where |S| = k.

For notational simplicity, assume that S = [k]. We think of f i ’s as unknowns in the equations
that arise out of the relations F (αi ) = v i . Thus, we need to show that there is a solution to the
following system of linear equations:

105
   
1 1 1 v1
 α1 αi αk   v2 
   
¡ ¢   
p 0 p 1 · · · p k−1  α21 α2i α2k  =  v3 
 .. .. ..   .. 
   
 . . .   . 
αk−1
1 αk−1
i
αk−1
k
vk
The above constraint matrix is a Vandermonde matrix and is known to have full rank (see Ex-
ercise 5.7). Hence, by Exercise 2.6, there always exists a unique solution for (p 0 , . . . , p k−1 ). This
completes the proof for Reed-Solomon codes.

Next, we prove the property for the general case which is presented below

Proof of Proposition 5.3.3. Consider a |C | × n matrix where each row represents a codeword
in C . Hence, there are |C | = |Σ|k rows in the matrix. The number of columns is equal to the
block length n of the code. Since C is Maximum Distance Separable, its distance d = n − k + 1.

Let S ⊆ [n] be of size exactly k. It is easy to see that for any ci 6= c j ∈ C , the corresponding
j
projections ciS and cS ∈ C S are not the same. As otherwise △(ci , c j ) ≤ d −1, which is not possible
as the minimum distance of the code C is d . Therefore, every codeword in C gets mapped to a
distinct codeword in C S . As a result, |C S | = |C | = |Σ|k . As C S ⊆ Σk , this implies that C S = Σk , as
desired. ✷
Proposition 5.3.3 implies an important property in pseudorandomness: see Exercise 5.8 for
more.

5.4 Exercises
Exercise 5.1. Prove that X 2 + X + 1 is the unique irreducible polynomial of degree two over F2 .
Exercise 5.2. Argue that any function f : Fq → Fq is equivalent to a polynomial P (X ) ∈ Fq [X ] of
degree at most q − 1: that is, for every α ∈ Fq
f (α) = P (α).
Exercise 5.3. For any [n, k]q Reed-Solomon code, exhibit two codewords that are at Hamming
distance exactly n − k + 1.
Exercise 5.4. Let RSF∗q [n, k] denote the Reed-Solomon code over Fq where the evaluation points
is Fq (i.e. n = q). Prove that
³ ´⊥
RSFq [n, k] = RSFq [n, n − k],
that is, the dual of these Reed-Solomon codes are Reed-Solomon codes themselves. Conclude that
Reed-Solomon codes contain self-dual codes (see Exercise 2.31 for a definition).

Hint: Exercise 2.2 might be useful.

106
Exercise 5.5. Since Reed-Solomon codes are linear codes, by Proposition 2.3.5, one can do error
detection for Reed-Solomon codes in quadratic time. In this problem, we will see that one can
design even more efficient error detection algorithm for Reed-Solomon codes. In particular, we
will consider data streaming algorithms (see Section 22.5 for more motivation on this class of al-
gorithms). A data stream algorithm makes a sequential pass on the input, uses poly-logarithmic
space and spend only poly-logarithmic time on each location in the input. In this problem we
show that there exists a randomized data stream algorithm to solve the error detection problem
for Reed-Solomon codes.

1. Give a randomized data stream algorithm that given as input a sequence (i 1 , α1 ), . . . , (i n , αn ) ∈

P
[m] × Fq that implicitly defines y ∈ Fm
q , where for any ℓ ∈ [m], y ℓ = j ∈[n]|i j =ℓ αℓ , decides
whether y = 0 with probability at least 2/3. Your algorithm should use O(log q(m + n))
space and polylog(q(m + n)) time per position of y. For simplicity, you can assume that
given an integer t ≥ 1 and prime power q, the algorithm has oracle access to an irreducible
polynomial of degree t over Fq .
Hint: Use Reed-Solomon codes.

2. Given [q, k]q Reed-Solomon code C (i.e. with the evaluation points being Fq ), present a
data stream algorithm for error detection of C with O(log q) space and polylogq time per
position of the received word. The algorithm should work correctly with probability at least
2/3. You should assume that the data stream algorithm has access to the values of k and q
(and knows that C has Fq as its evaluation points).
Hint: Part 1 and Exercise 5.4 should be helpful.

Exercise 5.6. We have defined Reed-Solomon in this chapter and Hadamard codes in Section 2.6.
In this problem we will prove that certain alternate definitions also suffice.

1. Consider the Reed-Solomon code over a field Fq and block length n = q − 1 defined as

RSF∗q [n, k, n − k + 1] = {(p(1), p(α), . . . , p(αn−1 )) | p(X ) ∈ F[X ] has degree ≤ k − 1}

where α is the generator of the multiplicative group F∗ of F.6

Prove that

RSF∗q [n, k, n − k + 1] = {(c 0 , c 1 , . . . , c n−1 ) ∈ Fn | c(αℓ ) = 0 for 1 ≤ ℓ ≤ n − k ,

where c(X ) = c 0 + c 1 X + · · · + c n−1 X n−1 } . (5.2)

Hint: Exercise 2.2 might be useful.

2. Recall that the [2r , r, 2r −1 ]2 Hadamard code is generated by the r × 2r matrix whose i th (for
0 ≤ i ≤ 2r − 1) column is the binary representation of i . Briefly argue that the Hadamard
6
This means that F∗q = {1, α, . . . , αn−1 }. Further, αn = 1.

107
codeword for the message (m 1 , m 2 , . . . , m r ) ∈ {0, 1}r is the evaluation of the (multivariate)
polynomial m 1 X 1 + m 2 X 2 + · · · + m r X r (where X 1 , . . . , X r are the r variables) over all the
possible assignments to the variables (X 1 , . . . , X r ) from {0, 1}r .
Using the definition of Hadamard codes above (re)prove the fact that the code has distance
2r −1 .
Exercise 5.7. Prove that the k × k Vandermonde matrix (where the (i , j )th entry is αij ) has full
rank (where α1 , . . . , αk are distinct).
Exercise 5.8. A set S ⊆ Fnq is said to be a t -wise independent source (for some 1 ≤ t ≤ n) if given
a uniformly random sample (X 1 , . . . , X n ) from S, the n random variables are t -wise independent:
i.e. any subset of t variables are uniformly independent random variables over Fq . We will explore
properties of these objects in this exercise.
1. Argue that the definition of t -wise independent source is equivalent to the definition in
Exercise 2.13.

2. Argue that for any k ≥ 1, any [n, k]q code C is a 1-wise independent source.

3. Prove that any [n, k]q MDS code is a k-wise independent source.

4. Using part 3 or otherwise prove that there exists a k-wise independent source over F2 of
size at most (2n)k . Conclude that k(log2 n + 1) uniformly and independent random bits
are enough to compute n random bits that are k-wise independent. Improve the bound
slightly to show that k(log2 n − log2 log2 n + O(1))-random bits are enough to generate k-
wise independent source over F2 .

5. For 0 < p ≤ 1/2, we say the n binary random variables X 1 , . . . , X n are p-biased and t -
wise independent if any of the t random variables are independent and Pr [X i = 1] = p
for every i ∈ [n]. For the rest of the problem, let p be a power of 1/2. Then show that
any t · log2 (1/p)-wise independent random variables can be converted into t -wise inde-
pendent p-biased random
¡ variables.
¢ Conclude that one can construct such sources with
t log2 (1/p)(1+log2 n log2 (1/p) ) uniformly random bits. Then improve this bound to t (1+
max(log2 (1/p), log2 n)) uniformly random bits.
Exercise 5.9. In many applications, errors occur in “bursts"– i.e. all the error locations are con-
tained in a contiguous region (think of a scratch on a DVD or disk). In this problem we will use
how one can use Reed-Solomon codes to correct bursty errors.
An error vector e ∈ {0, 1}n is called a t -single burst error pattern if all the non-zero bits in e
occur in the range [i , i + t − 1] for some 1 ≤ i ≤ n = t + 1. Further, a vector e ∈ {0, 1}n is called a
(s, t )-burst error pattern if it is the union of at most s t -single burst error pattern (i.e. all non-zero
bits in e are contained in one of at most s contiguous ranges in [n]).
We call a binary code C ⊆ {0, 1}n to be (s, t )-burst error correcting if one can uniquely decode
from any (s, t )-burst error pattern. More precisely, given an (s, t )-burst error pattern e and any
codeword c ∈ C , the only codeword c′ ∈ C such that (c + e) − c′ is an (s, t )-burst error pattern
satisfies c′ = c.

108
1. Argue that if C is (st )-error correcting (in the sense of Definition 1.3.5), then it is also (s, t )-
2
burst error correcting. Conclude that for any ε > 0, there exists code with rate ¢ ) and
¡ 1 Ω(ε
block length n that is (s, t )-burst error correcting for any s, t such that s · t ≤ 4 − ε · n.

2. Argue that for any rate R > 0 and for³large´ enough n, there exist (s, t )-burst error correcting
¡ ¢ log n
as long as s ·t ≤ 1−R−ε
2
·n and t ≥ Ω ε . In particular, one can correct from 12 −ε fraction
of burst-errors (as long as each burst is “long enough") with rate Ω(ε) (compare this with
item 1).
Hint: Use Reed-Solomon codes.

Exercise 5.10. In this problem we will look at a very important class of codes called BCH codes7 .
Let F = F2m . Consider the binary code C BCH defined as RSF [n, k, n − k + 1] ∩ Fn2 .

1. Prove that C BCH is a binary linear code of distance at least d = n − k + 1 and dimension at
least n − (d − 1) log2 (n + 1).

Hint: Use the characterization (5.2) of the Reed-Solomon code from Exercise 5.6.
l m
d −1
2. Prove a better lower bound of n − 2 log2 (n + 1) on the dimension of C BCH .

Hint: Try to find redundant checks among the “natural” parity checks defining C BCH ).

3. For d = 3, C BCH is the same as another code we have seen. What is that code?

4. For constant d (and growing n), prove that C BCH have nearly optimal dimension for dis-
tance d , in that the dimension cannot be n − t log2 (n + 1) for t < d −1
2 .
l m
Exercise 5.11. Show that for 1 ≤ k ≤ n, k2 log2 (n +1) random bits are enough to compute n-bits
that are k-wise independent. Note that this is an improvement of almost a factor of 2 from the
bound from Exercise 5.8 part 4 (and this new bound is known to be optimal).
Hint: Use Exercises 2.13 and 5.10.

Exercise 5.12. In this exercise, we continue in the theme of Exercise 5.10 and look at the inter-
section of a Reed-Solomon code with Fn2 to get a binary code. Let F = F2m . Fix positive integers
d , n with (d − 1)m < n < 2m , and a set S = {α1 , α2 , . . . , αn } of n distinct nonzero elements of F. For
a vector v = (v 1 , . . . , v n ) ∈ (F∗ )n of n not necessarily distinct nonzero elements from F, define the
Generalized Reed-Solomon code GRSS,v,d as follows:

GRSS,v,d = {(v 1 p(α1 ), v 2 p(α2 ), . . . , v n p(αn )) | p(X ) ∈ F[X ] has degree ≤ n − d } .

1. Prove that GRSS,v,d is an [n, n − d + 1, d ]F linear code.

2. Argue that GRSS,v,d ∩ Fn2 is a binary linear code of rate at least 1 − (d −1)m
n .
7
The acronym BCH stands for Bose-Chaudhuri-Hocquenghem, the discoverers of this family of codes.

109
3. Let c ∈ Fn2 be a nonzero binary vector. Prove that (for every choice of d , S) there are at most
(2m − 1)n−d +1 choices of the vector v for which c ∈ GRSS,v,d .

4. Using the above, prove that if the integer D satisfies Vol2 (n, D − 1) < (2m − 1)d −1 (where
PD−1 ¡n ¢
Vol2 (n, D − 1) = i =0 i ), then there exists a vector v ∈ (F∗ )n such that the minimum dis-
tance of the binary code GRSS,v,d ∩ Fn2 is at least D.

5. Using parts 2 and 4 above (or otherwise), argue that the family of codes GRSS,v,d ∩ Fn2 con-
tains binary linear codes that meet the Gilbert-Varshamov bound.

Exercise 5.13. In this exercise we will show that the dual of a GRS code is a GRS itself with differ-
ent parameters. First, we state the obvious definition of GRS codes over a general finite field Fq (as
opposed to the definition over fields of characteristic two in Exercise 5.12). In particular, define
the code GRSS,v,d ,q as follows:

GRSS,v,d ,q = {(v 1 p(α1 ), v 2 p(α2 ), . . . , v n p(αn )) | p(X ) ∈ Fq [X ] has degree ≤ n − d } .

Then show that ¡ ¢⊥

GRSS,v,d ,q = GRSS,v′ ,n−d +2,q ,
where v′ ∈ Fnq is a vector with all non-zero components.

Exercise 5.14. In Exercise 2.16, we saw that any linear code can be converted in to a systematic
code. In other words, there is a map to convert Reed-Solomon codes into a systematic one. In
this exercise the goal is to come up with an explicit encoding function that results in a systematic
Reed-Solomon code.
In particular, given the set of evaluation points α1 , . . . , αn , design an explicit map f from Fkq
to a polynomial of degree at most k −1 such that the following holds. For every message m ∈ Fkq , if
¡ ¢
the corresponding polynomial is f m (X ), then the vector f m (αi ) i ∈[n] has the message m appear
in the corresponding codeword (say in its first k positions). Further, argue that this map results in
an [n, k, n − k + 1]q code.

Exercise 5.15. In this problem, we will consider the number-theoretic counterpart of Reed-Solomon
Q
codes. Let 1 ≤ k < n be integers and let p 1 < p 2 < · · · < p n be n distinct primes. Denote K = ki=1 p i
Q
and N = ni=1 p i . The notation ZM stands for integers modulo M , i.e., the set {0, 1, . . . , M −1}. Con-
sider the Chinese Remainder code defined by the encoding map E : ZK → Zp 1 × Zp 2 × · · · × Zp n
defined by:
E (m) = (m mod p 1 , m mod p 2 , · · · , m mod p n ) .
(Note that this is not a code in the usual sense we have been studying since the symbols at different
positions belong to different alphabets. Still notions such as distance of this code make sense and
are studied in the question below.)
Suppose that m 1 6= m 2 . For 1 ≤ i ≤ n, define the indicator variable b i = 1 if E (m 1 )i 6= E (m 2 )i
Q b
and b i = 0 otherwise. Prove that ni=1 p i i > N /K .
Use the above to deduce that when m 1 6= m 2 , the encodings E (m 1 ) and E (m 2 ) differ in at least
n − k + 1 locations.

110
Exercise 5.16. In this problem, we will consider derivatives over a finite field Fq . Unlike the
case of derivatives over reals, derivatives over finite fields do not have any physical interpretation
but as we shall see shortly, the notion of derivatives over finite fields is still a useful concept. In
P
particular, given a polynomial f (X ) = it =0 f i X i over Fq , we define its derivative as

tX
−1
f ′ (X ) = (i + 1) · f i +1 · X i .
i =0

Further, we will denote by f (i ) (X ), the result of applying the derivative on f i times. In this prob-
lem, we record some useful facts about derivatives.
Pt i
1. Define R(X , Z ) = f (X + Z ) = i =0 r i (X ) · Z . Then for any j ≥ 1,

f ( j ) (X ) = j ! · r j (X ).

2. Using part 1 or otherwise, show that for any j ≥ char(Fq ),8 f ( j ) (X ) ≡ 0.

3. Let j ≤ char(Fq ). Further, assume that for every 0 ≤ i < j , f (i ) (α) = 0 for some α ∈ Fq . Then
prove that (X − α) j divides f (X ).

4. Finally, we will prove the following generalization of the degree mantra (Proposition 5.2.4).
¥Let
t
¦f (X ) be a non-zero polynomial of degree
(j)
t and m ≤ char(Fq ). Then there exists at most
m distinct elements α ∈ Fq such that f (α) = 0 for every 0 ≤ j < m.

Exercise 5.17. In this exercise, we will consider a code that is related to Reed-Solomon codes and
uses derivatives from Exercise 5.16. These codes are called derivative codes.
Let m ≥ 1 be an integer parameter and consider parameters k < char(Fq ) and n such that
m < k < nm. Then the derivative code with parameters (n, k, m) is defined as follow. Consider
any message m ∈ Fkq and let f m (X ) be the message polynomial as defined for the Reed-Solomon
code. Let α1 , . . . , αn ∈ Fq be distinct elements. Then the codeword for m is given by
 
f m (α1 ) f m (α2 ) ··· f m (αn )
 (1) (1) (1) 
 fm (α1 ) fm (α2 ) ··· fm (αn ) 
 .. .. ..