0% found this document useful (0 votes)

370 views883 pages

Dive Into Deep Learning

This document is a book about deep learning that is being released in version 0.7.0. It contains chapters on linear neural networks, an introduction to deep learning concepts, and mathematical and programming preliminaries needed for deep learning like linear algebra, calculus, probability and data manipulation in Python. The book provides both conceptual explanations and code implementations of basic deep learning models.

Uploaded by

Rishi Scifreak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

370 views883 pages

Dive Into Deep Learning

Uploaded by

Rishi Scifreak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 883

Dive into Deep Learning

Release 0.7.0

Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola

Contents

Preface 1

Installation 9

1 Introduction 13
1.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 The Key Components: Data, Models, and Algorithms . . . . . . . . . . . . . . . . 16
1.3 Kinds of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5 The Road to Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.6 Success Stories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2 Preliminaries 39
2.1 Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.1.1 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.1.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.3 Broadcasting Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.1.4 Indexing and Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.1.5 Saving Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.1.6 Conversion to Other Python Objects . . . . . . . . . . . . . . . . . . . . . 46
2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2.1 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.2 Handling Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.3 Conversion to the ndarray Format . . . . . . . . . . . . . . . . . . . . . . 49
2.3 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.1 Scalars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.3.4 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.3.5 Basic Properties of Tensor Arithmetic . . . . . . . . . . . . . . . . . . . . 54
2.3.6 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.3.7 Dot Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.3.8 Matrix-Vector Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3.9 Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3.10 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.11 More on Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.4 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.4.1 Derivatives and Differentiation . . . . . . . . . . . . . . . . . . . . . . . . 63
2.4.2 Partial Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.4.3 Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

i
2.4.4 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.5 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.5.1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.5.2 Backward for Non-Scalar Variables . . . . . . . . . . . . . . . . . . . . . . 70
2.5.3 Detaching Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.5.4 Computing the Gradient of Python Control Flow . . . . . . . . . . . . . . 72
2.5.5 Training Mode and Prediction Mode . . . . . . . . . . . . . . . . . . . . . 72
2.6 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.6.1 Basic Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.6.2 Dealing with Multiple Random Variables . . . . . . . . . . . . . . . . . . 78
2.6.3 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.7 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.7.1 Finding All the Functions and Classes in a Module . . . . . . . . . . . . . 83
2.7.2 Finding the Usage of Specific Functions and Classes . . . . . . . . . . . . 83
2.7.3 API Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3 Linear Neural Networks 87

3.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.1.1 Basic Elements of Linear Regression . . . . . . . . . . . . . . . . . . . . . 87
3.1.2 The Normal Distribution and Squared Loss . . . . . . . . . . . . . . . . . 93
3.1.3 From Linear Regression to Deep Networks . . . . . . . . . . . . . . . . . 95
3.2 Linear Regression Implementation from Scratch . . . . . . . . . . . . . . . . . . 97
3.2.1 Generating the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.2.2 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.2.3 Initializing Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 100
3.2.4 Defining the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.2.5 Defining the Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.2.6 Defining the Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . 101
3.2.7 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.3 Concise Implementation of Linear Regression . . . . . . . . . . . . . . . . . . . . 104
3.3.1 Generating the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.3.2 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.3.3 Defining the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.3.4 Initializing Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 106
3.3.5 Defining the Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.3.6 Defining the Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . 107
3.3.7 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.4 Softmax Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.4.1 Classification Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.4.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.4.3 Information Theory Basics . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.4.4 Model Prediction and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 114
3.5 The Image Classification Dataset (Fashion-MNIST) . . . . . . . . . . . . . . . . . 115
3.5.1 Getting the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.5.2 Reading a Minibatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.5.3 Putting All Things Together . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.6 Implementation of Softmax Regression from Scratch . . . . . . . . . . . . . . . . 119
3.6.1 Initializing Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 119
3.6.2 The Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.6.3 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

ii
3.6.4 The Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.6.5 Classification Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.6.6 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.6.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.7 Concise Implementation of Softmax Regression . . . . . . . . . . . . . . . . . . . 126
3.7.1 Initializing Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 126
3.7.2 The Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.7.3 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.7.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

4 Multilayer Perceptrons 129

4.1 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.1.1 Hidden Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.1.2 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.2 Implementation of Multilayer Perceptron from Scratch . . . . . . . . . . . . . . . 137
4.2.1 Initializing Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 137
4.2.2 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.2.3 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.2.4 The Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.2.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.3 Concise Implementation of Multilayer Perceptron . . . . . . . . . . . . . . . . . . 140
4.3.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.4 Model Selection, Underfitting and Overfitting . . . . . . . . . . . . . . . . . . . . 141
4.4.1 Training Error and Generalization Error . . . . . . . . . . . . . . . . . . . 142
4.4.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.4.3 Underfitting or Overfitting? . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.4.4 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.5 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.5.1 Squared Norm Regularization . . . . . . . . . . . . . . . . . . . . . . . . 152
4.5.2 High-Dimensional Linear Regression . . . . . . . . . . . . . . . . . . . . 153
4.5.3 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . 154
4.5.4 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
4.6 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
4.6.1 Overfitting Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
4.6.2 Robustness through Perturbations . . . . . . . . . . . . . . . . . . . . . . 159
4.6.3 Dropout in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.6.4 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . 161
4.6.5 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.7 Forward Propagation, Backward Propagation, and Computational Graphs . . . . . 165
4.7.1 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.7.2 Computational Graph of Forward Propagation . . . . . . . . . . . . . . . . 166
4.7.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
4.7.4 Training a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.8 Numerical Stability and Initialization . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.8.1 Vanishing and Exploding Gradients . . . . . . . . . . . . . . . . . . . . . 169
4.8.2 Parameter Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
4.9 Considering the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.9.1 Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.9.2 A Taxonomy of Learning Problems . . . . . . . . . . . . . . . . . . . . . . 180
4.9.3 Fairness, Accountability, and Transparency in Machine Learning . . . . . 181

iii
4.10 Predicting House Prices on Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . 182
4.10.1 Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.10.2 Accessing and Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . 184
4.10.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
4.10.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.10.5 k-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
4.10.6 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
4.10.7 Predict and Submit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

5 Deep Learning Computation 193

5.1 Layers and Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.1.1 A Custom Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
5.1.2 The Sequential Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.1.3 Blocks with Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
5.1.4 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
5.2 Parameter Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
5.2.1 Parameter Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
5.2.2 Parameter Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.2.3 Tied Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.3 Deferred Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.3.1 Instantiating a Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.3.2 Deferred Initialization in Practice . . . . . . . . . . . . . . . . . . . . . . 210
5.3.3 Forced Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
5.4 Custom Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
5.4.1 Layers without Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 213
5.4.2 Layers with Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
5.5 File I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.5.1 Loading and Saving ndarrays . . . . . . . . . . . . . . . . . . . . . . . . . 216
5.5.2 Gluon Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
5.6 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5.6.1 Computing Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
5.6.2 ndarray and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
5.6.3 Gluon and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

6 Convolutional Neural Networks 225

6.1 From Dense Layers to Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . 226
6.1.1 Invariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
6.1.2 Constraining the MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.1.3 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.1.4 Waldo Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
6.2 Convolutions for Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
6.2.1 The Cross-Correlation Operator . . . . . . . . . . . . . . . . . . . . . . . 230
6.2.2 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.2.3 Object Edge Detection in Images . . . . . . . . . . . . . . . . . . . . . . . 232
6.2.4 Learning a Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
6.2.5 Cross-Correlation and Convolution . . . . . . . . . . . . . . . . . . . . . . 234
6.3 Padding and Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
6.3.1 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
6.3.2 Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.4 Multiple Input and Output Channels . . . . . . . . . . . . . . . . . . . . . . . . . 239

iv
6.4.1 Multiple Input Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.4.2 Multiple Output Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
6.4.3 1 × 1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
6.5 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
6.5.1 Maximum Pooling and Average Pooling . . . . . . . . . . . . . . . . . . . 244
6.5.2 Padding and Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.5.3 Multiple Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.6 Convolutional Neural Networks (LeNet) . . . . . . . . . . . . . . . . . . . . . . . 248
6.6.1 LeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.6.2 Data Acquisition and Training . . . . . . . . . . . . . . . . . . . . . . . . 251

7 Modern Convolutional Networks 255

7.1 Deep Convolutional Neural Networks (AlexNet) . . . . . . . . . . . . . . . . . . . 255
7.1.1 Learning Feature Representation . . . . . . . . . . . . . . . . . . . . . . . 256
7.1.2 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
7.1.3 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
7.1.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
7.2 Networks Using Blocks (VGG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
7.2.1 VGG Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
7.2.2 VGG Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
7.2.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
7.3 Network in Network (NiN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
7.3.1 NiN Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
7.3.2 NiN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
7.3.3 Data Acquisition and Training . . . . . . . . . . . . . . . . . . . . . . . . 270
7.4 Networks with Parallel Concatenations (GoogLeNet) . . . . . . . . . . . . . . . . 271
7.4.1 Inception Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.4.2 GoogLeNet Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
7.4.3 Data Acquisition and Training . . . . . . . . . . . . . . . . . . . . . . . . 275
7.5 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
7.5.1 Training Deep Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
7.5.2 Batch Normalization Layers . . . . . . . . . . . . . . . . . . . . . . . . . . 278
7.5.3 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . 279
7.5.4 Using a Batch Normalization LeNet . . . . . . . . . . . . . . . . . . . . . 280
7.5.5 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
7.5.6 Controversy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.6 Residual Networks (ResNet) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.6.1 Function Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.6.2 Residual Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
7.6.3 ResNet Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
7.6.4 Data Acquisition and Training . . . . . . . . . . . . . . . . . . . . . . . . 290
7.7 Densely Connected Networks (DenseNet) . . . . . . . . . . . . . . . . . . . . . . 291
7.7.1 Function Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
7.7.2 Dense Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
7.7.3 Transition Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
7.7.4 DenseNet Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
7.7.5 Data Acquisition and Training . . . . . . . . . . . . . . . . . . . . . . . . 295

8 Recurrent Neural Networks 297

8.1 Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

v
8.1.1 Statistical Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
8.1.2 A Toy Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
8.1.3 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8.2 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
8.2.1 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
8.2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
8.2.3 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
8.2.4 Putting All Things Together . . . . . . . . . . . . . . . . . . . . . . . . . . 308
8.3 Language Models and the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.3.1 Estimating a Language Model . . . . . . . . . . . . . . . . . . . . . . . . . 310
8.3.2 Markov Models and n-grams . . . . . . . . . . . . . . . . . . . . . . . . . 311
8.3.3 Natural Language Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 311
8.3.4 Training Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 313
8.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
8.4.1 Recurrent Networks Without Hidden States . . . . . . . . . . . . . . . . . 318
8.4.2 Recurrent Networks with Hidden States . . . . . . . . . . . . . . . . . . . 318
8.4.3 Steps in a Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . 319
8.4.4 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
8.5 Implementation of Recurrent Neural Networks from Scratch . . . . . . . . . . . . 322
8.5.1 One-hot Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.5.2 Initializing the Model Parameters . . . . . . . . . . . . . . . . . . . . . . 323
8.5.3 RNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
8.5.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.5.5 Gradient Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.5.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
8.6 Concise Implementation of Recurrent Neural Networks . . . . . . . . . . . . . . . 329
8.6.1 Defining the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
8.6.2 Training and Predicting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
8.7 Backpropagation Through Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
8.7.1 A Simplified Recurrent Network . . . . . . . . . . . . . . . . . . . . . . . 333
8.7.2 The Computational Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
8.7.3 BPTT in Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336

9 Modern Recurrent Networks 339

9.1 Gated Recurrent Units (GRU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
9.1.1 Gating the Hidden State . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
9.1.2 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . 342
9.1.3 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
9.2 Long Short Term Memory (LSTM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
9.2.1 Gated Memory Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
9.2.2 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . 350
9.2.3 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
9.3 Deep Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
9.3.1 Functional Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
9.3.2 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
9.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
9.4 Bidirectional Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 356
9.4.1 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
9.4.2 Bidirectional Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
9.5 Machine Translation and the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 362

vi
9.5.1 Reading and Preprocessing the Dataset . . . . . . . . . . . . . . . . . . . 362
9.5.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
9.5.3 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
9.5.4 Loading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
9.5.5 Putting All Things Together . . . . . . . . . . . . . . . . . . . . . . . . . . 365
9.6 Encoder-Decoder Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
9.6.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
9.6.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
9.6.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
9.7 Sequence to Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
9.7.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
9.7.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
9.7.3 The Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
9.7.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
9.7.5 Predicting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
9.8 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
9.8.1 Greedy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
9.8.2 Exhaustive Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
9.8.3 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377

10 Attention Mechanisms 381

10.1 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
10.1.1 Dot Product Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
10.1.2 Multilayer Perceptron Attention . . . . . . . . . . . . . . . . . . . . . . . 385
10.2 Sequence to Sequence with Attention Mechanism . . . . . . . . . . . . . . . . . . 386
10.2.1 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
10.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
10.3 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
10.3.1 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
10.3.2 Position-wise Feed-Forward Networks . . . . . . . . . . . . . . . . . . . . 395
10.3.3 Add and Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
10.3.4 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
10.3.5 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
10.3.6 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
10.3.7 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

11 Optimization Algorithms 405

11.1 Optimization and Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
11.1.1 Optimization and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 405
11.1.2 Optimization Challenges in Deep Learning . . . . . . . . . . . . . . . . . 406
11.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
11.2.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
11.2.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
11.2.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
11.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
11.3.1 Gradient Descent in One Dimension . . . . . . . . . . . . . . . . . . . . . 420
11.3.2 Multivariate Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . 424
11.3.3 Adaptive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
11.4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
11.4.1 Stochastic Gradient Updates . . . . . . . . . . . . . . . . . . . . . . . . . 430

vii
11.4.2 Dynamic Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
11.4.3 Convergence Analysis for Convex Objectives . . . . . . . . . . . . . . . . 433
11.4.4 Stochastic Gradients and Finite Samples . . . . . . . . . . . . . . . . . . . 435
11.5 Minibatch Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . 436
11.5.1 Vectorization and Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
11.5.2 Minibatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
11.5.3 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
11.5.4 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . 440
11.5.5 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
11.6 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
11.6.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
11.6.2 Practical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
11.6.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
11.7 Adagrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
11.7.1 Sparse Features and Learning Rates . . . . . . . . . . . . . . . . . . . . . 455
11.7.2 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
11.7.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
11.7.4 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . 459
11.7.5 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
11.8 RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
11.8.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
11.8.2 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . 463
11.8.3 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
11.9 Adadelta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
11.9.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
11.9.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
11.10 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
11.10.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
11.10.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
11.10.3 Yogi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
11.11 Learning Rate Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
11.11.1 Toy Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
11.11.2 Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
11.11.3 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477

12 Computational Performance 483

12.1 Compilers and Interpreters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
12.1.1 Symbolic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
12.1.2 Hybrid Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
12.1.3 HybridSequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
12.2 Asynchronous Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
12.2.1 Asynchronous Programming in MXNet . . . . . . . . . . . . . . . . . . . 490
12.2.2 Using of the Synchronization Function to Allow the Front-End to Wait for
the Computation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
12.2.3 Using Asynchronous Programming to Improve Computing Performance . 492
12.2.4 The Impact of Asynchronous Programming on Memory . . . . . . . . . . 493
12.3 Automatic Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
12.3.1 Parallel Computation using CPUs and GPUs . . . . . . . . . . . . . . . . . 496
12.3.2 Parallel Computation of Computing and Communication . . . . . . . . . . 497
12.4 Multi-GPU Computation Implementation from Scratch . . . . . . . . . . . . . . . 498

viii
12.4.1 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
12.4.2 Defining the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
12.4.3 Synchronize Data Among Multiple GPUs . . . . . . . . . . . . . . . . . . . 501
12.4.4 Splitting a Data Batch into Multiple GPUs . . . . . . . . . . . . . . . . . . 502
12.4.5 Multi-GPU Training on a Single Minibatch . . . . . . . . . . . . . . . . . . 502
12.4.6 Training Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
12.4.7 Multi-GPU Training Experiment . . . . . . . . . . . . . . . . . . . . . . . 503
12.5 Concise Implementation of Multi-GPU Computation . . . . . . . . . . . . . . . . 505
12.5.1 Initializing Model Parameters on Multiple GPUs . . . . . . . . . . . . . . 505
12.5.2 Multi-GPU Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . 507

13 Computer Vision 511

13.1 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
13.1.1 Common Image Augmentation Method . . . . . . . . . . . . . . . . . . . 512
13.1.2 Using an Image Augmentation Training Model . . . . . . . . . . . . . . . 516
13.2 Fine Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
13.2.1 Hot Dog Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
13.3 Object Detection and Bounding Boxes . . . . . . . . . . . . . . . . . . . . . . . . 525
13.3.1 Bounding Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
13.4 Anchor Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
13.4.1 Generating Multiple Anchor Boxes . . . . . . . . . . . . . . . . . . . . . . 528
13.4.2 Intersection over Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
13.4.3 Labeling Training Set Anchor Boxes . . . . . . . . . . . . . . . . . . . . . 531
13.4.4 Bounding Boxes for Prediction . . . . . . . . . . . . . . . . . . . . . . . . 534
13.5 Multiscale Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
13.6 The Object Detection Dataset (Pikachu) . . . . . . . . . . . . . . . . . . . . . . . . 540
13.6.1 Downloading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
13.6.2 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
13.6.3 Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
13.7 Single Shot Multibox Detection (SSD) . . . . . . . . . . . . . . . . . . . . . . . . . 543
13.7.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
13.7.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
13.7.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
13.8 Region-based CNNs (R-CNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
13.8.1 R-CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
13.8.2 Fast R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
13.8.3 Faster R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
13.8.4 Mask R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
13.9 Semantic Segmentation and the Dataset . . . . . . . . . . . . . . . . . . . . . . . 560
13.9.1 Image Segmentation and Instance Segmentation . . . . . . . . . . . . . . 560
13.9.2 The Pascal VOC2012 Semantic Segmentation Dataset . . . . . . . . . . . . 561
13.10 Transposed Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
13.10.1 Basic 2D Transposed Convolution . . . . . . . . . . . . . . . . . . . . . . 566
13.10.2 Padding, Strides, and Channels . . . . . . . . . . . . . . . . . . . . . . . . 567
13.10.3 Analogy to Matrix Transposition . . . . . . . . . . . . . . . . . . . . . . . 568
13.11 Fully Convolutional Networks (FCN) . . . . . . . . . . . . . . . . . . . . . . . . . 570
13.11.1 Constructing a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
13.11.2 Initializing the Transposed Convolution Layer . . . . . . . . . . . . . . . . 572
13.11.3 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
13.11.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574

ix
13.11.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
13.12 Neural Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
13.12.1 Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
13.12.2 Reading the Content and Style Images . . . . . . . . . . . . . . . . . . . . 578
13.12.3 Preprocessing and Postprocessing . . . . . . . . . . . . . . . . . . . . . . 579
13.12.4 Extracting Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
13.12.5 Defining the Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 581
13.12.6 Creating and Initializing the Composite Image . . . . . . . . . . . . . . . 583
13.12.7 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
13.13 Image Classification (CIFAR-10) on Kaggle . . . . . . . . . . . . . . . . . . . . . . 586
13.13.1 Obtaining and Organizing the Dataset . . . . . . . . . . . . . . . . . . . . 587
13.13.2 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
13.13.3 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
13.13.4 Defining the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
13.13.5 Defining the Training Functions . . . . . . . . . . . . . . . . . . . . . . . 592
13.13.6 Training and Validating the Model . . . . . . . . . . . . . . . . . . . . . . 593
13.13.7 Classifying the Testing Set and Submitting Results on Kaggle . . . . . . . . 593
13.14 Dog Breed Identification (ImageNet Dogs) on Kaggle . . . . . . . . . . . . . . . . 594
13.14.1 Obtaining and Organizing the Dataset . . . . . . . . . . . . . . . . . . . . 595
13.14.2 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
13.14.3 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
13.14.4 Defining the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
13.14.5 Defining the Training Functions . . . . . . . . . . . . . . . . . . . . . . . 600
13.14.6 Training and Validating the Model . . . . . . . . . . . . . . . . . . . . . . 600
13.14.7 Classifying the Testing Set and Submit Results on Kaggle . . . . . . . . . . 601

14 Natural Language Processing 603

14.1 Word Embedding (word2vec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
14.1.1 Why Not Use One-hot Vectors? . . . . . . . . . . . . . . . . . . . . . . . . 603
14.1.2 The Skip-Gram Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
14.1.3 The Continuous Bag of Words (CBOW) Model . . . . . . . . . . . . . . . . 605
14.2 Approximate Training for Word2vec . . . . . . . . . . . . . . . . . . . . . . . . . 607
14.2.1 Negative Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
14.2.2 Hierarchical Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
14.3 The Dataset for Word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
14.3.1 Reading and Preprocessing the Dataset . . . . . . . . . . . . . . . . . . . 611
14.3.2 Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
14.3.3 Loading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
14.3.4 Putting All Things Together . . . . . . . . . . . . . . . . . . . . . . . . . . 616
14.4 Implementation of Word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
14.4.1 The Skip-Gram Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
14.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
14.4.3 Applying the Word Embedding Model . . . . . . . . . . . . . . . . . . . . 621
14.5 Subword Embedding (fastText) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
14.6 Word Embedding with Global Vectors (GloVe) . . . . . . . . . . . . . . . . . . . . 624
14.6.1 The GloVe Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
14.6.2 Understanding GloVe from Conditional Probability Ratios . . . . . . . . . 625
14.7 Finding Synonyms and Analogies . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
14.7.1 Using Pre-Trained Word Vectors . . . . . . . . . . . . . . . . . . . . . . . 627
14.7.2 Applying Pre-Trained Word Vectors . . . . . . . . . . . . . . . . . . . . . 628

x
14.8 Text Classification and the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 630
14.8.1 The Text Sentiment Classification Dataset . . . . . . . . . . . . . . . . . . 631
14.8.2 Putting All Things Together . . . . . . . . . . . . . . . . . . . . . . . . . . 633
14.9 Text Sentiment Classification: Using Recurrent Neural Networks . . . . . . . . . . 633
14.9.1 Using a Recurrent Neural Network Model . . . . . . . . . . . . . . . . . . 634
14.10 Text Sentiment Classification: Using Convolutional Neural Networks (textCNN) . . 637
14.10.1 One-Dimensional Convolutional Layer . . . . . . . . . . . . . . . . . . . . 637
14.10.2 Max-Over-Time Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . 639
14.10.3 The TextCNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640

15 Recommender Systems 645

15.1 Overview of Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . 645
15.1.1 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
15.1.2 Explicit Feedback and Implicit Feedback . . . . . . . . . . . . . . . . . . 646
15.1.3 Recommendation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
15.2 The MovieLens Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
15.2.1 Getting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
15.2.2 Statistics of the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
15.2.3 Splitting the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
15.2.4 Loading the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
15.3 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
15.3.1 The Matrix Factorization Model . . . . . . . . . . . . . . . . . . . . . . . 652
15.3.2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
15.3.3 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
15.3.4 Training and Evaluating the Model . . . . . . . . . . . . . . . . . . . . . . 654
15.4 AutoRec: Rating Prediction with Autoencoders . . . . . . . . . . . . . . . . . . . 656
15.4.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
15.4.2 Implementing the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
15.4.3 Reimplementing the Evaluator . . . . . . . . . . . . . . . . . . . . . . . . 658
15.4.4 Training and Evaluating the Model . . . . . . . . . . . . . . . . . . . . . . 658
15.5 Personalized Ranking for Recommender Systems . . . . . . . . . . . . . . . . . . 660
15.5.1 Bayesian Personalized Ranking Loss and its Implementation . . . . . . . 660
15.5.2 Hinge Loss and its Implementation . . . . . . . . . . . . . . . . . . . . . 662
15.6 Neural Collaborative Filtering for Personalized Ranking . . . . . . . . . . . . . . 662
15.6.1 The NeuMF model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
15.6.2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
15.6.3 Negative Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
15.6.4 Evaluator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
15.6.5 Training and Evaluating the Model . . . . . . . . . . . . . . . . . . . . . . 666
15.7 Sequence-Aware Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . 669
15.7.1 Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
15.7.2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
15.7.3 Sequential DataLoader . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
15.7.4 Load the MovieLens 100K dataset . . . . . . . . . . . . . . . . . . . . . . 673
15.7.5 Train the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
15.8 Feature-Rich Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . 675
15.8.1 An Online Advertising Dataset . . . . . . . . . . . . . . . . . . . . . . . . 675
15.8.2 Dataset Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
15.9 Factorization Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
15.9.1 2-Way Factorization Machines . . . . . . . . . . . . . . . . . . . . . . . . 678

xi
15.9.2 An Efficient Optimization Criterion . . . . . . . . . . . . . . . . . . . . . 678
15.9.3 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
15.9.4 Load the Advertising Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 680
15.9.5 Train the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
15.10 Deep Factorization Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
15.10.1 Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
15.10.2 Implemenation of DeepFM . . . . . . . . . . . . . . . . . . . . . . . . . . 683
15.10.3 Training and Evaluating the Model . . . . . . . . . . . . . . . . . . . . . . 684

16 Generative Adversarial Networks 687

16.1 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
16.1.1 Generate some “real” data . . . . . . . . . . . . . . . . . . . . . . . . . . . 689
16.1.2 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
16.1.3 Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
16.1.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
16.2 Deep Convolutional Generative Adversarial Networks . . . . . . . . . . . . . . . . 693
16.2.1 The Pokemon Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
16.2.2 The Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694
16.2.3 Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696
16.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697

17 Appendix: Mathematics for Deep Learning 701

17.1 Geometry and Linear Algebraic Operations . . . . . . . . . . . . . . . . . . . . . 702
17.1.1 Geometry of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
17.1.2 Dot Products and Angles . . . . . . . . . . . . . . . . . . . . . . . . . . . 704
17.1.3 Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
17.1.4 Geometry of Linear Transformations . . . . . . . . . . . . . . . . . . . . 709
17.1.5 Linear Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711
17.1.6 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712
17.1.7 Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712
17.1.8 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
17.1.9 Tensors and Common Linear Algebra Operations . . . . . . . . . . . . . . 715
17.2 Eigendecompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
17.2.1 Finding Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
17.2.2 Decomposing Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
17.2.3 Operations on Eigendecompositions . . . . . . . . . . . . . . . . . . . . . 720
17.2.4 Eigendecompositions of Symmetric Matrices . . . . . . . . . . . . . . . . 721
17.2.5 Gershgorin Circle Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 721
17.2.6 A Useful Application: The Growth of Iterated Maps . . . . . . . . . . . . . 722
17.2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
17.3 Single Variable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
17.3.1 Differential Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
17.3.2 Rules of Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 731
17.4 Multivariable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739
17.4.1 Higher-Dimensional Differentiation . . . . . . . . . . . . . . . . . . . . . 739
17.4.2 Geometry of Gradients and Gradient Descent . . . . . . . . . . . . . . . . 740
17.4.3 A Note on Mathematical Optimization . . . . . . . . . . . . . . . . . . . . 741
17.4.4 Multivariate Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 742
17.4.5 The Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . 744
17.4.6 Hessians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747

xii
17.4.7 A Little Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749
17.5 Integral Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
17.5.1 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
17.5.2 The Fundamental Theorem of Calculus . . . . . . . . . . . . . . . . . . . 756
17.5.3 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
17.5.4 A Comment on Sign Conventions . . . . . . . . . . . . . . . . . . . . . . . 759
17.5.5 Multiple Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
17.5.6 Change of Variables in Multiple Integrals . . . . . . . . . . . . . . . . . . 762
17.6 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
17.6.1 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 764
17.7 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781
17.7.1 The Maximum Likelihood Principle . . . . . . . . . . . . . . . . . . . . . 781
17.7.2 Numerical Optimization and the Negative Log-Likelihood . . . . . . . . . 783
17.7.3 Maximum Likelihood for Continuous Variables . . . . . . . . . . . . . . . 785
17.8 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787
17.8.1 Bernoulli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787
17.8.2 Discrete Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
17.8.3 Continuous Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 790
17.8.4 Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792
17.8.5 Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794
17.8.6 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797
17.9 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
17.9.1 Optical Character Recognition . . . . . . . . . . . . . . . . . . . . . . . . 801
17.9.2 The Probabilistic Model for Classification . . . . . . . . . . . . . . . . . . 802
17.9.3 The Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 803
17.9.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
17.10 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807
17.10.1 Evaluating and Comparing Estimators . . . . . . . . . . . . . . . . . . . . 808
17.10.2 Conducting Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . 811
17.10.3 Constructing Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . 815
17.11 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
17.11.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
17.11.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
17.11.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
17.11.4 Kullback–Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . 826
17.11.5 Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828

18 Appendix: Tools for Deep Learning 833

18.1 Using Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833
18.1.1 Editing and Running the Code Locally . . . . . . . . . . . . . . . . . . . . 833
18.1.2 Advanced Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
18.2 Using AWS Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838
18.2.1 Registering Account and Logging In . . . . . . . . . . . . . . . . . . . . . 838
18.2.2 Creating and Running an EC2 Instance . . . . . . . . . . . . . . . . . . . . 839
18.2.3 Installing CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843
18.2.4 Installing MXNet and Downloading the D2L Notebooks . . . . . . . . . . . 844
18.2.5 Running Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846
18.2.6 Closing Unused Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . 846
18.3 Selecting Servers and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
18.3.1 Selecting Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847

xiii
18.3.2 Selecting GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849
18.4 Contributing to This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 852
18.4.1 From Reader to Contributor in 6 Steps . . . . . . . . . . . . . . . . . . . . 852
18.5 d2l API Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855

Bibliography 861

xiv
Preface

Just a few years ago, there were no legions of deep learning scientists developing intelligent prod-
ucts and services at major companies and startups. When the youngest among us (the authors)
entered the field, machine learning did not command headlines in daily newspapers. Our parents
had no idea what machine learning was, let alone why we might prefer it to a career in medicine or
law. Machine learning was a forward-looking academic discipline with a narrow set of real-world
applications. And those applications, e.g., speech recognition and computer vision, required so
much domain knowledge that they were often regarded as separate areas entirely for which ma-
chine learning was one small component. Neural networks then, the antecedents of the deep
learning models that we focus on in this book, were regarded as outmoded tools.
In just the past five years, deep learning has taken the world by surprise, driving rapid progress
in fields as diverse as computer vision, natural language processing, automatic speech recogni-
tion, reinforcement learning, and statistical modeling. With these advances in hand, we can now
build cars that drive themselves with more autonomy than ever before (and less autonomy than
some companies might have you believe), smart reply systems that automatically draft the most
mundane emails, helping people dig out from oppressively large inboxes, and software agents that
dominate the worldʼs best humans at board games like Go, a feat once thought to be decades away.
Already, these tools exert ever-wider impacts on industry and society, changing the way movies
are made, diseases are diagnosed, and playing a growing role in basic sciences—from astrophysics
to biology.

About This Book

This book represents our attempt to make deep learning approachable, teaching you both the
concepts, the context, and the code.

One Medium Combining Code, Math, and HTML

For any computing technology to reach its full impact, it must be well-understood, well-
documented, and supported by mature, well-maintained tools. The key ideas should be clearly
distilled, minimizing the onboarding time needing to bring new practitioners up to date. Mature
libraries should automate common tasks, and exemplar code should make it easy for practitioners
to modify, apply, and extend common applications to suit their needs. Take dynamic web appli-
cations as an example. Despite a large number of companies, like Amazon, developing successful
database-driven web applications in the 1990s, the potential of this technology to aid creative en-
trepreneurs has been realized to a far greater degree in the past ten years, owing in part to the
development of powerful, well-documented frameworks.

1
Testing the potential of deep learning presents unique challenges because any single application
brings together various disciplines. Applying deep learning requires simultaneously understand-
ing (i) the motivations for casting a problem in a particular way; (ii) the mathematics of a given
modeling approach; (iii) the optimization algorithms for fitting the models to data; and (iv) and the
engineering required to train models efficiently, navigating the pitfalls of numerical computing
and getting the most out of available hardware. Teaching both the critical thinking skills required
to formulate problems, the mathematics to solve them, and the software tools to implement those
solutions all in one place presents formidable challenges. Our goal in this book is to present a
unified resource to bring would-be practitioners up to speed.
We started this book project in July 2017 when we needed to explain MXNetʼs (then new) Gluon in-
terface to our users. At the time, there were no resources that simultaneously (i) were up to date;
(ii) covered the full breadth of modern machine learning with substantial technical depth; and
(iii) interleaved exposition of the quality one expects from an engaging textbook with the clean
runnable code that one expects to find in hands-on tutorials. We found plenty of code exam-
ples for how to use a given deep learning framework (e.g., how to do basic numerical computing
with matrices in TensorFlow) or for implementing particular techniques (e.g., code snippets for
LeNet, AlexNet, ResNets, etc) scattered across various blog posts and GitHub repositories. How-
ever, these examples typically focused on how to implement a given approach, but left out the
discussion of why certain algorithmic decisions are made. While some interactive resources have
popped up sporadically to address a particular topic, e.g., the engaging blog posts published on
the website Distill1 , or personal blogs, they only covered selected topics in deep learning, and
often lacked associated code. On the other hand, while several textbooks have emerged, most no-
tably (Goodfellow et al., 2016), which offers a comprehensive survey of the concepts behind deep
learning, these resources do not marry the descriptions to realizations of the concepts in code,
sometimes leaving readers clueless as to how to implement them. Moreover, too many resources
are hidden behind the paywalls of commercial course providers.
We set out to create a resource that could (1) be freely available for everyone; (2) offer sufficient
technical depth to provide a starting point on the path to actually becoming an applied machine
learning scientist; (3) include runnable code, showing readers how to solve problems in practice;
(4) that allowed for rapid updates, both by us and also by the community at large; and (5) be com-
plemented by a forum2 for interactive discussion of technical details and to answer questions.
These goals were often in conflict. Equations, theorems, and citations are best managed and laid
out in LaTeX. Code is best described in Python. And webpages are native in HTML and JavaScript.
Furthermore, we want the content to be accessible both as executable code, as a physical book,
as a downloadable PDF, and on the internet as a website. At present there exist no tools and no
workflow perfectly suited to these demands, so we had to assemble our own. We describe our
approach in detail in Section 18.4. We settled on Github to share the source and to allow for edits,
Jupyter notebooks for mixing code, equations and text, Sphinx as a rendering engine to generate
multiple outputs, and Discourse for the forum. While our system is not yet perfect, these choices
provide a good compromise among the competing concerns. We believe that this might be the
first book published using such an integrated workflow.

Learning by Doing

Many textbooks teach a series of topics, each in exhaustive detail. For example, Chris Bishopʼs
excellent textbook (Bishop, 2006), teaches each topic so thoroughly, that getting to the chapter on
1
[Link]
2
[Link]

2 Contents
linear regression requires a non-trivial amount of work. While experts love this book precisely
for its thoroughness, for beginners, this property limits its usefulness as an introductory text.
In this book, we will teach most concepts just in time. In other words, you will learn concepts at the
very moment that they are needed to accomplish some practical end. While we take some time at
the outset to teach fundamental preliminaries, like linear algebra and probability, we want you to
taste the satisfaction of training your first model before worrying about more esoteric probability
distributions.
Aside from a few preliminary notebooks that provide a crash course in the basic mathematical
background, each subsequent chapter introduces both a reasonable number of new concepts and
provides single self-contained working examples—using real datasets. This presents an organi-
zational challenge. Some models might logically be grouped together in a single notebook. And
some ideas might be best taught by executing several models in succession. On the other hand,
there is a big advantage to adhering to a policy of 1 working example, 1 notebook: This makes it as
easy as possible for you to start your own research projects by leveraging our code. Just copy a
notebook and start modifying it.
We will interleave the runnable code with background material as needed. In general, we will
often err on the side of making tools available before explaining them fully (and we will follow up
by explaining the background later). For instance, we might use stochastic gradient descent before
fully explaining why it is useful or why it works. This helps to give practitioners the necessary
ammunition to solve problems quickly, at the expense of requiring the reader to trust us with
some curatorial decisions.
Throughout, we will be working with the MXNet library, which has the rare property of being
flexible enough for research while being fast enough for production. This book will teach deep
learning concepts from scratch. Sometimes, we want to delve into fine details about the models
that would typically be hidden from the user by Gluonʼs advanced abstractions. This comes up
especially in the basic tutorials, where we want you to understand everything that happens in a
given layer or optimizer. In these cases, we will often present two versions of the example: one
where we implement everything from scratch, relying only on the NumPy interface and auto-
matic differentiation, and another, more practical example, where we write succinct code using
Gluon. Once we have taught you how some component works, we can just use the Gluon version
in subsequent tutorials.

Content and Structure

The book can be roughly divided into three parts, which are presented by different colors in Fig.
1:

Contents 3
Fig. 1: Book structure

• The first part covers basics and preliminaries. Chapter 1 offers an introduction to deep learn-
ing. Then, in Chapter 2, we quickly bring you up to speed on the prerequisites required for
hands-on deep learning, such as how to store and manipulate data, and how to apply various
numerical operations based on basic concepts from linear algebra, calculus, and probabil-
ity. Chapter 3 and Chapter 4 cover the most basic concepts and techniques of deep learning,
such as linear regression, multilayer perceptrons and regularization.
• The next five chapters focus on modern deep learning techniques. Chapter 5 describes the
various key components of deep learning calculations and lays the groundwork for us to
subsequently implement more complex models. Next, in Chapter 6 and Chapter 7, we intro-
duce convolutional neural networks (CNNs), powerful tools that form the backbone of most
modern computer vision systems. Subsequently, in Chapter 8 and Chapter 9, we introduce
recurrent neural networks (RNNs), models that exploit temporal or sequential structure in
data, and are commonly used for natural language processing and time series prediction.
In Chapter 10, we introduce a new class of models that employ a technique called attention
mechanisms and they have recently begun to displace RNNs in natural language processing.
These sections will get you up to speed on the basic tools behind most modern applications
of deep learning.
• Part three discusses scalability, efficiency, and applications. First, in Chapter 11, we dis-
cuss several common optimization algorithms used to train deep learning models. The next
chapter, Chapter 12 examines several key factors that influence the computational perfor-
mance of your deep learning code. In Chapter 13 and Chapter 14, we illustrate major appli-
cations of deep learning in computer vision and natural language processing, respectively.

4 Contents
Code

Most sections of this book feature executable code because of our belief in the importance of an
interactive learning experience in deep learning. At present, certain intuitions can only be devel-
oped through trial and error, tweaking the code in small ways and observing the results. Ideally,
an elegant mathematical theory might tell us precisely how to tweak our code to achieve a desired
result. Unfortunately, at present, such elegant theories elude us. Despite our best attempts, for-
mal explanations for various techniques are still lacking, both because the mathematics to char-
acterize these models can be so difficult and also because serious inquiry on these topics has only
just recently kicked into high gear. We are hopeful that as the theory of deep learning progresses,
future editions of this book will be able to provide insights in places the present edition cannot.
Most of the code in this book is based on Apache MXNet. MXNet is an open-source framework for
deep learning and the preferred choice of AWS (Amazon Web Services), as well as many colleges
and companies. All of the code in this book has passed tests under the newest MXNet version.
However, due to the rapid development of deep learning, some code in the print edition may not
work properly in future versions of MXNet. However, we plan to keep the online version remain
up-to-date. In case you encounter any such problems, please consult Installation (page 9) to update
your code and runtime environment.
At times, to avoid unnecessary repetition, we encapsulate the frequently-imported and referred-
to functions, classes, etc. in this book in the d2l package. For any block block such as a function,
a class, or multiple imports to be saved in the package, we will mark it with # Saved in the d2l
package for later use. The d2l package is light-weight and only requires the following packages
and modules as dependencies:

# Saved in the d2l package for later use

import collections
from collections import defaultdict
from IPython import display
import math
from matplotlib import pyplot as plt
from mxnet import autograd, context, gluon, image, init, np, npx
from [Link] import nn, rnn
import os
import pandas as pd
import random
import re
import sys
import tarfile
import time
import zipfile

We offer a detailed overview of these functions and classes in Section 18.5.

Target Audience

This book is for students (undergraduate or graduate), engineers, and researchers, who seek a
solid grasp of the practical techniques of deep learning. Because we explain every concept from
scratch, no previous background in deep learning or machine learning is required. Fully explain-
ing the methods of deep learning requires some mathematics and programming, but we will only
assume that you come in with some basics, including (the very basics of) linear algebra, calcu-

Contents 5
lus, probability, and Python programming. Moreover, in the Appendix, we provide a refresher
on most of the mathematics covered in this book. Most of the time, we will prioritize intuition
and ideas over mathematical rigor. There are many terrific books which can lead the interested
reader further. For instance, Linear Analysis by Bela Bollobas (Bollobas, 1999) covers linear alge-
bra and functional analysis in great depth. All of Statistics (Wasserman, 2013) is a terrific guide to
statistics. And if you have not used Python before, you may want to peruse this Python tutorial3 .

Forum

Associated with this book, we have launched a discussion forum, located at [Link].io4 .
When you have questions on any section of the book, you can find the associated discussion page
by scanning the QR code at the end of the section to participate in its discussions. The authors of
this book and broader MXNet developer community frequently participate in forum discussions.

Acknowledgments

We are indebted to the hundreds of contributors for both the English and the Chinese drafts. They
helped improve the content and offered valuable feedback. Specifically, we thank every contrib-
utor of this English draft for making it better for everyone. Their GitHub IDs or names are (in
no particular order): alxnorden, avinashingit, bowen0701, brettkoonce, Chaitanya Prakash Ba-
pat, cryptonaut, Davide Fiocco, edgarroman, gkutiel, John Mitro, Liang Pu, Rahul Agarwal, Mo-
hamed Ali Jamaoui, Michael (Stu) Stewart, Mike Müller, NRauschmayr, Prakhar Srivastav, sad-,
sfermigier, Sheng Zha, sundeepteki, topecongiro, tpdi, vermicelli, Vishaal Kapoor, vishwesh5,
YaYaB, Yuhong Chen, Evgeniy Smirnov, lgov, Simon Corston-Oliver, IgorDzreyev, Ha Nguyen,
pmuens, alukovenko, senorcinco, vfdev-5, dsweet, Mohammad Mahdi Rahimi, Abhishek Gupta,
uwsd, DomKM, Lisa Oakley, Bowen Li, Aarush Ahuja, prasanth5reddy, brianhendee, mani2106,
mtn, lkevinzc, caojilin, Lakshya, Fiete Lüer, Surbhi Vijayvargeeya, Muhyun Kim, dennismalm-
gren, adursun, Anirudh Dagar, liqingnz, Pedro Larroy, lgov, ati-ozgur, Jun Wu, Matthias Blume,
Lin Yuan, geogunow, Josh Gardner, Maximilian Böther, Rakib Islam, Leonard Lausen, Abhinav
Upadhyay, rongruosong, Steve Sedlmeyer, ruslo, Rafael Schlatter, liusy182, Giannis Pappas, ruslo,
ati-ozgur, qbaza, dchoi77, Adam Gerson. Notably, Brent Werness (Amazon) and Rachel Hu (Ama-
zon) co-authored the Mathematics for Deep Learning chapter in the Appendix with us and are the
major contributors to that chapter.
We thank Amazon Web Services, especially Swami Sivasubramanian, Raju Gulabani, Charlie Bell,
and Andrew Jassy for their generous support in writing this book. Without the available time,
resources, discussions with colleagues, and continuous encouragement this book would not have
happened.

Summary

• Deep learning has revolutionized pattern recognition, introducing technology that now
powers a wide range of technologies, including computer vision, natural language process-
ing, automatic speech recognition.
3
[Link]
4
[Link]

6 Contents
• To successfully apply deep learning, you must understand how to cast a problem, the math-
ematics of modeling, the algorithms for fitting your models to data, and the engineering
techniques to implement it all.
• This book presents a comprehensive resource, including prose, figures, mathematics, and
code, all in one place.
• To answer questions related to this book, visit our forum at [Link]
• Apache MXNet is a powerful library for coding up deep learning models and running them
in parallel across GPU cores.
• Gluon is a high level library that makes it easy to code up deep learning models using Apache
MXNet.
• Conda is a Python package manager that ensures that all software dependencies are met.
• All notebooks are available for download on GitHub and the conda configurations needed to
run this bookʼs code are expressed in the [Link] file.
• If you plan to run this code on GPUs, do not forget to install the necessary drivers and update
your configuration.

Exercises

1. Register an account on the discussion forum of this book [Link].io5 .

2. Install Python on your computer.
3. Follow the links at the bottom of the section to the forum, where you will be able to seek out
help and discuss the book and find answers to your questions by engaging the authors and
broader community.
4. Create an account on the forum and introduce yourself.

5
[Link]

Contents 7
8 Contents
Installation

In order to get you up and running for hands-on learning experience, we need to set you up with an
environment for running Python, Jupyter notebooks, the relevant libraries, and the code needed
to run the book itself.

Installing Miniconda

The simplest way to get going will be to install Miniconda7 . The Python 3.x version is recom-
mended. You can skip the following steps if conda has already been installed. Download the
corresponding Miniconda sh file from the website and then execute the installation from the com-
mand line using sh <FILENAME> -b. For macOS users:

# The file name is subject to changes

sh Miniconda3-latest-MacOSX-x86_64.sh -b

For Linux users:

# The file name is subject to changes

sh Miniconda3-latest-Linux-x86_64.sh -b

Next, initialize the shell so we can run conda directly.

~/miniconda3/bin/conda init

Now close and re-open your current shell. You should be able to create a new environment as
following:

conda create --name d2l -y

Downloading the D2L Notebooks

Next, we need to download the code of this book. You can use the link8 to download and unzip
the code. Alternatively, if you have unzip (otherwise run sudo apt install unzip) available:
7
[Link]
8
[Link]

9
mkdir d2l-en && cd d2l-en
curl [Link] -o [Link]
unzip [Link] && rm [Link]

Now we will want to activate the d2l environment and install pip. Enter y for the queries that
follow this command.

conda activate d2l

conda install python=3.7 pip -y

Installing MXNet and the d2l Package

Before installing MXNet, please first check whether or not you have proper GPUs on your machine
(the GPUs that power the display on a standard laptop do not count for our purposes). If you are
installing on a GPU server, proceed to GPU Support (page 11) for instructions to install a GPU-
supported MXNet.
Otherwise, you can install the CPU version. That will be more than enough horsepower to get you
through the first few chapters but you will want to access GPUs before running larger models.

# For Windows users

pip install mxnet==1.6.0b20190926

# For Linux and macOS users

pip install mxnet==1.6.0b20191122

We also install the d2l package that encapsulates frequently used functions and classes in this
book.

pip install d2l==0.11.0

Once they are installed, we now open the Jupyter notebook by running:

jupyter notebook

At this point, you can open [Link] (it usually opens automatically) in your Web
browser. Then we can run the code for each section of the book. Please always execute conda
activate d2l to activate the runtime environment before running the code of the book or updat-
ing MXNet or the d2l package. To exit the environment, run conda deactivate.

Upgrading to a New Version

Both this book and MXNet are keeping improving. Please check a new version from time to time.
1. The URL [Link] always points to the latest contents.
2. Please upgrade the d2l package by pip install d2l --upgrade.
3. For the CPU version, MXNet can be upgraded by pip install -U --pre mxnet.

10 Contents
GPU Support

By default, MXNet is installed without GPU support to ensure that it will run on any computer
(including most laptops). Part of this book requires or recommends running with GPU. If your
computer has NVIDIA graphics cards and has installed CUDA9 , then you should install a GPU-
enabled MXNet. If you have installed the CPU-only version, you may need to remove it first by
running:

pip uninstall mxnet

Then we need to find the CUDA version you installed. You may check it through nvcc --version
or cat /usr/local/cuda/[Link]. Assume that you have installed CUDA 10.1, then you can
install MXNet with the following command:

# For Windows users

pip install mxnet-cu101==1.6.0b20190926

# For Linux and macOS users

pip install mxnet-cu101==1.6.0b20191122

Like the CPU version, the GPU-enabled MXNet can be upgraded by pip install -U --pre mxnet-
cu101. You may change the last digits according to your CUDA version, e.g., cu100 for CUDA 10.0
and cu90 for CUDA 9.0. You can find all available MXNet versions via pip search mxnet.

Exercises

1. Download the code for the book and install the runtime environment.

9
[Link]

Contents 11
12 Contents
1 | Introduction

Until recently, nearly every computer program that interact with daily were coded by software
developers from first principles. Say that we wanted to write an application to manage an e-
commerce platform. After huddling around a whiteboard for a few hours to ponder the prob-
lem, we would come up with the broad strokes of a working solution that might probably look
something like this: (i) users interact with the application through an interface running in a web
browser or mobile application; (ii) our application interacts with a commercial-grade database
engine to keep track of each userʼs state and maintain records of historical transactions; and (iii)
at the heart of our application, the business logic (you might say, the brains) of our application spells
out in methodical detail the appropriate action that our program should take in every conceivable
circumstance.
To build the brains of our application, weʼd have to step through every possible corner case that
we anticipate encountering, devising appropriate rules. Each time a customer clicks to add an
item to their shopping cart, we add an entry to the shopping cart database table, associating that
userʼs ID with the requested productʼs ID. While few developers ever get it completely right the
first time (it might take some test runs to work out the kinks), for the most part, we could write
such a program from first principles and confidently launch it before ever seeing a real customer.
Our ability to design automated systems from first principles that drive functioning products and
systems, often in novel situations, is a remarkable cognitive feat. And when you are able to devise
solutions that work 100% of the time, you should not be using machine learning.
Fortunately for the growing community of ML scientists, many tasks that we would like to auto-
mate do not bend so easily to human ingenuity. Imagine huddling around the whiteboard with
the smartest minds you know, but this time you are tackling one of the following problems:
• Write a program that predicts tomorrowʼs weather given geographic information, satellite
images, and a trailing window of past weather.
• Write a program that takes in a question, expressed in free-form text, and answers it cor-
rectly.
• Write a program that given an image can identify all the people it contains, drawing outlines
around each.
• Write a program that presents users with products that they are likely to enjoy but unlikely,
in the natural course of browsing, to encounter.
In each of these cases, even elite programmers are incapable of coding up solutions from scratch.
The reasons for this can vary. Sometimes the program that we are looking for follows a pattern
that changes over time, and we need our programs to adapt. In other cases, the relationship (say
between pixels, and abstract categories) may be too complicated, requiring thousands or millions
of computations that are beyond our conscious understanding (even if our eyes manage the task

13
effortlessly). Machine learning (ML) is the study of powerful techniques that can learn from expe-
rience. As ML algorithm accumulates more experience, typically in the form of observational data
or interactions with an environment, their performance improves. Contrast this with our deter-
ministic e-commerce platform, which performs according to the same business logic, no matter
how much experience accrues, until the developers themselves learn and decide that it is time to
update the software. In this book, we will teach you the fundamentals of machine learning, and
focus in particular on deep learning, a powerful set of techniques driving innovations in areas as
diverse as computer vision, natural language processing, healthcare, and genomics.

1.1 A Motivating Example

Before we could begin writing, the authors of this book, like much of the work force, had to be-
come caffeinated. We hopped in the car and started driving. Using an iPhone, Alex called out
“Hey Siri”, awakening the phoneʼs voice recognition system. Then Mu commanded “directions to
Blue Bottle coffee shop”. The phone quickly displayed the transcription of his command. It also
recognized that we were asking for directions and launched the Maps application to fulfill our re-
quest. Once launched, the Maps app identified a number of routes. Next to each route, the phone
displayed a predicted transit time. While we fabricated this story for pedagogical convenience, it
demonstrates that in the span of just a few seconds, our everyday interactions with a smart phone
can engage several machine learning models.
Imagine just writing a program to respond to a wake word like “Alexa”, “Okay, Google” or “Siri”. Try
coding it up in a room by yourself with nothing but a computer and a code editor, as illustrated
in Fig. 1.1.1. How would you write such a program from first principles? Think about it… the
problem is hard. Every second, the microphone will collect roughly 44,000 samples. Each sample
is a measurement of the amplitude of the sound wave. What rule could map reliably from a snippet
of raw audio to confident predictions {yes, no} on whether the snippet contains the wake word?
If you are stuck, do not worry. We do not know how to write such a program from scratch either.
That is why we use ML.

Fig. 1.1.1: Identify an awake word.

Hereʼs the trick. Often, even when we do not know how to tell a computer explicitly how to map
from inputs to outputs, we are nonetheless capable of performing the cognitive feat ourselves. In
other words, even if you do not know how to program a computer to recognize the word “Alexa”,
you yourself are able to recognize the word “Alexa”. Armed with this ability, we can collect a huge
dataset containing examples of audio and label those that do and that do not contain the wake word.
In the ML approach, we do not attempt to design a system explicitly to recognize wake words.
Instead, we define a flexible program whose behavior is determined by a number of parameters.
Then we use the dataset to determine the best possible set of parameters, those that improve the
performance of our program with respect to some measure of performance on the task of interest.
You can think of the parameters as knobs that we can turn, manipulating the behavior of the pro-
gram. Fixing the parameters, we call the program a model. The set of all distinct programs (input-
output mappings) that we can produce just by manipulating the parameters is called a family of

14 Chapter 1. Introduction
models. And the meta-program that uses our dataset to choose the parameters is called a learning
algorithm.
Before we can go ahead and engage the learning algorithm, we have to define the problem pre-
cisely, pinning down the exact nature of the inputs and outputs, and choosing an appropriate
model family. In this case, our model receives a snippet of audio as input, and it generates a se-
lection among {yes, no} as output. If all goes according to plan the modelʼs guesses will typically
be correct as to whether (or not) the snippet contains the wake word.
If we choose the right family of models, then there should exist one setting of the knobs such
that the model fires yes every time it hears the word “Alexa”. Because the exact choice of the
wake word is arbitrary, we will probably need a model family sufficiently rich that, via another
setting of the knobs, it could fire yes only upon hearing the word “Apricot”. We expect that the
same model family should be suitable for “Alexa” recognition and “Apricot” recognition because
they seem, intuitively, to be similar tasks. However, we might need a different family of models
entirely if we want to deal with fundamentally different inputs or outputs, say if we wanted to map
from images to captions, or from English sentences to Chinese sentences.
As you might guess, if we just set all of the knobs randomly, it is not likely that our model will rec-
ognize “Alexa”, “Apricot”, or any other English word. In deep learning, the learning is the process
by which we discover the right setting of the knobs coercing the desired behavior from our model.
As shown in Fig. 1.1.2, the training process usually looks like this:
1. Start off with a randomly initialized model that cannot do anything useful.
2. Grab some of your labeled data (e.g., audio snippets and corresponding {yes, no} labels)
3. Tweak the knobs so the model sucks less with respect to those examples
4. Repeat until the model is awesome.

Fig. 1.1.2: A typical training process.

To summarize, rather than code up a wake word recognizer, we code up a program that can learn
to recognize wake words, if we present it with a large labeled dataset. You can think of this act of
determining a programʼs behavior by presenting it with a dataset as programming with data. We
can “program” a cat detector by providing our machine learning system with many examples of
cats and dogs, such as the images below:

1.1. A Motivating Example 15

cat cat dog dog

This way the detector will eventually learn to emit a very large positive number if it is a cat, a very
large negative number if it is a dog, and something closer to zero if it is not sure, and this barely
scratches the surface of what ML can do.
Deep learning is just one among many popular methods for solving machine learning problems.
Thus far, we have only talked about machine learning broadly and not deep learning. To see why
deep learning is important, we should pause for a moment to highlight a couple crucial points.
First, the problems that we have discussed thus far—learning from raw audio signal, the raw pixel
values of images, or mapping between sentences of arbitrary lengths and their counterparts in
foreign languages—are problems where deep learning excels and where traditional ML methods
faltered. Deep models are deep in precisely the sense that they learn many layers of computation.
It turns out that these many-layered (or hierarchical) models are capable of addressing low-level
perceptual data in a way that previous tools could not. In bygone days, the crucial part of applying
ML to these problems consisted of coming up with manually-engineered ways of transforming the
data into some form amenable to shallow models. One key advantage of deep learning is that it
replaces not only the shallow models at the end of traditional learning pipelines, but also the labor-
intensive process of feature engineering. Second, by replacing much of the domain-specific prepro-
cessing, deep learning has eliminated many of the boundaries that previously separated computer
vision, speech recognition, natural language processing, medical informatics, and other applica-
tion areas, offering a unified set of tools for tackling diverse problems.

1.2 The Key Components: Data, Models, and Algorithms

In our wake-word example, we described a dataset consisting of audio snippets and binary labels
gave a hand-wavy sense of how we might train a model to approximate a mapping from snippets
to classifications. This sort of problem, where we try to predict a designated unknown label given
known inputs, given a dataset consisting of examples, for which the labels are known is called
supervised learning, and it is just one among many kinds of machine learning problems. In the
next section, we will take a deep dive into the different ML problems. First, weʼd like to shed more
light on some core components that will follow us around, no matter what kind of ML problem
we take on:
1. The data that we can learn from.
2. A model of how to transform the data.

16 Chapter 1. Introduction
3. A loss function that quantifies the badness of our model.
4. An algorithm to adjust the modelʼs parameters to minimize the loss.

1.2.1 Data

It might go without saying that you cannot do data science without data. We could lose hundreds
of pages pondering what precisely constitutes data, but for now we will err on the practical side
and focus on the key properties to be concerned with. Generally we are concerned with a collec-
tion of examples (also called data points, samples, or instances). In order to work with data usefully,
we typically need to come up with a suitable numerical representation. Each example typically
consists of a collection of numerical attributes called features. In the supervised learning prob-
lems above, a special feature is designated as the prediction target, (sometimes called the label or
dependent variable). The given features from which the model must make its predictions can then
simply be called the features, (or often, the inputs, covariates, or independent variables).
If we were working with image data, each individual photograph might constitute an example,
each represented by an ordered list of numerical values corresponding to the brightness of each
pixel. A 200 × 200 color photograph would consist of 200 × 200 × 3 = 120000 numerical values,
corresponding to the brightness of the red, green, and blue channels for each spatial location.
In a more traditional task, we might try to predict whether or not a patient will survive, given a
standard set of features such as age, vital signs, diagnoses, etc.
When every example is characterized by the same number of numerical values, we say that the
data consists of fixed-length vectors and we describe the (constant) length of the vectors as the
dimensionality of the data. As you might imagine, fixed length can be a convenient property. If we
wanted to train a model to recognize cancer in microscopy images, fixed-length inputs means we
have one less thing to worry about.
However, not all data can easily be represented as fixed length vectors. While we might expect
microscope images to come from standard equipment, we cannot expect images mined from the
Internet to all show up with the same resolution or shape. For images, we might consider crop-
ping them all to a standard size, but that strategy only gets us so far. We risk losing information
in the cropped out portions. Moreover, text data resists fixed-length representations even more
stubbornly. Consider the customer reviews left on e-commerce sites like Amazon, IMDB, or Tri-
pAdvisor. Some are short: “it stinks!”. Others ramble for pages. One major advantage of deep
learning over traditional methods is the comparative grace with which modern models can han-
dle varying-length data.
Generally, the more data we have, the easier our job becomes. When we have more data, we can
train more powerful models, and rely less heavily on pre-conceived assumptions. The regime
change from (comparatively small) to big data is a major contributor to the success of modern
deep learning. To drive the point home, many of the most exciting models in deep learning either
do not work without large datasets. Some others work in the low-data regime, but no better than
traditional approaches.
Finally it is not enough to have lots of data and to process it cleverly. We need the right data. If
the data is full of mistakes, or if the chosen features are not predictive of the target quantity of
interest, learning is going to fail. The situation is captured well by the cliché: garbage in, garbage
out. Moreover, poor predictive performance is not the only potential consequence. In sensitive
applications of machine learning, like predictive policing, resumé screening, and risk models
used for lending, we must be especially alert to the consequences of garbage data. One common

1.2. The Key Components: Data, Models, and Algorithms 17

failure mode occurs in datasets where some groups of people are unrepresented in the training
data. Imagine applying a skin cancer recognition system in the wild that had never seen black
skin before. Failure can also occur when the data does not merely under-represent some groups,
but reflects societal prejudices. For example if past hiring decisions are used to train a predictive
model that will be used to screen resumes, then machine learning models could inadvertently
capture and automate historical injustices. Note that this can all happen without the data scientist
actively conspiring, or even being aware.

1.2.2 Models

Most machine learning involves transforming the data in some sense. We might want to build a
system that ingests photos and predicts smiley-ness. Alternatively, we might want to ingest a set of
sensor readings and predict how normal vs anomalous the readings are. By model, we denote the
computational machinery for ingesting data of one type, and spitting out predictions of a possibly
different type. In particular, we are interested in statistical models that can be estimated from
data. While simple models are perfectly capable of addressing appropriately simple problems the
problems that we focus on in this book stretch the limits of classical methods. Deep learning is
differentiated from classical approaches principally by the set of powerful models that it focuses
on. These models consist of many successive transformations of the data that are chained together
top to bottom, thus the name deep learning. On our way to discussing deep neural networks, we
will discuss some more traditional methods.

1.2.3 Objective functions

Earlier, we introduced machine learning as “learning from experience”. By learning here, we mean
improving at some task over time. But who is to say what constitutes an improvement? You might
imagine that we could propose to update our model, and some people might disagree on whether
the proposed update constituted an improvement or a decline.
In order to develop a formal mathematical system of learning machines, we need to have formal
measures of how good (or bad) our models are. In machine learning, and optimization more
generally, we call these objective functions. By convention, we usually define objective functions
so that lower is better. This is merely a convention. You can take any function f for which higher is
better, and turn it into a new function f ′ that is qualitatively identical but for which lower is better
by setting f ′ = −f . Because lower is better, these functions are sometimes called loss functions or
cost functions.
When trying to predict numerical values, the most common objective function is squared error
(y−ŷ)2 . For classification, the most common objective is to minimize error rate, i.e., the fraction of
instances on which our predictions disagree with the ground truth. Some objectives (like squared
error) are easy to optimize. Others (like error rate) are difficult to optimize directly, owing to
non-differentiability or other complications. In these cases, it is common to optimize a surrogate
objective.
Typically, the loss function is defined with respect to the modelʼs parameters and depends upon the
dataset. The best values of our modelʼs parameters are learned by minimizing the loss incurred on
a training set consisting of some number of examples collected for training. However, doing well on
the training data does not guarantee that we will do well on (unseen) test data. So we will typically
want to split the available data into two partitions: the training data (for fitting model parameters)
and the test data (which is held out for evaluation), reporting the following two quantities:

18 Chapter 1. Introduction
• Training Error: The error on that data on which the model was trained. You could think of
this as being like a studentʼs scores on practice exams used to prepare for some real exam.
Even if the results are encouraging, that does not guarantee success on the final exam.
• Test Error: This is the error incurred on an unseen test set. This can deviate significantly
from the training error. When a model performs well on the training data but fails to gen-
eralize to unseen data, we say that it is overfitting. In real-life terms, this is like flunking the
real exam despite doing well on practice exams.

1.2.4 Optimization algorithms

Once we have got some data source and representation, a model, and a well-defined objective
function, we need an algorithm capable of searching for the best possible parameters for mini-
mizing the loss function. The most popular optimization algorithms for neural networks follow
an approach called gradient descent. In short, at each step, they check to see, for each parameter,
which way the training set loss would move if you perturbed that parameter just a small amount.
They then update the parameter in the direction that reduces the loss.

1.3 Kinds of Machine Learning

In the following sections, we discuss a few kinds of machine learning problems in greater detail.
We begin with a list of objectives, i.e., a list of things that we would like machine learning to do.
Note that the objectives are complemented with a set of techniques of how to accomplish them,
including types of data, models, training techniques, etc. The list below is just a sampling of the
problems ML can tackle to motivate the reader and provide us with some common language for
when we talk about more problems throughout the book.

1.3.1 Supervised learning

Supervised learning addresses the task of predicting targets given inputs. The targets, which we
often call labels, are generally denoted by y. The input data, also called the features or covariates,
are typically denoted x. Each (input, target) pair is called an examples or an instances. Some times,
when the context is clear, we may use the term examples, to refer to a collection of inputs, even
when the corresponding targets are unknown. We denote any particular instance with a subscript,
typically i, for instance (xi , yi ). A dataset is a collection of n instances {xi , yi }ni=1 . Our goal is to
produce a model fθ that maps any input xi to a prediction fθ (xi ).
To ground this description in a concrete example, if we were working in healthcare, then we might
want to predict whether or not a patient would have a heart attack. This observation, heart attack
or no heart attack, would be our label y. The input data x might be vital signs such as heart rate,
diastolic and systolic blood pressure, etc.
The supervision comes into play because for choosing the parameters θ, we (the supervisors)
provide the model with a dataset consisting of labeled examples (xi , yi ), where each example xi
is matched with the correct label.
In probabilistic terms, we typically are interested in estimating the conditional probability P (y|x).
While it is just one among several paradigms within machine learning, supervised learning ac-
counts for the majority of successful applications of machine learning in industry. Partly, that is

1.3. Kinds of Machine Learning 19

because many important tasks can be described crisply as estimating the probability of something
unknown given a particular set of available data:
• Predict cancer vs not cancer, given a CT image.
• Predict the correct translation in French, given a sentence in English.
• Predict the price of a stock next month based on this monthʼs financial reporting data.
Even with the simple description “predict targets from inputs” supervised learning can take a great
many forms and require a great many modeling decisions, depending on (among other considera-
tions) the type, size, and the number of inputs and outputs. For example, we use different models
to process sequences (like strings of text or time series data) and for processing fixed-length vec-
tor representations. We will visit many of these problems in depth throughout the first 9 parts of
this book.
Informally, the learning process looks something like this: Grab a big collection of examples for
which the covariates are known and select from them a random subset, acquiring the ground truth
labels for each. Sometimes these labels might be available data that has already been collected
(e.g., did a patient die within the following year?) and other times we might need to employ human
annotators to label the data, (e.g., assigning images to categories).
Together, these inputs and corresponding labels comprise the training set. We feed the training
dataset into a supervised learning algorithm, a function that takes as input a dataset and outputs
another function, the learned model. Finally, we can feed previously unseen inputs to the learned
model, using its outputs as predictions of the corresponding label. The full process in drawn in
Fig. 1.3.1.

Fig. 1.3.1: Supervised learning.

Regression

Perhaps the simplest supervised learning task to wrap your head around is regression. Consider,
for example a set of data harvested from a database of home sales. We might construct a table,
where each row corresponds to a different house, and each column corresponds to some relevant
attribute, such as the square footage of a house, the number of bedrooms, the number of bath-
rooms, and the number of minutes (walking) to the center of town. In this dataset each example
would be a specific house, and the corresponding feature vector would be one row in the table.
If you live in New York or San Francisco, and you are not the CEO of Amazon, Google, Microsoft, or
Facebook, the (sq. footage, no. of bedrooms, no. of bathrooms, walking distance) feature vector
for your home might look something like: [100, 0, .5, 60]. However, if you live in Pittsburgh, it
might look more like [3000, 4, 3, 10]. Feature vectors like this are essential for most classic machine
learning algorithms. We will continue to denote the feature vector correspond to any example i
as xi and we can compactly refer to the full table containing all of the feature vectors as X.

20 Chapter 1. Introduction
What makes a problem a regression is actually the outputs. Say that you are in the market for a new
home. You might want to estimate the fair market value of a house, given some features like these.
The target value, the price of sale, is a real number. If you remember the formal definition of the
reals you might be scratching your head now. Homes probably never sell for fractions of a cent,
let alone prices expressed as irrational numbers. In cases like this, when the target is actually
discrete, but where the rounding takes place on a sufficiently fine scale, we will abuse language
just a bit cn continue to describe our outputs and targets as real-valued numbers.
We denote any individual target yi (corresponding to example xi ) and the set of all targets y (cor-
responding to all examples X). When our targets take on arbitrary values in some range, we call
this a regression problem. Our goal is to produce a model whose predictions closely approximate
the actual target values. We denote the predicted target for any instance ŷi . Do not worry if the
notation is bogging you down. We will unpack it more thoroughly in the subsequent chapters.
Lots of practical problems are well-described regression problems. Predicting the rating that a
user will assign to a movie can be thought of as a regression problem and if you designed a great
algorithm to accomplish this feat in 2009, you might have won the 1-million-dollar Netflix prize11 .
Predicting the length of stay for patients in the hospital is also a regression problem. A good rule
of thumb is that any How much? or How many? problem should suggest regression.
• “How many hours will this surgery take?”: regression
• “How many dogs are in this photo?”: regression.
However, if you can easily pose your problem as “Is this a _ ?”, then it is likely, classification, a
different kind of supervised problem that we will cover next. Even if you have never worked with
machine learning before, you have probably worked through a regression problem informally.
Imagine, for example, that you had your drains repaired and that your contractor spent x1 = 3
hours removing gunk from your sewage pipes. Then she sent you a bill of y1 = $350. Now imagine
that your friend hired the same contractor for x2 = 2 hours and that she received a bill of y2 =
$250. If someone then asked you how much to expect on their upcoming gunk-removal invoice
you might make some reasonable assumptions, such as more hours worked costs more dollars.
You might also assume that there is some base charge and that the contractor then charges per
hour. If these assumptions held true, then given these two data points, you could already identify
the contractorʼs pricing structure: $100 per hour plus $50 to show up at your house. If you followed
that much then you already understand the high-level idea behind linear regression (and you just
implicitly designed a linear model with a bias term).
In this case, we could produce the parameters that exactly matched the contractorʼs prices. Some-
times that is not possible, e.g., if some of the variance owes to some factors besides your two fea-
tures. In these cases, we will try to learn models that minimize the distance between our predic-
tions and the observed values. In most of our chapters, we will focus on one of two very common
losses, the L1 loss12 where
∑
l(y, y ′ ) = |yi − yi′ | (1.3.1)
i

and the least mean squares loss, or L2 loss13 , where

∑
l(y, y ′ ) = (yi − yi′ )2 . (1.3.2)
i
11
[Link]
12
[Link]
13
[Link]

1.3. Kinds of Machine Learning 21

As we will see later, the L2 loss corresponds to the assumption that our data was corrupted by
Gaussian noise, whereas the L1 loss corresponds to an assumption of noise from a Laplace distri-
bution.

Classification

While regression models are great for addressing how many? questions, lots of problems do not
bend comfortably to this template. For example, a bank wants to add check scanning to their
mobile app. This would involve the customer snapping a photo of a check with their smart phoneʼs
camera and the machine learning model would need to be able to automatically understand text
seen in the image. It would also need to understand hand-written text to be even more robust.
This kind of system is referred to as optical character recognition (OCR), and the kind of problem
it addresses is called classification. It is treated with a different set of algorithms than those used
for regression (although many techniques will carry over).
In classification, we want our model to look at a feature vector, e.g., the pixel values in an image,
and then predict which category (formally called classes), among some (discrete) set of options, an
example belongs. For hand-written digits, we might have 10 classes, corresponding to the digits 0
through 9. The simplest form of classification is when there are only two classes, a problem which
we call binary classification. For example, our dataset X could consist of images of animals and
our labels Y might be the classes {cat, dog}. While in regression, we sought a regressor to output a
real value ŷ, in classification, we seek a classifier, whose output ŷ is the predicted class assignment.
For reasons that we will get into as the book gets more technical, it can be hard to optimize a
model that can only output a hard categorical assignment, e.g., either cat or dog. In these cases,
it is usually much easier to instead express our model in the language of probabilities. Given an
example x, our model assigns a probability ŷk to each label k. Because these are probabilities,
they need to be positive numbers and add up to 1 and thus we only need K − 1 numbers to assign
probabilities of K categories. This is easy to see for binary classification. If there is a 0.6 (60%)
probability that an unfair coin comes up heads, then there is a 0.4 (40%) probability that it comes
up tails. Returning to our animal classification example, a classifier might see an image and output
the probability that the image is a cat P (y = cat | x) = 0.9. We can interpret this number by saying
that the classifier is 90% sure that the image depicts a cat. The magnitude of the probability for
the predicted class conveys one notion of uncertainty. It is not the only notion of uncertainty and
we will discuss others in more advanced chapters.
When we have more than two possible classes, we call the problem multiclass classification. Com-
mon examples include hand-written character recognition [0, 1, 2, 3 ... 9, a, b, c, ...].
While we attacked regression problems by trying to minimize the L1 or L2 loss functions, the
common loss function for classification problems is called cross-entropy. In MXNet Gluon, the
corresponding loss function can be found here14 .
Note that the most likely class is not necessarily the one that you are going to use for your decision.
Assume that you find this beautiful mushroom in your backyard as shown in Fig. 1.3.2.
14
[Link]

22 Chapter 1. Introduction
Fig. 1.3.2: Death cap—do not eat!

Now, assume that you built a classifier and trained it to predict if a mushroom is poisonous based
on a photograph. Say our poison-detection classifier outputs P (y = deathcap|image) = 0.2. In
other words, the classifier is 80% sure that our mushroom is not a death cap. Still, youʼd have to be
a fool to eat it. That is because the certain benefit of a delicious dinner is not worth a 20% risk of
dying from it. In other words, the effect of the uncertain risk outweighs the benefit by far. We can
look at this more formally. Basically, we need to compute the expected risk that we incur, i.e., we
need to multiply the probability of the outcome with the benefit (or harm) associated with it:

L(action|x) = Ey∼p(y|x) [loss(action, y)]. (1.3.3)

Hence, the loss L incurred by eating the mushroom is L(a = eat|x) = 0.2 ∗ ∞ + 0.8 ∗ 0 = ∞,
whereas the cost of discarding it is L(a = discard|x) = 0.2 ∗ 0 + 0.8 ∗ 1 = 0.8.
Our caution was justified: as any mycologist would tell us, the above mushroom actually is a death
cap. Classification can get much more complicated than just binary, multiclass, or even multi-
label classification. For instance, there are some variants of classification for addressing hierar-
chies. Hierarchies assume that there exist some relationships among the many classes. So not all
errors are equal—if we must err, we would prefer to misclassify to a related class rather than to a
distant class. Usually, this is referred to as hierarchical classification. One early example is due to
Linnaeus15 , who organized the animals in a hierarchy.
In the case of animal classification, it might not be so bad to mistake a poodle for a schnauzer,
but our model would pay a huge penalty if it confused a poodle for a dinosaur. Which hierarchy
is relevant might depend on how you plan to use the model. For example, rattle snakes and garter
snakes might be close on the phylogenetic tree, but mistaking a rattler for a garter could be deadly.

Tagging

Some classification problems do not fit neatly into the binary or multiclass classification setups.
For example, we could train a normal binary classifier to distinguish cats from dogs. Given the
current state of computer vision, we can do this easily, with off-the-shelf tools. Nonetheless, no
matter how accurate our model gets, we might find ourselves in trouble when the classifier en-
counters an image of the Town Musicians of Bremen.
15
[Link]

1.3. Kinds of Machine Learning 23

Fig. 1.3.3: A cat, a roster, a dog and a donkey

As you can see, there is a cat in the picture, and a rooster, a dog, a donkey and a bird, with some
trees in the background. Depending on what we want to do with our model ultimately, treating
this as a binary classification problem might not make a lot of sense. Instead, we might want to
give the model the option of saying the image depicts a cat and a dog and a donkey and a rooster
and a bird.
The problem of learning to predict classes that are not mutually exclusive is called multi-label clas-
sification. Auto-tagging problems are typically best described as multi-label classification prob-
lems. Think of the tags people might apply to posts on a tech blog, e.g., “machine learning”, “tech-
nology”, “gadgets”, “programming languages”, “linux”, “cloud computing”, “AWS”. A typical article
might have 5-10 tags applied because these concepts are correlated. Posts about “cloud comput-
ing” are likely to mention “AWS” and posts about “machine learning” could also deal with “pro-
gramming languages”.
We also have to deal with this kind of problem when dealing with the biomedical literature, where
correctly tagging articles is important because it allows researchers to do exhaustive reviews of
the literature. At the National Library of Medicine, a number of professional annotators go over
each article that gets indexed in PubMed to associate it with the relevant terms from MeSH, a
collection of roughly 28k tags. This is a time-consuming process and the annotators typically have
a one year lag between archiving and tagging. Machine learning can be used here to provide
provisional tags until each article can have a proper manual review. Indeed, for several years, the
BioASQ organization has hosted a competition16 to do precisely this.
16
[Link]

24 Chapter 1. Introduction
Search and ranking

Sometimes we do not just want to assign each example to a bucket or to a real value. In the field of
information retrieval, we want to impose a ranking on a set of items. Take web search for example,
the goal is less to determine whether a particular page is relevant for a query, but rather, which
one of the plethora of search results is most relevant for a particular user. We really care about
the ordering of the relevant search results and our learning algorithm needs to produce ordered
subsets of elements from a larger set. In other words, if we are asked to produce the first 5 letters
from the alphabet, there is a difference between returning A B C D E and C A B E D. Even if the
result set is the same, the ordering within the set matters.
One possible solution to this problem is to first assign to every element in the set a corresponding
relevance score and then to retrieve the top-rated elements. PageRank17 , the original secret sauce
behind the Google search engine was an early example of such a scoring system but it was peculiar
in that it did not depend on the actual query. Here they relied on a simple relevance filter to
identify the set of relevant items and then on PageRank to order those results that contained the
query term. Nowadays, search engines use machine learning and behavioral models to obtain
query-dependent relevance scores. There are entire academic conferences devoted to this subject.

Recommender systems

Recommender systems are another problem setting that is related to search and ranking. The
problems are similar insofar as the goal is to display a set of relevant items to the user. The main
difference is the emphasis on personalization to specific users in the context of recommender sys-
tems. For instance, for movie recommendations, the results page for a SciFi fan and the results
page for a connoisseur of Peter Sellers comedies might differ significantly. Similar problems pop
up in other recommendation settings, e.g., for retail products, music, or news recommendation.
In some cases, customers provide explicit feedback communicating how much they liked a partic-
ular product (e.g., the product ratings and reviews on Amazon, IMDB, GoodReads, etc.). In some
other cases, they provide implicit feedback, e.g., by skipping titles on a playlist, which might in-
dicate dissatisfaction but might just indicate that the song was inappropriate in context. In the
simplest formulations, these systems are trained to estimate some score yij , such as an estimated
rating or the probability of purchase, given a user ui and product pj .
Given such a model, then for any given user, we could retrieve the set of objects with the largest
scores yij , which are could then be recommended to the customer. Production systems are consid-
erably more advanced and take detailed user activity and item characteristics into account when
computing such scores. Fig. 1.3.4 is an example of deep learning books recommended by Amazon
based on personalization algorithms tuned to capture the authorʼs preferences.
17
[Link]

1.3. Kinds of Machine Learning 25

Fig. 1.3.4: Deep learning books recommended by Amazon.

Despite their tremendous economic value, recommendation systems naively built on top of pre-
dictive models suffer some serious conceptual flaws. To start, we only observe censored feedback.
Users preferentially rate movies that they feel strongly about: you might notice that items receive
many 5 and 1 star ratings but that there are conspicuously few 3-star ratings. Moreover, current
purchase habits are often a result of the recommendation algorithm currently in place, but learn-
ing algorithms do not always take this detail into account. Thus it is possible for feedback loops
to form where a recommender system preferentially pushes an item that is then taken to be bet-
ter (due to greater purchases) and in turn is recommended even more frequently. Many of these
problems about how to deal with censoring, incentives, and feedback loops, are important open
research questions.

Sequence Learning

So far, we have looked at problems where we have some fixed number of inputs and produce a fixed
number of outputs. Before we considered predicting home prices from a fixed set of features:
square footage, number of bedrooms, number of bathrooms, walking time to downtown. We
also discussed mapping from an image (of fixed dimension) to the predicted probabilities that it
belongs to each of a fixed number of classes, or taking a user ID and a product ID, and predicting
a star rating. In these cases, once we feed our fixed-length input into the model to generate an
output, the model immediately forgets what it just saw.
This might be fine if our inputs truly all have the same dimensions and if successive inputs truly
have nothing to do with each other. But how would we deal with video snippets? In this case,
each snippet might consist of a different number of frames. And our guess of what is going on in
each frame might be much stronger if we take into account the previous or succeeding frames.

26 Chapter 1. Introduction
Same goes for language. One popular deep learning problem is machine translation: the task of
ingesting sentences in some source language and predicting their translation in another language.
These problems also occur in medicine. We might want a model to monitor patients in the in-
tensive care unit and to fire off alerts if their risk of death in the next 24 hours exceeds some
threshold. We definitely would not want this model to throw away everything it knows about the
patient history each hour and just make its predictions based on the most recent measurements.
These problems are among the most exciting applications of machine learning and they are in-
stances of sequence learning. They require a model to either ingest sequences of inputs or to emit
sequences of outputs (or both!). These latter problems are sometimes referred to as seq2seq prob-
lems. Language translation is a seq2seq problem. Transcribing text from spoken speech is also
a seq2seq problem. While it is impossible to consider all types of sequence transformations, a
number of special cases are worth mentioning:
Tagging and Parsing. This involves annotating a text sequence with attributes. In other words,
the number of inputs and outputs is essentially the same. For instance, we might want to know
where the verbs and subjects are. Alternatively, we might want to know which words are the
named entities. In general, the goal is to decompose and annotate text based on structural and
grammatical assumptions to get some annotation. This sounds more complex than it actually is.
Below is a very simple example of annotating a sentence with tags indicating which words refer
to named entities.

Tom has dinner in Washington with Sally.

Ent - - - Ent - Ent

Automatic Speech Recognition. With speech recognition, the input sequence x is an audio
recording of a speaker (shown in Fig. 1.3.5), and the output y is the textual transcript of what the
speaker said. The challenge is that there are many more audio frames (sound is typically sampled
at 8kHz or 16kHz) than text, i.e., there is no 1:1 correspondence between audio and text, since
thousands of samples correspond to a single spoken word. These are seq2seq problems where
the output is much shorter than the input.

Fig. 1.3.5: -D-e-e-p- L-ea-r-ni-ng-

Text to Speech. Text-to-Speech (TTS) is the inverse of speech recognition. In other words, the
input x is text and the output y is an audio file. In this case, the output is much longer than the
input. While it is easy for humans to recognize a bad audio file, this is not quite so trivial for
computers.
Machine Translation. Unlike the case of speech recognition, where corresponding inputs and
outputs occur in the same order (after alignment), in machine translation, order inversion can
be vital. In other words, while we are still converting one sequence into another, neither the

1.3. Kinds of Machine Learning 27

number of inputs and outputs nor the order of corresponding data points are assumed to be the
same. Consider the following illustrative example of the peculiar tendency of Germans to place
the verbs at the end of sentences.

German: Haben Sie sich schon dieses grossartige Lehrwerk angeschaut?

English: Did you already check out this excellent tutorial?
Wrong alignment: Did you yourself already this excellent tutorial looked-at?

Many related problems pop up in other learning tasks. For instance, determining the order in
which a user reads a Webpage is a two-dimensional layout analysis problem. Dialogue problems
exhibit all kinds of additional complications, where determining what to say next requires taking
into account real-world knowledge and the prior state of the conversation across long temporal
distances. This is an active area of research.

1.3.2 Unsupervised learning

All the examples so far were related to Supervised Learning, i.e., situations where we feed the model
a giant dataset containing both the features and corresponding target values. You could think of
the supervised learner as having an extremely specialized job and an extremely anal boss. The
boss stands over your shoulder and tells you exactly what to do in every situation until you learn
to map from situations to actions. Working for such a boss sounds pretty lame. On the other hand,
it is easy to please this boss. You just recognize the pattern as quickly as possible and imitate their
actions.
In a completely opposite way, it could be frustrating to work for a boss who has no idea what
they want you to do. However, if you plan to be a data scientist, youʼd better get used to it. The
boss might just hand you a giant dump of data and tell you to do some data science with it! This
sounds vague because it is. We call this class of problems unsupervised learning, and the type and
number of questions we could ask is limited only by our creativity. We will address a number of
unsupervised learning techniques in later chapters. To whet your appetite for now, we describe a
few of the questions you might ask:
• Can we find a small number of prototypes that accurately summarize the data? Given a set of
photos, can we group them into landscape photos, pictures of dogs, babies, cats, mountain
peaks, etc.? Likewise, given a collection of usersʼ browsing activity, can we group them into
users with similar behavior? This problem is typically known as clustering.
• Can we find a small number of parameters that accurately capture the relevant properties of
the data? The trajectories of a ball are quite well described by velocity, diameter, and mass
of the ball. Tailors have developed a small number of parameters that describe human body
shape fairly accurately for the purpose of fitting clothes. These problems are referred to
as subspace estimation problems. If the dependence is linear, it is called principal component
analysis.
• Is there a representation of (arbitrarily structured) objects in Euclidean space (i.e., the space
of vectors in Rn ) such that symbolic properties can be well matched? This is called represen-
tation learning and it is used to describe entities and their relations, such as Rome − Italy +
France = Paris.
• Is there a description of the root causes of much of the data that we observe? For instance,
if we have demographic data about house prices, pollution, crime, location, education,

28 Chapter 1. Introduction
salaries, etc., can we discover how they are related simply based on empirical data? The
fields concerned with causality and probabilistic graphical models address this problem.
• Another important and exciting recent development in unsupervised learning is the advent
of generative adversarial networks. These give us a procedural way to synthesize data, even
complicated structured data like images and audio. The underlying statistical mechanisms
are tests to check whether real and fake data are the same. We will devote a few notebooks
to them.

1.3.3 Interacting with an Environment

So far, we have not discussed where data actually comes from, or what actually happens when a
machine learning model generates an output. That is because supervised learning and unsuper-
vised learning do not address these issues in a very sophisticated way. In either case, we grab a
big pile of data up front, then set our pattern recognition machines in motion without ever in-
teracting with the environment again. Because all of the learning takes place after the algorithm
is disconnected from the environment, this is sometimes called offline learning. For supervised
learning, the process looks like Fig. 1.3.6.

Fig. 1.3.6: Collect data for supervised learning from an environment.

This simplicity of offline learning has its charms. The upside is we can worry about pattern recog-
nition in isolation, without any distraction from these other problems. But the downside is that
the problem formulation is quite limiting. If you are more ambitious, or if you grew up reading
Asimovʼs Robot Series, then you might imagine artificially intelligent bots capable not only of mak-
ing predictions, but of taking actions in the world. We want to think about intelligent agents, not
just predictive models. That means we need to think about choosing actions, not just making predic-
tions. Moreover, unlike predictions, actions actually impact the environment. If we want to train
an intelligent agent, we must account for the way its actions might impact the future observations
of the agent.
Considering the interaction with an environment opens a whole set of new modeling questions.
Does the environment:
• Remember what we did previously?
• Want to help us, e.g., a user reading text into a speech recognizer?
• Want to beat us, i.e., an adversarial setting like spam filtering (against spammers) or playing
a game (vs an opponent)?

1.3. Kinds of Machine Learning 29

• Not care (as in many cases)?
• Have shifting dynamics (does future data always resemble the past or do the patterns change
over time, either naturally or in response to our automated tools)?
This last question raises the problem of distribution shift, (when training and test data are differ-
ent). It is a problem that most of us have experienced when taking exams written by a lecturer,
while the homeworks were composed by her TAs. We will briefly describe reinforcement learning
and adversarial learning, two settings that explicitly consider interaction with an environment.

1.3.4 Reinforcement learning

If you are interested in using machine learning to develop an agent that interacts with an environ-
ment and takes actions, then you are probably going to wind up focusing on reinforcement learning
(RL). This might include applications to robotics, to dialogue systems, and even to developing AI
for video games. Deep reinforcement learning (DRL), which applies deep neural networks to RL
problems, has surged in popularity. The breakthrough deep Q-network that beat humans at Atari
games using only the visual input18 , and the AlphaGo program that dethroned the world champion
at the board game Go19 are two prominent examples.
Reinforcement learning gives a very general statement of a problem, in which an agent interacts
with an environment over a series of timesteps. At each timestep t, the agent receives some ob-
servation ot from the environment and must choose an action at that is subsequently transmitted
back to the environment via some mechanism (sometimes called an actuator). Finally, the agent
receives a reward rt from the environment. The agent then receives a subsequent observation,
and chooses a subsequent action, and so on. The behavior of an RL agent is governed by a policy.
In short, a policy is just a function that maps from observations (of the environment) to actions.
The goal of reinforcement learning is to produce a good policy.

Fig. 1.3.7: The interaction between reinforcement learning and an environment.

It is hard to overstate the generality of the RL framework. For example, we can cast any supervised
learning problem as an RL problem. Say we had a classification problem. We could create an RL
agent with one action corresponding to each class. We could then create an environment which
gave a reward that was exactly equal to the loss function from the original supervised problem.
That being said, RL can also address many problems that supervised learning cannot. For exam-
ple, in supervised learning we always expect that the training input comes associated with the
18
[Link]
19
[Link]

30 Chapter 1. Introduction
correct label. But in RL, we do not assume that for each observation, the environment tells us the
optimal action. In general, we just get some reward. Moreover, the environment may not even
tell us which actions led to the reward.
Consider for example the game of chess. The only real reward signal comes at the end of the
game when we either win, which we might assign a reward of 1, or when we lose, which we could
assign a reward of -1. So reinforcement learners must deal with the credit assignment problem:
determining which actions to credit or blame for an outcome. The same goes for an employee
who gets a promotion on October 11. That promotion likely reflects a large number of well-chosen
actions over the previous year. Getting more promotions in the future requires figuring out what
actions along the way led to the promotion.
Reinforcement learners may also have to deal with the problem of partial observability. That is,
the current observation might not tell you everything about your current state. Say a cleaning
robot found itself trapped in one of many identical closets in a house. Inferring the precise lo-
cation (and thus state) of the robot might require considering its previous observations before
entering the closet.
Finally, at any given point, reinforcement learners might know of one good policy, but there might
be many other better policies that the agent has never tried. The reinforcement learner must
constantly choose whether to exploit the best currently-known strategy as a policy, or to explore
the space of strategies, potentially giving up some short-run reward in exchange for knowledge.

MDPs, bandits, and friends

The general reinforcement learning problem is a very general setting. Actions affect subsequent
observations. Rewards are only observed corresponding to the chosen actions. The environment
may be either fully or partially observed. Accounting for all this complexity at once may ask too
much of researchers. Moreover, not every practical problem exhibits all this complexity. As a
result, researchers have studied a number of special cases of reinforcement learning problems.
When the environment is fully observed, we call the RL problem a Markov Decision Process (MDP).
When the state does not depend on the previous actions, we call the problem a contextual bandit
problem. When there is no state, just a set of available actions with initially unknown rewards, this
problem is the classic multi-armed bandit problem.

1.4 Roots

Although many deep learning methods are recent inventions, humans have held the desire to an-
alyze data and to predict future outcomes for centuries. In fact, much of natural science has its
roots in this. For instance, the Bernoulli distribution is named after Jacob Bernoulli (1655-1705)20 ,
and the Gaussian distribution was discovered by Carl Friedrich Gauss (1777-1855)21 . He invented
for instance the least mean squares algorithm, which is still used today for countless problems
from insurance calculations to medical diagnostics. These tools gave rise to an experimental ap-
proach in the natural sciences—for instance, Ohmʼs law relating current and voltage in a resistor
is perfectly described by a linear model.
20
[Link]
21
[Link]

1.4. Roots 31
Even in the middle ages, mathematicians had a keen intuition of estimates. For instance, the
geometry book of Jacob Köbel (1460-1533)22 illustrates averaging the length of 16 adult menʼs feet
to obtain the average foot length.

Fig. 1.4.1: Estimating the length of a foot

Fig. 1.4.1 illustrates how this estimator works. The 16 adult men were asked to line up in a row,
when leaving church. Their aggregate length was then divided by 16 to obtain an estimate for
what now amounts to 1 foot. This “algorithm” was later improved to deal with misshapen feet—
the 2 men with the shortest and longest feet respectively were sent away, averaging only over the
remainder. This is one of the earliest examples of the trimmed mean estimate.
Statistics really took off with the collection and availability of data. One of its titans, Ronald Fisher
(1890-1962)23 , contributed significantly to its theory and also its applications in genetics. Many of
his algorithms (such as Linear Discriminant Analysis) and formula (such as the Fisher Information
Matrix) are still in frequent use today (even the Iris dataset that he released in 1936 is still used
sometimes to illustrate machine learning algorithms). Fisher was also a proponent of eugenics,
which should remind us that the morally dubious use data science has as long and enduring a
history as its productive use in industry and the natural sciences.
A second influence for machine learning came from Information Theory (Claude Shannon, 1916-
2001)24 and the Theory of computation via Alan Turing (1912-1954)25 . Turing posed the question
“can machines think?” in his famous paper Computing machinery and intelligence26 (Mind, Oc-
22
[Link]
23
[Link]
24
[Link]
25
[Link]
26
[Link]

32 Chapter 1. Introduction
tober 1950). In what he described as the Turing test, a machine can be considered intelligent if it
is difficult for a human evaluator to distinguish between the replies from a machine and a human
based on textual interactions.
Another influence can be found in neuroscience and psychology. After all, humans clearly exhibit
intelligent behavior. It is thus only reasonable to ask whether one could explain and possibly re-
verse engineer this capacity. One of the oldest algorithms inspired in this fashion was formulated
by Donald Hebb (1904-1985)27 . In his groundbreaking book The Organization of Behavior (Hebb
& Hebb, 1949), he posited that neurons learn by positive reinforcement. This became known as
the Hebbian learning rule. It is the prototype of Rosenblattʼs perceptron learning algorithm and it
laid the foundations of many stochastic gradient descent algorithms that underpin deep learning
today: reinforce desirable behavior and diminish undesirable behavior to obtain good settings of
the parameters in a neural network.
Biological inspiration is what gave neural networks their name. For over a century (dating back
to the models of Alexander Bain, 1873 and James Sherrington, 1890), researchers have tried to
assemble computational circuits that resemble networks of interacting neurons. Over time, the
interpretation of biology has become less literal but the name stuck. At its heart, lie a few key
principles that can be found in most networks today:
• The alternation of linear and nonlinear processing units, often referred to as layers.
• The use of the chain rule (also known as backpropagation) for adjusting parameters in the
entire network at once.
After initial rapid progress, research in neural networks languished from around 1995 until 2005.
This was due to a number of reasons. Training a network is computationally very expensive.
While RAM was plentiful at the end of the past century, computational power was scarce. Sec-
ond, datasets were relatively small. In fact, Fisherʼs Iris dataset from 1932 was a popular tool for
testing the efficacy of algorithms. MNIST with its 60,000 handwritten digits was considered huge.
Given the scarcity of data and computation, strong statistical tools such as Kernel Methods, Deci-
sion Trees and Graphical Models proved empirically superior. Unlike neural networks, they did
not require weeks to train and provided predictable results with strong theoretical guarantees.

1.5 The Road to Deep Learning

Much of this changed with the ready availability of large amounts of data, due to the World Wide
Web, the advent of companies serving hundreds of millions of users online, a dissemination of
cheap, high quality sensors, cheap data storage (Kryderʼs law), and cheap computation (Mooreʼs
law), in particular in the form of GPUs, originally engineered for computer gaming. Suddenly
algorithms and models that seemed computationally infeasible became relevant (and vice versa).
This is best illustrated in Table 1.5.1.
27
[Link]

1.5. The Road to Deep Learning 33

Table 1.5.1: Dataset versus computer memory and compu-
tational power
Decade Dataset Mem- Floating Point Calculations per Second
ory
1970 100 (Iris) 1 KB 100 KF (Intel 8080)
1980 1 K (House prices in Boston) 100 KB 1 MF (Intel 80186)
1990 10 K (optical character recognition) 10 MB 10 MF (Intel 80486)
2000 10 M (web pages) 100 MB 1 GF (Intel Core)
2010 10 G (advertising) 1 GB 1 TF (Nvidia C2050)
2020 1 T (social network) 100 GB 1 PF (Nvidia DGX-2)

It is evident that RAM has not kept pace with the growth in data. At the same time, the increase
in computational power has outpaced that of the data available. This means that statistical mod-
els needed to become more memory efficient (this is typically achieved by adding nonlineari-
ties) while simultaneously being able to spend more time on optimizing these parameters, due
to an increased compute budget. Consequently the sweet spot in machine learning and statis-
tics moved from (generalized) linear models and kernel methods to deep networks. This is also
one of the reasons why many of the mainstays of deep learning, such as multilayer perceptrons
(McCulloch & Pitts, 1943), convolutional neural networks (LeCun et al., 1998), Long Short-Term
Memory (Hochreiter & Schmidhuber, 1997), and Q-Learning (Watkins & Dayan, 1992), were es-
sentially “rediscovered” in the past decade, after laying comparatively dormant for considerable
time.
The recent progress in statistical models, applications, and algorithms, has sometimes been
likened to the Cambrian Explosion: a moment of rapid progress in the evolution of species. In-
deed, the state of the art is not just a mere consequence of available resources, applied to decades
old algorithms. Note that the list below barely scratches the surface of the ideas that have helped
researchers achieve tremendous progress over the past decade.
• Novel methods for capacity control, such as Dropout (Srivastava et al., 2014) have helped
to mitigate the danger of overfitting. This was achieved by applying noise injection (Bishop,
1995) throughout the network, replacing weights by random variables for training purposes.
• Attention mechanisms solved a second problem that had plagued statistics for over a cen-
tury: how to increase the memory and complexity of a system without increasing the num-
ber of learnable parameters. (Bahdanau et al., 2014) found an elegant solution by using what
can only be viewed as a learnable pointer structure. Rather than having to remember an en-
tire sentence, e.g., for machine translation in a fixed-dimensional representation, all that
needed to be stored was a pointer to the intermediate state of the translation process. This
allowed for significantly increased accuracy for long sentences, since the model no longer
needed to remember the entire sentence before commencing the generation of a new sen-
tence.
• Multi-stage designs, e.g., via the Memory Networks (MemNets) (Sukhbaatar et al., 2015) and
the Neural Programmer-Interpreter (Reed & DeFreitas, 2015) allowed statistical modelers
to describe iterative approaches to reasoning. These tools allow for an internal state of the
deep network to be modified repeatedly, thus carrying out subsequent steps in a chain of
reasoning, similar to how a processor can modify memory for a computation.
• Another key development was the invention of GANS (Goodfellow et al., 2014). Traditionally,
statistical methods for density estimation and generative models focused on finding proper

34 Chapter 1. Introduction
probability distributions and (often approximate) algorithms for sampling from them. As
a result, these algorithms were largely limited by the lack of flexibility inherent in the sta-
tistical models. The crucial innovation in GANs was to replace the sampler by an arbitrary
algorithm with differentiable parameters. These are then adjusted in such a way that the dis-
criminator (effectively a two-sample test) cannot distinguish fake from real data. Through
the ability to use arbitrary algorithms to generate data, it opened up density estimation to
a wide variety of techniques. Examples of galloping Zebras (Zhu et al., 2017) and of fake
celebrity faces (Karras et al., 2017) are both testimony to this progress.
• In many cases, a single GPU is insufficient to process the large amounts of data available for
training. Over the past decade the ability to build parallel distributed training algorithms
has improved significantly. One of the key challenges in designing scalable algorithms is
that the workhorse of deep learning optimization, stochastic gradient descent, relies on rel-
atively small minibatches of data to be processed. At the same time, small batches limit the
efficiency of GPUs. Hence, training on 1024 GPUs with a minibatch size of, say 32 images
per batch amounts to an aggregate minibatch of 32k images. Recent work, first by Li (Li,
2017), and subsequently by (You et al., 2017) and (Jia et al., 2018) pushed the size up to 64k
observations, reducing training time for ResNet50 on ImageNet to less than 7 minutes. For
comparison—initially training times were measured in the order of days.
• The ability to parallelize computation has also contributed quite crucially to progress in re-
inforcement learning, at least whenever simulation is an option. This has led to significant
progress in computers achieving superhuman performance in Go, Atari games, Starcraft,
and in physics simulations (e.g., using MuJoCo). See e.g., (Silver et al., 2016) for a descrip-
tion of how to achieve this in AlphaGo. In a nutshell, reinforcement learning works best if
plenty of (state, action, reward)triples are available, i.e., whenever it is possible to try out
lots of things to learn how they relate to each other. Simulation provides such an avenue.
• Deep Learning frameworks have played a crucial role in disseminating ideas. The first
generation of frameworks allowing for easy modeling encompassed Caffe28 , Torch29 , and
Theano30 . Many seminal papers were written using these tools. By now, they have been su-
perseded by TensorFlow31 , often used via its high level API Keras32 , CNTK33 , Caffe 234 , and
Apache MxNet35 . The third generation of tools, namely imperative tools for deep learning,
was arguably spearheaded by Chainer36 , which used a syntax similar to Python NumPy to
describe models. This idea was adopted by PyTorch37 and the Gluon API38 of MXNet. It is
the latter group that this course uses to teach deep learning.
The division of labor between systems researchers building better tools and statistical modelers
building better networks has greatly simplified things. For instance, training a linear logistic re-
gression model used to be a nontrivial homework problem, worthy to give to new machine learn-
ing PhD students at Carnegie Mellon University in 2014. By now, this task can be accomplished
with less than 10 lines of code, putting it firmly into the grasp of programmers.
28
[Link]
29
[Link]
30
[Link]
31
[Link]
32
[Link]
33
[Link]
34
[Link]
35
[Link]
36
[Link]
37
[Link]
38
[Link]

1.5. The Road to Deep Learning 35

1.6 Success Stories

Artificial Intelligence has a long history of delivering results that would be difficult to accomplish
otherwise. For instance, mail is sorted using optical character recognition. These systems have
been deployed since the 90s (this is, after all, the source of the famous MNIST and USPS sets of
handwritten digits). The same applies to reading checks for bank deposits and scoring creditwor-
thiness of applicants. Financial transactions are checked for fraud automatically. This forms the
backbone of many e-commerce payment systems, such as PayPal, Stripe, AliPay, WeChat, Apple,
Visa, MasterCard. Computer programs for chess have been competitive for decades. Machine
learning feeds search, recommendation, personalization and ranking on the Internet. In other
words, artificial intelligence and machine learning are pervasive, albeit often hidden from sight.
It is only recently that AI has been in the limelight, mostly due to solutions to problems that were
considered intractable previously.
• Intelligent assistants, such as Appleʼs Siri, Amazonʼs Alexa, or Googleʼs assistant are able to
answer spoken questions with a reasonable degree of accuracy. This includes menial tasks
such as turning on light switches (a boon to the disabled) up to making barberʼs appoint-
ments and offering phone support dialog. This is likely the most noticeable sign that AI is
affecting our lives.
• A key ingredient in digital assistants is the ability to recognize speech accurately. Gradually
the accuracy of such systems has increased to the point where they reach human parity
(Xiong et al., 2018) for certain applications.
• Object recognition likewise has come a long way. Estimating the object in a picture was a
fairly challenging task in 2010. On the ImageNet benchmark (Lin et al., 2010) achieved a
top-5 error rate of 28%. By 2017, (Hu et al., 2018) reduced this error rate to 2.25%. Similarly
stunning results have been achieved for identifying birds, or diagnosing skin cancer.
• Games used to be a bastion of human intelligence. Starting from TDGammon [23], a pro-
gram for playing Backgammon using temporal difference (TD) reinforcement learning, al-
gorithmic and computational progress has led to algorithms for a wide range of applications.
Unlike Backgammon, chess has a much more complex state space and set of actions. Deep-
Blue beat Gary Kasparov, Campbell et al. (Campbell et al., 2002), using massive parallelism,
special purpose hardware and efficient search through the game tree. Go is more difficult
still, due to its huge state space. AlphaGo reached human parity in 2015, (Silver et al., 2016)
using Deep Learning combined with Monte Carlo tree sampling. The challenge in Poker
was that the state space is large and it is not fully observed (we do not know the opponentsʼ
cards). Libratus exceeded human performance in Poker using efficiently structured strate-
gies (Brown & Sandholm, 2017). This illustrates the impressive progress in games and the
fact that advanced algorithms played a crucial part in them.
• Another indication of progress in AI is the advent of self-driving cars and trucks. While
full autonomy is not quite within reach yet, excellent progress has been made in this direc-
tion, with companies such as Tesla, NVIDIA, and Waymo shipping products that enable at
least partial autonomy. What makes full autonomy so challenging is that proper driving re-
quires the ability to perceive, to reason and to incorporate rules into a system. At present,
deep learning is used primarily in the computer vision aspect of these problems. The rest is
heavily tuned by engineers.
Again, the above list barely scratches the surface of where machine learning has impacted prac-
tical applications. For instance, robotics, logistics, computational biology, particle physics, and

36 Chapter 1. Introduction
astronomy owe some of their most impressive recent advances at least in parts to machine learn-
ing. ML is thus becoming a ubiquitous tool for engineers and scientists.
Frequently, the question of the AI apocalypse, or the AI singularity has been raised in non-
technical articles on AI. The fear is that somehow machine learning systems will become sen-
tient and decide independently from their programmers (and masters) about things that directly
affect the livelihood of humans. To some extent, AI already affects the livelihood of humans in
an immediate way—creditworthiness is assessed automatically, autopilots mostly navigate cars,
decisions about whether to grant bail use statistical data as input. More frivolously, we can ask
Alexa to switch on the coffee machine.
Fortunately, we are far from a sentient AI system that is ready to manipulate its human creators
(or burn their coffee). First, AI systems are engineered, trained and deployed in a specific, goal-
oriented manner. While their behavior might give the illusion of general intelligence, it is a com-
bination of rules, heuristics and statistical models that underlie the design. Second, at present
tools for artificial general intelligence simply do not exist that are able to improve themselves, rea-
son about themselves, and that are able to modify, extend and improve their own architecture
while trying to solve general tasks.
A much more pressing concern is how AI is being used in our daily lives. It is likely that many me-
nial tasks fulfilled by truck drivers and shop assistants can and will be automated. Farm robots will
likely reduce the cost for organic farming but they will also automate harvesting operations. This
phase of the industrial revolution may have profound consequences on large swaths of society
(truck drivers and shop assistants are some of the most common jobs in many states). Further-
more, statistical models, when applied without care can lead to racial, gender or age bias and raise
reasonable concerns about procedural fairness if automated to drive consequential decisions. It
is important to ensure that these algorithms are used with care. With what we know today, this
strikes us a much more pressing concern than the potential of malevolent superintelligence to
destroy humanity.

Summary

• Machine learning studies how computer systems can leverage experience (often data) to im-
prove performance at specific tasks. It combines ideas from statistics, data mining, artificial
intelligence, and optimization. Often, it is used as a means of implementing artificially-
intelligent solutions.
• As a class of machine learning, representational learning focuses on how to automatically
find the appropriate way to represent data. This is often accomplished by a progression of
learned transformations.
• Much of the recent progress in deep learning has been triggered by an abundance of data
arising from cheap sensors and Internet-scale applications, and by significant progress in
computation, mostly through GPUs.
• Whole system optimization is a key component in obtaining good performance. The avail-
ability of efficient deep learning frameworks has made design and implementation of this
significantly easier.

1.6. Success Stories 37

Exercises

1. Which parts of code that you are currently writing could be “learned”, i.e., improved by
learning and automatically determining design choices that are made in your code? Does
your code include heuristic design choices?
2. Which problems that you encounter have many examples for how to solve them, yet no spe-
cific way to automate them? These may be prime candidates for using deep learning.
3. Viewing the development of artificial intelligence as a new industrial revolution, what is the
relationship between algorithms and data? Is it similar to steam engines and coal (what is
the fundamental difference)?
4. Where else can you apply the end-to-end training approach? Physics? Engineering? Econo-
metrics?

38 Chapter 1. Introduction
2 | Preliminaries

To get started with deep learning, we will need to develop a few basic skills. All machine learning
is concerned with extracting information from data. So we will begin by learning the practical
skills for storing, manipulating, and preprocessing data.
Moreover, machine learning typically requires working with large datasets, which we can think
of as tables, where the rows correspond to examples and the columns correspond to attributes.
Linear algebra gives us a powerful set of techniques for working with tabular data. We will not go
too far into the weeds but rather focus on the basic of matrix operations and their implementation.
Additionally, deep learning is all about optimization. We have a model with some parameters and
we want to find those that fit our data the best. Determining which way to move each parameter at
each step of an algorithm requires a little bit of calculus, which will be briefly introduced. Fortu-
nately, the autograd package automatically computes differentiation for us, and we will cover it
next.
Next, machine learning is concerned with making predictions: what is the likely value of some un-
known attribute, given the information that we observe? To reason rigorously under uncertainty
we will need to invoke the language of probability.
In the end, the official documentation provides plenty of descriptions and examples that are be-
yond this book. To conclude the chapter, we will show you how to look up documentation for the
needed information.
This book has kept the mathematical content to the minimum necessary to get a proper under-
standing of deep learning. However, it does not mean that this book is mathematics free. Thus,
this chapter provides a rapid introduction to basic and frequently-used mathematics to allow any-
one to understand at least most of the mathematical content of the book. If you wish to understand
all of the mathematical content, further reviewing Chapter 17 should be sufficient.

2.1 Data Manipulation

In order to get anything done, we need some way to store and manipulate data. Generally, there
are two important things we need to do with data: (i) acquire them; and (ii) process them once they
are inside the computer. There is no point in acquiring data absent some way to store it, so letʼs
get our hands dirty first by playing with synthetic data. To start, we introduce the n-dimensional
array (ndarray), MXNetʼs primary tool for storing and transforming data. In MXNet, ndarray is a
class and we call any instance “an ndarray”.
If you have worked with NumPy, the most widely-used scientific computing package in Python,
then will find this section familiar. Thatʼs by design. We designed MXNetʼs ndarray to be an exten-
sion to NumPyʼs ndarray with a few killer features. First, MXNetʼs ndarray supports asynchronous

39
computation on CPU, GPU, and distributed cloud architectures, whereas NumPy only supports
CPU computation. Second, MXNetʼs ndarray supports automatic differentiation. These proper-
ties make MXNetʼs ndarray suitable for deep learning. Throughout the book, when we say ndarray,
we are referring to MXNetʼs ndarray unless otherwise stated.

2.1.1 Getting Started

In this section, we aim to get you up and running, equipping you with the the basic math and
numerical computing tools that you will build on as you progress through the book. Do not worry
if you struggle to grok some of the mathematical concepts or library functions. The following
sections will revisit this material in the context practical examples and it will sink. On the other
hand, if you already have some background and want to go deeper into the mathematical content,
just skip this section.
To start, we import the np (numpy) and npx (numpy_extension) modules from MXNet. Here, the np
module includes functions supported by NumPy, while the npx module contains a set of extensions
developed to empower deep learning within a NumPy-like environment. When using ndarray, we
almost always invoke the set_np function: this is for compatibility of ndarray processing by other
components of MXNet.

from mxnet import np, npx

npx.set_np()

An ndarray represents a (possibly multi-dimensional) array of numerical values. With one axis,
an ndarray corresponds (in math) to a vector. With two axes, an ndarray corresponds to a matrix.
Arrays with more than two axes do not have special mathematical names—we simply call them
tensors.
To start, we can use arange to create a row vector x containing the first 12 integers starting with 0,
though they are created as floats by default. Each of the values in an ndarray is called an element
of the ndarray. For instance, there are 12 elements in the ndarray x. Unless otherwise specified,
a new ndarray will be stored in main memory and designated for CPU-based computation.

x = [Link](12)
x

array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.])

We can access an ndarrayʼs shape (the length along each axis) by inspecting its shape property.

[Link]

(12,)

If we just want to know the total number of elements in an ndarray, i.e., the product of all of the
shape elements, we can inspect its size property. Because we are dealing with a vector here, the
single element of its shape is identical to its size.

[Link]

40 Chapter 2. Preliminaries
12

To change the shape of an ndarray without altering either the number of elements or their values,
we can invoke the reshape function. For example, we can transform our ndarray, x, from a row
vector with shape (12,) to a matrix with shape (3, 4). This new ndarray contains the exact same
values, but views them as a matrix organized as 3 rows and 4 columns. To reiterate, although the
shape has changed, the elements in x have not. Note that the size is unaltered by reshaping.

x = [Link](3, 4)
x

array([[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])

Reshaping by manually specifying every dimension is unnecessary. If our target shape is a matrix
with shape (height, width), then after we know the width, the height is given implicitly. Why
should we have to perform the division ourselves? In the example above, to get a matrix with
3 rows, we specified both that it should have 3 rows and 4 columns. Fortunately, ndarray can
automatically work out one dimension given the rest. We invoke this capability by placing -1 for
the dimension that we would like ndarray to automatically infer. In our case, instead of calling
[Link](3, 4), we could have equivalently called [Link](-1, 4) or [Link](3, -1).
The empty method grabs a chunk of memory and hands us back a matrix without bothering to
change the value of any of its entries. This is remarkably efficient but we must be careful because
the entries might take arbitrary values, including very big ones!

[Link]((3, 4))

array([[1.4624006e-24, 4.5861696e-41, 1.8142162e+18, 3.0880414e-41],

[0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
[0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00]])

Typically, we will want our matrices initialized either with zeros, ones, some other constants, or
numbers randomly sampled from a specific distribution. We can create an ndarray representing
a tensor with all elements set to 0 and a shape of (2, 3, 4) as follows:

[Link]((2, 3, 4))

array([[[0., 0., 0., 0.],

[0., 0., 0., 0.],
[0., 0., 0., 0.]],

[[0., 0., 0., 0.],

[0., 0., 0., 0.],
[0., 0., 0., 0.]]])

Similarly, we can create tensors with each element set to 1 as follows:

2.1. Data Manipulation 41

[Link]((2, 3, 4))

array([[[1., 1., 1., 1.],

[1., 1., 1., 1.],
[1., 1., 1., 1.]],

[[1., 1., 1., 1.],

[1., 1., 1., 1.],
[1., 1., 1., 1.]]])

Often, we want to randomly sample the values for each element in an ndarray from some
probability distribution. For example, when we construct arrays to serve as parameters in a
neural network, we will typically inititialize their values randomly. The following snippet creates
an ndarray with shape (3, 4). Each of its elements is randomly sampled
from a standard Gaussian (normal) distribution with a mean of 0 and a standard deviation of 1.

[Link](0, 1, size=(3, 4))

array([[ 2.2122064 , 1.1630787 , 0.7740038 , 0.4838046 ],

[ 1.0434405 , 0.29956347, 1.1839255 , 0.15302546],
[ 1.8917114 , -1.1688148 , -1.2347414 , 1.5580711 ]])

We can also specify the exact values for each element in the desired ndarray by supplying a Python
list (or list of lists) containing the numerical values. Here, the outermost list corresponds to axis
0, and the inner list to axis 1.

[Link]([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])

array([[2., 1., 4., 3.],

[1., 2., 3., 4.],
[4., 3., 2., 1.]])

2.1.2 Operations

This book is not about software engineering. Our interests are not limited to simply reading and
writing data from/to arrays. We want to perform mathematical operations on those arrays. Some
of the simplest and most useful operations are the elementwise operations. These apply a stan-
dard scalar operation to each element of an array. For functions that take two arrays as inputs,
elementwise operations apply some standard binary operator on each pair of corresponding ele-
ments from the two arrays. We can create an elementwise function from any function that maps
from a scalar to a scalar.
In mathematical notation, we would denote such a unary scalar operator (taking one input) by the
signature f : R → R. This just mean that the function is mapping from any real number (R) onto
another. Likewise, we denote a binary scalar operator (taking two real inputs, and yielding one

42 Chapter 2. Preliminaries
output) by the signature f : R, R → R. Given any two vectors u and v of the same shape, and a binary
operator f , we can produce a vector c = F (u, v) by setting ci ← f (ui , vi ) for all i, where ci , ui , and
vi are the ith elements of vectors c, u, and v. Here, we produced the vector-valued F : Rd , Rd → Rd
by lifting the scalar function to an elementwise vector operation.
In MXNet, the common standard arithmetic operators (+, -, *, /, and **) have all been lifted to el-
ementwise operations for any identically-shaped tensors of arbitrary shape. We can call element-
wise operations on any two tensors of the same shape. In the following example, we use commas
to formulate a 5-element tuple, where each element is the result of an elementwise operation.

x = [Link]([1, 2, 4, 8])
y = [Link]([2, 2, 2, 2])
x + y, x - y, x * y, x / y, x ** y # The ** operator is exponentiation

(array([ 3., 4., 6., 10.]),

array([-1., 0., 2., 6.]),
array([ 2., 4., 8., 16.]),
array([0.5, 1. , 2. , 4. ]),
array([ 1., 4., 16., 64.]))

Many more operations can be applied elementwise, including unary operators like exponentia-
tion.

[Link](x)

array([2.7182817e+00, 7.3890562e+00, 5.4598148e+01, 2.9809580e+03])

In addition to elementwise computations, we can also perform linear algebra operations, includ-
ing vector dot products and matrix multiplication. We will explain the crucial bits of linear algebra
(with no assumed prior knowledge) in Section 2.3.
We can also concatenate multiple ndarrays together, stacking them end-to-end to form a larger
ndarray. We just need to provide a list of ndarrays and tell the system along which axis to con-
catenate. The example below shows what happens when we concatenate two matrices along rows
(axis 0, the first element of the shape) vs. columns (axis 1, the second element of the shape). We
can see that, the first output ndarrayʻs axis-0 length (6) is the sum of the two input ndarraysʼ axis-0
lengths (3 + 3); while the second output ndarrayʻs axis-1 length (8) is the sum of the two input
ndarraysʼ axis-1 lengths (4 + 4).

x = [Link](12).reshape(3, 4)
y = [Link]([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
[Link]([x, y], axis=0), [Link]([x, y], axis=1)

(array([[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[ 2., 1., 4., 3.],
[ 1., 2., 3., 4.],
[ 4., 3., 2., 1.]]),
array([[ 0., 1., 2., 3., 2., 1., 4., 3.],
(continues on next page)

2.1. Data Manipulation 43

(continued from previous page)
[ 4., 5., 6., 7., 1., 2., 3., 4.],
[ 8., 9., 10., 11., 4., 3., 2., 1.]]))

Sometimes, we want to construct a binary ndarray via logical statements. Take x == y as an ex-
ample. For each position, if x and y are equal at that position, the corresponding entry in the
new ndarray takes a value of 1, meaning that the logical statement x == y is true at that position;
otherwise that position takes 0.

x == y

array([[False, True, False, True],

[False, False, False, False],
[False, False, False, False]])

Summing all the elements in the ndarray yields an ndarray with only one element.

[Link]()

array(66.)

For stylistic convenience, we can write [Link]()as [Link](x).

2.1.3 Broadcasting Mechanism

In the above section, we saw how to perform elementwise operations on two ndarrays of the same
shape. Under certain conditions, even when shapes differ, we can still perform elementwise op-
erations by invoking the broadcasting mechanism. These mechanisms work in the following way:
First, expand one or both arrays by copying elements appropriately so that after this transforma-
tion, the two ndarrays have the same shape. Second, carry out the elementwise operations on the
resulting arrays.
In most cases, we broadcast along an axis where an array initially only has length 1, such as in the
following example:

a = [Link](3).reshape(3, 1)
b = [Link](2).reshape(1, 2)
a, b

(array([[0.],
[1.],
[2.]]), array([[0., 1.]]))

Since a and b are 3 × 1 and 1 × 2 matrices respectively, their shapes do not match up if we want
to add them. We broadcast the entries of both matrices into a larger 3 × 2 matrix as follows: for
matrix a it replicates the columns and for matrix b it replicates the rows before adding up both
elementwise.

44 Chapter 2. Preliminaries
a + b

array([[0., 1.],
[1., 2.],
[2., 3.]])

2.1.4 Indexing and Slicing

Just as in any other Python array, elements in an ndarray can be accessed by index. As in any
Python array, the first element has index 0 and ranges are specified to include the first but before
the last element. As in standard Python lists, we can access elements according to their relative
position to the end of the list by using negative indices.
Thus, [-1] selects the last element and [1:3] selects the second and the third elements as follows:

x[-1], x[1:3]

(array([ 8., 9., 10., 11.]), array([[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]]))

Beyond reading, we can also write elements of a matrix by specifying indices.

x[1, 2] = 9
x

array([[ 0., 1., 2., 3.],

[ 4., 5., 9., 7.],
[ 8., 9., 10., 11.]])

If we want to assign multiple elements the same value, we simply index all of them and then assign
them the value. For instance, [0:2, :] accesses the first and second rows, where : takes all the
elements along axis 1 (column). While we discussed indexing for matrices, this obviously also
works for vectors and for tensors of more than 2 dimensions.

x[0:2, :] = 12
x

array([[12., 12., 12., 12.],

[12., 12., 12., 12.],
[ 8., 9., 10., 11.]])

2.1.5 Saving Memory

In the previous example, every time we ran an operation, we allocated new memory to host its
results. For example, if we write y = x + y, we will dereference the ndarray that y used to point to
and instead point y at the newly allocated memory. In the following example, we demonstrate this
with Pythonʼs id() function, which gives us the exact address of the referenced object in memory.

2.1. Data Manipulation 45

After running y = y + x, we will find that id(y) points to a different location. That is because
Python first evaluates y + x, allocating new memory for the result and then makes y point to this
new location in memory.

before = id(y)
y = y + x
id(y) == before

False

This might be undesirable for two reasons. First, we do not want to run around allocating mem-
ory unnecessarily all the time. In machine learning, we might have hundreds of megabytes of
parameters and update all of them multiple times per second. Typically, we will want to perform
these updates in place. Second, we might point at the same parameters from multiple variables.
If we do not update in place, this could cause that discarded memory is not released, and make it
possible for parts of our code to inadvertently reference stale parameters.
Fortunately, performing in-place operations in MXNet is easy. We can assign the result of an op-
eration to a previously allocated array with slice notation, e.g., y[:] = <expression>. To illustrate
this concept, we first create a new matrix z with the same shape as another y, using zeros_like to
allocate a block of 0 entries.

z = np.zeros_like(y)
print('id(z):', id(z))
z[:] = x + y
print('id(z):', id(z))

id(z): 140561335411888
id(z): 140561335411888

If the value of x is not reused in subsequent computations, we can also use x[:] = x + y or x +=
y to reduce the memory overhead of the operation.

before = id(x)
x += y
id(x) == before

True

2.1.6 Conversion to Other Python Objects

Converting an MXNet ndarray to a NumPy ndarray, or vice versa, is easy. The converted result
does not share memory. This minor inconvenience is actually quite important: when you perform
operations on the CPU or on GPUs, you do not want MXNet to halt computation, waiting to see
whether the NumPy package of Python might want to be doing something else with the same
chunk of memory. The array and asnumpy functions do the trick.

46 Chapter 2. Preliminaries
a = [Link]()
b = [Link](a)
type(a), type(b)

([Link], [Link])

To convert a size-1 ndarray to a Python scalar, we can invoke the item function or Pythonʼs built-in
functions.

a = [Link]([3.5])
a, [Link](), float(a), int(a)

(array([3.5]), 3.5, 3.5, 3)

Summary

• MXNetʼs ndarray is an extension to NumPyʼs ndarray with a few killer advantages that make
it suitable for deep learning.
• MXNetʼs ndarray provides a variety of functionalities including basic mathematics opera-
tions, broadcasting, indexing, slicing, memory saving, and conversion to other Python ob-
jects.

Exercises

1. Run the code in this section. Change the conditional statement x == y in this section to x <
y or x > y, and then see what kind of ndarray you can get.
2. Replace the two ndarrays that operate by element in the broadcasting mechanism with other
shapes, e.g., three dimensional tensors. Is the result the same as expected?

2.2 Data Preprocessing

So far we have introduced a variety of techniques for manipulating data that are already stored
in ndarrays. To apply deep learning to solving real-world problems, we often begin with prepro-
cessing raw data, rather than those nicely prepared data in the ndarray format. Among popular
data analytic tools in Python, the pandas package is commonly used. Like many other extension
packages in the vast ecosystem of Python, pandas can work together with ndarray. So, we will
briefly walk through steps for preprocessing raw data with pandas and converting them into the
ndarray format. We will cover more data preprocessing techniques in later chapters.

2.2. Data Preprocessing 47

2.2.1 Reading the Dataset

As an example, we begin by creating an artificial dataset that is stored in a csv (comma-separated

values) file. Data stored in other formats may be processed in similar ways.

# Write the dataset row by row into a csv file

data_file = '../data/house_tiny.csv'
with open(data_file, 'w') as f:
[Link]('NumRooms,Alley,Price\n') # Column names
[Link]('NA,Pave,127500\n') # Each row is a data point
[Link]('2,NA,106000\n')
[Link]('4,NA,178100\n')
[Link]('NA,NA,140000\n')

To load the raw dataset from the created csv file, we import the pandas package and invoke the
read_csv function. This dataset has 4 rows and 3 columns, where each row describes the number
of rooms (“NumRooms”), the alley type (“Alley”), and the price (“Price”) of a house.

# If pandas is not installed, just uncomment the following line:

# !pip install pandas
import pandas as pd

data = pd.read_csv(data_file)
print(data)

NumRooms Alley Price

0 NaN Pave 127500
1 2.0 NaN 106000
2 4.0 NaN 178100
3 NaN NaN 140000

2.2.2 Handling Missing Data

Note that “NaN” entries are missing values. To handle missing data, typical methods include im-
putation and deletion, where imputation replaces missing values with substituted ones, while dele-
tion ignores missing values. Here we will consider imputation.
By integer-location based indexing (iloc), we split data into inputs and outputs, where the former
takes the first 2 columns while the later only keeps the last column. For numerical values in inputs
that are missing, we replace the “NaN” entries with the mean value of the same column.

inputs, outputs = [Link][:, 0:2], [Link][:, 2]

inputs = [Link]([Link]())
print(inputs)

NumRooms Alley
0 3.0 Pave
1 2.0 NaN
2 4.0 NaN
3 3.0 NaN

48 Chapter 2. Preliminaries
For categorical or discrete values in inputs, we consider “NaN” as a category. Since the “Alley”
column only takes 2 types of categorical values “Pave” and “NaN”, pandas can automatically con-
vert this column to 2 columns “Alley_Pave” and “Alley_nan”. A row whose alley type is “Pave” will
set values of “Alley_Pave” and “Alley_nan” to 1 and 0. A row with a missing alley type will set their
values to 0 and 1.

inputs = pd.get_dummies(inputs, dummy_na=True)

print(inputs)

NumRooms Alley_Pave Alley_nan

0 3.0 1 0
1 2.0 0 1
2 4.0 0 1
3 3.0 0 1

2.2.3 Conversion to the ndarray Format

Now that all the entries in inputs and outputs are numerical, they can be converted to the ndar-
ray format. Once data are in this format, they can be further manipulated with those ndarray
functionalities that we have introduced in Section 2.1.

from mxnet import np

X, y = [Link]([Link]), [Link]([Link])
X, y

(array([[3., 1., 0.],

[2., 0., 1.],
[4., 0., 1.],
[3., 0., 1.]], dtype=float64),
array([127500, 106000, 178100, 140000], dtype=int64))

Summary

• Like many other extension packages in the vast ecosystem of Python, pandas can work to-
gether with ndarray.
• Imputation and deletion can be used to handle missing data.

Exercises

Create a raw dataset with more rows and columns.

1. Delete the column with the most missing values.
2. Convert the preprocessed dataset to the ndarray format.

2.2. Data Preprocessing 49

2.3 Linear Algebra

Now that you can store and manipulate data, letʼs briefly review the subset of basic linear algebra
that you will need to understand and implement most of models covered in this book. Below, we
introduce the basic mathematical objects, arithmetic, and operations in linear algebra, expressing
each both through mathematical notation and the corresponding implementation in code.

2.3.1 Scalars

If you never studied linear algebra or machine learning, then your past experience with math
probably consisted of thinking about one number at a time. And, if you ever balanced a check-
book or even paid for dinner at a restaurant then you already know how to do basic things like
adding and multiplying pairs of numbers. For example, the temperature in Palo Alto is 52 de-
grees Fahrenheit. Formally, we call values consisting of just one numerical quantity scalars. If
you wanted to convert this value to Celsius (the metric systemʼs more sensible temperature scale),
you would evaluate the expression c = 59 (f − 32), setting f to 52. In this equation, each of the
terms—5, 9, and 32—are scalar values. The placeholders c and f are called variables and they rep-
resented unknown scalar values.
In this book, we adopt the mathematical notation where scalar variables are denoted by ordinary
lower-cased letters (e.g., x, y, and z). We denote the space of all (continuous) real-valued scalars
by R. For expedience, we will punt on rigorous definitions of what precisely space is, but just
remember for now that the expression x ∈ R is a formal way to say that x is a real-valued scalar.
The symbol ∈ can be pronounced “in” and simply denotes membership in a set. Analogously, we
could write x, y ∈ {0, 1} to state that x and y are numbers whose value can only be 0 or 1.
In MXNet code, a scalar is represented by an ndarray with just one element. In the next snippet,
we instantiate two scalars and perform some familiar arithmetic operations with them, namely
addition, multiplication, division, and exponentiation.

from mxnet import np, npx

npx.set_np()

x = [Link](3.0)
y = [Link](2.0)

x + y, x * y, x / y, x ** y

(array(5.), array(6.), array(1.5), array(9.))

50 Chapter 2. Preliminaries
2.3.2 Vectors

You can think of a vector as simply a list of scalar values. We call these values the elements (entries
or components) of the vector. When our vectors represent examples from our dataset, their values
hold some real-world significance. For example, if we were training a model to predict the risk that
a loan defaults, we might associate each applicant with a vector whose components correspond
to their income, length of employment, number of previous defaults, and other factors. If we
were studying the risk of heart attacks hospital patients potentially face, we might represent each
patient by a vector whose components capture their most recent vital signs, cholesterol levels,
minutes of exercise per day, etc. In math notation, we will usually denote vectors as bold-faced,
lower-cased letters (e.g., x, y, and z).
In MXNet, we work with vectors via 1-dimensional ndarrays. In general ndarrays can have arbi-
trary lengths, subject to the memory limits of your machine.

x = [Link](4)
x

array([0., 1., 2., 3.])

We can refer to any element of a vector by using a subscript. For example, we can refer to the ith
element of x by xi . Note that the element xi is a scalar, so we do not bold-face the font when refer-
ring to it. Extensive literature considers column vectors to be the default orientation of vectors,
so does this book. In math, a vector x can be written as
 
x1
 x2 
 
x =  . , (2.3.1)
 .. 
xn

where x1 , . . . , xn are elements of the vector. In code, we access any element by indexing into the
ndarray.

x[3]

array(3.)

Length, Dimensionality, and Shape

Letʼs revisit some concepts from Section 2.1. A vector is just an array of numbers. And just as every
array has a length, so does every vector. In math notation, if we want to say that a vector x consists
of n real-valued scalars, we can express this as x ∈ Rn . The length of a vector is commonly called
the dimension of the vector.
As with an ordinary Python array, we can access the length of an ndarray by calling Pythonʼs built-
in len() function.

len(x)

2.3. Linear Algebra 51

When an ndarray represents a vector (with precisely one axis), we can also access its length via
the .shape attribute. The shape is a tuple that lists the length (dimensionality) along each axis of
the ndarray. For ndarrays with just one axis, the shape has just one element.

[Link]

(4,)

Note that the word “dimension” tends to get overloaded in these contexts and this tends to confuse
people. To clarify, we use the dimensionality of a vector or an axis to refer to its length, i.e., the
number of elements of a vector or an axis. However, we use the dimensionality of an ndarray to
refer to the number of axes that an ndarray has. In this sense, the dimensionality of an ndarrayʼs
some axis will be the length of that axis.

2.3.3 Matrices

Just as vectors generalize scalars from order 0 to order 1, matrices generalize vectors from order
1 to order 2. Matrices, which we will typically denote with bold-faced, capital letters (e.g., X, Y,
and Z), are represented in code as ndarrays with 2 axes.
In math notation, we use A ∈ Rm×n to express that the matrix A consists of m rows and n columns
of real-valued scalars. Visually, we can illustrate any matrix A ∈ Rm×n as a table, where each
element aij belongs to the ith row and j th column:
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
 
A= . .. .. ..  . (2.3.2)
 .. . . . 
am1 am2 · · · amn

For any A ∈ Rm×n , the shape of A is (m, n) or m × n. Specifically, when a matrix has the same
number of rows and columns, its shape becomes a square; thus, it is called a square matrix.
We can create an m × n matrix in MXNet by specifying a shape with two components m and n
when calling any of our favorite functions for instantiating an ndarray.

A = [Link](20).reshape(5, 4)
A

array([[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.],
[16., 17., 18., 19.]])

We can access the scalar element aij of a matrix A in (2.3.2) by specifying the indices for the row
(i) and column (j), such as [A]ij . When the scalar elements of a matrix A, such as in (2.3.2), are not
given, we may simply use the lower-case letter of the matrix A with the index subscript, aij , to refer

52 Chapter 2. Preliminaries
to [A]ij . To keep notation simple, commas are inserted to separate indices only when necessary,
such as a2,3j and [A]2i−1,3 .
Sometimes, we want to flip the axes. When we exchange a matrixʼs rows and columns, the result is
called the transpose of the matrix. Formally, we signify a matrix Aʼs transpose by A⊤ and if B = A⊤ ,
then bij = aji for any i and j. Thus, the transpose of A in (2.3.2) is a n × m matrix:
 
a11 a21 . . . am1
 a12 a22 . . . am2 
 
A⊤ =  . .. .. ..  . (2.3.3)
 .. . . . 
a1n a2n . . . amn
In code, we access a matrixʼs transpose via the T attribute.

A.T

array([[ 0., 4., 8., 12., 16.],

[ 1., 5., 9., 13., 17.],
[ 2., 6., 10., 14., 18.],
[ 3., 7., 11., 15., 19.]])

As a special type of the square matrix, a symmetric matrix A is equal to its transpose: A = A⊤ .

B = [Link]([[1, 2, 3], [2, 0, 4], [3, 4, 5]])

array([[1., 2., 3.],

[2., 0., 4.],
[3., 4., 5.]])

B == B.T

array([[ True, True, True],

[ True, True, True],
[ True, True, True]])

Matrices are useful data structures: they allow us to organize data that have different modalities
of variation. For example, rows in our matrix might correspond to different houses (data points),
while columns might correspond to different attributes. This should sound familiar if you have
ever used spreadsheet software or have read Section 2.2. Thus, although the default orientation of
a single vector is a column vector, in a matrix that represents a tabular dataset, it is more conven-
tional to treat each data point as a row vector in the matrix. And, as we will see in later chapters,
this convention will enable common deep learning practices. For example, along the outermost
axis of an ndarray, we can access or enumerate minibatches of data points, or just data points if
no minibatch exists.

2.3.4 Tensors

Just as vectors generalize scalars, and matrices generalize vectors, we can build data structures
with even more axes. Tensors give us a generic way of describing ndarrays with an arbitrary num-

2.3. Linear Algebra 53

ber of axes. Vectors, for example, are first-order tensors, and matrices are second-order tensors.
Tensors are denoted with capital letters of a special font face (e.g., X, Y, and Z) and their indexing
mechanism (e.g., xijk and [X]1,2i−1,3 ) is similar to that of matrices.
Tensors will become more important when we start working with images, which arrive as ndarrays
with 3 axes corresponding to the height, width, and a channel axis for stacking the color channels
(red, green, and blue). For now, we will skip over higher order tensors and focus on the basics.

X = [Link](24).reshape(2, 3, 4)
X

array([[[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]],

[[12., 13., 14., 15.],

[16., 17., 18., 19.],
[20., 21., 22., 23.]]])

2.3.5 Basic Properties of Tensor Arithmetic

Scalars, vectors, matrices, and tensors of an arbitrary number of axes have some nice properties
that often come in handy. For example, you might have noticed from the definition of an elemen-
twise operation that any elementwise unary operation does not change the shape of its operand.
Similarly, given any two tensors with the same shape, the result of any binary elementwise oper-
ation will be a tensor of that same shape. For example, adding two matrices of the same shape
performs elementwise addition over these two matrices.

A = [Link](20).reshape(5, 4)
B = [Link]() # Assign a copy of A to B by allocating new memory
A, A + B

(array([[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.],
[16., 17., 18., 19.]]), array([[ 0., 2., 4., 6.],
[ 8., 10., 12., 14.],
[16., 18., 20., 22.],
[24., 26., 28., 30.],
[32., 34., 36., 38.]]))

Specifically, elementwise multiplication of two matrices is called their Hadamard product (math
notation ⊙). Consider matrix B ∈ Rm×n whose element of row i and column j is bij . The Hadamard
product of matrices A (defined in (2.3.2)) and B
 
a11 b11 a12 b12 . . . a1n b1n
 a21 b21 a22 b22 . . . a2n b2n 
 
A⊙B= . .. . .. . (2.3.4)
 . . . . . . 
am1 bm1 am2 bm2 . . . amn bmn

54 Chapter 2. Preliminaries
A * B

array([[ 0., 1., 4., 9.],

[ 16., 25., 36., 49.],
[ 64., 81., 100., 121.],
[144., 169., 196., 225.],
[256., 289., 324., 361.]])

Multiplying or adding a tensor by a scalar also does not change the shape of the tensor, where each
element of the operand tensor will be added or multiplied by the scalar.

a = 2
X = [Link](24).reshape(2, 3, 4)
a + X, (a * X).shape

(array([[[ 2., 3., 4., 5.],

[ 6., 7., 8., 9.],
[10., 11., 12., 13.]],

[[14., 15., 16., 17.],

[18., 19., 20., 21.],
[22., 23., 24., 25.]]]), (2, 3, 4))

2.3.6 Reduction

One useful operation that we can perform with arbitrary tensors ∑is to calculate the sum of their
elements. In mathematical notation, we express ∑sums using the symbol. To express the sum of
the elements in a vector x of length d, we write di=1 xi . In code, we can just call the sum function.

x = [Link](4)
x, [Link]()

(array([0., 1., 2., 3.]), array(6.))

We can express sums over the elements of tensors

∑of arbitrary
∑n shape. For example, the sum of the
elements of an m × n matrix A could be written m i=1 j=1 ij .
a

[Link], [Link]()

((5, 4), array(190.))

By default, invoking the sum function reduces a tensor along all its axes to a scalar. We can also
specify the axes along which the tensor is reduced via summation. Take matrices as an example.
To reduce the row dimension (axis 0) by summing up elements of all the rows, we specify axis=0
when invoking sum. Since the input matrix reduces along axis 0 to generate the output vector, the
dimension of axis 0 of the input is lost in the output shape.

2.3. Linear Algebra 55

A_sum_axis0 = [Link](axis=0)
A_sum_axis0, A_sum_axis0.shape

(array([40., 45., 50., 55.]), (4,))

Specifying axis=1 will reduce the column dimension (axis 1) by summing up elements of all the
columns. Thus, the dimension of axis 1 of the input is lost in the output shape.

A_sum_axis1 = [Link](axis=1)
A_sum_axis1, A_sum_axis1.shape

(array([ 6., 22., 38., 54., 70.]), (5,))

Reducing a matrix along both rows and columns via summation is equivalent to summing up all
the elements of the matrix.

[Link](axis=[0, 1]) # Same as [Link]()

array(190.)

A related quantity is the mean, which is also called the average. We calculate the mean by dividing
the sum by the total number of elements. In code, we could just call mean on tensors of arbitrary
shape.

[Link](), [Link]() / [Link]

(array(9.5), array(9.5))

Like sum, mean can also reduce a tensor along the specified axes.

[Link](axis=0), [Link](axis=0) / [Link][0]

(array([ 8., 9., 10., 11.]), array([ 8., 9., 10., 11.]))

Non-Reduction Sum

However, sometimes it can be useful to keep the number of axes unchanged when invoking sum
or mean by setting keepdims=True.

sum_A = [Link](axis=1, keepdims=True)

sum_A

array([[ 6.],
[22.],
[38.],
(continues on next page)

56 Chapter 2. Preliminaries
(continued from previous page)
[54.],
[70.]])

For instance, since sum_A still keeps its 2 axes after summing each row, we can divide A by sum_A
with broadcasting.

A / sum_A

array([[0. , 0.16666667, 0.33333334, 0.5 ],

[0.18181819, 0.22727273, 0.27272728, 0.3181818 ],
[0.21052632, 0.23684211, 0.2631579 , 0.28947368],
[0.22222222, 0.24074075, 0.25925925, 0.2777778 ],
[0.22857143, 0.24285714, 0.25714287, 0.27142859]])

If we want to calculate the cumulative sum of elements of A along some axis, say axis=0 (row by
row), we can call the cumsum function. This function will not reduce the input tensor along any
axis.

[Link](axis=0)

array([[ 0., 1., 2., 3.],

[ 4., 6., 8., 10.],
[12., 15., 18., 21.],
[24., 28., 32., 36.],
[40., 45., 50., 55.]])

2.3.7 Dot Products

So far, we have only performed elementwise operations, sums, and averages. And if this was all
we could do, linear algebra probably would not deserve its own section. However, one of the most
dot product x⊤ y (or
fundamental operations is the dot product. Given two vectors x, y ∈ Rd , their ∑
⊤
⟨x, y⟩) is a sum over the products of the elements at the same position: x y = di=1 xi yi .

y = [Link](4)
x, y, [Link](x, y)

(array([0., 1., 2., 3.]), array([1., 1., 1., 1.]), array(6.))

Note that we can express the dot product of two vectors equivalently by performing an element-
wise multiplication and then a sum:

[Link](x * y)

array(6.)

Dot products are useful in a wide range of contexts. For example, given some set of values, denoted
by a vector x ∈ Rd and a set of weights denoted by w ∈ Rd , the weighted sum of the values in x

2.3. Linear Algebra 57

according to the weights w could be(∑expressed ) as the dot product x⊤ w. When the weights are
d
non-negative and sum to one (i.e., i=1 wi = 1 ), the dot product expresses a weighted average.
After normalizing two vectors to have the unit length, the dot products express the cosine of the
angle between them. We will formally introduce this notion of length later in this section.

2.3.8 Matrix-Vector Products

Now that we know how to calculate dot products, we can begin to understand matrix-vector prod-
ucts. Recall the matrix A ∈ Rm×n and the vector x ∈ Rn defined and visualized in (2.3.2) and (2.3.1)
respectively. Letʼs start off by visualizing the matrix A in terms of its row vectors
 ⊤
a1
 a⊤ 
 2
A =  . , (2.3.5)
 .. 
a⊤
m

where each a⊤i ∈ R is a row vector representing the i row of the matrix A. The matrix-vector
n th

product Ax is simply a column vector of length m, whose ith element is the dot product a⊤
i x:
 ⊤  ⊤ 
a1 a1 x
 a⊤   a⊤ x 
 2  2 
Ax =  .  x =  .  . (2.3.6)
 . 
.  .. 
a⊤m a⊤mx

We can think of multiplication by a matrix A ∈ Rm×n as a transformation that projects vectors

from Rn to Rm . These transformations turn out to be remarkably useful. For example, we can
represent rotations as multiplications by a square matrix. As we will see in subsequent chapters,
we can also use matrix-vector products to describe the most intensive calculations required when
computing each layer in a neural network given the values of the previous layer.
Expressing matrix-vector products in code with ndarrays, we use the same dot function as for dot
products. When we call [Link](A, x) with a matrix A and a vector x, the matrix-vector product is
performed. Note that the column dimension of A (its length along axis 1) must be the same as the
dimension of x (its length).

[Link], [Link], [Link](A, x)

((5, 4), (4,), array([ 14., 38., 62., 86., 110.]))

2.3.9 Matrix-Matrix Multiplication

If you have gotten the hang of dot products and matrix-vector products, then matrix-matrix multi-
plication should be straightforward.
Say that we have two matrices A ∈ Rn×k and B ∈ Rk×m :
   
a11 a12 · · · a1k b11 b12 · · · b1m
 a21 a22 · · · a2k  b21 b22 · · · b2m 
   
A= . .. .. ..  , B =  .. .. . . ..  . (2.3.7)
 .. . . .   . . . . 
an1 an2 · · · ank bk1 bk2 · · · bkm

58 Chapter 2. Preliminaries
Denote by a⊤ i ∈ R the row vector representing the i row of the matrix A, and let bj ∈ R be the
k th k

column vector from the j th column of the matrix B. To produce the matrix product C = AB, it is
easiest to think of A in terms of its row vectors and B in terms of its column vectors:
 ⊤
a1
a⊤  [ ]
 2
A =  .  , B = b1 b2 · · · bm . (2.3.8)
 . 
.
a⊤
n

Then the matrix product C ∈ Rn×m is produced as we simply compute each element cij as the dot
product a⊤
i bj :
 ⊤  ⊤ 
a1 a1 b1 a⊤ ⊤
1 b2 · · · a1 bm
a⊤  [ ]  ⊤ ⊤ ⊤ 
 2 a2 b1 a2 b2 · · · a2 bm 
C = AB =  .  b1 b2 · · · bm =  .
.. 
. .. . . (2.3.9)
 ..   .. .. . 
a⊤
n a⊤ ⊤ ⊤
n b1 an b2 · · · an bm

We can think of the matrix-matrix multiplication AB as simply performing m matrix-vector prod-

ucts and stitching the results together to form an n × m matrix. Just as with ordinary dot products
and matrix-vector products, we can compute matrix-matrix multiplication by using the dot func-
tion. In the following snippet, we perform matrix multiplication on A and B. Here, A is a matrix
with 5 rows and 4 columns, and B is a matrix with 4 rows and 3 columns. After multiplication, we
obtain a matrix with 5 rows and 3 columns.

B = [Link](shape=(4, 3))
[Link](A, B)

array([[ 6., 6., 6.],

[22., 22., 22.],
[38., 38., 38.],
[54., 54., 54.],
[70., 70., 70.]])

Matrix-matrix multiplication can be simply called matrix multiplication, and should not be con-
fused with the Hadamard product.

2.3.10 Norms

Some of the most useful operators in linear algebra are norms. Informally, the norm of a vector
tells us how big a vector is. The notion of size under consideration here concerns not dimension-
ality but rather the magnitude of the components.
In linear algebra, a vector norm is a function f that maps a vector to a scalar, satisfying a handful
of properties. Given any vector x, the first property says that if we scale all the elements of a vector
by a constant factor α, its norm also scales by the absolute value of the same constant factor:

f (αx) = |α|f (x). (2.3.10)

The second property is the familiar triangle inequality:

f (x + y) ≤ f (x) + f (y). (2.3.11)

2.3. Linear Algebra 59

The third property simply says that the norm must be non-negative:

f (x) ≥ 0. (2.3.12)

That makes sense, as in most contexts the smallest size for anything is 0. The final property re-
quires that the smallest norm is achieved and only achieved by a vector consisting of all zeros.

∀i, [x]i = 0 ⇔ f (x) = 0. (2.3.13)

You might notice that norms sound a lot like measures of distance. And if you remember Euclidean
distances (think Pythagorasʼ theorem) from grade school, then the concepts of non-negativity and
the triangle inequality might ring a bell. In fact, the Euclidean distance is a norm: specifically it is
the ℓ2 norm. Suppose that the elements in the n-dimensional vector x are x1 , . . . , xn . The ℓ2 norm
of x is the square root of the sum of the squares of the vector elements:
v
u n
u∑
∥x∥2 = t x2i , (2.3.14)
i=1

where the subscript 2 is often omitted in ℓ2 norms, i.e., ∥x∥ is equivalent to ∥x∥2 . In code, we can
calculate the ℓ2 norm of a vector by calling [Link].

u = [Link]([3, -4])
[Link](u)

array(5.)

In deep learning, we work more often with the squared ℓ2 norm. You will also frequently en-
counter the ℓ1 norm, which is expressed as the sum of the absolute values of the vector elements:
∑
n
∥x∥1 = |xi | . (2.3.15)
i=1

As compared with the ℓ2 norm, it is less influenced by outliers. To calculate the ℓ1 norm, we
compose the absolute value function with a sum over the elements.

[Link](u).sum()

array(7.)

Both the ℓ2 norm and the ℓ1 norm are special cases of the more general ℓp norm:
( n )1/p
∑ p
∥x∥p = |xi | . (2.3.16)
i=1

Analogous to ℓ2 norms of vectors, the Frobenius norm of a matrix X ∈ Rm×n is the square root of
the sum of the squares of the matrix elements:
v
u∑
um ∑ n
∥X∥F = t x2ij . (2.3.17)
i=1 j=1

The Frobenius norm satisfies all the properties of vector norms. It behaves as if it were an ℓ2 norm
of a matrix-shaped vector. Invoking [Link] will calculate the Frobenius norm of a matrix.

60 Chapter 2. Preliminaries
[Link]([Link]((4, 9)))

array(6.)

Norms and Objectives

While we do not want to get too far ahead of ourselves, we can plant some intuition already about
why these concepts are useful. In deep learning, we are often trying to solve optimization prob-
lems: maximize the probability assigned to observed data; minimize the distance between pre-
dictions and the ground-truth observations. Assign vector representations to items (like words,
products, or news articles) such that the distance between similar items is minimized, and the
distance between dissimilar items is maximized. Oftentimes, the objectives, perhaps the most
important components of deep learning algorithms (besides the data), are expressed as norms.

2.3.11 More on Linear Algebra

In just this section, we have taught you all the linear algebra that you will need to understand a
remarkable chunk of modern deep learning. There is a lot more to linear algebra and a lot of
that mathematics is useful for machine learning. For example, matrices can be decomposed into
factors, and these decompositions can reveal low-dimensional structure in real-world datasets.
There are entire subfields of machine learning that focus on using matrix decompositions and
their generalizations to high-order tensors to discover structure in datasets and solve prediction
problems. But this book focuses on deep learning. And we believe you will be much more inclined
to learn more mathematics once you have gotten your hands dirty deploying useful machine learn-
ing models on real datasets. So while we reserve the right to introduce more mathematics much
later on, we will wrap up this section here.
If you are eager to learn more about linear algebra, you may refer to either Section 17.1 or other
excellent resources (Strang, 1993; Kolter, 2008; Petersen et al., 2008).

Summary

• Scalars, vectors, matrices, and tensors are basic mathematical objects in linear algebra.
• Vectors generalize scalars, and matrices generalize vectors.
• In the ndarray representation, scalars, vectors, matrices, and tensors have 0, 1, 2, and an
arbitrary number of axes, respectively.
• A tensor can be reduced along the specified axes by sum and mean.
• Elementwise multiplication of two matrices is called their Hadamard product. It is different
from matrix multiplication.
• In deep learning, we often work with norms such as the ℓ1 norm, the ℓ2 norm, and the Frobe-
nius norm.
• We can perform a variety of operations over scalars, vectors, matrices, and tensors with
ndarray functions.

2.3. Linear Algebra 61

Exercises

1. Prove that the transpose of a matrix Aʼs transpose is A: (A⊤ )⊤ = A.

2. Given two matrices A and B, show that the sum of transposes is equal to the transpose of a
sum: A⊤ + B⊤ = (A + B)⊤ .
3. Given any square matrix A, is A + A⊤ always symmetric? Why?
4. We defined the tensor X of shape (2, 3, 4) in this section. What is the output of len(X)?
5. For a tensor X of arbitrary shape, does len(X) always correspond to the length of a certain
axis of X? What is that axis?
6. Run A / [Link](axis=1) and see what happens. Can you analyze the reason?
7. When traveling between two points in Manhattan, what is the distance that you need to cover
in terms of the coordinates, i.e., in terms of avenues and streets? Can you travel diagonally?
8. Consider a tensor with shape (2, 3, 4). What are the shapes of the summation outputs along
axis 0, 1, and 2?
9. Feed a tensor with 3 or more axes to the [Link] function and observe its output. What
does this function compute for ndarrays of arbitrary shape?

2.4 Calculus

Finding the area of a polygon had remained mysterious until at least 2, 500 years ago, when ancient
Greeks divided a polygon into triangles and summed their areas. To find the area of curved shapes,
such as a circle, ancient Greeks inscribed polygons in such shapes. As shown in Fig. 2.4.1, an
inscribed polygon with more sides of equal length better approximates the circle. This process is
also known as the method of exhaustion.

Fig. 2.4.1: Find the area of a circle with the method of exhaustion.

In fact, the method of exhaustion is where integral calculus (will be described in Section 17.5) orig-
inates from. More than 2, 000 years later, the other branch of calculus, differential calculus, was
invented. Among the most critical applications of differential calculus, optimization problems

62 Chapter 2. Preliminaries
consider how to do something the best. As discussed in Section 2.3.10, such problems are ubiqui-
tous in deep learning.
In deep learning, we train models, updating them successively so that they get better and better
as they see more and more data. Usually, getting better means minimizing a loss function, a score
that answers the question “how bad is our model?” This question is more subtle than it appears.
Ultimately, what we really care about is producing a model that performs well on data that we have
never seen before. But we can only fit the model to data that we can actually see. Thus we can
decompose the task of fitting models into two key concerns: i) optimization: the process of fitting
our models to observed data; ii) generalization: the mathematical principles and practitionersʼ
wisdom that guide as to how to produce models whose validity extends beyond the exact set of
data points used to train them.
To help you understand optimization problems and methods in later chapters, here we give a very
brief primer on differential calculus that is commonly used in deep learning.

2.4.1 Derivatives and Differentiation

We begin by addressing the calculation of derivatives, a crucial step in nearly all deep learning
optimization algorithms. In deep learning, we typically choose loss functions that are differen-
tiable with respect to our modelʼs parameters. Put simply, this means that for each parameter,
we can determine how rapidly the loss would increase or decrease, were we to increase or decrease
that parameter by an infinitesimally small amount.
Suppose that we have a function f : R → R, whose input and output are both scalars. The derivative
of f is defined as

f (x + h) − f (x)
f ′ (x) = lim , (2.4.1)
h→0 h
if this limit exists. If f ′ (a) exists, f is said to be differentiable at a. If f is differentiable at every
number of an interval, then this function is differentiable on this interval. We can interpret the
derivative f ′ (x) in (2.4.1) as the instantaneous rate of change of f (x) with respect to x. The so-called
instantaneous rate of change is based on the variation h in x, which approaches 0.
To illustrate derivatives, letʼs experiment with an example. Define u = f (x) = 3x2 − 4x.

%matplotlib inline
import d2l
from IPython import display
from mxnet import np, npx
npx.set_np()

def f(x):
return 3 * x ** 2 - 4 * x

By setting x = 1 and letting h approach 0, the numerical result of f (x+h)−f

h
(x)
in (2.4.1) approaches
2. Though this experiment is not a mathematical proof, we will see later that the derivative u′ is 2
when x = 1.

def numerical_lim(f, x, h):

return (f(x + h) - f(x)) / h
(continues on next page)

2.4. Calculus 63
(continued from previous page)

h = 0.1
for i in range(5):
print('h=%.5f, numerical limit=%.5f' % (h, numerical_lim(f, 1, h)))
h *= 0.1

h=0.10000, numerical limit=2.30000

h=0.01000, numerical limit=2.03000
h=0.00100, numerical limit=2.00300
h=0.00010, numerical limit=2.00030
h=0.00001, numerical limit=2.00003

Letʼs familiarize ourselves with a few equivalent notations for derivatives. Given y = f (x), where
x and y are the independent variable and the dependent variable of the function f , respectively.
The following expressions are equivalent:
dy df d
f ′ (x) = y ′ = = = f (x) = Df (x) = Dx f (x), (2.4.2)
dx dx dx
d
where symbols dx and D are differentiation operators that indicate operation of differentiation. We
can use the following rules to differentiate common functions:
• DC = 0 (C is a constant),
• Dxn = nxn−1 (the power rule, n is any real number),
• Dex = ex ,
• D ln(x) = 1/x.
To differentiate a function that is formed from a few simpler functions such as the above com-
mon functions, the following rules can be handy for us. Suppose that functions f and g are both
differentiable and C is a constant, we have the constant multiple rule
d d
[Cf (x)] = C f (x), (2.4.3)
dx dx
the sum rule
d d d
[f (x) + g(x)] = f (x) + g(x), (2.4.4)
dx dx dx
the product rule
d d d
[f (x)g(x)] = f (x) [g(x)] + g(x) [f (x)], (2.4.5)
dx dx dx
and the quotient rule
[ ]
d f (x) d
g(x) dx [f (x)] − f (x) dx
d
[g(x)]
= 2
. (2.4.6)
dx g(x) [g(x)]

Now we can apply a few of the above rules to find u′ = f ′ (x) = 3 dx x − 4 dx

d 2 d
x = 6x − 4. Thus, by
′
setting x = 1, we have u = 2: this is supported by our earlier experiment in this section where
the numerical result approaches 2. This derivative is also the slope of the tangent line to the curve
u = f (x) when x = 1.

64 Chapter 2. Preliminaries
To visualize such an interpretation of derivatives, we will use matplotlib, a popular plotting li-
brary in Python. To configure properties of the figures produced by matplotlib, we need to define
a few functions. In the following, the use_svg_display function specifies the matplotlib pack-
age to output the svg figures for sharper images. The comment # Saved in the d2l package for
later use is a special mark where the following function, class, or import statements are also
saved in the d2l package so that we can directly invoke d2l.use_svg_display() later.

# Saved in the d2l package for later use

def use_svg_display():
"""Use the svg format to display a plot in Jupyter."""
display.set_matplotlib_formats('svg')

We define the set_figsize function to specify the figure sizes. Note that here we directly use
[Link] since the import statement from matplotlib import pyplot as plt has been marked for
being saved in the d2l package in the preface.

# Saved in the d2l package for later use

def set_figsize(figsize=(3.5, 2.5)):
"""Set the figure size for matplotlib."""
use_svg_display()
[Link]['[Link]'] = figsize

The following set_axes function sets properties of axes of figures produced by matplotlib.

# Saved in the d2l package for later use

def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
"""Set the axes for matplotlib."""
axes.set_xlabel(xlabel)
axes.set_ylabel(ylabel)
axes.set_xscale(xscale)
axes.set_yscale(yscale)
axes.set_xlim(xlim)
axes.set_ylim(ylim)
if legend:
[Link](legend)
[Link]()

With these 3 functions for figure configurations, we define the plot function to plot multiple
curves succinctly since we will need to visualize many curves throughout the book.

# Saved in the d2l package for later use

def plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None,
ylim=None, xscale='linear', yscale='linear',
fmts=['-', 'm--', 'g-.', 'r:'], figsize=(3.5, 2.5), axes=None):
"""Plot data points."""
d2l.set_figsize(figsize)
axes = axes if axes else [Link]()

# Return True if X (ndarray or list) has 1 axis

def has_one_axis(X):
return (hasattr(X, "ndim") and [Link] == 1 or isinstance(X, list)
and not hasattr(X[0], "__len__"))

(continues on next page)

2.4. Calculus 65
(continued from previous page)
if has_one_axis(X):
X = [X]
if Y is None:
X, Y = [[]] * len(X), X
elif has_one_axis(Y):
Y = [Y]
if len(X) != len(Y):
X = X * len(Y)
[Link]()
for x, y, fmt in zip(X, Y, fmts):
if len(x):
[Link](x, y, fmt)
else:
[Link](y, fmt)
set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)

Now we can plot the function u = f (x) and its tangent line y = 2x−3 at x = 1, where the coefficient
2 is the slope of the tangent line.

x = [Link](0, 3, 0.1)
plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])

2.4.2 Partial Derivatives

So far we have dealt with the differentiation of functions of just one variable. In deep learning,
functions often depend on many variables. Thus, we need to extend the ideas of differentiation to
these multivariate functions.
Let y = f (x1 , x2 , . . . , xn ) be a function with n variables. The partial derivative of y with respect to
its ith parameter xi is

∂y f (x1 , . . . , xi−1 , xi + h, xi+1 , . . . , xn ) − f (x1 , . . . , xi , . . . , xn )

= lim . (2.4.7)
∂xi h→0 h
∂y
To calculate ∂x i
, we can simply treat x1 , . . . , xi−1 , xi+1 , . . . , xn as constants and calculate the deriva-

66 Chapter 2. Preliminaries
tive of y with respect to xi . For notation of partial derivatives, the following are equivalent:

∂y ∂f
= = fxi = fi = Di f = Dxi f. (2.4.8)
∂xi ∂xi

2.4.3 Gradients

We can concatenate partial derivatives of a multivariate function with respect to all its variables
to obtain the gradient vector of the function. Suppose that the input of function f : Rn → R is an
n-dimensional vector x = [x1 , x2 , . . . , xn ]⊤ and the output is a scalar. The gradient of the function
f (x) with respect to x is a vector of n partial derivatives:
[ ]⊤
∂f (x) ∂f (x) ∂f (x)
∇x f (x) = , ,..., , (2.4.9)
∂x1 ∂x2 ∂xn

where ∇x f (x) is often replaced by ∇f (x) when there is no ambiguity.

Let A be a matrix with m rows and n columns, and x be an n-dimensional vector, the following
rules are often used when differentiating multivariate functions:
• ∇x Ax = A⊤ ,
• ∇x x⊤ A = A,
• ∇x x⊤ Ax = (A + A⊤ )x,
• ∇x ∥x∥2 = ∇x x⊤ x = 2x.
Similarly, for any matrix X, we have ∇X ∥X∥2F = 2X. As we will see later, gradients are useful for
designing optimization algorithms in deep learning.

2.4.4 Chain Rule

However, such gradients can be hard to find. This is because multivariate functions in deep learn-
ing are often composite, so we may not apply any of the aforementioned rules to differentiate these
functions. Fortunately, the chain rule enables us to differentiate composite functions.
Letʼs first consider functions of a single variable. Suppose that functions y = f (u) and u = g(x)
are both differentiable, then the chain rule states that

dy dy du
= . (2.4.10)
dx du dx
Now letʼs turn our attention to a more general scenario where functions have an arbitrary number
of variables. Suppose that the differentiable function y has variables u1 , u2 , . . . , um , where each
differentiable function ui has variables x1 , x2 , . . . , xn . Note that y is a function of x1 , x2 , . . . , xn .
Then the chain rule gives

dy dy du1 dy du2 dy dum

= + + ··· + (2.4.11)
dxi du1 dxi du2 dxi dum dxi
for any i = 1, 2, . . . , n.

2.4. Calculus 67
Summary

• Differential calculus and integral calculus are two branches of calculus, where the former
can be applied to the ubiquitous optimization problems in deep learning.
• A derivative can be interpreted as the instantaneous rate of change of a function with respect
to its variable. It is also the slope of the tangent line to the curve of the function.
• A gradient is a vector whose components are the partial derivatives of a multivariate function
with respect to all its variables.
• The chain rule enables us to differentiate composite functions.

Exercises

1. Plot the function y = f (x) = x3 − 1

x and its tangent line when x = 1.
2. Find the gradient of the function f (x) = 3x21 + 5ex2 .
3. What is the gradient of the function f (x) = ∥x∥2 ?
4. Can you write out the chain rule for the case where u = f (x, y, z) and x = x(a, b), y = y(a, b),
and z = z(a, b)?

2.5 Automatic Differentiation

As we have explained in Section 2.4, differentiation is a crucial step in nearly all deep learning
optimization algorithms. While the calculations for taking these derivatives are straightforward,
requiring only some basic calculus, for complex models, working out the updates by hand can be
a pain (and often error-prone).
The autograd package expedites this work by automatically calculating derivatives, i.e., automatic
differentiation. And while many other libraries require that we compile a symbolic graph to take
automatic derivatives, autograd allows us to take derivatives while writing ordinary imperative
code. Every time we pass data through our model, autograd builds a graph on the fly, tracking
which data combined through which operations to produce the output. This graph enables auto-
grad to subsequently backpropagate gradients on command. Here, backpropagate simply means
to trace through the computational graph, filling in the partial derivatives with respect to each pa-
rameter.

from mxnet import autograd, np, npx

npx.set_np()

68 Chapter 2. Preliminaries
2.5.1 A Simple Example

As a toy example, say that we are interested in differentiating the function y = 2x⊤ x with respect
to the column vector x. To start, letʼs create the variable x and assign it an initial value.

x = [Link](4)
x

array([0., 1., 2., 3.])

Note that before we even calculate the gradient of y with respect to x, we will need a place to store
it. It is important that we do not allocate new memory every time we take a derivative with respect
to a parameter because we will often update the same parameters thousands or millions of times
and could quickly run out of memory.
Note also that a gradient of a scalar-valued function with respect to a vector x is itself vector-valued
and has the same shape as x. Thus it is intuitive that in code, we will access a gradient taken with
respect to x as an attribute of the ndarray x itself. We allocate memory for an ndarrayʼs gradient
by invoking its attach_grad method.

x.attach_grad()

After we calculate a gradient taken with respect to x, we will be able to access it via the grad at-
tribute. As a safe default, [Link] is initialized as an array containing all zeros. That is sensible
because our most common use case for taking gradient in deep learning is to subsequently update
parameters by adding (or subtracting) the gradient to maximize (or minimize) the differentiated
function. By initializing the gradient to an array of zeros, we ensure that any update accidentally
executed before a gradient has actually been calculated will not alter the parametersʼ value.

[Link]

array([0., 0., 0., 0.])

Now letʼs calculate y. Because we wish to subsequently calculate gradients, we want MXNet to
generate a computational graph on the fly. We could imagine that MXNet would be turning on a
recording device to capture the exact path by which each variable is generated.
Note that building the computational graph requires a nontrivial amount of computation. So
MXNet will only build the graph when explicitly told to do so. We can invoke this behavior by
placing our code inside an [Link] scope.

with [Link]():
y = 2 * [Link](x, x)
y

array(28.)

Since x is an ndarray of length 4, [Link] will perform an inner product of x and x, yielding the
scalar output that we assign to y. Next, we can automatically calculate the gradient of y with
respect to each component of x by calling yʼs backward function.

2.5. Automatic Differentiation 69

[Link]()

If we recheck the value of [Link], we will find its contents overwritten by the newly calculated
gradient.

[Link]

array([ 0., 4., 8., 12.])

The gradient of the function y = 2x⊤ x with respect to x should be 4x. Letʼs quickly verify that
our desired gradient was calculated correctly. If the two ndarrays are indeed the same, then the
equality between them holds at every position.

[Link] == 4 * x

array([ True, True, True, True])

If we subsequently compute the gradient of another variable whose value was calculated as a func-
tion of x, the contents of [Link] will be overwritten.

with [Link]():
y = [Link]()
[Link]()
[Link]

array([1., 1., 1., 1.])

2.5.2 Backward for Non-Scalar Variables

Technically, when y is not a scalar, the most natural interpretation of the gradient of y (a vector of
length m) with respect to x (a vector of length n) is the Jacobian (an m×n matrix). For higher-order
and higher-dimensional y and x, the Jacobian could be a gnarly high-order tensor.
However, while these more exotic objects do show up in advanced machine learning (including in
deep learning), more often when we are calling backward on a vector, we are trying to calculate
the derivatives of the loss functions for each constituent of a batch of training examples. Here,
our intent is not to calculate the Jacobian but rather the sum of the partial derivatives computed
individually for each example in the batch.
Thus when we invoke backward on a vector-valued variable y, which is a function of x, MXNet
assumes that we want the sum of the gradients. In short, MXNet will create a new scalar variable
by summing the elements in y, and compute the gradient of that scalar variable with respect to x.

with [Link]():
y = x * x # y is a vector
[Link]()

u = [Link]()
(continues on next page)

70 Chapter 2. Preliminaries
(continued from previous page)
u.attach_grad()
with [Link]():
v = (u * u).sum() # v is a scalar
[Link]()

[Link] == [Link]

array([ True, True, True, True])

2.5.3 Detaching Computation

Sometimes, we wish to move some calculations outside of the recorded computational graph. For
example, say that y was calculated as a function of x, and that subsequently z was calculated as a
function of both y and x. Now, imagine that we wanted to calculate the gradient of z with respect
to x, but wanted for some reason to treat y as a constant, and only take into account the role that
x played after y was calculated.
Here, we can call u = [Link]() to return a new variable u that has the same value as y but dis-
cards any information about how y was computed in the computational graph. In other words,
the gradient will not flow backwards through u to x. This will provide the same functionality as
if we had calculated u as a function of x outside of the [Link] scope, yielding a u that
will be treated as a constant in any backward call. Thus, the following backward function computes
the partial derivative of z = u * x with respect to x while treating u as a constant, instead of the
partial derivative of z = x * x * x with respect to x.

with [Link]():
y = x * x
u = [Link]()
z = u * x
[Link]()
[Link] == u

array([ True, True, True, True])

Since the computation of y was recorded, we can subsequently call [Link]() to get the deriva-
tive of y = x * x with respect to x, which is 2 * x.

[Link]()
[Link] == 2 * x

array([ True, True, True, True])

Note that attaching gradients to a variable x implicitly calls x = [Link](). If x is computed based
on other variables, this part of computation will not be used in the backward function.

y = [Link](4) * 2
y.attach_grad()
(continues on next page)

2.5. Automatic Differentiation 71

(continued from previous page)
with [Link]():
u = x * y
u.attach_grad() # Implicitly run u = [Link]()
z = 5 * u - x
[Link]()
[Link], [Link], [Link]

(array([-1., -1., -1., -1.]), array([5., 5., 5., 5.]), array([0., 0., 0., 0.]))

2.5.4 Computing the Gradient of Python Control Flow

One benefit of using automatic differentiation is that even if building the computational graph
of a function required passing through a maze of Python control flow (e.g., conditionals, loops,
and arbitrary function calls), we can still calculate the gradient of the resulting variable. In the
following snippet, note that the number of iterations of the while loop and the evaluation of the
if statement both depend on the value of the input a.

def f(a):
b = a * 2
while [Link](b) < 1000:
b = b * 2
if [Link]() > 0:
c = b
else:
c = 100 * b
return c

Again to compute gradients, we just need to record the calculation and then call the backward
function.

a = [Link]()
a.attach_grad()
with [Link]():
d = f(a)
[Link]()

We can now analyze the f function defined above. Note that it is piecewise linear in its input a. In
other words, for any a there exists some constant scalar k such that f(a) = k * a, where the value
of k depends on the input a. Consequently d / a allows us to verify that the gradient is correct.

[Link] == d / a

array(True)

2.5.5 Training Mode and Prediction Mode

As we have seen, after we call [Link], MXNet logs the operations in the following block.
There is one more subtle detail to be aware of. Additionally, [Link] will change the

72 Chapter 2. Preliminaries
running mode from prediction mode to training mode. We can verify this behavior by calling the
is_training function.

print(autograd.is_training())
with [Link]():
print(autograd.is_training())

False
True

When we get to complicated deep learning models, we will encounter some algorithms where the
model behaves differently during training and when we subsequently use it to make predictions.
We will cover these differences in detail in later chapters.

Summary

• MXNet provides the autograd package to automate the calculation of derivatives. To use it,
we first attach gradients to those variables with respect to which we desire partial deriva-
tives. We then record the computation of our target value, execute its backward function,
and access the resulting gradient via our variableʼs grad attribute.
• We can detach gradients to control the part of the computation that will be used in the back-
ward function.
• The running modes of MXNet include training mode and prediction mode. We can deter-
mine the running mode by calling the is_training function.

Exercises

1. Why is the second derivative much more expensive to compute than the first derivative?
2. After running [Link](), immediately run it again and see what happens.
3. In the control flow example where we calculate the derivative of d with respect to a, what
would happen if we changed the variable a to a random vector or matrix. At this point, the
result of the calculation f(a) is no longer a scalar. What happens to the result? How do we
analyze this?
4. Redesign an example of finding the gradient of the control flow. Run and analyze the result.
df (x)
5. Let f (x) = sin(x). Plot f (x) and dx , where the latter is computed without exploiting that
f ′ (x) = cos(x).
6. In a second-price auction (such as in eBay or in computational advertising), the winning
bidder pays the second-highest price. Compute the gradient of the final price with respect
to the winning bidderʼs bid using autograd. What does the result tell you about the mecha-
nism? If you are curious to learn more about second-price auctions, check out the paper by
Edelman et al. (Edelman et al., 2007).

2.5. Automatic Differentiation 73

2.6 Probability

In some form or another, machine learning is all about making predictions. We might want to
predict the probability of a patient suffering a heart attack in the next year, given their clinical his-
tory. In anomaly detection, we might want to assess how likely a set of readings from an airplaneʼs
jet engine would be, were it operating normally. In reinforcement learning, we want an agent to
act intelligently in an environment. This means we need to think about the probability of getting
a high reward under each of the available action. And when we build recommender systems we
also need to think about probability. For example, say hypothetically that we worked for a large
online bookseller. We might want to estimate the probability that a particular user would buy
a particular book. For this we need to use the language of probability. Entire courses, majors,
theses, careers, and even departments, are devoted to probability. So naturally, our goal in this
section is not to teach the whole subject. Instead we hope to get you off the ground, to teach you
just enough that you can start building your first deep learning models, and to give you enough of
a flavor for the subject that you can begin to explore it on your own if you wish.
We have already invoked probabilities in previous sections without articulating what precisely
they are or giving a concrete example. Letʼs get more serious now by considering the first case:
distinguishing cats and dogs based on photographs. This might sound simple but it is actually a
formidable challenge. To start with, the difficulty of the problem may depend on the resolution
of the image.

Fig. 2.6.1: Images of varying resolutions (10 × 10, 20 × 20, 40 × 40, 80 × 80, and 160 × 160 pixels).

As shown in Fig. 2.6.1, while it is easy for humans to recognize cats and dogs at the resolution of
160 × 160 pixels, it becomes challenging at 40 × 40 pixels and next to impossible at 10 × 10 pixels.

74 Chapter 2. Preliminaries
In other words, our ability to tell cats and dogs apart at a large distance (and thus low resolution)
might approach uninformed guessing. Probability gives us a formal way of reasoning about our
level of certainty. If we are completely sure that the image depicts a cat, we say that the probability
that the corresponding label y is “cat”, denoted P (y = “cat”) equals 1. If we had no evidence to
suggest that y = “cat” or that $y = $ “dog”, then we might say that the two possibilities were equally
likely expressing this as P (y = “cat”) = P (y = “dog”) = 0.5. If we were reasonably confident, but
not sure that the image depicted a cat, we might assign a probability 0.5 < P (y = “cat”) < 1.
Now consider the second case: given some weather monitoring data, we want to predict the proba-
bility that it will rain in Taipei tomorrow. If it is summertime, the rain might come with probability
0.5.
In both cases, we have some value of interest. And in both cases we are uncertain about the out-
come. But there is a key difference between the two cases. In this first case, the image is in fact
either a dog or a cat, and we just do not know which. In the second case, the outcome may actu-
ally be a random event, if you believe in such things (and most physicists do). So probability is a
flexible language for reasoning about our level of certainty, and it can be applied effectively in a
broad set of contexts.

2.6.1 Basic Probability Theory

Say that we cast a die and want to know what the chance is of seeing a 1 rather than another digit.
If the die is fair, all the 6 outcomes {1, . . . , 6} are equally likely to occur, and thus we would see a
1 in one out of six cases. Formally we state that 1 occurs with probability 16 .
For a real die that we receive from a factory, we might not know those proportions and we would
need to check whether it is tainted. The only way to investigate the die is by casting it many times
and recording the outcomes. For each cast of the die, we will observe a value in {1, . . . , 6}. Given
these outcomes, we want to investigate the probability of observing each outcome.
One natural approach for each value is to take the individual count for that value and to divide it
by the total number of tosses. This gives us an estimate of the probability of a given event. The law
of large numbers tell us that as the number of tosses grows this estimate will draw closer and closer
to the true underlying probability. Before going into the details of what is going here, letʼs try it
out.
To start, letʼs import the necessary packages.

%matplotlib inline
import d2l
from mxnet import np, npx
import random
npx.set_np()

Next, we will want to be able to cast the die. In statistics we call this process of drawing ex-
amples from probability distributions sampling. The distribution that assigns probabilities to a
number of discrete choices is called the multinomial distribution. We will give a more formal def-
inition of distribution later, but at a high level, think of it as just an assignment of probabilities
to events. In MXNet, we can sample from the multinomial distribution via the aptly named np.
[Link] function. The function can be called in many ways, but we will focus on the
simplest. To draw a single sample, we simply pass in a vector of probabilities. The output of the

2.6. Probability 75
[Link] function is another vector of the same length: its value at index i is the
number of times the sampling outcome corresponds to i.

fair_probs = [1.0 / 6] * 6
[Link](1, fair_probs)

array([0, 0, 0, 1, 0, 0], dtype=int64)

If you run the sampler a bunch of times, you will find that you get out random values each time.
As with estimating the fairness of a die, we often want to generate many samples from the same
distribution. It would be unbearably slow to do this with a Python for loop, so [Link]
supports drawing multiple samples at once, returning an array of independent samples in any
shape we might desire.

[Link](10, fair_probs)

array([1, 1, 5, 1, 1, 1], dtype=int64)

We can also conduct, say 3, groups of experiments, where each group draws 10 samples, all at
once.

counts = [Link](10, fair_probs, size=3)

counts

array([[1, 2, 1, 2, 4, 0],
[3, 2, 2, 1, 0, 2],
[1, 2, 1, 3, 1, 2]], dtype=int64)

Now that we know how to sample rolls of a die, we can simulate 1000 rolls. We can then go through
and count, after each of the 1000 rolls, how many times each number was rolled. Specifically, we
calculate the relative frequency as the estimate of the true probability.

# Store the results as 32-bit floats for division

counts = [Link](1000, fair_probs).astype(np.float32)
counts / 1000 # Reletive frequency as the estimate

array([0.164, 0.153, 0.181, 0.163, 0.163, 0.176])

Because we generated the data from a fair die, we know that each outcome has true probability 16 ,
roughly 0.167, so the above output estimates look good.
We can also visualize how these probabilities converge over time towards the true probability.
Letʼs conduct 500 groups of experiments where each group draws 10 samples.

counts = [Link](10, fair_probs, size=500)

cum_counts = [Link](np.float32).cumsum(axis=0)
estimates = cum_counts / cum_counts.sum(axis=1, keepdims=True)

d2l.set_figsize((6, 4.5))
(continues on next page)

76 Chapter 2. Preliminaries
(continued from previous page)
for i in range(6):
[Link](estimates[:, i].asnumpy(),
label=("P(die=" + str(i + 1) + ")"))
[Link](y=0.167, color='black', linestyle='dashed')
[Link]().set_xlabel('Groups of experiments')
[Link]().set_ylabel('Estimated probability')
[Link]();

Each solid curve corresponds to one of the six values of the die and gives our estimated probability
that the die turns up that value as assessed after each group of experiments. The dashed black line
gives the true underlying probability. As we get more data by conducting more experiments, the
6 solid curves converge towards the true probability.

Axioms of Probability Theory

When dealing with the rolls of a die, we call the set S = {1, 2, 3, 4, 5, 6} the sample space or outcome
space, where each element is an outcome. An event is a set of outcomes from a given sample space.
For instance, “seeing a 5” ({5}) and “seeing an odd number” ({1, 3, 5}) are both valid events of
rolling a die. Note that if the outcome of a random experiment is in event A, then event A has
occurred. That is to say, if 3 dots faced up after rolling a die, since 3 ∈ {1, 3, 5}, we can say that the
event “seeing an odd number” has occurred.
Formally, probability can be thought of a function that maps a set to a real value. The probability
of an event A in the given sample space S, denoted as P (A), satisfies the following properties:
• For any event A, its probability is never negative, i.e., P (A) ≥ 0;

2.6. Probability 77
• Probability of the entire sample space is 1, i.e., P (S) = 1;
• For any countable sequence of events A1 , A2 , . . . that are mutually exclusive (Ai ∩Aj = ∅ for all
i ̸= j), the
∪ probability∑∞that any happens is equal to the sum of their individual probabilities,
i.e., P ( ∞i=1 Ai ) = i=1 P (Ai ).

These are also the axioms of probability theory, proposed by Kolmogorov in 1933. Thanks to this
axiom system, we can avoid any philosophical dispute on randomness; instead, we can reason
rigorously with a mathematical language. For instance, by letting event A1 be the entire sample
space and Ai = ∅ for all i > 1, we can prove that P (∅) = 0, i.e., the probability of an impossible
event is 0.

Random Variables

In our random experiment of casting a die, we introduced the notion of a random variable. A ran-
dom variable can be pretty much any quantity and is not deterministic. It could take one value
among a set of possibilities in a random experiment. Consider a random variable X whose value
is in the sample space S = {1, 2, 3, 4, 5, 6} of rolling a die. We can denote the event “seeing a 5”
as {X = 5} or X = 5, and its probability as P ({X = 5}) or P (X = 5). By P (X = a), we make a
distinction between the random variable X and the values (e.g., a) that X can take. However, such
pedantry results in a cumbersome notation. For a compact notation, on one hand, we can just de-
note P (X) as the distribution over the random variable X: the distribution tells us the probability
that X takes any value. On the other hand, we can simply write P (a) to denote the probability that
a random variable takes the value a. Since an event in probability theory is a set of outcomes from
the sample space, we can specify a range of values for a random variable to take. For example,
P (1 ≤ X ≤ 3) denotes the probability of the event {1 ≤ X ≤ 3}, which means {X = 1, 2, or, 3}.
Equivalently, P (1 ≤ X ≤ 3) represents the probability that the random variable X can take a
value from {1, 2, 3}.
Note that there is a subtle difference between discrete random variables, like the sides of a die,
and continuous ones, like the weight and the height of a person. There is little point in ask-
ing whether two people have exactly the same height. If we take precise enough measure-
ments you will find that no two people on the planet have the exact same height. In fact, if
we take a fine enough measurement, you will not have the same height when you wake up and
when you go to sleep. So there is no purpose in asking about the probability that someone is
1.80139278291028719210196740527486202 meters tall. Given the world population of humans the
probability is virtually 0. It makes more sense in this case to ask whether someoneʼs height falls
into a given interval, say between 1.79 and 1.81 meters. In these cases we quantify the likelihood
that we see a value as a density. The height of exactly 1.80 meters has no probability, but nonzero
density. In the interval between any two different heights we have nonzero probability. In the rest
of this section, we consider probability in discrete space. For probability over continuous random
variables, you may refer to Section 17.6.

2.6.2 Dealing with Multiple Random Variables

Very often, we will want to consider more than one random variable at a time. For instance, we
may want to model the relationship between diseases and symptoms. Given a disease and a symp-
tom, say “flu” and “cough”, either may or may not occur in a patient with some probability. While
we hope that the probability of both would be close to zero, we may want to estimate these prob-
abilities and their relationships to each other so that we may apply our inferences to effect better

78 Chapter 2. Preliminaries
medical care.
As a more complicated example, images contain millions of pixels, thus millions of random vari-
ables. And in many cases images will come with a label, identifying objects in the image. We can
also think of the label as a random variable. We can even think of all the metadata as random
variables such as location, time, aperture, focal length, ISO, focus distance, and camera type. All
of these are random variables that occur jointly. When we deal with multiple random variables,
there are several quantities of interest.

Joint Probability

The first is called the joint probability P (A = a, B = b). Given any values a and b, the joint proba-
bility lets us answer, what is the probability that A = a and B = b simultaneously? Note that for
any values a and b, P (A = a, B = b) ≤ P (A = a). This has to be the case, since for A = a and
B = b to happen, A = a has to happen and B = b also has to happen (and vice versa). Thus, A = a
and B = b cannot be more likely than A = a or B = b individually.

Conditional Probability

This brings us to an interesting ratio: 0 ≤ P (A=a,B=b)

P (A=a) ≤ 1. We call this ratio a conditional probability
and denote it by P (B = b | A = a): it is the probability of B = b, provided that A = a has occurred.

Bayes’ theorem

Using the definition of conditional probabilities, we can derive one of the most useful and cel-
ebrated equations in statistics: Bayes’ theorem. It goes as follows. By construction, we have the
multiplication rule that P (A, B) = P (B | A)P (A). By symmetry, this also holds for P (A, B) =
P (A | B)P (B). Assume that P (B) > 0. Solving for one of the conditional variables we get

P (B | A)P (A)
P (A | B) = . (2.6.1)
P (B)

Note that here we use the more compact notation where P (A, B) is a joint distribution and P (A | B)
is a conditional distribution. Such distributions can be evaluated for particular values A = a, B = b.

Marginalization

Bayesʼ theorem is very useful if we want to infer one thing from the other, say cause and effect,
but we only know the properties in the reverse direction, as we will see later in this section. One
important operation that we need, to make this work, is marginalization. It is the operation of
determining P (B) from P (A, B). We can see that the probability of B amounts to accounting for
all possible choices of A and aggregating the joint probabilities over all of them:
∑
P (B) = P (A, B), (2.6.2)
A

which is also known as the sum rule. The probability or distribution as a result of marginalization
is called a marginal probability or a marginal distribution.

2.6. Probability 79
Independence

Another useful property to check for is dependence vs. independence. Two random variables A and
B are independent means that the occurrence of one event of A does not reveal any information
about the occurrence of an event of B. In this case P (B | A) = P (B). Statisticians typically
express this as A ⊥ B. From Bayesʼ theorem, it follows immediately that also P (A | B) = P (A).
In all the other cases we call A and B dependent. For instance, two successive rolls of a die are
independent. In contrast, the position of a light switch and the brightness in the room are not
(they are not perfectly deterministic, though, since we could always have a broken light bulb,
power failure, or a broken switch).
Since P (A | B) = PP(A,B)
(B) = P (A) is equivalent to P (A, B) = P (A)P (B), two random variables are
independent if and only if their joint distribution is the product of their individual distributions.
Likewise, two random variables A and B are conditionally independent given another random vari-
able C if and only if P (A, B | C) = P (A | C)P (B | C). This is expressed as A ⊥ B | C.

Application

Letʼs put our skills to the test. Assume that a doctor administers an AIDS test to a patient. This test
is fairly accurate and it fails only with 1% probability if the patient is healthy but reporting him as
diseased. Moreover, it never fails to detect HIV if the patient actually has it. We use D1 to indicate
the diagnosis (1 if positive and 0 if negative) and H to denote the HIV status (1 if positive and 0 if
negative). Table 2.6.1 lists such conditional probability.

Table 2.6.1: Conditional probability of P (D1 | H).

Conditional probability H = 1 H = 0
P (D1 = 1 | H) 1 0.01
P (D1 = 0 | H) 0 0.99

Note that the column sums are all 1 (but the row sums are not), since the conditional probability
needs to sum up to 1, just like the probability. Letʼs work out the probability of the patient having
AIDS if the test comes back positive, i.e., P (H = 1 | D1 = 1). Obviously this is going to depend
on how common the disease is, since it affects the number of false alarms. Assume that the pop-
ulation is quite healthy, e.g., P (H = 1) = 0.0015. To apply Bayesʼ Theorem, we need to apply
marginalization and the multiplication rule to determine
P (D1 = 1)
=P (D1 = 1, H = 0) + P (D1 = 1, H = 1)
(2.6.3)
=P (D1 = 1 | H = 0)P (H = 0) + P (D1 = 1 | H = 1)P (H = 1)
=0.011485.
Thus, we get
P (H = 1 | D1 = 1)
P (D1 = 1 | H = 1)P (H = 1)
= . (2.6.4)
P (D1 = 1)
=0.1306
In other words, there is only a 13.06% chance that the patient actually has AIDS, despite using a
very accurate test. As we can see, probability can be quite counterintuitive.

80 Chapter 2. Preliminaries
What should a patient do upon receiving such terrifying news? Likely, the patient would ask the
physician to administer another test to get clarity. The second test has different characteristics
and it is not as good as the first one, as shown in Table 2.6.2.

Table 2.6.2: Conditional probability of P (D2 | H).

Conditional probability H = 1 H = 0
P (D2 = 1 | H) 0.98 0.03
P (D2 = 0 | H) 0.02 0.97

Unfortunately, the second test comes back positive, too. Letʼs work out the requisite probabilities
to invoke Bayesʼ Theorem by assuming the conditional independence:

P (D1 = 1, D2 = 1 | H = 0)
=P (D1 = 1 | H = 0)P (D2 = 1 | H = 0) (2.6.5)
=0.0003,

P (D1 = 1, D2 = 1 | H = 1)
=P (D1 = 1 | H = 1)P (D2 = 1 | H = 1) (2.6.6)
=0.98.
Now we can apply marginalization and the multiplication rule:

P (D1 = 1, D2 = 1)
=P (D1 = 1, D2 = 1, H = 0) + P (D1 = 1, D2 = 1, H = 1)
(2.6.7)
=P (D1 = 1, D2 = 1 | H = 0)P (H = 0) + P (D1 = 1, D2 = 1 | H = 1)P (H = 1)
=0.00176955.

In the end, the probability of the patient having AIDS given both positive tests is

P (H = 1 | D1 = 1, D2 = 1)
P (D1 = 1, D2 = 1 | H = 1)P (H = 1)
= (2.6.8)
P (D1 = 1, D2 = 1)
=0.8307.

That is, the second test allowed us to gain much higher confidence that not all is well. Despite the
second test being considerably less accurate than the first one, it still significantly improved our
estimate.

2.6.3 Expectation and Variance

To summarize key characteristics of probability distributions, we need some measures. The ex-
pectation (or average) of the random variable X is denoted as
∑
E[X] = xP (X = x). (2.6.9)
x

When the input of a function f (x) is a random variable drawn from the distribution P with differ-
ent values x, the expectation of f (x) is computed as
∑
Ex∼P [f (x)] = f (x)P (x). (2.6.10)
x

2.6. Probability 81
In many cases we want to measure by how much the random variable X deviates from its expec-
tation. This can be quantified by the variance
[ ]
Var[X] = E (X − E[X])2 = E[X 2 ] − E[X]2 . (2.6.11)

Its square root is called the standard deviation. The variance of a function of a random variable
measures by how much the function deviates from the expectation of the function, as different
values x of the random variable are sampled from its distribution:
[ ]
Var[f (x)] = E (f (x) − E[f (x)])2 . (2.6.12)

Summary

• We can use MXNet to sample from probability distributions.

• We can analyze multiple random variables using joint distribution, conditional distribution,
Bayesʼ theorem, marginalization, and independence assumptions.
• Expectation and variance offer useful measures to summarize key characteristics of proba-
bility distributions.

Exercises

1. We conducted m = 500 groups of experiments where each group draws n = 10 samples.

Vary m and n. Observe and analyze the experimental results.
2. Given two events with probability P (A) and P (B), compute upper and lower bounds on
P (A ∪ B) and P (A ∩ B). (Hint: display the situation using a Venn Diagram45 .)
3. Assume that we have a sequence of random variables, say A, B, and C, where B only de-
pends on A, and C only depends on B, can you simplify the joint probability P (A, B, C)?
(Hint: this is a Markov Chain46 .)
4. In Section 2.6.2, the first test is more accurate. Why not just run the first test a second time?

2.7 Documentation

Due to constraints on the length of this book, we cannot possibly introduce every single MXNet
function and class (and you probably would not want us to). The API documentation and addi-
tional tutorials and examples provide plenty of documentation beyond the book. In this section
we provide you with some guidance to exploring the MXNet API.
45
[Link]
46
[Link]

82 Chapter 2. Preliminaries
2.7.1 Finding All the Functions and Classes in a Module

In order to know which functions and classes can be called in a module, we invoke the dir func-
tion. For instance, we can query all properties in the [Link] module as follows:

from mxnet import np

print(dir([Link]))

['all', 'builtins', 'cached', 'doc', 'file', 'loader', 'name', '_

,→_package__', '__spec__', '_mx_nd_np', 'absolute_import', 'choice', 'multinomial', 'normal',
,→ 'rand', 'randint', 'shuffle', 'uniform']

Generally, we can ignore functions that start and end with __ (special objects in Python) or func-
tions that start with a single _(usually internal functions). Based on the remaining function or
attribute names, we might hazard a guess that this module offers various methods for generating
random numbers, including sampling from the uniform distribution (uniform), normal distribu-
tion (normal), and multinomial distribution (multinomial).

2.7.2 Finding the Usage of Specific Functions and Classes

For more specific instructions on how to use a given function or class, we can invoke the help
function. As an example, letʼs explore the usage instructions for ndarrayʼs ones_like function.

help(np.ones_like)

Help on function ones_like in module [Link]:

ones_like(a)
Return an array of ones with the same shape and type as a given array.

Parameters
----------
a : ndarray
The shape and data-type of a define these same attributes of
the returned array.

Returns
-------
out : ndarray
Array of ones with the same shape and type as a.

Examples
--------
>>> x = [Link](6)
>>> x = [Link]((2, 3))
>>> x
array([[0., 1., 2.],
[3., 4., 5.]])
>>> np.ones_like(x)

2.7. Documentation 83
array([[1., 1., 1.],
[1., 1., 1.]])

>>> y = [Link](3, dtype=float)

>>> y
array([0., 1., 2.], dtype=float64)
>>>
>>> np.ones_like(y)
array([1., 1., 1.], dtype=float64)
From the documentation, we can see that the ones_like function creates a new array with the
same shape as the supplied ndarray and sets all the elements to 1. Whenever possible, you should
run a quick test to confirm your interpretation:

x = [Link]([[0, 0, 0], [2, 2, 2]])

np.ones_like(x)

array([[1., 1., 1.],

[1., 1., 1.]])

In the Jupyter notebook, we can use ? to display the document in another window. For exam-
ple, [Link]? will create content that is almost identical to help([Link]),
displaying it in a new browser window. In addition, if we use two question marks, such as np.
[Link]??, the code implementing the function will also be displayed.

2.7.3 API Documentation

For further details on the API details check the MXNet website at [Link] You
can find the details under the appropriate headings (also for programming languages other than
Python).

Summary

• The official documentation provides plenty of descriptions and examples that are beyond
this book.
• We can look up documentation for the usage of MXNet API by calling the dir and help func-
tions, or checking the MXNet website.

Exercises

1. Look up ones_like and autograd on the MXNet website.

2. What are all the possible outputs after running [Link](4, 2)?
3. Can you rewrite [Link](4, 2) by using the [Link] function?

84 Chapter 2. Preliminaries
2.7. Documentation 85
86 Chapter 2. Preliminaries
3 | Linear Neural Networks

Before we get into the details of deep neural networks, we need to cover the basics of neural net-
work training. In this chapter, we will cover the entire training process, including defining simple
neural network architectures, handling data, specifying a loss function, and training the model.
In order to make things easier to grasp, we begin with the simplest concepts. Fortunately, classic
statistical learning techniques such as linear and logistic regression can be cast as shallow neural
networks. Starting from these classic algorithms, we will introduce you to the basics, providing
the basis for more complex techniques such as softmax regression (introduced at the end of this
chapter) and multilayer perceptrons (introduced in the next chapter).

3.1 Linear Regression

Regression refers to a set of methods for modeling the relationship between data points x and
corresponding real-valued targets y. In the natural sciences and social sciences, the purpose of
regression is most often to characterize the relationship between the inputs and outputs. Machine
learning, on the other hand, is most often concerned with prediction.
Regression problems pop up whenever we want to predict a numerical value. Common exam-
ples include predicting prices (of homes, stocks, etc.), predicting length of stay (for patients in
the hospital), demand forecasting (for retail sales), among countless others. Not every prediction
problem is a classic regression problem. In subsequent sections, we will introduce classification
problems, where the goal is to predict membership among a set of categories.

3.1.1 Basic Elements of Linear Regression

Linear regression may be both the simplest and most popular among the standard tools to regres-
sion. Dating back to the dawn of the 19th century, linear regression flows from a few simple
assumptions. First, we assume that the relationship between the features x and targets y is linear,
i.e., that y can be expressed as a weighted sum of the inputs x, give or take some noise on the ob-
servations. Second, we assume that any noise is well-behaved (following a Gaussian distribution).
To motivate the approach, letʼs start with a running example. Suppose that we wish to estimate
the prices of houses (in dollars) based on their area (in square feet) and age (in years).
To actually fit a model for predicting house prices, we would need to get our hands on a dataset
consisting of sales for which we know the sale price, area and age for each home. In the termi-
nology of machine learning, the dataset is called a training data or training set, and each row (here
the data corresponding to one sale) is called an instance or example. The thing we are trying to
predict (here, the price) is called a target or label. The variables (here age and area) upon which
the predictions are based are called features or covariates.

87
Typically, we will use n to denote the number of examples in our dataset. We index the samples
(i) (i)
by i, denoting each input data point as x(i) = [x1 , x2 ] and the corresponding label as y (i) .

Linear Model

The linearity assumption just says that the target (price) can be expressed as a weighted sum of
the features (area and age):

price = warea · area + wage · age + b. (3.1.1)

Here, warea and wage are called weights, and b is called a bias (also called an offset or intercept). The
weights determine the influence of each feature on our prediction and the bias just says what value
the predicted price should take when all of the features take value 0. Even if we will never see any
homes with zero area, or that are precisely zero years old, we still need the intercept or else we
will limit the expressivity of our linear model.
Given a dataset, our goal is to choose the weights w and bias b such that on average, the predictions
made according our model best fit the true prices observed in the data.
In disciplines where it is common to focus on datasets with just a few features, explicitly ex-
pressing models long-form like this is common. In ML, we usually work with high-dimensional
datasets, so it is more convenient to employ linear algebra notation. When our inputs consist of
d features, we express our prediction ŷ as

ŷ = w1 · x1 + ... + wd · xd + b. (3.1.2)

Collecting all features into a vector x and all weights into a vector w, we can express our model
compactly using a dot product:

ŷ = wT x + b. (3.1.3)

Here, the vector x corresponds to a single data point. We will often find it convenient to refer to
our entire dataset via the design matrix X. Here, X contains one row for every example and one
column for every feature.
For a collection of data points X, the predictions ŷ can be expressed via the matrix-vector product:

ŷ = Xw + b. (3.1.4)

Given a training dataset X and corresponding (known) targets y, the goal of linear regression is
to find the weight vector w and bias term b that given some a new data point xi , sampled from the
same distribution as the training data will (in expectation) predict the target yi with the lowest
error.
Even if we believe that the best model for predicting y given x is linear, we would not expect to
find real-world data where yi exactly equals wT x + b for all points (x, y). For example, whatever
instruments we use to observe the features X and labels y might suffer small amount of measure-
ment error. Thus, even when we are confident that the underlying relationship is linear, we will
incorporate a noise term to account for such errors.
Before we can go about searching for the best parameters w and b, we will need two more things:
(i) a quality measure for some given model; and (ii) a procedure for updating the model to improve
its quality.

88 Chapter 3. Linear Neural Networks

Loss Function

Before we start thinking about how to fit our model, we need to determine a measure of fitness.
The loss function quantifies the distance between the real and predicted value of the target. The loss
will usually be a non-negative number where smaller values are better and perfect predictions
incur a loss of 0. The most popular loss function in regression problems is the sum of squared
errors. When our prediction for some example i is ŷ (i) and the corresponding true label is y (i) , the
squared error is given by:

1 ( (i) )2
l(i) (w, b) = ŷ − y (i) . (3.1.5)
2
The constant 1/2 makes no real difference but will prove notationally convenient, cancelling out
when we take the derivative of the loss. Since the training dataset is given to us, and thus out of
our control, the empirical error is only a function of the model parameters. To make things more
concrete, consider the example below where we plot a regression problem for a one-dimensional
case as shown in Fig. 3.1.1.

Fig. 3.1.1: Fit data with a linear model.

Note that large differences between estimates ŷ (i) and observations y (i) lead to even larger contri-
butions to the loss, due to the quadratic dependence. To measure the quality of a model on the
entire dataset, we simply average (or equivalently, sum) the losses on the training set.

1 ∑ (i) 1 ∑ 1 ( ⊤ (i) )2
n n
L(w, b) = l (w, b) = w x + b − y (i) . (3.1.6)
n n 2
i=1 i=1

When training the model, we want to find parameters (w∗ , b∗ ) that minimize the total loss across
all training samples:

w∗ , b∗ = argmin L(w, b). (3.1.7)

w,b

Analytic Solution

Linear regression happens to be an unusually simple optimization problem. Unlike most other
models that we will encounter in this book, linear regression can be solved analytically by apply-
ing a simple formula, yielding a global optimum. To start, we can subsume the bias b into the

3.1. Linear Regression 89

parameter w by appending a column to the design matrix consisting of all 1s. Then our predic-
tion problem is to minimize ||y − Xw||. Because this expression has a quadratic form, it is convex,
and so long as the problem is not degenerate (our features are linearly independent), it is strictly
convex.
Thus there is just one critical point on the loss surface and it corresponds to the global minimum.
Taking the derivative of the loss with respect to w and setting it equal to 0 yields the analytic solu-
tion:

w∗ = (XT X)−1 XT y. (3.1.8)

While simple problems like linear regression may admit analytic solutions, you should not get
used to such good fortune. Although analytic solutions allow for nice mathematical analysis, the
requirement of an analytic solution is so restrictive that it would exclude all of deep learning.

Gradient descent

Even in cases where we cannot solve the models analytically, and even when the loss surfaces are
high-dimensional and nonconvex, it turns out that we can still train models effectively in practice.
Moreover, for many tasks, these difficult-to-optimize models turn out to be so much better that
figuring out how to train them ends up being well worth the trouble.
The key technique for optimizing nearly any deep learning model, and which we will call upon
throughout this book, consists of iteratively reducing the error by updating the parameters in the
direction that incrementally lowers the loss function. This algorithm is called gradient descent. On
convex loss surfaces, it will eventually converge to a global minimum, and while the same cannot
be said for nonconvex surfaces, it will at least lead towards a (hopefully good) local minimum.
The most naive application of gradient descent consists of taking the derivative of the true loss,
which is an average of the losses computed on every single example in the dataset. In practice,
this can be extremely slow. We must pass over the entire dataset before making a single update.
Thus, we will often settle for sampling a random minibatch of examples every time we need to
computer the update, a variant called stochastic gradient descent.
In each iteration, we first randomly sample a minibatch B consisting of a fixed number of training
data examples. We then compute the derivative (gradient) of the average loss on the mini batch
with regard to the model parameters. Finally, we multiply the gradient by a predetermined step
size η > 0 and subtract the resulting term from the current parameter values.
We can express the update mathematically as follows (∂ denotes the partial derivative) :
η ∑
(w, b) ← (w, b) − ∂(w,b) l(i) (w, b). (3.1.9)
|B|
i∈B

To summarize, steps of the algorithm are the following: (i) we initialize the values of the model pa-
rameters, typically at random; (ii) we iteratively sample random batches from the the data (many
times), updating the parameters in the direction of the negative gradient.
For quadratic losses and linear functions, we can write this out explicitly as follows: Note that w
and x are vectors. Here, the more elegant vector notation makes the math much more readable

90 Chapter 3. Linear Neural Networks

than expressing things in terms of coefficients, say w1 , w2 , . . . , wd .
η ∑ η ∑ (i) ( ⊤ (i) )
w←w− ∂w l(i) (w, b) = w − x w x + b − y (i) ,
|B| |B|
i∈B i∈B
η ∑ η ∑( ) (3.1.10)
b←b− ∂b l(i) (w, b) =b− w⊤ x(i) + b − y (i) .
|B| |B|
i∈B i∈B

In the above equation, |B| represents the number of examples in each minibatch (the batch size)
and η denotes the learning rate. We emphasize that the values of the batch size and learning rate
are manually pre-specified and not typically learned through model training. These parameters
that are tunable but not updated in the training loop are called hyper-parameters. Hyperparameter
tuning is the process by which these are chosen, and typically requires that we adjust the hyper-
parameters based on the results of the inner (training) loop as assessed on a separate validation
split of the data.
After training for some predetermined number of iterations (or until some other stopping criteria
is met), we record the estimated model parameters, denoted ŵ, b̂ (in general the “hat” symbol
denotes estimates). Note that even if our function is truly linear and noiseless, these parameters
will not be the exact minimizers of the loss because, although the algorithm converges slowly
towards a local minimum it cannot achieve it exactly in a finite number of steps.
Linear regression happens to be a convex learning problem, and thus there is only one (global)
minimum. However, for more complicated models, like deep networks, the loss surfaces contain
many minima. Fortunately, for reasons that are not yet fully understood, deep learning prac-
titioners seldom struggle to find parameters that minimize the loss on training data. The more
formidable task is to find parameters that will achieve low loss on data that we have not seen be-
fore, a challenge called generalization. We return to these topics throughout the book.

Making Predictions with the Learned Model

Given the learned linear regression model ŵ⊤ x + b̂, we can now estimate the price of a new house
(not contained in the training data) given its area x1 and age (year) x2 . Estimating targets given
features is commonly called prediction and inference.
We will try to stick with prediction because calling this step inference, despite emerging as standard
jargon in deep learning, is somewhat of a misnomer. In statistics, inference more often denotes
estimating parameters based on a dataset. This misuse of terminology is a common source of
confusion when deep learning practitioners talk to statisticians.

Vectorization for Speed

When training our models, we typically want to process whole minibatches of examples simulta-
neously. Doing this efficiently requires that we vectorize the calculations and leverage fast linear
algebra libraries rather than writing costly for-loops in Python.
To illustrate why this matters so much, we can consider two methods for adding vectors. To start
we instantiate two 100000-dimensional vectors containing all ones. In one method we will loop
over the vectors with a Python for loop. In the other method we will rely on a single call to np.

3.1. Linear Regression 91

%matplotlib inline
import d2l
import math
from mxnet import np
import time

n = 10000
a = [Link](n)
b = [Link](n)

Since we will benchmark the running time frequently in this book, letʼs define a timer (hereafter
accessed via the d2l package to track the running time.

# Saved in the d2l package for later use

class Timer(object):
"""Record multiple running times."""
def __init__(self):
[Link] = []
[Link]()

def start(self):
# Start the timer
self.start_time = [Link]()

def stop(self):
# Stop the timer and record the time in a list
[Link]([Link]() - self.start_time)
return [Link][-1]

def avg(self):
# Return the average time
return sum([Link])/len([Link])

def sum(self):
# Return the sum of time
return sum([Link])

def cumsum(self):
# Return the accumuated times
return [Link]([Link]).cumsum().tolist()

Now we can benchmark the workloads. First, we add them, one coordinate at a time, using a for
loop.

timer = Timer()
c = [Link](n)
for i in range(n):
c[i] = a[i] + b[i]
'%.5f sec' % [Link]()

'4.38336 sec'

Alternatively, we rely on np to compute the elementwise sum:

92 Chapter 3. Linear Neural Networks

[Link]()
d = a + b
'%.5f sec' % [Link]()

'0.00024 sec'

You probably noticed that the second method is dramatically faster than the first. Vectorizing code
often yields order-of-magnitude speedups. Moreover, we push more of the math to the library and
need not write as many calculations ourselves, reducing the potential for errors.

3.1.2 The Normal Distribution and Squared Loss

While you can already get your hands dirty using only the information above, in the following
section we can more formally motivate the square loss objective via assumptions about the distri-
bution of noise.
Recall from the above that the squared loss l(y, ŷ) = 12 (y − ŷ)2 has many convenient properties.
These include a simple derivative ∂ŷ l(y, ŷ) = (ŷ − y).
As we mentioned earlier, linear regression was invented by Gauss in 1795, who also discovered
the normal distribution (also called the Gaussian). It turns out that the connection between the
normal distribution and linear regression runs deeper than common parentage. To refresh your
memory, the probability density of a normal distribution with mean µ and variance σ 2 is given as
follows:
( )
1 1
p(z) = √ exp − 2 (z − µ) . 2
(3.1.11)
2πσ 2 2σ

Below we define a Python function to compute the normal distribution.

x = [Link](-7, 7, 0.01)

def normal(z, mu, sigma):

p = 1 / [Link](2 * [Link] * sigma**2)
return p * [Link](- 0.5 / sigma**2 * (z - mu)**2)

We can now visualize the normal distributions.

# Mean and variance pairs

parameters = [(0, 1), (0, 2), (3, 1)]
[Link](x, [normal(x, mu, sigma) for mu, sigma in parameters], xlabel='z',
ylabel='p(z)', figsize=(4.5, 2.5),
legend=['mean %d, var %d' % (mu, sigma) for mu, sigma in parameters])

3.1. Linear Regression 93

As you can see, changing the mean corresponds to a shift along the x axis, and increasing the
variance spreads the distribution out, lowering its peak.
One way to motivate linear regression with the mean squared error loss function is to formally
assume that observations arise from noisy observations, where the noise is normally distributed
as follows

y = w⊤ x + b + ϵ where ϵ ∼ N (0, σ 2 ). (3.1.12)

Thus, we can now write out the likelihood of seeing a particular y for a given x via
( )
1 1 ⊤
p(y|x) = √ exp − 2 (y − w x − b) . 2
(3.1.13)
2πσ 2 2σ

Now, according to the maximum likelihood principle, the best values of b and w are those that max-
imize the likelihood of the entire dataset:
∏
n
P (Y | X) = p(y (i) |x(i) ). (3.1.14)
i=1

Estimators chosen according to the maximum likelihood principle are called Maximum Likelihood
Estimators (MLE). While, maximizing the product of many exponential functions, might look dif-
ficult, we can simplify things significantly, without changing the objective, by maximizing the log
of the likelihood instead. For historical reasons, optimizations are more often expressed as mini-
mization rather than maximization. So, without changing anything we can minimize the Negative
Log-Likelihood (NLL) − log p(y|X). Working out the math gives us:

∑ 1 ( (i) )2
n
1 ⊤ (i)
− log p(y|X) = log(2πσ 2 ) + y − w x − b . (3.1.15)
2 2σ 2
i=1

Now we just need one more assumption: that σ is some fixed constant. Thus we can ignore the first
term because it does not depend on w or b. Now the second term is identical to the squared error
objective introduced earlier, but for the multiplicative constant σ12 . Fortunately, the solution does
not depend on σ. It follows that minimizing squared error is equivalent to maximum likelihood
estimation of a linear model under the assumption of additive Gaussian noise.

94 Chapter 3. Linear Neural Networks

3.1.3 From Linear Regression to Deep Networks

So far we only talked about linear functions. While neural networks cover a much richer family
of models, we can begin thinking of the linear model as a neural network by expressing it the
language of neural networks. To begin, letʼs start by rewriting things in a ʻlayerʼ notation.

Neural Network Diagram

Deep learning practitioners like to draw diagrams to visualize what is happening in their models.
In Fig. 3.1.2, we depict our linear model as a neural network. Note that these diagrams indicate
the connectivity pattern (here, each input is connected to the output) but not the values taken by
the weights or biases.

Fig. 3.1.2: Linear regression is a single-layer neural network.

Because there is just a single computed neuron (node) in the graph (the input values are not com-
puted but given), we can think of linear models as neural networks consisting of just a single ar-
tificial neuron. Since for this model, every input is connected to every output (in this case there
is only one output!), we can regard this transformation as a fully-connected layer, also commonly
called a dense layer. We will talk a lot more about networks composed of such layers in the next
chapter on multilayer perceptrons.

Biology

Although linear regression (invented in 1795) predates computational neuroscience, so it might

seem anachronistic to describe linear regression as a neural network. To see why linear models
were a natural place to begin when the cyberneticists/neurophysiologists Warren McCulloch and
Walter Pitts looked when they began to develop models of artificial neurons, consider the cartoon-
ish picture of a biological neuron in Fig. 3.1.3, consisting of dendrites (input terminals), the nucleus
(CPU), the axon (output wire), and the axon terminals (output terminals), enabling connections to
other neurons via synapses.

3.1. Linear Regression 95

Dendrite
Axon Terminal
Node of
Ranvier
Cell body

Axon Schwann cell

Myelin sheath
Nucleus

Fig. 3.1.3: The real neuron

Information xi arriving from other neurons (or environmental sensors such as the retina) is re-
ceived in the dendrites. In particular, that information is weighted by synaptic weights wi determin-
ing the effect of the inputs (e.g., activation or inhibition via the product xi wi ). The weighted
∑ inputs
arriving from multiple sources are aggregated in the nucleus as a weighted sum y = i xi wi + b,
and this information is then sent for further processing in the axon y, typically after some nonlin-
ear processing via σ(y). From there it either reaches its destination (e.g., a muscle) or is fed into
another neuron via its dendrites.
Certainly, the high-level idea that many such units could be cobbled together with the right con-
nectivity and right learning algorithm, to produce far more interesting and complex behavior than
any one neuron along could express owes to our study of real biological neural systems.
At the same time, most research in deep learning today draws little direct inspiration in neuro-
science. We invoke Stuart Russell and Peter Norvig who, in their classic AI text book Artificial In-
telligence: A Modern Approach (Russell & Norvig, 2016), pointed out that although airplanes might
have been inspired by birds, orninthology has not been the primary driver of aeronautics inno-
vation for some centuries. Likewise, inspiration in deep learning these days comes in equal or
greater measure from mathematics, statistics, and computer science.

Summary

• Key ingredients in a machine learning model are training data, a loss function, an optimiza-
tion algorithm, and quite obviously, the model itself.
• Vectorizing makes everything better (mostly math) and faster (mostly code).
• Minimizing an objective function and performing maximum likelihood can mean the same
thing.
• Linear models are neural networks, too.

Exercises

1. Assume
∑ that we have some data x1 , . . . , xn ∈ R. Our goal is to find a constant b such that
i (xi − b) is minimized.
2

96 Chapter 3. Linear Neural Networks

• Find a closed-form solution for the optimal value of b.
• How does this problem and its solution relate to the normal distribution?
2. Derive the closed-form solution to the optimization problem for linear regression with
squared error. To keep things simple, you can omit the bias b from the problem (we can
do this in principled fashion by adding one column to X consisting of all ones).
• Write out the optimization problem in matrix and vector notation (treat all the data as
a single matrix, all the target values as a single vector).
• Compute the gradient of the loss with respect to w.
• Find the closed form solution by setting the gradient equal to zero and solving the ma-
trix equation.
• When might this be better than using stochastic gradient descent? When might this
method break?
3. Assume that the noise model governing the additive noise ϵ is the exponential distribution.
That is, p(ϵ) = 12 exp(−|ϵ|).
• Write out the negative log-likelihood of the data under the model − log P (Y | X).
• Can you find a closed form solution?
• Suggest a stochastic gradient descent algorithm to solve this problem. What could pos-
sibly go wrong (hint - what happens near the stationary point as we keep on updating
the parameters). Can you fix this?

3.2 Linear Regression Implementation from Scratch

Now that you understand the key ideas behind linear regression, we can begin to work through
a hands-on implementation in code. In this section, we will implement the entire method from
scratch, including the data pipeline, the model, the loss function, and the gradient descent op-
timizer. While modern deep learning frameworks can automate nearly all of this work, imple-
menting things from scratch is the only to make sure that you really know what you are doing.
Moreover, when it comes time to customize models, defining our own layers, loss functions, etc.,
understanding how things work under the hood will prove handy. In this section, we will rely only
on ndarray and autograd. Afterwards, we will introduce a more compact implementation, taking
advantage of Gluonʼs bells and whistles. To start off, we import the few required packages.

%matplotlib inline
import d2l
from mxnet import autograd, np, npx
import random
npx.set_np()

3.2. Linear Regression Implementation from Scratch 97

3.2.1 Generating the Dataset

To keep things simple, we will construct an artificial dataset according to a linear model with
additive noise. Out task will be to recover this modelʼs parameters using the finite set of examples
contained in our dataset. We will keep the data low-dimensional so we can visualize it easily. In
the following code snippet, we generated a dataset containing 1000 examples, each consisting of
2 features sampled from a standard normal distribution. Thus our synthetic dataset will be an
object X ∈ R1000×2 .
The true parameters generating our data will be w = [2, −3.4]⊤ and b = 4.2 and our synthetic
labels will be assigned according to the following linear model with noise term ϵ:

y = Xw + b + ϵ. (3.2.1)

You could think of ϵ as capturing potential measurement errors on the features and labels. We
will assume that the standard assumptions hold and thus that ϵ obeys a normal distribution with
mean of 0. To make our problem easy, we will set its standard deviation to 0.01. The following
code generates our synthetic dataset:

# Saved in the d2l package for later use

def synthetic_data(w, b, num_examples):
"""generate y = X w + b + noise"""
X = [Link](0, 1, (num_examples, len(w)))
y = [Link](X, w) + b
y += [Link](0, 0.01, [Link])
return X, y

true_w = [Link]([2, -3.4])

true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)

Note that each row in features consists of a 2-dimensional data point and that each row in labels
consists of a 1-dimensional target value (a scalar).

print('features:', features[0],'\nlabel:', labels[0])

features: [2.2122064 1.1630787]

label: 4.662078

By generating a scatter plot using the second features[:, 1] and labels, we can clearly observe
the linear correlation between the two.

d2l.set_figsize((3.5, 2.5))
[Link](features[:, 1].asnumpy(), [Link](), 1);

98 Chapter 3. Linear Neural Networks

3.2.2 Reading the Dataset

Recall that training models consists of making multiple passes over the dataset, grabbing one
minibatch of examples at a time, and using them to update our model. Since this process is so fun-
damental to training machine learning algorithms, its worth defining a utility function to shuffle
the data and access it in minibatches.
In the following code, we define a data_iter function to demonstrate one possible implementa-
tion of this functionality. The function takes a batch size, a design matrix, and a vector of labels,
yielding minibatches of size batch_size. Each minibatch consists of an tuple of features and la-
bels.

def data_iter(batch_size, features, labels):

num_examples = len(features)
indices = list(range(num_examples))
# The examples are read at random, in no particular order
[Link](indices)
for i in range(0, num_examples, batch_size):
batch_indices = [Link](
indices[i: min(i + batch_size, num_examples)])
yield features[batch_indices], labels[batch_indices]

In general, note that we want to use reasonably sized minibatches to take advantage of the GPU
hardware, which excels at parallelizing operations. Because each example can be fed through our
models in parallel and the gradient of the loss function for each example can also be taken in
parallel, GPUs allow us to process hundreds of examples in scarcely more time than it might take
to process just a single example.
To build some intuition, letʼs read and print the first small batch of data examples. The shape of
the features in each minibatch tells us both the minibatch size and the number of input features.
Likewise, our minibatch of labels will have a shape given by batch_size.

batch_size = 10

for X, y in data_iter(batch_size, features, labels):

print(X, '\n', y)
break

3.2. Linear Regression Implementation from Scratch 99

[[ 0.7727994 0.83015364]
[ 1.7198472 0.6671155 ]
[-1.88909 -2.515245 ]
[-1.6774265 -0.80320734]
[ 0.32338732 -0.2544687 ]
[-0.00534759 1.0762215 ]
[-1.0730928 0.20131476]
[ 2.083999 -0.25189096]
[-0.3979261 -1.2483113 ]
[ 0.43164727 -0.31814137]]
[2.9260201 5.3652325 8.953278 3.5695467 5.7129297 0.5446847 1.373391
9.23388 7.651542 6.156946 ]

As we run the iterator, we obtain distinct minibatches successively until all the data has been
exhausted (try this). While the iterator implemented above is good for didactic purposes, it is
inefficient in ways that might get us in trouble on real problems. For example, it requires that we
load all data in memory and that we perform lots of random memory access. The built-in iterators
implemented in Apache MXNet are considerably efficient and they can deal both with data stored
on file and data fed via a data stream.

3.2.3 Initializing Model Parameters

Before we can begin optimizing our modelʼs parameters by gradient descent, we need to have some
parameters in the first place. In the following code, we initialize weights by sampling random
numbers from a normal distribution with mean 0 and a standard deviation of 0.01, setting the
bias b to 0.

w = [Link](0, 0.01, (2, 1))

b = [Link](1)

Now that we have initialized our parameters, our next task is to update them until they fit our data
sufficiently well. Each update requires taking the gradient (a multi-dimensional derivative) of our
loss function with respect to the parameters. Given this gradient, we can update each parameter
in the direction that reduces the loss.
Since nobody wants to compute gradients explicitly (this is tedious and error prone), we use au-
tomatic differentiation to compute the gradient. See Section 2.5 for more details. Recall from the
autograd chapter that in order for autograd to know that it should store a gradient for our param-
eters, we need to invoke the attach_grad function, allocating memory to store the gradients that
we plan to take.

w.attach_grad()
b.attach_grad()

3.2.4 Defining the Model

Next, we must define our model, relating its inputs and parameters to its outputs. Recall that
to calculate the output of the linear model, we simply take the matrix-vector dot product of the

100 Chapter 3. Linear Neural Networks

examples X and the models weights w, and add the offset b to each example. Note that below np.
dot(X, w) is a vector and b is a scalar. Recall that when we add a vector and a scalar, the scalar is
added to each component of the vector.

# Saved in the d2l package for later use

def linreg(X, w, b):
return [Link](X, w) + b

3.2.5 Defining the Loss Function

Since updating our model requires taking the gradient of our loss function, we ought to define
the loss function first. Here we will use the squared loss function as described in the previous
section. In the implementation, we need to transform the true value y into the predicted valueʼs
shape y_hat. The result returned by the following function will also be the same as the y_hat
shape.

# Saved in the d2l package for later use

def squared_loss(y_hat, y):
return (y_hat - [Link](y_hat.shape)) ** 2 / 2

3.2.6 Defining the Optimization Algorithm

As we discussed in the previous section, linear regression has a closed-form solution. However,
this is not a book about linear regression, it is a book about deep learning. Since none of the
other models that this book introduces can be solved analytically, we will take this opportunity to
introduce your first working example of stochastic gradient descent (SGD).
At each step, using one batch randomly drawn from our dataset, we will estimate the gradient of
the loss with respect to our parameters. Next, we will update our parameters (a small amount)
in the direction that reduces the loss. Recall from Section 2.5 that after we call backward each
parameter (param) will have its gradient stored in [Link]. The following code applies the SGD
update, given a set of parameters, a learning rate, and a batch size. The size of the update step
is determined by the learning rate lr. Because our loss is calculated as a sum over the batch of
examples, we normalize our step size by the batch size (batch_size), so that the magnitude of a
typical step size does not depend heavily on our choice of the batch size.

# Saved in the d2l package for later use

def sgd(params, lr, batch_size):
for param in params:
param[:] = param - lr * [Link] / batch_size

3.2.7 Training

Now that we have all of the parts in place, we are ready to implement the main training loop. It
is crucial that you understand this code because you will see nearly identical training loops over
and over again throughout your career in deep learning.

3.2. Linear Regression Implementation from Scratch 101

In each iteration, we will grab minibatches of models, first passing them through our model to
obtain a set of predictions. After calculating the loss, we call the backward function to initiate the
backwards pass through the network, storing the gradients with respect to each parameter in its
corresponding .grad attribute. Finally, we will call the optimization algorithm sgd to update the
model parameters. Since we previously set the batch size batch_size to 10, the loss shape l for
each minibatch is (10, 1).
In summary, we will execute the following loop:
• Initialize parameters (w, b)
• Repeat until done
∑
– Compute gradient g ← ∂(w,b) B1 i∈B l(xi , y i , w, b)
– Update parameters (w, b) ← (w, b) − ηg
In the code below, l is a vector of the losses for each example in the minibatch. Because l is not a
scalar variable, running [Link]() adds together the elements in l to obtain the new variable
and then calculates the gradient.
In each epoch (a pass through the data), we will iterate through the entire dataset (using the
data_iter function) once passing through every examples in the training dataset (assuming the
number of examples is divisible by the batch size). The number of epochs num_epochs and the
learning rate lr are both hyper-parameters, which we set here to 3 and 0.03, respectively. Unfor-
tunately, setting hyper-parameters is tricky and requires some adjustment by trial and error. We
elide these details for now but revise them later in Chapter 11.

lr = 0.03 # Learning rate

num_epochs = 3 # Number of iterations
net = linreg # Our fancy linear model
loss = squared_loss # 0.5 (y-y')^2

for epoch in range(num_epochs):

# Assuming the number of examples can be divided by the batch size, all
# the examples in the training dataset are used once in one epoch
# iteration. The features and tags of minibatch examples are given by X
# and y respectively
for X, y in data_iter(batch_size, features, labels):
with [Link]():
l = loss(net(X, w, b), y) # Minibatch loss in X and y
[Link]() # Compute gradient on l with respect to [w, b]
sgd([w, b], lr, batch_size) # Update parameters using their gradient
train_l = loss(net(features, w, b), labels)
print('epoch %d, loss %f' % (epoch + 1, train_l.mean().asnumpy()))

epoch 1, loss 0.024993

epoch 2, loss 0.000088
epoch 3, loss 0.000051

In this case, because we synthesized the data ourselves, we know precisely what the true param-
eters are. Thus, we can evaluate our success in training by comparing the true parameters with
those that we learned through our training loop. Indeed they turn out to be very close to each
other.

102 Chapter 3. Linear Neural Networks

print('Error in estimating w', true_w - [Link](true_w.shape))
print('Error in estimating b', true_b - b)

Error in estimating w [0.00069368 0.00018263]

Error in estimating b [0.00043964]

Note that we should not take it for granted that we are able to recover the parameters accurately.
This only happens for a special category problems: strongly convex optimization problems with
“enough” data to ensure that the noisy samples allow us to recover the underlying dependency.
In most cases this is not the case. In fact, the parameters of a deep network are rarely the same (or
even close) between two different runs, unless all conditions are identical, including the order in
which the data is traversed. However, in machine learning, we are typically less concerned with
recovering true underlying parameters, and more concerned with parameters that lead to accu-
rate prediction. Fortunately, even on difficult optimization problems, stochastic gradient descent
can often find remarkably good solutions, owing partly to the fact that, for deep networks, there
exist many configurations of the parameters that lead to accurate prediction.

Summary

We saw how a deep network can be implemented and optimized from scratch, using just ndarray
and autograd, without any need for defining layers, fancy optimizers, etc. This only scratches the
surface of what is possible. In the following sections, we will describe additional models based
on the concepts that we have just introduced and learn how to implement them more concisely.

Exercises

1. What would happen if we were to initialize the weights w = 0. Would the algorithm still
work?
2. Assume that you are Georg Simon Ohm50 trying to come up with a model between voltage
and current. Can you use autograd to learn the parameters of your model.
3. Can you use Planckʼs Law51 to determine the temperature of an object using spectral energy
density?
4. What are the problems you might encounter if you wanted to extend autograd to second
derivatives? How would you fix them?
5. Why is the reshape function needed in the squared_loss function?
6. Experiment using different learning rates to find out how fast the loss function value drops.
7. If the number of examples cannot be divided by the batch size, what happens to the
data_iter functionʼs behavior?
50
[Link]
51
[Link]

3.2. Linear Regression Implementation from Scratch 103

3.3 Concise Implementation of Linear Regression

Broad and intense interest in deep learning for the past several years has inspired both companies,
academics, and hobbyists to develop a variety of mature open source frameworks for automating
the repetitive work of implementing gradient-based learning algorithms. In the previous section,
we relied only on (i) ndarray for data storage and linear algebra; and (ii) autograd for calculating
derivatives. In practice, because data iterators, loss functions, optimizers, and neural network
layers (and some whole architectures) are so common, modern libraries implement these com-
ponents for us as well.
In this section, we will show you how to implement the linear regression model from Section 3.2
concisely by using Gluon.

3.3.1 Generating the Dataset

To start, we will generate the same dataset as in the previous section.

import d2l
from mxnet import autograd, gluon, np, npx
npx.set_np()

true_w = [Link]([2, -3.4])

true_b = 4.2
features, labels = d2l.synthetic_data(true_w, true_b, 1000)

3.3.2 Reading the Dataset

Rather than rolling our own iterator, we can call upon Gluonʼs data module to read data. The first
step will be to instantiate an ArrayDataset. This objectʼs constructor takes one or more ndarrays
as arguments. Here, we pass in features and labels as arguments. Next, we will use the Array-
Dataset to instantiate a DataLoader, which also requires that we specify a batch_size and specify
a Boolean value shuffle indicating whether or not we want the DataLoader to shuffle the data on
each epoch (pass through the dataset).

# Saved in the d2l package for later use

def load_array(data_arrays, batch_size, is_train=True):
"""Construct a Gluon data loader"""
dataset = [Link](*data_arrays)
return [Link](dataset, batch_size, shuffle=is_train)

batch_size = 10
data_iter = load_array((features, labels), batch_size)

104 Chapter 3. Linear Neural Networks

Now we can use data_iter in much the same way as we called the data_iter function in the pre-
vious section. To verify that it is working, we can read and print the first minibatch of instances.

for X, y in data_iter:
print(X, '\n', y)
break

[[ 1.5702168 1.11278 ]
[-0.1742568 1.9691626 ]
[-1.2996627 -1.483092 ]
[ 0.76639044 -0.04300109]
[ 0.681858 1.3600351 ]
[ 0.02715491 -0.26509324]
[-0.44584066 -0.6373412 ]
[ 0.31657252 0.91473174]
[ 0.955866 1.0097522 ]
[ 0.01555612 -1.2977124 ]]
[ 3.5531938 -2.8460252 6.6499796 5.8846426 0.9486723 5.1568856
5.4724007 1.7201893 2.67091 8.643943 ]

3.3.3 Defining the Model

When we implemented linear regression from scratch (in :numrefʻʻsec_linear_scratchʻʻ), we de-

fined our model parameters explicitly and coded up the calculations to produce output using ba-
sic linear algebra operations. You should know how to do this. But once your models get more
complex, and once you have to do this nearly every day, you will be glad for the assistance. The
situation is similar to coding up your own blog from scratch. Doing it once or twice is rewarding
and instructive, but you would be a lousy web developer if every time you needed a blog you spent
a month reinventing the wheel.
For standard operations, we can use Gluonʼs predefined layers, which allow us to focus especially
on the layers used to construct the model rather than having to focus on the implementation.
To define a linear model, we first import the nn module, which defines a large number of neural
network layers (note that “nn” is an abbreviation for neural networks). We will first define a model
variable net, which will refer to an instance of the Sequential class. In Gluon, Sequential defines
a container for several layers that will be chained together. Given input data, a Sequential passes
it through the first layer, in turn passing the output as the second layerʼs input and so forth. In
the following example, our model consists of only one layer, so we do not really need Sequential.
But since nearly all of our future models will involve multiple layers, we will use it anyway just to
familiarize you with the most standard workflow.

from [Link] import nn

net = [Link]()

Recall the architecture of a single-layer network as shown in Fig. 3.3.1. The layer is said to be
fully-connected because each of its inputs are connected to each of its outputs by means of a matrix-
vector multiplication. In Gluon, the fully-connected layer is defined in the Dense class. Since we
only want to generate a single scalar output, we set that number to 1.

3.3. Concise Implementation of Linear Regression 105

Fig. 3.3.1: Linear regression is a single-layer neural network.

[Link]([Link](1))

It is worth noting that, for convenience, Gluon does not require us to specify the input shape for
each layer. So here, we do not need to tell Gluon how many inputs go into this linear layer. When
we first try to pass data through our model, e.g., when we execute net(X) later, Gluon will auto-
matically infer the number of inputs to each layer. We will describe how this works in more detail
in the chapter “Deep Learning Computation”.

3.3.4 Initializing Model Parameters

Before using net, we need to initialize the model parameters, such as the weights and biases in
the linear regression model. We will import the initializer module from MXNet. This module
provides various methods for model parameter initialization. Gluon makes init available as a
shortcut (abbreviation) to access the initializer package. By calling [Link](sigma=0.01),
we specify that each weight parameter should be randomly sampled from a normal distribution
with mean 0 and standard deviation 0.01. The bias parameter will be initialized to zero by default.
Both the weight vector and bias will have attached gradients.

from mxnet import init

[Link]([Link](sigma=0.01))

The code above may look straightforward but you should note that something strange is happening
here. We are initializing parameters for a network even though Gluon does not yet know how
many dimensions the input will have! It might be 2 as in our example or it might be 2000. Gluon
lets us get away with this because behind the scenes, the initialization is actually deferred. The
real initialization will take place only when we for the first time attempt to pass data through the
network. Just be careful to remember that since the parameters have not been initialized yet, we
cannot access or manipulate them.

3.3.5 Defining the Loss Function

In Gluon, the loss module defines various loss functions. We will the imported module loss with
the pseudonym gloss, to avoid confusing it for the variable holding our chosen loss function. In
this example, we will use the Gluon implementation of squared loss (L2Loss).

from [Link] import loss as gloss

loss = gloss.L2Loss() # The squared loss is also known as the L2 norm loss

106 Chapter 3. Linear Neural Networks

3.3.6 Defining the Optimization Algorithm

Minibatch SGD and related variants are standard tools for optimizing neural networks and thus
Gluon supports SGD alongside a number of variations on this algorithm through its Trainer class.
When we instantiate the Trainer, we will specify the parameters to optimize over (obtainable from
our net via net.collect_params()), the optimization algorithm we wish to use (sgd), and a dictio-
nary of hyper-parameters required by our optimization algorithm. SGD just requires that we set
the value learning_rate, (here we set it to 0.03).

from mxnet import gluon

trainer = [Link](net.collect_params(), 'sgd', {'learning_rate': 0.03})

3.3.7 Training

You might have noticed that expressing our model through Gluon requires comparatively few lines
of code. We did not have to individually allocate parameters, define our loss function, or im-
plement stochastic gradient descent. Once we start working with much more complex models,
Gluonʼs advantages will grow considerably. However, once we have all the basic pieces in place,
the training loop itself is strikingly similar to what we did when implementing everything from
scratch.
To refresh your memory: for some number of epochs, we will make a complete pass over the
dataset (train_data), iteratively grabbing one minibatch of inputs and the corresponding ground-
truth labels. For each minibatch, we go through the following ritual:
• Generate predictions by calling net(X) and calculate the loss l (the forward pass).
• Calculate gradients by calling [Link]() (the backward pass).
• Update the model parameters by invoking our SGD optimizer (note that trainer already
knows which parameters to optimize over, so we just need to pass in the minibatch size.
For good measure, we compute the loss after each epoch and print it to monitor progress.

num_epochs = 3
for epoch in range(1, num_epochs + 1):
for X, y in data_iter:
with [Link]():
l = loss(net(X), y)
[Link]()
[Link](batch_size)
l = loss(net(features), labels)
print('epoch %d, loss: %f' % (epoch, [Link]().asnumpy()))

epoch 1, loss: 0.024890

epoch 2, loss: 0.000086
epoch 3, loss: 0.000052

Below, we compare the model parameters learned by training on finite data and the actual param-
eters that generated our dataset. To access parameters with Gluon, we first access the layer that we
need from net and then access that layerʼs weight (weight) and bias (bias). To access each param-

3.3. Concise Implementation of Linear Regression 107

eterʼs values as an ndarray, we invoke its data method. As in our from-scratch implementation,
note that our estimated parameters are close to their ground truth counterparts.

w = net[0].[Link]()
print('Error in estimating w', true_w.reshape([Link]) - w)
b = net[0].[Link]()
print('Error in estimating b', true_b - b)

Error in estimating w [[ 0.00081301 -0.00010157]]

Error in estimating b [0.00095034]

Summary

• Using Gluon, we can implement models much more succinctly.

• In Gluon, the data module provides tools for data processing, the nn module defines a large
number of neural network layers, and the loss module defines many common loss func-
tions.
• MXNetʼs module initializer provides various methods for model parameter initialization.
• Dimensionality and storage are automatically inferred (but be careful not to attempt to ac-
cess parameters before they have been initialized).

Exercises

1. If we replace l = loss(output, y) with l = loss(output, y).mean(), we need to change

[Link](batch_size) to [Link](1) for the code to behave identically. Why?
2. Review the MXNet documentation to see what loss functions and initialization methods are
provided in the modules [Link] and init. Replace the loss by Huberʼs loss.
3. How do you access the gradient of [Link]?

3.4 Softmax Regression

In Section 3.1, we introduced linear regression, working through implementations from scratch
in Section 3.2 and again using Gluon in Section 3.3 to do the heavy lifting.
Regression is the hammer we reach for when we want to answer how much? or how many? ques-
tions. If you want to predict the number of dollars (the price) at which a house will be sold, or
the number of wins a baseball team might have, or the number of days that a patient will remain
hospitalized before being discharged, then you are probably looking for a regression model.
In practice, we are more often interested in classification: asking not how much? but which one?

108 Chapter 3. Linear Neural Networks

• Does this email belong in the spam folder or the inbox*?
• Is this customer more likely to sign up or not to sign up for a subscription service?*
• Does this image depict a donkey, a dog, a cat, or a rooster?
• Which movie is Aston most likely to watch next?
Colloquially, machine learning practitioners overload the word classification to describe two subtly
different problems: (i) those where we are interested only in hard assignments of examples to
categories; and (ii) those where we wish to make soft assignments, i.e., to assess the probability that
each category applies. The distinction tends to get blurred, in part, because often, even when we
only care about hard assignments, we still use models that make soft assignments.

3.4.1 Classification Problems

To get our feet wet, letʼs start off with a simple image classification problem. Here, each input
consists of a 2 × 2 grayscale image. We can represent each pixel value with a single scalar, giving
us four features x1 , x2 , x3 , x4 . Further, letʼs assume that each image belongs to one among the
categories “cat”, “chicken” and “dog”.
Next, we have to choose how to represent the labels. We have two obvious choices. Perhaps the
most natural impulse would be to choose y ∈ {1, 2, 3}, where the integers represent {dog, cat,
chicken} respectively. This is a great way of storing such information on a computer. If the cat-
egories had some natural ordering among them, say if we were trying to predict {baby, toddler,
adolescent, young adult, adult, geriatric}, then it might even make sense to cast this problem as
regression and keep the labels in this format.
But general classification problems do not come with natural orderings among the classes. For-
tunately, statisticians long ago invented a simple way to represent categorical data: the one hot
encoding. A one-hot encoding is a vector with as many components as we have categories. The
component corresponding to particular instanceʼs category is set to 1 and all other components
are set to 0.

y ∈ {(1, 0, 0), (0, 1, 0), (0, 0, 1)}. (3.4.1)

In our case, y would be a three-dimensional vector, with (1, 0, 0) corresponding to “cat”, (0, 1, 0) to
“chicken” and (0, 0, 1) to “dog”.

Network Architecture

In order to estimate the conditional probabilities associated with each classes, we need a model
with multiple outputs, one per class. To address classification with linear models, we will need as
many linear functions as we have outputs. Each output will correspond to its own linear function.
In our case, since we have 4 features and 3 possible output categories, we will need 12 scalars to
represent the weights, (w with subscripts) and 3 scalars to represent the biases (b with subscripts).
We compute these three logits, o1 , o2 , and o3 , for each input:

o1 = x1 w11 + x2 w12 + x3 w13 + x4 w14 + b1 ,

o2 = x1 w21 + x2 w22 + x3 w23 + x4 w24 + b2 , (3.4.2)
o3 = x1 w31 + x2 w32 + x3 w33 + x4 w34 + b3 .

3.4. Softmax Regression 109

We can depict this calculation with the neural network diagram shown in Fig. 3.4.1. Just as in lin-
ear regression, softmax regression is also a single-layer neural network. And since the calculation
of each output, o1 , o2 , and o3 , depends on all inputs, x1 , x2 , x3 , and x4 , the output layer of softmax
regression can also be described as fully-connected layer.

Fig. 3.4.1: Softmax regression is a single-layer neural network.

To express the model more compactly, we can use linear algebra notation. In vector form, we
arrive at o = Wx + b, a form better suited both for mathematics, and for writing code. Note that
we have gathered all of our weights into a 3 × 4 matrix and that for a given example x, our outputs
are given by a matrix-vector product of our weights by our inputs plus our biases b.

Softmax Operation

The main approach that we are going to take here is to interpret the outputs of our model as proba-
bilities. We will optimize our parameters to produce probabilities that maximize the likelihood of
the observed data. Then, to generate predictions, we will set a threshold, for example, choosing
the argmax of the predicted probabilities.
Put formally, we would like outputs ŷk that we can interpret as the probability that a given item
belongs to class k. Then we can choose the class with the largest output value as our prediction
argmaxk yk . For example, if ŷ1 , ŷ2 , and ŷ3 are 0.1, .8, and 0.1, respectively, then we predict category
2, which (in our example) represents “chicken”.
You might be tempted to suggest that we interpret the logits o directly as our outputs of interest.
However, there are some problems with directly interpreting the output of the linear layer as a
probability. Nothing constrains these numbers to sum to 1. Moreover, depending on the inputs,
they can take negative values. These violate basic axioms of probability presented in Section 2.6
To interpret our outputs as probabilities, we must guarantee that (even on new data), they will be
nonnegative and sum up to 1. Moreover, we need a training objective that encourages the model
to estimate faithfully probabilities. Of all instances when a classifier outputs .5, we hope that half
of those examples will actually belong to the predicted class. This is a property called calibration.
The softmax function, invented in 1959 by the social scientist R Duncan Luce in the context of choice
models does precisely this. To transform our logits such that they become nonnegative and sum to
1, while requiring that the model remains differentiable, we first exponentiate each logit (ensuring
non-negativity) and then divide by their sum (ensuring that they sum to 1).

exp(oi )
ŷ = softmax(o) where ŷi = ∑ . (3.4.3)
j exp(oj )

It is easy to see ŷ1 + ŷ2 + ŷ3 = 1 with 0 ≤ ŷi ≤ 1 for all i. Thus, ŷ is a proper probability distribu-
tion and the values of ŷ can be interpreted accordingly. Note that the softmax operation does not

110 Chapter 3. Linear Neural Networks

change the ordering among the logits, and thus we can still pick out the most likely class by:

ı̂(o) = argmax oi = argmax ŷi . (3.4.4)

i i

The logits o then are simply the pre-softmax values that determining the probabilities assigned
to each category. Summarizing it all in vector notation we get o(i) = Wx(i) + b, where ŷ(i) =
softmax(o(i) ).

Vectorization for Minibatches

To improve computational efficiency and take advantage of GPUs, we typically carry out vector
calculations for minibatches of data. Assume that we are given a minibatch X of examples with
dimensionality d and batch size n. Moreover, assume that we have q categories (outputs). Then
the minibatch features X are in Rn×d , weights W ∈ Rd×q , and the bias satisfies b ∈ Rq .

O = XW + b,
(3.4.5)
Ŷ = softmax(O).

This accelerates the dominant operation into a matrix-matrix product WX vs the matrix-vector
products we would be executing if we processed one example at a time. The softmax itself can be
computed by exponentiating all entries in O and then normalizing them by the sum.

3.4.2 Loss Function

Next, we need a loss function to measure the quality of our predicted probabilities. We will rely on
likelihood maximization, the very same concept that we encountered when providing a probabilis-
tic justification for the least squares objective in linear regression (Section 3.1).

Log-Likelihood

The softmax function gives us a vector ŷ, which we can interpret as estimated conditional prob-
abilities of each class given the input x, e.g., ŷ1 = P̂ (y = cat | x). We can compare the estimates
with reality by checking how probable the actual classes are according to our model, given the
features.
∏
n ∑
n
P (Y | X) = P (y (i)
| x ) and thus − log P (Y | X) =
(i)
− log P (y (i) | x(i) ). (3.4.6)
i=1 i=1

Maximizing P (Y | X) (and thus equivalently minimizing − log P (Y | X)) corresponds to predict-

ing the label well. This yields the loss function (we dropped the superscript (i) to avoid notation
clutter):
∑
l = − log P (y | x) = − yj log ŷj . (3.4.7)
j

For reasons explained later on, this loss function is commonly called the cross-entropy loss. Here,
we used that by construction ŷ is a discrete probability distribution and that the vector y is a one-
hot vector. Hence the the sum over all coordinates j vanishes for all but one term. Since all ŷj are
probabilities, their logarithm is never larger than 0. Consequently, the loss function cannot be

3.4. Softmax Regression 111

minimized any further if we correctly predict y with certainty, i.e., if P (y | x) = 1 for the correct
label. Note that this is often not possible. For example, there might be label noise in the dataset
(some examples may be mislabeled). It may also not be possible when the input features are not
sufficiently informative to classify every example perfectly.

Softmax and Derivatives

Since the softmax and the corresponding loss are so common, it is worth while understanding a
bit better how it is computed. Plugging o into the definition of the loss l and using the definition
of the softmax we obtain:
∑ ∑ ∑ ∑ ∑ ∑
l=− yj log ŷj = yj log exp(ok ) − yj oj = log exp(ok ) − yj o j . (3.4.8)
j j k j k j

To understand a bit better what is going on, consider the derivative with respect to o. We get

exp(oj )
∂oj l = ∑ − yj = softmax(o)j − yj = P (y = j | x) − yj . (3.4.9)
k exp(ok )

In other words, the gradient is the difference between the probability assigned to the true class
by our model, as expressed by the probability P (y | x), and what actually happened, as expressed
by y. In this sense, it is very similar to what we saw in regression, where the gradient was the
difference between the observation y and estimate ŷ. This is not coincidence. In any exponential
family54 model, the gradients of the log-likelihood are given by precisely this term. This fact makes
computing gradients easy in practice.

Cross-Entropy Loss

Now consider the case where we observe not just a single outcome but an entire distribution over
outcomes. We can use the same representation as before for y. The only difference is that rather
than a vector containing only binary entries, say (0, 0, 1), we now have a generic probability vector,
say (0.1, 0.2, 0.7). The math that we used previously to define the loss l still works out fine, just that
the interpretation is slightly more general. It is the expected value of the loss for a distribution
over labels.
∑
l(y, ŷ) = − yj log ŷj . (3.4.10)
j

This loss is called the cross-entropy loss and it is one of the most commonly used losses for multi-
class classification. We can demystify the name by introducing the basics of information theory.

3.4.3 Information Theory Basics

Information theory deals with the problem of encoding, decoding, transmitting and manipulating
information (also known as data) in as concise form as possible.
54
[Link]

112 Chapter 3. Linear Neural Networks

Entropy

The central idea in information theory is to quantify the information content in data. This quantity
places a hard limit on our ability to compress the data. In information theory, this quantity is
called the entropy55 of a distribution p, and it is captured by the following equation:
∑
H[p] = −p(j) log p(j). (3.4.11)
j

One of the fundamental theorems of information theory states that in order to encode data drawn
randomly from the distribution p, we need at least H[p] “nats” to encode it. If you wonder what a
“nat” is, it is the equivalent of bit but when using a code with base e rather than one with base 2.
1
One nat is log(2) ≈ 1.44 bit. H[p]/2 is often also called the binary entropy.

Surprisal

You might be wondering what compression has to do with prediction. Imagine that we have a
stream of data that we want to compress. If it is always easy for us to predict the next token,
then this data is easy to compress! Take the extreme example where every token in the stream
always takes the same value. That is a very boring data stream! And not only is it boring, but it is
easy to predict. Because they are always the same, we do not have to transmit any information to
communicate the contents of the stream. Easy to predict, easy to compress.
However if we cannot perfectly predict every event, then we might some times be surprised. Our
surprise is greater when we assigned an event lower probability. For reasons that we will elaborate
in the appendix, Claude Shannon settled on log(1/p(j)) = − log p(j) to quantify oneʼs surprisal at
observing an event j having assigned it a (subjective) probability p(j). The entropy is then the
expected surprisal when one assigned the correct probabilities (that truly match the data-generating
process). The entropy of the data is then the least surprised that one can ever be (in expectation).

Cross-Entropy Revisited

So if entropy is level of surprise experienced by someone who knows the true probability, then
you might be wondering, what is cross-entropy? The cross-entropy from :math:‘p‘ to :math:‘q‘, de-
noted H(p, q), is the expected surprisal of an observer with subjective probabilities q upon seeing
data that was actually generated according to probabilities p. The lowest possible cross-entropy
is achieved when p = q. In this case, the cross-entropy from p to q is H(p, p) = H(p). Relating
this back to our classification objective, even if we get the best possible predictions, if the best
possible possible, then we will never be perfect. Our loss is lower-bounded by the entropy given
by the actual conditional distributions P (y | x).

Kullback Leibler Divergence

Perhaps the most common way to measure the distance between two distributions is to calculate
the Kullback Leibler divergence D(p∥q). This is simply the difference between the cross-entropy
55
[Link]

3.4. Softmax Regression 113

and the entropy, i.e., the additional cross-entropy incurred over the irreducible minimum value it
could take:
∑ p(j)
D(p∥q) = H(p, q) − H[p] = p(j) log . (3.4.12)
q(j)
j

Note that in classification, we do not know the true p, so we cannot compute the entropy directly.
However, because the entropy is out of our control, minimizing D(p∥q) with respect to q is equiv-
alent to minimizing the cross-entropy loss.
In short, we can think of the cross-entropy classification objective in two ways: (i) as maximizing
the likelihood of the observed data; and (ii) as minimizing our surprise (and thus the number of
bits) required to communicate the labels.

3.4.4 Model Prediction and Evaluation

After training the softmax regression model, given any example features, we can predict the prob-
ability of each output category. Normally, we use the category with the highest predicted proba-
bility as the output category. The prediction is correct if it is consistent with the actual category
(label). In the next part of the experiment, we will use accuracy to evaluate the modelʼs perfor-
mance. This is equal to the ratio between the number of correct predictions a nd the total number
of predictions.

Summary

• We introduced the softmax operation which takes a vector maps it into probabilities.
• Softmax regression applies to classification problems. It uses the probability distribution of
the output category in the softmax operation.
• cross-entropy is a good measure of the difference between two probability distributions. It
measures the number of bits needed to encode the data given our model.

Exercises

1. Show that the Kullback-Leibler divergence D(p∥q) is nonnegative for all distributions p and
q. Hint: use Jensenʼs inequality, i.e., use the fact that − log x is a convex function.
∑
2. Show that log j exp(oj ) is a convex function in o.
3. We can explore the connection between exponential families and the softmax in some more
depth
• Compute the second derivative of the cross-entropy loss l(y, ŷ) for the softmax.
• Compute the variance of the distribution given by softmax(o) and show that it matches
the second derivative computed above.
4. Assume that we three classes which occur with equal probability, i.e., the probability vector
is ( 13 , 13 , 13 ).
• What is the problem if we try to design a binary code for it? Can we match the entropy
lower bound on the number of bits?

114 Chapter 3. Linear Neural Networks

• Can you design a better code. Hint: what happens if we try to encode two independent
observations? What if we encode n observations jointly?
5. Softmax is a misnomer for the mapping introduced above (but everyone in deep learning
uses it). The real softmax is defined as RealSoftMax(a, b) = log(exp(a) + exp(b)).
• Prove that RealSoftMax(a, b) > max(a, b).
• Prove that this holds for λ−1 RealSoftMax(λa, λb), provided that λ > 0.
• Show that for λ → ∞ we have λ−1 RealSoftMax(λa, λb) → max(a, b).
• What does the soft-min look like?
• Extend this to more than two numbers.

3.5 The Image Classification Dataset (Fashion-MNIST)

In Section 17.9, we trained a naive Bayes classifier, using the MNIST dataset introduced in 1998
(LeCun et al., 1998). While MNIST had a good run as a benchmark dataset, even simple models by
todayʼs standards achieve classification accuracy over 95%. making it unsuitable for distinguish-
ing between stronger models and weaker ones. Today, MNIST serves as more of sanity checks
than as a benchmark. To up the ante just a bit, we will focus our discussion in the coming sections
on the qualitatively similar, but comparatively complex Fashion-MNIST dataset (Xiao et al., 2017),
which was released in 2017.

%matplotlib inline
import d2l
from mxnet import gluon
import sys

d2l.use_svg_display()

3.5.1 Getting the Dataset

Just as with MNIST, Gluon makes it easy to download and load the FashionMNIST dataset into
memory via the FashionMNIST class contained in [Link]. We briefly work through the
mechanics of loading and exploring the dataset below. Please refer to Section 17.9 for more details
on loading data.

mnist_train = [Link](train=True)
mnist_test = [Link](train=False)

FashionMNIST consists of images from 10 categories, each represented by 6k images in the train-
ing set and by 1k in the test set. Consequently the training set and the test set contain 60k and 10k
images, respectively.

3.5. The Image Classification Dataset (Fashion-MNIST) 115

len(mnist_train), len(mnist_test)

(60000, 10000)

The images in Fashion-MNIST are associated with the following categories: t-shirt, trousers,
pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot. The following function converts
between numeric label indices and their names in text.

# Saved in the d2l package for later use

def get_fashion_mnist_labels(labels):
text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
return [text_labels[int(i)] for i in labels]

We can now create a function to visualize these examples.

# Saved in the d2l package for later use

def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):
"""Plot a list of images."""
figsize = (num_cols * scale, num_rows * scale)
_, axes = [Link](num_rows, num_cols, figsize=figsize)
axes = [Link]()
for i, (ax, img) in enumerate(zip(axes, imgs)):
[Link]([Link]())
[Link].get_xaxis().set_visible(False)
[Link].get_yaxis().set_visible(False)
if titles:
ax.set_title(titles[i])
return axes

Here are the images and their corresponding labels (in text) for the first few examples in the train-
ing dataset.

X, y = mnist_train[:18]
show_images([Link](axis=-1), 2, 9, titles=get_fashion_mnist_labels(y));

3.5.2 Reading a Minibatch

To make our life easier when reading from the training and test sets, we use a DataLoader rather
than creating one from scratch, as we did in Section 3.2. Recall that at each iteration, a DataLoader

116 Chapter 3. Linear Neural Networks

reads a minibatch of data with size batch_size each time.
During training, reading data can be a significant performance bottleneck, especially when our
model is simple or when our computer is fast. A handy feature of Gluonʼs DataLoader is the ability
to use multiple processes to speed up data reading. For instance, we can set aside 4 processes to
read the data (via num_workers). Because this feature is not currently supported on Windows the
following code checks the platform to make sure that we do not saddle our Windows-using friends
with error messages later on.

# Saved in the d2l package for later use

def get_dataloader_workers(num_workers=4):
# 0 means no additional process is used to speed up the reading of data.
if [Link]('win'):
return 0
else:
return num_workers

Below, we convert the image data from uint8 to 32-bit floating point numbers using the ToTensor
class. Additionally, the transformer will divide all numbers by 255 so that all pixels have values
between 0 and 1. The ToTensor class also moves the image channel from the last dimension to
the first dimension to facilitate the convolutional neural network calculations introduced later.
Through the transform_first function of the dataset, we apply the transformation of ToTensor to
the first element of each instance (image and label).

batch_size = 256
transformer = [Link]()
train_iter = [Link](mnist_train.transform_first(transformer),
batch_size, shuffle=True,
num_workers=get_dataloader_workers())

Letʼs look at the time it takes to read the training data.

timer = [Link]()
for X, y in train_iter:
continue
'%.2f sec' % [Link]()

'1.75 sec'

3.5.3 Putting All Things Together

Now we define the load_data_fashion_mnist function that obtains and reads the Fashion-MNIST
dataset. It returns the data iterators for both the training set and validation set. In addition, it
accepts an optional argument to resize images to another shape.

# Saved in the d2l package for later use

def load_data_fashion_mnist(batch_size, resize=None):
"""Download the Fashion-MNIST dataset and then load into memory."""
dataset = [Link]
trans = [[Link](resize)] if resize else []
(continues on next page)

3.5. The Image Classification Dataset (Fashion-MNIST) 117

(continued from previous page)
[Link]([Link]())
trans = [Link](trans)
mnist_train = [Link](train=True).transform_first(trans)
mnist_test = [Link](train=False).transform_first(trans)
return ([Link](mnist_train, batch_size, shuffle=True,
num_workers=get_dataloader_workers()),
[Link](mnist_test, batch_size, shuffle=False,
num_workers=get_dataloader_workers()))

Below, we verify that image resizing works.

train_iter, test_iter = load_data_fashion_mnist(32, (64, 64))

for X, y in train_iter:
print([Link])
break

(32, 1, 64, 64)

We are now ready to work with the FashionMNIST dataset in the sections that follow.

Summary

• Fashion-MNIST is an apparel classification dataset consisting of images representing 10 cat-

egories.
• We will use this dataset in subsequent sections and chapters to evaluate various classification
algorithms.
• We store the shape of each image with height h width w pixels as h × w or (h, w).
• Data iterators are a key component for efficient performance. Rely on well-implemented
iterators that exploit multi-threading to avoid slowing down your training loop.

Exercises

1. Does reducing the batch_size (for instance, to 1) affect read performance?

2. For non-Windows users, try modifying num_workers to see how it affects read performance.
Plot the performance against the number of works employed.
3. Use the MXNet documentation to see which other datasets are available in [Link].
[Link].
4. Use the MXNet documentation to see which other transformations are available in mxnet.
[Link].

118 Chapter 3. Linear Neural Networks

3.6 Implementation of Softmax Regression from Scratch

Just as we implemented linear regression from scratch, we believe that multiclass logistic (soft-
max) regression is similarly fundamental and you ought to know the gory details of how to imple-
ment it yourself. As with linear regression, after doing things by hand we will breeze through an
implementation in Gluon for comparison. To begin, letʼs import the familiar packages.

import d2l
from mxnet import autograd, np, npx, gluon
from IPython import display
npx.set_np()

We will work with the Fashion-MNIST dataset, just introduced in Section 3.5, setting up an iterator
with batch size 256.

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

3.6.1 Initializing Model Parameters

As in our linear regression example, each example here will be represented by a fixed-length vec-
tor. Each example in the raw data is a 28 × 28 image. In this section, we will flatten each image,
treating them as 784 1D vectors. In the future, we will talk about more sophisticated strategies for
exploiting the spatial structure in images, but for now we treat each pixel location as just another
feature.
Recall that in softmax regression, we have as many outputs as there are categories. Because our
dataset has 10 categories, our network will have an output dimension of 10. Consequently, our
weights will constitute a 784 × 10 matrix and the biases will constitute a 1 × 10 vector. As with
linear regression, we will initialize our weights W with Gaussian noise and our biases to take the
initial value 0.

num_inputs = 784
num_outputs = 10

W = [Link](0, 0.01, (num_inputs, num_outputs))

b = [Link](num_outputs)

Recall that we need to attach gradients to the model parameters. More literally, we are allocat-
ing memory for future gradients to be stored and notifiying MXNet that we will want to calculate
gradients with respect to these parameters in the future.

W.attach_grad()
b.attach_grad()

3.6.2 The Softmax

Before implementing the softmax regression model, letʼs briefly review how operators such as
sum work along specific dimensions in an ndarray. Given a matrix X we can sum over all elements

3.6. Implementation of Softmax Regression from Scratch 119

(default) or only over elements in the same axis, i.e., the column (axis=0) or the same row (axis=1).
Note that if X is an array with shape (2, 3) and we sum over the columns ([Link](axis=0), the
result will be a (1D) vector with shape (3,). If we want to keep the number of axes in the original
array (resulting in a 2D array with shape (1, 3)), rather than collapsing out the dimension that
we summed over we can specify keepdims=True when invoking sum.

X = [Link]([[1, 2, 3], [4, 5, 6]])

print([Link](axis=0, keepdims=True), '\n', [Link](axis=1, keepdims=True))

[[5. 7. 9.]]
[[ 6.]
[15.]]

We are now ready to implement the softmax function. Recall that softmax consists of two steps:
First, we exponentiate each term (using exp). Then, we sum over each row (we have one row per
example in the batch) to get the normalization constants for each example. Finally, we divide each
row by its normalization constant, ensuring that the result sums to 1. Before looking at the code,
letʼs recall what this looks expressed as an equation:
exp(Xij )
softmax(X)ij = ∑ . (3.6.1)
k exp(Xik )

The denominator, or normalization constant, is also sometimes called the partition function
(and its logarithm is called the log-partition function). The origins of that name are in statisti-
cal physics58 where a related equation models the distribution over an ensemble of particles).

def softmax(X):
X_exp = [Link](X)
partition = X_exp.sum(axis=1, keepdims=True)
return X_exp / partition # The broadcast mechanism is applied here

As you can see, for any random input, we turn each element into a non-negative number. More-
over, each row sums up to 1, as is required for a probability. Note that while this looks correct
mathematically, we were a bit sloppy in our implementation because failed to take precautions
against numerical overflow or underflow due to large (or very small) elements of the matrix, as
we did in Section 17.9.

X = [Link](size=(2, 5))
X_prob = softmax(X)
X_prob, X_prob.sum(axis=1)

(array([[0.21324193, 0.33961776, 0.1239742 , 0.27106097, 0.05210521],

[0.11462264, 0.3461234 , 0.19401033, 0.29583326, 0.04941036]]),
array([1.0000001, 1. ]))

3.6.3 The Model

Now that we have defined the softmax operation, we can implement the softmax regression model.
The below code defines the forward pass through the network. Note that we flatten each original
58
[Link]

120 Chapter 3. Linear Neural Networks

image in the batch into a vector with length num_inputs with the reshape function before passing
the data through our model.

def net(X):
return softmax([Link]([Link](-1, num_inputs), W) + b)

3.6.4 The Loss Function

Next, we need to implement the cross-entropy loss function, introduced in Section 3.4. This may
be the most common loss function in all of deep learning because, at the moment, classification
problems far outnumber regression problems.
Recall that cross-entropy takes the negative log likelihood of the predicted probability assigned
to the true label − log P (y | x). Rather than iterating over the predictions with a Python for loop
(which tends to be inefficient), we can use the pick function which allows us to easily select the
appropriate terms from the matrix of softmax entries. Below, we illustrate the pick function on a
toy example, with 3 categories and 2 examples.

y_hat = [Link]([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])

y_hat[[0, 1], [0, 2]]

array([0.1, 0.5])

Now we can implement the cross-entropy loss function efficiently with just one line of code.

def cross_entropy(y_hat, y):

return - [Link](y_hat[range(len(y_hat)), y])

3.6.5 Classification Accuracy

Given the predicted probability distribution y_hat, we typically choose the class with highest pre-
dicted probability whenever we must output a hard prediction. Indeed, many applications require
that we make a choice. Gmail must categorize an email into Primary, Social, Updates, or Forums.
It might estimate probabilities internally, but at the end of the day it has to choose one among the
categories.
When predictions are consistent with the actual category y, they are correct. The classification
accuracy is the fraction of all predictions that are correct. Although it can be difficult optimize
accuracy directly (it is not differentiable), it is often the performance metric that we care most
about, and we will nearly always report it when training classifiers.
To compute accuracy we do the following: First, we execute y_hat.argmax(axis=1) to gather the
predicted classes (given by the indices for the largest entires each row). The result has the same
shape as the variable y. Now we just need to check how frequently the two match. Since the
equality operator == is datatype-sensitive (e.g., an int and a float32 are never equal), we also need
to convert both to the same type (we pick float32). The result is an ndarray containing entries of
0 (false) and 1 (true). Taking the mean yields the desired result.

3.6. Implementation of Softmax Regression from Scratch 121

# Saved in the d2l package for later use
def accuracy(y_hat, y):
if y_hat.shape[1] > 1:
return float((y_hat.argmax(axis=1) == [Link]('float32')).sum())
else:
return float((y_hat.astype('int32') == [Link]('int32')).sum())

We will continue to use the variables y_hat and y defined in the pick function, as the predicted
probability distribution and label, respectively. We can see that the first exampleʼs prediction cat-
egory is 2 (the largest element of the row is 0.6 with an index of 2), which is inconsistent with the
actual label, 0. The second exampleʼs prediction category is 2 (the largest element of the row is
0.5 with an index of 2), which is consistent with the actual label, 2. Therefore, the classification
accuracy rate for these two examples is 0.5.

y = [Link]([0, 2])
accuracy(y_hat, y) / len(y)

0.5

Similarly, we can evaluate the accuracy for model net on the dataset (accessed via data_iter).

# Saved in the d2l package for later use

def evaluate_accuracy(net, data_iter):
metric = Accumulator(2) # num_corrected_examples, num_examples
for X, y in data_iter:
[Link](accuracy(net(X), y), [Link])
return metric[0] / metric[1]

Here Accumulator is a utility class to accumulated sum over multiple numbers.

# Saved in the d2l package for later use

class Accumulator(object):
"""Sum a list of numbers over time"""

def init(self, n):

[Link] = [0.0] * n

def add(self, *args):

[Link] = [a+b for a, b in zip([Link], args)]

def reset(self):
[Link] = [0] * len([Link])

def getitem(self, i):

return [Link][i]

Because we initialized the net model with random weights, the accuracy of this model should be
close to random guessing, i.e., 0.1 for 10 classes.

evaluate_accuracy(net, test_iter)

122 Chapter 3. Linear Neural Networks

0.0925

3.6.6 Model Training

The training loop for softmax regression should look strikingly familiar if you read through our
implementation of linear regression in Section 3.2. Here we refactor the implementation to make
it reusable. First, we define a function to train for one data epoch. Note that updater is general
function to update the model parameters, which accepts the batch size as an argument. It can be
either a wrapper of [Link] or a Gluon trainer.

# Saved in the d2l package for later use

def train_epoch_ch3(net, train_iter, loss, updater):
metric = Accumulator(3) # train_loss_sum, train_acc_sum, num_examples
if isinstance(updater, [Link]):
updater = [Link]
for X, y in train_iter:
# Compute gradients and update parameters
with [Link]():
y_hat = net(X)
l = loss(y_hat, y)
[Link]()
updater([Link][0])
[Link](float([Link]()), accuracy(y_hat, y), [Link])
# Return training loss and training accuracy
return metric[0]/metric[2], metric[1]/metric[2]

Before showing the implementation of the training function, we define a utility class that draw
data in animation. Again, it aims to simplify the codes in later chapters.

# Saved in the d2l package for later use

class Animator(object):
def __init__(self, xlabel=None, ylabel=None, legend=[], xlim=None,
ylim=None, xscale='linear', yscale='linear', fmts=None,
nrows=1, ncols=1, figsize=(3.5, 2.5)):
"""Incrementally plot multiple lines."""
d2l.use_svg_display()
[Link], [Link] = [Link](nrows, ncols, figsize=figsize)
if nrows * ncols == 1:
[Link] = [[Link], ]
# Use a lambda to capture arguments
self.config_axes = lambda: d2l.set_axes(
[Link][0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
self.X, self.Y, [Link] = None, None, fmts

def add(self, x, y):

"""Add multiple data points into the figure."""
if not hasattr(y, "__len__"):
y = [y]
n = len(y)
if not hasattr(x, "__len__"):
x = [x] * n
(continues on next page)

3.6. Implementation of Softmax Regression from Scratch 123

(continued from previous page)
if not self.X:
self.X = [[] for _ in range(n)]
if not self.Y:
self.Y = [[] for _ in range(n)]
if not [Link]:
[Link] = ['-'] * n
for i, (a, b) in enumerate(zip(x, y)):
if a is not None and b is not None:
self.X[i].append(a)
self.Y[i].append(b)
[Link][0].cla()
for x, y, fmt in zip(self.X, self.Y, [Link]):
[Link][0].plot(x, y, fmt)
self.config_axes()
[Link]([Link])
display.clear_output(wait=True)

The training function then runs multiple epochs and visualize the training progress.

# Saved in the d2l package for later use

def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater):
animator = Animator(xlabel='epoch', xlim=[1, num_epochs],
ylim=[0.3, 0.9],
legend=['train loss', 'train acc', 'test acc'])
for epoch in range(num_epochs):
train_metrics = train_epoch_ch3(net, train_iter, loss, updater)
test_acc = evaluate_accuracy(net, test_iter)
[Link](epoch+1, train_metrics+(test_acc,))

Again, we use the minibatch stochastic gradient descent to optimize the loss function of the model.
Note that the number of epochs (num_epochs), and learning rate (lr) are both adjustable hyper-
parameters. By changing their values, we may be able to increase the classification accuracy of
the model. In practice we will want to split our data three ways into training, validation, and test
data, using the validation data to choose the best values of our hyperparameters.

num_epochs, lr = 10, 0.1

def updater(batch_size):
return [Link]([W, b], lr, batch_size)

train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, updater)

124 Chapter 3. Linear Neural Networks

3.6.7 Prediction

Now that training is complete, our model is ready to classify some images. Given a series of im-
ages, we will compare their actual labels (first line of text output) and the model predictions (sec-
ond line of text output).

# Saved in the d2l package for later use

def predict_ch3(net, test_iter, n=6):
for X, y in test_iter:
break
trues = d2l.get_fashion_mnist_labels(y)
preds = d2l.get_fashion_mnist_labels(net(X).argmax(axis=1))
titles = [true+'\n' + pred for true, pred in zip(trues, preds)]
d2l.show_images(X[0:n].reshape(n, 28, 28), 1, n, titles=titles[0:n])

predict_ch3(net, test_iter)

Summary

With softmax regression, we can train models for multi-category classification. The training loop
is very similar to that in linear regression: retrieve and read data, define models and loss functions,
then train models using optimization algorithms. As you will soon find out, most common deep
learning models have similar training procedures.

3.6. Implementation of Softmax Regression from Scratch 125

Exercises

1. In this section, we directly implemented the softmax function based on the mathematical
definition of the softmax operation. What problems might this cause (hint: try to calculate
the size of exp(50))?
2. The function cross_entropy in this section is implemented according to the definition of
the cross-entropy loss function. What could be the problem with this implementation (hint:
consider the domain of the logarithm)?
3. What solutions you can think of to fix the two problems above?
4. Is it always a good idea to return the most likely label. E.g. would you do this for medical
diagnosis?
5. Assume that we want to use softmax regression to predict the next word based on some
features. What are some problems that might arise from a large vocabulary?

3.7 Concise Implementation of Softmax Regression

Just as Gluon made it much easier to implement linear regression in Section 3.3, we will find it
similarly (or possibly more) convenient for implementing classification models. Again, we begin
with our import ritual.

import d2l
from mxnet import gluon, init, npx
from [Link] import nn
npx.set_np()

Letʼs stick with the Fashion-MNIST dataset and keep the batch size at 256 as in the last section.

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

3.7.1 Initializing Model Parameters

As mentioned in Section 3.4, the output layer of softmax regression is a fully-connected (Dense)
layer. Therefore, to implement our model, we just need to add one Dense layer with 10 outputs to
our Sequential. Again, here, the Sequential is not really necessary, but we might as well form the
habit since it will be ubiquitous when implementing deep models. Again, we initialize the weights
at random with zero mean and standard deviation 0.01.

net = [Link]()
[Link]([Link](10))
[Link]([Link](sigma=0.01))

126 Chapter 3. Linear Neural Networks

3.7.2 The Softmax

In the previous example, we calculated our modelʼs output and then ran this output through the
cross-entropy loss. Mathematically, that is a perfectly reasonable thing to do. However, from a
computational perspective, exponentiation can be a source of numerical stability
zj
issues (as dis-
cussed in Section 17.9). Recall that the softmax function calculates ŷj = ∑ne ezi , where ŷj is the
i=1
j th element of yhat and zj is the j th element of the input y_linear variable, as computed by the
softmax.
If some of the zi are very large (i.e., very positive), then ezi might be larger than the largest number
we can have for certain types of float (i.e., overflow). This would make the denominator (and/or
numerator) inf and we wind up encountering either 0, inf, or nan for ŷj . In these situations we
do not get a well-defined return value for cross_entropy. One trick to get around this is to first
subtract max(zi ) from all zi before proceeding with the softmax calculation. You can verify that
this shifting of each zi by constant factor does not change the return value of softmax.
After the subtraction and normalization step, it might be that possible that some zj have large
negative values and thus that the corresponding ezj will take values close to zero. These might
be rounded to zero due to finite precision (i.e underflow), making ŷj zero and giving us -inf for
log(ŷj ). A few steps down the road in backpropagation, we might find ourselves faced with a
screenful of the dreaded not-a-number (nan) results.
Fortunately, we are saved by the fact that even though we are computing exponential functions,
we ultimately intend to take their log (when calculating the cross-entropy loss). By combining
these two operators (softmax and cross_entropy) together, we can escape the numerical stability
issues that might otherwise plague us during backpropagation. As shown in the equation below,
we avoided calculating ezj and can instead zj directly due to the canceling in log(exp(·)).
( )
ez j
log (ŷj ) = log ∑n z
i=1 e
i
( n )
∑
= log (e ) − log
zj
e zi
(3.7.1)
i=1
( n )
∑
= zj − log ez i .
i=1

We will want to keep the conventional softmax function handy in case we ever want to evaluate
the probabilities output by our model. But instead of passing softmax probabilities into our new
loss function, we will just pass the logits and compute the softmax and its log all at once inside the
softmax_cross_entropy loss function, which does smart things like the log-sum-exp trick (see on
Wikipedia60 ).

loss = [Link]()

3.7.3 Optimization Algorithm

Here, we use minibatch stochastic gradient descent with a learning rate of 0.1 as the optimiza-
tion algorithm. Note that this is the same as we applied in the linear regression example and it
illustrates the general applicability of the optimizers.
60
[Link]

3.7. Concise Implementation of Softmax Regression 127

trainer = [Link](net.collect_params(), 'sgd', {'learning_rate': 0.1})

3.7.4 Training

Next we call the training function defined in the last section to train a model.

num_epochs = 10
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

As before, this algorithm converges to a solution that achieves an accuracy of 83.7%, albeit this
time with fewer lines of code than before. Note that in many cases, Gluon takes additional pre-
cautions beyond these most well-known tricks to ensure numerical stability, saving us from even
more pitfalls that we would encounter if we tried to code all of our models from scratch in practice.

Exercises

1. Try adjusting the hyper-parameters, such as batch size, epoch, and learning rate, to see what
the results are.
2. Why might the test accuracy decrease again after a while? How could we fix this?

128 Chapter 3. Linear Neural Networks

4 | Multilayer Perceptrons

In this chapter, we will introduce your first truly deep networks. The simplest deep networks are
called multilayer perceptrons, and they consist of many layers of neurons each fully connected
to those in the layer below (from which they receive input) and those above (which they, in turn,
influence). When we train high-capacity models we run the risk of overfitting. Thus, we will
need to provide your first rigorous introduction to the notions of overfitting, underfitting, and
capacity control. To help you combat these problems, we will introduce regularization techniques
such as dropout and weight decay. We will also discuss issues relating to numerical stability and
parameter initialization that are key to successfully training deep networks. Throughout, we focus
on applying models to real data, aiming to give the reader a firm grasp not just of the concepts
but also of the practice of using deep networks. We punt matters relating to the computational
performance, scalability and efficiency of our models to subsequent chapters.

4.1 Multilayer Perceptron

In the previous chapters, we showed how you could implement multiclass logistic regression (also
called softmax regression) for classifying images of clothing into the 10 possible categories. To get
there, we had to learn how to wrangle data, coerce our outputs into a valid probability distribution
(via softmax), how to apply an appropriate loss function, and how to optimize over our parameters.
Now that weʼve covered these preliminaries, we are free to focus our attention on the more exciting
enterprise of designing powerful models using deep neural networks.

4.1.1 Hidden Layers

Letʼs recall linear regression and softmax regression with an example as illustrated in Fig. 4.1.1.
In general, we mapped our inputs directly to our outputs via a single linear transformation:

ô = softmax(Wx + b). (4.1.1)

Fig. 4.1.1: Single layer perceptron with 5 output units.

129
If our labels really were related to our input data by an approximately linear function, then this
approach would be perfect. But linearity is a strong assumption. Linearity implies that for whatever
target value we are trying to predict, increasing the value of each of our inputs should either drive
the value of the output up or drive it down, irrespective of the value of the other inputs.
Sometimes this makes sense! Say we are trying to predict whether an individual will or will not
repay a loan. We might reasonably imagine that all else being equal, an applicant with a higher
income would be more likely to repay than one with a lower income. In these cases, linear models
might perform well, and they might even be hard to beat.
But what about classifying images in FashionMNIST? Should increasing the intensity of the pixel
at location (13, 17) always increase the likelihood that the image depicts a pocketbook? That seems
ridiculous because we all know that you cannot make sense out of an image without accounting
for the interactions among pixels.

From One to Many

As another case, consider trying to classify images based on whether they depict cats or dogs given
black-and-white images.
If we use a linear model, weʼd basically be saying that for each pixel, increasing its value (making
it more white) must always increase the probability that the image depicts a dog or must always
increase the probability that the image depicts a cat. We would be making the absurd assumption
that the only requirement for differentiating cats vs. dogs is to assess how bright they are. That
approach is doomed to fail in a work that contains both black dogs and black cats, and both white
dogs and white cats.
Teasing out what is depicted in an image generally requires allowing more complex relationships
between our inputs and outputs. Thus we need models capable of discovering patterns that might
be characterized by interactions among the many features. We can over come these limitations of
linear models and handle a more general class of functions by incorporating one or more hidden
layers. The easiest way to do this is to stack many layers of neurons on top of each other. Each
layer feeds into the layer above it, until we generate an output. This architecture is commonly
called a multilayer perceptron, often abbreviated as MLP. The neural network diagram for an MLP
looks like Fig. 4.1.2.

Fig. 4.1.2: Multilayer perceptron with hidden layers. This example contains a hidden layer with 5
hidden units in it.

130 Chapter 4. Multilayer Perceptrons

The multilayer perceptron above has 4 inputs and 3 outputs, and the hidden layer in the middle
contains 5 hidden units. Since the input layer does not involve any calculations, building this
network would consist of implementing 2 layers of computation. The neurons in the input layer
are fully connected to the inputs in the hidden layer. Likewise, the neurons in the hidden layer
are fully connected to the neurons in the output layer.

From Linear to Nonlinear

We can write out the calculations that define this one-hidden-layer MLP in mathematical notation
as follows:
h = W1 x + b1 ,
o = W2 h + b2 , (4.1.2)
ŷ = softmax(o).

By adding another layer, we have added two new sets of parameters, but what have we gained in
exchange? In the model defined above, we do not achieve anything for our troubles!
That is because our hidden units are just a linear function of the inputs and the outputs (pre-
softmax) are just a linear function of the hidden units. A linear function of a linear function is
itself a linear function. That means that for any values of the weights, we could just collapse out
the hidden layer yielding an equivalent single-layer model using W = W2 W1 and b = W2 b1 + b2 .

o = W2 h + b2 = W2 (W1 x + b1 ) + b2 = (W2 W1 )x + (W2 b1 + b2 ) = Wx + b. (4.1.3)

In order to get a benefit from multilayer architectures, we need another key ingredient—a non-
linearity σ to be applied to each of the hidden units after each layerʼs linear transformation. The
most popular choice for the nonlinearity these days is the rectified linear unit (ReLU) max(x, 0).
After incorporating these non-linearities it becomes impossible to merge layers.

h = σ(W1 x + b1 ),
o = W2 h + b2 , (4.1.4)
ŷ = softmax(o).

Clearly, we could continue stacking such hidden layers, e.g., h1 = σ(W1 x+b1 ) and h2 = σ(W2 h1 +
b2 ) on top of each other to obtain a true multilayer perceptron.
Multilayer perceptrons can account for complex interactions in the inputs because the hidden
neurons depend on the values of each of the inputs. Itʼs easy to design a hidden node that does ar-
bitrary computation, such as, for instance, logical operations on its inputs. Moreover, for certain
choices of the activation function itʼs widely known that multilayer perceptrons are universal ap-
proximators. That means that even for a single-hidden-layer neural network, with enough nodes,
and the right set of weights, we can model any function at all! Actually learning that function is the
hard part.
Moreover, just because a single-layer network can learn any function does not mean that you
should try to solve all of your problems with single-layer networks. It turns out that we can approx-
imate many functions much more compactly if we use deeper (vs wider) neural networks. Weʼll
get more into the math in a subsequent chapter, but for now letʼs actually build an MLP. In this
example, weʼll implement a multilayer perceptron with two hidden layers and one output layer.

4.1. Multilayer Perceptron 131

Vectorization and Minibatch

As before, by the matrix X, we denote a minibatch of inputs. The calculations to produce outputs
from an MLP with two hidden layers can thus be expressed:

H1 = σ(W1 X + b1 ),
H2 = σ(W2 H1 + b2 ), (4.1.5)
O = softmax(W3 H2 + b3 ).

With some abuse of notation, we define the nonlinearity σ to apply to its inputs on a row-wise
fashion, i.e., one observation at a time. Note that we are also using the notation for softmax in the
same way to denote a row-wise operation. Often, as in this section, the activation functions that
we apply to hidden layers are not merely row-wise, but component wise. That means that after
computing the linear portion of the layer, we can calculate each nodes activation without looking
at the values taken by the other hidden units. This is true for most activation functions (the batch
normalization operation will be introduced in Section 7.5 is a notable exception to that rule).

%matplotlib inline
import d2l
from mxnet import autograd, np, npx
npx.set_np()

4.1.2 Activation Functions

Because they are so fundamental to deep learning, before going further, letʼs take a brief look at
some common activation functions.

ReLU Function

As stated above, the most popular choice, due to its simplicity of implementation and its efficacy in
training is the rectified linear unit (ReLU). ReLUs provide a very simple nonlinear transformation.
Given the element z, the function is defined as the maximum of that element and 0.

ReLU(z) = max(z, 0). (4.1.6)

It can be understood that the ReLU function retains only positive elements and discards negative
elements (setting those nodes to 0). To get a better idea of what it looks like, we can plot it.
Because it is used so commonly, NDarray supports the relu function as a basic native operator.
As you can see, the activation function is piece-wise linear.

x = [Link](-8.0, 8.0, 0.1)

x.attach_grad()
with [Link]():
y = [Link](x)
d2l.set_figsize((4, 2.5))
[Link](x, y, 'x', 'relu(x)')

132 Chapter 4. Multilayer Perceptrons

When the input is negative, the derivative of ReLU function is 0 and when the input is positive, the
derivative of ReLU function is 1. Note that the ReLU function is not differentiable when the input
takes value precisely equal to 0. In these cases, we go with the left-hand-side (LHS) derivative and
say that the derivative is 0 when the input is 0. We can get away with this because the input may
never actually be zero. There is an old adage that if subtle boundary conditions matter, we are
probably doing (real) mathematics, not engineering. That conventional wisdom may apply here.
See the derivative of the ReLU function plotted below.

[Link]()
[Link](x, [Link], 'x', 'grad of relu')

Note that there are many variants to the ReLU function, such as the parameterized ReLU (pReLU)
of He et al., 201562 . This variation adds a linear term to the ReLU, so some information still gets
through, even when the argument is negative.

pReLU(x) = max(0, x) + α min(0, x). (4.1.7)

The reason for using the ReLU is that its derivatives are particularly well behaved: either they van-
ish or they just let the argument through. This makes optimization better behaved and it reduces
the issue of the vanishing gradient problem (more on this later).
62
[Link]

4.1. Multilayer Perceptron 133

Sigmoid Function

The sigmoid function transforms its inputs which take values in R to the interval (0, 1). For that
reason, the sigmoid is often called a squashing function: it squashes any input in the range (-inf,
inf) to some value in the range (0, 1).
1
sigmoid(x) = . (4.1.8)
1 + exp(−x)

In the earliest neural networks, scientists were interested in modeling biological neurons which
either fire or do not fire. Thus the pioneers of this field, going all the way back to McCulloch and
Pitts in the 1940s, were focused on thresholding units. A thresholding function takes either value
0 (if the input is below the threshold) or value 1 (if the input exceeds the threshold)
When attention shifted to gradient based learning, the sigmoid function was a natural choice be-
cause it is a smooth, differentiable approximation to a thresholding unit. Sigmoids are still com-
mon as activation functions on the output units, when we want to interpret the outputs as prob-
abilities for binary classification problems (you can think of the sigmoid as a special case of the
softmax) but the sigmoid has mostly been replaced by the simpler and easier to train ReLU for
most use in hidden layers. In the “Recurrent Neural Network” chapter, we will describe how sig-
moid units can be used to control the flow of information in a neural network thanks to its capacity
to transform the value range between 0 and 1.
See the sigmoid function plotted below. When the input is close to 0, the sigmoid function ap-
proaches a linear transformation.

with [Link]():
y = [Link](x)
[Link](x, y, 'x', 'sigmoid(x)')

The derivative of sigmoid function is given by the following equation:

d exp(−x)
sigmoid(x) = = sigmoid(x) (1 − sigmoid(x)) . (4.1.9)
dx (1 + exp(−x))2

The derivative of sigmoid function is plotted below. Note that when the input is 0, the derivative of
the sigmoid function reaches a maximum of 0.25. As the input diverges from 0 in either direction,
the derivative approaches 0.

134 Chapter 4. Multilayer Perceptrons

[Link]()
[Link](x, [Link], 'x', 'grad of sigmoid')

Tanh Function

Like the sigmoid function, the tanh (Hyperbolic Tangent) function also squashes its inputs, trans-
forms them into elements on the interval between -1 and 1:
1 − exp(−2x)
tanh(x) = . (4.1.10)
1 + exp(−2x)

We plot the tanh function blow. Note that as the input nears 0, the tanh function approaches a
linear transformation. Although the shape of the function is similar to the sigmoid function, the
tanh function exhibits point symmetry about the origin of the coordinate system.

with [Link]():
y = [Link](x)
[Link](x, y, 'x', 'tanh(x)')

4.1. Multilayer Perceptron 135

The derivative of the Tanh function is:
d
tanh(x) = 1 − tanh2 (x). (4.1.11)
dx
The derivative of tanh function is plotted below. As the input nears 0, the derivative of the tanh
function approaches a maximum of 1. And as we saw with the sigmoid function, as the input
moves away from 0 in either direction, the derivative of the tanh function approaches 0.

[Link]()
[Link](x, [Link], 'x', 'grad of tanh')

In summary, we now know how to incorporate nonlinearities to build expressive multilayer neural
network architectures. As a side note, your knowledge now already puts you in command of the
state of the art in deep learning, circa 1990. In fact, you have an advantage over anyone working the
1990s, because you can leverage powerful open-source deep learning frameworks to build models
rapidly, using only a few lines of code. Previously, getting these nets training required researchers
to code up thousands of lines of C and Fortran.

Summary

• The multilayer perceptron adds one or multiple fully-connected hidden layers between the
output and input layers and transforms the output of the hidden layer via an activation func-
tion.
• Commonly-used activation functions include the ReLU function, the sigmoid function, and
the tanh function.

Exercises

1. Compute the derivative of the tanh and the pReLU activation function.
2. Show that a multilayer perceptron using only ReLU (or pReLU) constructs a continuous
piecewise linear function.
3. Show that tanh(x) + 1 = 2sigmoid(2x).

136 Chapter 4. Multilayer Perceptrons

4. Assume we have a multilayer perceptron without nonlinearities between the layers. In par-
ticular, assume that we have d input dimensions, d output dimensions and that one of the
layers had only d/2 dimensions. Show that this network is less expressive (powerful) than a
single layer perceptron.
5. Assume that we have a nonlinearity that applies to one minibatch at a time. What kinds of
problems do you expect this to cause?

4.2 Implementation of Multilayer Perceptron from Scratch

Now that we know how multilayer perceptrons (MLPs) work in theory, letʼs implement them. First,
we import the required packages.

import d2l
from mxnet import gluon, np, npx
npx.set_np()

To compare against the results we previously achieved with vanilla softmax regression, we con-
tinue to use the Fashion-MNIST image classification dataset.

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

4.2.1 Initializing Model Parameters

Recall that this dataset contains 10 classes and that each image consists of a 28 × 28 = 784 grid of
pixel values. Since we will be discarding the spatial structure (for now), we can just think of this as
a classification dataset with 784 input features and 10 classes. In particular we will implement our
MLP with one hidden layer and 256 hidden units. Note that we can regard both of these choices
as hyperparameters that could be set based on performance on validation data. Typically, we will
choose layer widths as powers of 2 to make everything align nicely in memory.
Again, we will represent our parameters with several ndarrays. Note that we now have one weight
matrix and one bias vector per layer. As always, we must call attach_grad to allocate memory for
the gradients with respect to these parameters.

num_inputs, num_outputs, num_hiddens = 784, 10, 256

W1 = [Link](scale=0.01, size=(num_inputs, num_hiddens))

b1 = [Link](num_hiddens)
W2 = [Link](scale=0.01, size=(num_hiddens, num_outputs))
b2 = [Link](num_outputs)
params = [W1, b1, W2, b2]
(continues on next page)

4.2. Implementation of Multilayer Perceptron from Scratch 137

(continued from previous page)

for param in params:

param.attach_grad()

4.2.2 Activation Function

To make sure we know how everything works, we will use the maximum function to implement ReLU
ourselves, instead of invoking [Link] directly.

def relu(X):
return [Link](X, 0)

4.2.3 The model

As in softmax regression, we will reshape each 2D image into a flat vector of length num_inputs.
Finally, we can implement our model with just a few lines of code.

def net(X):
X = [Link](-1, num_inputs)
H = relu([Link](X, W1) + b1)
return [Link](H, W2) + b2

4.2.4 The Loss Function

For better numerical stability and because we already know how to implement softmax regres-
sion completely from scratch in Section 3.6, we will use Gluonʼs integrated function for calcu-
lating the softmax and cross-entropy loss. Recall that we discussed some of these intricacies in
Section 4.1. We encourage the interested reader to examing the source code for [Link].
[Link] for more details.

loss = [Link]()

4.2.5 Training

Steps for training the MLP are no different than for softmax regression. In the d2l package, we
directly call the train_ch3 function, whose implementation was introduced in Section 3.6. We set
the number of epochs to 10 and the learning rate to 0.5.

num_epochs, lr = 10, 0.5

d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs,
lambda batch_size: [Link](params, lr, batch_size))

138 Chapter 4. Multilayer Perceptrons

To see how well we did, letʼs apply the model to some test data. If you are interested, compare the
result to corresponding linear model in Section 3.6.

d2l.predict_ch3(net, test_iter)

This looks a bit better than our previous result, a good sign that we are on the right path.

Summary

We saw that implementing a simple MLP is easy, even when done manually. That said, with a large
number of layers, this can get messy (e.g., naming and keeping track of the model parameters,
etc).

Exercises

1. Change the value of the hyperparameter num_hiddens in order to see how this hyperparam-
eter influences your results.
2. Try adding a new hidden layer to see how it affects the results.
3. How does changing the learning rate change the result?
4. What is the best result you can get by optimizing over all the parameters (learning rate, it-
erations, number of hidden layers, number of hidden units per layer)?

4.2. Implementation of Multilayer Perceptron from Scratch 139

4.3 Concise Implementation of Multilayer Perceptron

Now that we learned how multilayer perceptrons (MLPs) work in theory, letʼs implement them.
We begin, as always, by importing modules.

import d2l
from mxnet import gluon, init, npx
from [Link] import nn
npx.set_np()

4.3.1 The Model

The only difference from our softmax regression implementation is that we add two Dense (fully-
connected) layers instead of one. The first is our hidden layer, which has 256 hidden units and
uses the ReLU activation function.

net = [Link]()
[Link]([Link](256, activation='relu'),
[Link](10))
[Link]([Link](sigma=0.01))

Again, note that as always, Gluon automatically infers the missing input dimensions to each layer.
Training the model follows the exact same steps as in our softmax regression implementation.

batch_size, num_epochs = 256, 10

train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
loss = [Link]()
trainer = [Link](net.collect_params(), 'sgd', {'learning_rate': 0.5})
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

140 Chapter 4. Multilayer Perceptrons

Exercises

1. Try adding a few more hidden layers to see how the result changes.
2. Try out different activation functions. Which ones work best?
3. Try out different initializations of the weights.

4.4 Model Selection, Underfitting and Overfitting

As machine learning scientists, our goal is to discover general patterns. Say, for example, that we
wish to learn the pattern that associates genetic markers with the development of dementia in
adulthood. It is easy enough to memorize our training set. Each personʼs genes uniquely identify
them, not just among people represented in our dataset, but among all people on earth!
Given the genetic markers representing some person, we do not want our model to simply rec-
ognize “oh, that is Bob”, and then output the classification, say among {dementia, mild cognitive
impairment, healthy}, that corresponds to Bob. Rather, our goal is to discover patterns that cap-
ture regularities in the underlying population from which our training set was drawn. If we are
successfully in this endeavor, then we could successfully assess risk even for individuals that we
have never encountered before. This problem—how to discover patterns that generalize—is the
fundamental problem of machine learning.
The danger is that when we train models, we access just a small sample of data. The largest public
image datasets contain roughly one million images. And more often we have to learn from thou-
sands or tens of thousands. In a large hospital system we might access hundreds of thousands
of medical records. With finite samples, we always run the risk that we might discover apparent
associations that turn out not to hold up when we collect more data.
Letʼs consider an extreme pathological case. Imagine that you want to learn to predict which peo-
ple will repay their loans. A lender hires you as a data scientist to investigate, handing over the

4.4. Model Selection, Underfitting and Overfitting 141

complete files on 100 applicants, 5 of which defaulted on their loans within 3 years. Realistically,
the files might include hundreds of potential features, including income, occupation, credit score,
length of employment etc. Moreover, say that they additionally hand over video footage of each
applicantʼs interview with their lending agent.
Now suppose that after featurizing the data into an enormous design matrix, you discover that of
the 5 applicants who default, all of them were wearing blue shirts during their interviews, while
only 40% of general population wore blue shirts. There is a good chance that if you train a predic-
tive model to predict default, it might rely upon blue-shirt-wearing as an important feature.
Even if in fact defaulters were no more likely to wear blue shirts than people in the general popu-
lation, thereʼs a .45 = .01 probability that we would observe all five defaulters wearing blue shirts.
With just 5 positive examples of defaults and hundreds or thousands of features, we would prob-
ably find a large number of features that appear to be perfectly predictive of our labor just due to
random chance. With an unlimited amount of data, we would expect these spurious associations
to eventually disappear. But we seldom have that luxury.
The phenomena of fitting our training data more closely than we fit the underlying distribution
is called overfitting, and the techniques used to combat overfitting are called regularization. In
the previous sections, you might have observed this effect while experimenting with the Fashion-
MNIST dataset. If you altered the model structure or the hyper-parameters during the experiment,
you might have noticed that with enough nodes, layers, and training epochs, the model can even-
tually reach perfect accuracy on the training set, even as the accuracy on test data deteriorates.

4.4.1 Training Error and Generalization Error

In order to discuss this phenomenon more formally, we need to differentiate between training
error and generalization error. The training error is the error of our model as calculated on the
training dataset, while generalization error is the expectation of our modelʼs error were we to apply
it to an infinite stream of additional data points drawn from the same underlying data distribution
as our original sample.
Problematically, we can never calculate the generalization error exactly. That is because the imagi-
nary stream of infinite data is an imaginary object. In practice, we must estimate the generalization
error by applying our model to an independent test set constituted of a random selection of data
points that were withheld from our training set.
The following three thought experiments will help illustrate this situation better. Consider a col-
lege student trying to prepare for his final exam. A diligent student will strive to practice well and
test her abilities using exams from previous years. Nonetheless, doing well on past exams is no
guarantee that she will excel when it matters. For instance, the student might try to prepare by
rote learning the answers to the exam questions. This requires the student to memorize many
things. She might even remember the answers for past exams perfectly. Another student might
prepare by trying to understand the reasons for giving certain answers. In most cases, the latter
student will do much better.
Likewise, consider a model that simply uses a lookup table to answer questions. If the set of allow-
able inputs is discrete and reasonably small, then perhaps after viewing many training examples,
this approach would perform well. Still this model has no ability to do better than random guess-
ing when faced with examples that it has never seen before. In reality the input spaces are far too
large to memorize the answers corresponding to every conceivable input. For example, consider
the black and white 28 × 28 images. If each pixel can take one among 256 gray scale values, then

142 Chapter 4. Multilayer Perceptrons

there are 256784 possible images. That means that there are far more low-res grayscale thumbnail-
sized images than there are atoms in the universe. Even if we could encounter this data, we could
never afford to store the lookup table.
Last, consider the problem of trying to classify the outcomes of coin tosses (class 0: heads, class
1: tails) based on some contextual features that might be available. No matter what algorithm we
come up with, because the generalization error will always be 12 . However, for most algorithms,
we should expect our training error to be considerably lower, depending on the luck of the draw,
even if we did not have any features! Consider the dataset {0, 1, 1, 1, 0, 1}. Our feature-less would
have to fall back on always predicting the majority class, which appears from our limited sample
to be 1. In this case, the model that always predicts class 1 will incur an error of 13 , considerably
better than our generalization error. As we increase the amount of data, the probability that the
fraction of heads will deviate significantly from 12 diminishes, and our training error would come
to match the generalization error.

Statistical Learning Theory

Since generalization is the fundamental problem in machine learning, you might not be surprised
to learn that many mathematicians and theorists have dedicated their lives to developing formal
theories to describe this phenomenon. In their eponymous theorem66 , Glivenko and Cantelli de-
rived the rate at which the training error converges to the generalization error. In a series of semi-
nal papers, Vapnik and Chervonenkis67 extended this theory to more general classes of functions.
This work laid the foundations of Statistical Learning Theory68 .
In the standard supervised learning setting, which we have addressed up until now and will stick
throughout most of this book, we assume that both the training data and the test data are drawn
independently from identical distributions (commonly called the i.i.d. assumption). This means
that the process that samples our data has no memory. The 2nd example drawn and the 3rd drawn
are no more correlated than the 2nd and the 2-millionth sample drawn.
Being a good machine learning scientist requires thinking critically, and already you should be
poking holes in this assumption, coming up with common cases where the assumption fails. What
if we train a mortality risk predictor on data collected from patients at UCSF, and apply it on pa-
tients at Massachusetts General Hospital? These distributions are simply not identical. Moreover,
draws might be correlated in time. What if we are classifying the topics of Tweets. The news cycle
would create temporal dependencies in the topics being discussed violating any assumptions of
independence.
Sometimes we can get away with minor violations of the i.i.d. assumption and our models will
continue to work remarkably well. After all, nearly every real-world application involves at least
some minor violation of the i.i.d. assumption, and yet we have useful tools for face recognition,
speech recognition, language translation, etc.
Other violations are sure to cause trouble. Imagine, for example, if we tried to train a face recog-
nition system by training it exclusively on university students and then want to deploy it as a tool
for monitoring geriatrics in a nursing home population. This is unlikely to work well since college
students tend to look considerably different from the elderly.
In subsequent chapters and volumes, we will discuss problems arising from violations of the i.i.d.
66
[Link]
67
[Link]
68
[Link]

4.4. Model Selection, Underfitting and Overfitting 143

assumption. For now, even taking the i.i.d. assumption for granted, understanding generalization
is a formidable problem. Moreover, elucidating the precise theoretical foundations that might
explain why deep neural networks generalize as well as they do continues to vexes the greatest
minds in learning theory.
When we train our models, we attempt searching for a function that fits the training data as well
as possible. If the function is so flexible that it can catch on to spurious patterns just as easily as to
the true associations, then it might perform too well without producing a model that generalizes
well to unseen data. This is precisely what we want to avoid (or at least control). Many of the
techniques in deep learning are heuristics and tricks aimed at guarding against overfitting.

Model Complexity

When we have simple models and abundant data, we expect the generalization error to resemble
the training error. When we work with more complex models and fewer examples, we expect the
training error to go down but the generalization gap to grow. What precisely constitutes model
complexity is a complex matter. Many factors govern whether a model will generalize well. For
example a model with more parameters might be considered more complex. A model whose
parameters can take a wider range of values might be more complex. Often with neural networks,
we think of a model that takes more training steps as more complex, and one subject to early
stopping as less complex.
It can be difficult to compare the complexity among members of substantially different model
classes (say a decision tree versus a neural network). For now, a simple rule of thumb is quite use-
ful: A model that can readily explain arbitrary facts is what statisticians view as complex, whereas
one that has only a limited expressive power but still manages to explain the data well is probably
closer to the truth. In philosophy, this is closely related to Popperʼs criterion of falsifiability69 of a
scientific theory: a theory is good if it fits data and if there are specific tests which can be used to
disprove it. This is important since all statistical estimation is post hoc70 , i.e., we estimate after we
observe the facts, hence vulnerable to the associated fallacy. For now, we will put the philosophy
aside and stick to more tangible issues.
In this section, to give you some intuition, weʼll focus on a few factors that tend to influence the
generalizability of a model class:
1. The number of tunable parameters. When the number of tunable parameters, sometimes
called the degrees of freedom, is large, models tend to be more susceptible to overfitting.
2. The values taken by the parameters. When weights can take a wider range of values, models
can be more susceptible to over fitting.
3. The number of training examples. Itʼs trivially easy to overfit a dataset containing only one
or two examples even if your model is simple. But overfitting a dataset with millions of
examples requires an extremely flexible model.

4.4.2 Model Selection

In machine learning, we usually select our final model after evaluating several candidate models.
This process is called model selection. Sometimes the models subject to comparison are funda-
69
[Link]
70
[Link]

144 Chapter 4. Multilayer Perceptrons

mentally different in nature (say, decision trees vs linear models). At other times, we are compar-
ing members of the same class of models that have been trained with different hyperparameter
settings.
With multilayer perceptrons for example, we may wish to compare models with different numbers
of hidden layers, different numbers of hidden units, and various choices of the activation functions
applied to each hidden layer. In order to determine the best among our candidate models, we will
typically employ a validation set.

Validation Dataset

In principle we should not touch our test set until after we have chosen all our hyper-parameters.
Were we to use the test data in the model selection process, there is a risk that we might overfit
the test data. Then we would be in serious trouble. If we overfit our training data, there is always
the evaluation on test data to keep us honest. But if we overfit the test data, how would we ever
know?
Thus, we should never rely on the test data for model selection. And yet we cannot rely solely on
the training data for model selection either because we cannot estimate the generalization error
on the very data that we use to train the model.
The common practice to address this problem is to split our data three ways, incorporating a val-
idation set in addition to the training and test sets.
In practical applications, the picture gets muddier. While ideally we would only touch the test
data once, to assess the very best model or to compare a small number of models to each other,
real-world test data is seldom discarded after just one use. We can seldom afford a new test set for
each round of experiments.
The result is a murky practice where the boundaries between validation and test data are worry-
ingly ambiguous. Unless explicitly stated otherwise, in the experiments in this book we are really
working with what should rightly be called training data and validation data, with no true test sets.
Therefore, the accuracy reported in each experiment is really the validation accuracy and not a
true test set accuracy. The good news is that we do not need too much data in the validation set.
The uncertainty in our estimates can be shown to be of the order of O(n− 2 ).
1

K-Fold Cross-Validation

When training data is scarce, we might not even be able to afford to hold out enough data to con-
stitute a proper validation set. One popular solution to this problem is to employ K-fold cross-
validation. Here, the original training data is split into K non-overlapping subsets. Then model
training and validation are executed K times, each time training on K − 1 subsets and validat-
ing on a different subset (the one not used for training in that round). Finally, the training and
validation error rates are estimated by averaging over the results from the K experiments.

4.4.3 Underfitting or Overfitting?

When we compare the training and validation errors, we want to be mindful of two common situ-
ations: First, we want to watch out for cases when our training error and validation error are both
substantial but there is a little gap between them. If the model is unable to reduce the training

4.4. Model Selection, Underfitting and Overfitting 145

error, that could mean that our model is too simple (i.e., insufficiently expressive) to capture the
pattern that we are trying to model. Moreover, since the generalization gap between our train-
ing and validation errors is small, we have reason to believe that we could get away with a more
complex model. This phenomenon is known as underfitting.
On the other hand, as we discussed above, we want to watch out for the cases when our train-
ing error is significantly lower than our validation error, indicating severe overfitting. Note that
overfitting is not always a bad thing. With deep learning especially, it is well known that the best
predictive models often perform far better on training data than on holdout data. Ultimately, we
usually care more about the validation error than about the gap between the training and valida-
tion errors.
Whether we overfit or underfit can depend both on the complexity of our model and the size of
the available training datasets, two topics that we discuss below.

Model Complexity

To illustrate some classical intuition about overfitting and model complexity, we given an example
using polynomials. Given training data consisting of a single feature x and a corresponding real-
valued label y, we try to find the polynomial of degree d
∑
d
ŷ = xi wi (4.4.1)
i=0
to estimate the labels y. This is just a linear regression problem where our features are given by
the powers of x, the wi given the modelʼs weights, and the bias is given by w0 since x0 = 1 for all x.
Since this is just a linear regression problem, we can use the squared error as our loss function.
A higher-order polynomial function is more complex than a lower order polynomial function,
since the higher-order polynomial has more parameters and the model functionʼs selection range
is wider. Fixing the training dataset, higher-order polynomial functions should always achieve
lower (at worst, equal) training error relative to lower degree polynomials. In fact, whenever the
data points each have a distinct value of x, a polynomial function with degree equal to the number
of data points can fit the training set perfectly. We visualize the relationship between polynomial
degree and under- vs over-fitting in Fig. 4.4.1.

Fig. 4.4.1: Influence of Model Complexity on Underfitting and Overfitting

146 Chapter 4. Multilayer Perceptrons

Dataset Size

The other big consideration to bear in mind is the dataset size. Fixing our model, the fewer sam-
ples we have in the training dataset, the more likely (and more severely) we are to encounter over-
fitting. As we increase the amount of training data, the generalization error typically decreases.
Moreover, in general, more data never hurts. For a fixed task and data distribution, there is typi-
cally a relationship between model complexity and dataset size. Given more data, we might prof-
itably attempt to fit a more complex model. Absent sufficient data, simpler models may be difficult
to beat. For many tasks, deep learning only outperforms linear models when many thousands of
training examples are available. In part, the current success of deep learning owes to the current
abundance of massive datasets due to internet companies, cheap storage, connected devices, and
the broad digitization of the economy.

4.4.4 Polynomial Regression

We can now explore these concepts interactively by fitting polynomials to data. To get started we
will import our usual packages.

import d2l
from mxnet import gluon, np, npx
from [Link] import nn
npx.set_np()

Generating the Dataset

First we need data. Given x, we will use the following cubic polynomial to generate the labels on
training and test data:

x2 x3
y = 5 + 1.2x − 3.4 + 5.6 + ϵ where ϵ ∼ N (0, 0.1). (4.4.2)
2! 3!
The noise term ϵ obeys a normal distribution with a mean of 0 and a standard deviation of 0.1. We
will synthesize 100 samples each for the training set and test set.

maxdegree = 20 # Maximum degree of the polynomial

n_train, n_test = 100, 100 # Training and test dataset sizes
true_w = [Link](maxdegree) # Allocate lots of empty space
true_w[0:4] = [Link]([5, 1.2, -3.4, 5.6])

features = [Link](size=(n_train + n_test, 1))

features = [Link](features)
poly_features = [Link](features, [Link](maxdegree).reshape(1, -1))
poly_features = poly_features / (
[Link]([Link](maxdegree) + 1).reshape(1, -1))
labels = [Link](poly_features, true_w)
labels += [Link](scale=0.1, size=[Link])

For optimization, we typically want to avoid very large values of gradients, losses, etc. This is
why the monomials stored in poly_features are rescaled from xi to i!1 xi . It allows us to avoid
very large values for large exponents i. Factorials are implemented in Gluon using the Gamma
function, where n! = Γ(n + 1).

4.4. Model Selection, Underfitting and Overfitting 147

Take a look at the first 2 samples from the generated dataset. The value 1 is technically a feature,
namely the constant feature corresponding to the bias.

features[:2], poly_features[:2], labels[:2]

(array([[1.5094751],
[1.9676613]]),
array([[1.0000000e+00, 1.5094751e+00, 1.1392574e+00, 5.7322693e-01,
2.1631797e-01, 6.5305315e-02, 1.6429458e-02, 3.5428370e-03,
6.6847802e-04, 1.1211676e-04, 1.6923748e-05, 2.3223611e-06,
2.9212887e-07, 3.3920095e-08, 3.6572534e-09, 3.6803552e-10,
3.4721288e-11, 3.0829944e-12, 2.5853909e-13, 2.0539909e-14],
[1.0000000e+00, 1.9676613e+00, 1.9358451e+00, 1.2696959e+00,
6.2458295e-01, 2.4579351e-01, 8.0606394e-02, 2.2658013e-02,
5.5729118e-03, 1.2184002e-03, 2.3973989e-04, 4.2884261e-05,
7.0318083e-06, 1.0643244e-06, 1.4958786e-07, 1.9622549e-08,
2.4131586e-09, 2.7931044e-10, 3.0532687e-11, 3.1619987e-12]]),
array([6.1804767, 7.7595935]))

Training and Testing Model

Letʼs first implement a function to evaluate the loss on a given data.

# Saved in the d2l package for later use

def evaluate_loss(net, data_iter, loss):
"""Evaluate the loss of a model on the given dataset"""
metric = [Link](2) # sum_loss, num_examples
for X, y in data_iter:
[Link](loss(net(X), y).sum(), [Link])
return metric[0] / metric[1]

Now define the training function.

def train(train_features, test_features, train_labels, test_labels,

num_epochs=1000):
loss = [Link].L2Loss()
net = [Link]()
# Switch off the bias since we already catered for it in the polynomial
# features
[Link]([Link](1, use_bias=False))
[Link]()
batch_size = min(10, train_labels.shape[0])
train_iter = d2l.load_array((train_features, train_labels), batch_size)
test_iter = d2l.load_array((test_features, test_labels), batch_size,
is_train=False)
trainer = [Link](net.collect_params(), 'sgd',
{'learning_rate': 0.01})
animator = [Link](xlabel='epoch', ylabel='loss', yscale='log',
xlim=[1, num_epochs], ylim=[1e-3, 1e2],
legend=['train', 'test'])
for epoch in range(1, num_epochs+1):
d2l.train_epoch_ch3(net, train_iter, loss, trainer)
(continues on next page)

148 Chapter 4. Multilayer Perceptrons

(continued from previous page)
if epoch % 50 == 0:
[Link](epoch, (evaluate_loss(net, train_iter, loss),
evaluate_loss(net, test_iter, loss)))
print('weight:', net[0].[Link]().asnumpy())

Third-Order Polynomial Function Fitting (Normal)

We will begin by first using a third-order polynomial function with the same order as the data
generation function. The results show that this modelʼs training error rate when using the testing
dataset is low. The trained model parameters are also close to the true values w = [5, 1.2, −3.4, 5.6].

# Pick the first four dimensions, i.e., 1, x, x^2, x^3 from the polynomial
# features
train(poly_features[:n_train, 0:4], poly_features[n_train:, 0:4],
labels[:n_train], labels[n_train:])

weight: [[ 5.0161924 1.1759939 -3.4185557 5.6421604]]

Linear Function Fitting (Underfitting)

Letʼs take another look at linear function fitting. After the decline in the early epoch, it becomes
difficult to further decrease this modelʼs training error rate. After the last epoch iteration has been
completed, the training error rate is still high. When used to fit non-linear patterns (like the third-
order polynomial function here) linear models are liable to underfit.

# Pick the first four dimensions, i.e., 1, x from the polynomial features
train(poly_features[:n_train, 0:3], poly_features[n_train:, 0:3],
labels[:n_train], labels[n_train:])

weight: [[ 4.8935127 4.108137 -2.364595 ]]

4.4. Model Selection, Underfitting and Overfitting 149

Insufficient Training (Overfitting)

Now letʼs try to train the model using a polynomial of too high degree. Here, there is insufficient
data to learn that the higher-degree coefficients should have values close to zero. As a result, our
overly-complex model is far too susceptible to being influenced by noise in the training data. Of
course, our training error will now be low (even lower than if we had the right model!) but our
test error will be high.
Try out different model complexities (n_degree) and training set sizes (n_subset) to gain some
intuition of what is happening.

n_subset = 100 # Subset of data to train on

n_degree = 20 # Degree of polynomials
train(poly_features[1:n_subset, 0:n_degree],
poly_features[n_train:, 0:n_degree], labels[1:n_subset],
labels[n_train:])

weight: [[ 4.9833336 1.3050598 -3.25709 5.0301933 -0.40508285 1.6137251

0.1106354 0.30472332 0.03450087 0.04252107 -0.03633049 0.05625657
0.0635356 0.02527198 -0.0073961 -0.00711018 0.04849717 0.06699993
0.0279271 -0.05373173]]

150 Chapter 4. Multilayer Perceptrons

In later chapters, we will continue to discuss overfitting problems and methods for dealing with
them, such as weight decay and dropout.

Summary

• Since the generalization error rate cannot be estimated based on the training error rate,
simply minimizing the training error rate will not necessarily mean a reduction in the gen-
eralization error rate. Machine learning models need to be careful to safeguard against over-
fitting such as to minimize the generalization error.
• A validation set can be used for model selection (provided that it is not used too liberally).
• Underfitting means that the model is not able to reduce the training error rate while over-
fitting is a result of the model training error rate being much lower than the testing dataset
rate.
• We should choose an appropriately complex model and avoid using insufficient training
samples.

Exercises

1. Can you solve the polynomial regression problem exactly? Hint: use linear algebra.
2. Model selection for polynomials
• Plot the training error vs. model complexity (degree of the polynomial). What do you
observe?
• Plot the test error in this case.
• Generate the same graph as a function of the amount of data?
3. What happens if you drop the normalization of the polynomial features xi by 1/i!. Can you
fix this in some other way?
4. What degree of polynomial do you need to reduce the training error to 0?
5. Can you ever expect to see 0 generalization error?

4.5 Weight Decay

Now that we have characterized the problem of overfitting and motivated the need for capacity
control, we can begin discussing some of the popular techniques used to these ends in practice.
Recall that we can always mitigate overfitting by going out and collecting more training data, that
can be costly and time consuming, typically making it impossible in the short run. For now, letʼs
assume that we have already obtained as much high-quality data as our resources permit and focus
on techniques aimed at limiting the capacity of the function classes under consideration.

4.5. Weight Decay 151

In our toy example, we saw that we could control the complexity of a polynomial by adjusting its
degree. However, most of machine learning does not consist of polynomial curve fitting. And
moreover, even when we focus on polynomial regression, when we deal with high-dimensional
data, manipulating model capacity by tweaking the degree d is problematic. To see why, note that
for multivariate data we must generalize the concept of polynomials to include monomials, which
are simply products of powers of variables. For example, x21 x2 , and x3 x25 are both monomials of
degree 3. The number of such terms with a given degree d blows up as a function of the degree d.
Concretely,
(D−1+d) for vectors of dimensionality D, the number of monomials of a given degree d is
D−1 . Hence, a small change in degree, even from say 1 to 2 or 2 to 3 would entail a mas-
sive blowup in the complexity of our model. Thus, tweaking the degree is too blunt a hammer.
Instead, we need a more fine-grained tool for adjusting function complexity.

4.5.1 Squared Norm Regularization

Weight decay (commonly called L2 regularization), might be the most widely-used technique for
regularizing parametric machine learning models. The basic intuition behind weight decay is the
notion that among all functions f , the function f = 0 is the simplest. Intuitively, we can then
measure functions by their proximity to zero. But how precisely should we measure the distance
between a function and zero? There is no single right answer. In fact, entire branches of mathe-
matics, e.g., in functional analysis and the theory of Banach spaces are devoted to answering this
issue.
For our present purposes, a very simple interpretation will suffice: We will consider a linear func-
tion f (x) = w⊤ x to be simple if its weight vector is small. We can measure this via ||mathbf w||2 .
One way of keeping the weight vector small is to add its norm as a penalty term to the problem
of minimizing the loss. Thus we replace our original objective, minimize the prediction error on
the training labels, with new objective, minimize the sum of the prediction error and the penalty term.
Now, if the weight vector becomes too large, our learning algorithm will find more profit in min-
imizing the norm ||w||2 versus minimizing the training error. That is exactly what we want. To
illustrate things in code, letʼs revive our previous example from Section 3.1 for linear regression.
There, our loss was given by

1 ∑ 1 ( ⊤ (i) )2
n
l(w, b) = w x +b−y (i)
. (4.5.1)
n 2
i=1

Recall that x(i) are the observations, y (i) are labels, and (w, b) are the weight and bias parame-
ters respectively. To arrive at a new loss function that penalizes the size of the weight vector, we
need to add ||mathbf w||2 , but how much should we add? To address this, we need to add a new
hyperparameter, that we will call the regularization constant and denote by λ:

λ
l(w, b) + ∥w∥2 . (4.5.2)
2
This non-negative parameter λ ≥ 0 governs the amount of regularization. For λ = 0, we recover
our original loss function, whereas for λ > 0 we ensure that w cannot grow too large. The astute
reader might wonder why we are squaring the norm of the weight vector. We do this for two
reasons. First, we do it for computational convenience. By squaring the L2 norm, we remove the
square root, leaving the sum of squares of each component of the weight vector. This is convenient
because it is easy to compute derivatives of a sum of terms (the sum of derivatives equals the
derivative of the sum).

152 Chapter 4. Multilayer Perceptrons

Moreover, you might ask, why the L2 norm in the first place and not the L1 norm, or some other
distance function. In fact, several other choices are valid and are popular throughout statistics.
While L2-regularized linear models constitute the classic ridge regression algorithm L1-regularized
linear regression is a similarly fundamental model in statistics popularly known as lasso regression.
One mathematical reason for working with the L2 norm and not some other norm, is that it pe-
nalizes large components of the weight vector much more than it penalizes small ones. This en-
courages our learning algorithm to discover models which distribute their weight across a larger
number of features, which might make them more robust in practice since they do not depend
precariously on a single feature. The stochastic gradient descent updates for L2-regularized re-
gression are as follows:
( )
ηλ η ∑ (i) ( ⊤ (i) )
w ← 1− w− x w x + b − y (i) , (4.5.3)
|B| |B|
i∈B

As before, we update w based on the amount by which our estimate differs from the observation.
However, we also shrink the size of w towards 0. That is why the method is sometimes called
“weight decay”: because the penalty term literally causes our optimization algorithm to decay the
magnitude of the weight at each step of training. This is more convenient than having to pick
the number of parameters as we did for polynomials. In particular, we now have a continuous
mechanism for adjusting the complexity of f . Small values of λ correspond to unconstrained w,
whereas large values of λ constrain w considerably. Since we do not want to have large bias terms
either, we often add b2 as a penalty, too.

4.5.2 High-Dimensional Linear Regression

For high-dimensional regression it is difficult to pick the ʻrightʼ dimensions to omit. Weight-decay
regularization is a much more convenient alternative. We will illustrate this below. First, we will
generate some synthetic data as before

∑
d
y = 0.05 + 0.01xi + ϵ where ϵ ∼ N (0, 0.01). (4.5.4)
i=1

representing our label as a linear function of our inputs, corrupted by Gaussian noise with zero
mean and variance 0.01. To observe the effects of overfitting more easily, we can make our prob-
lem high-dimensional, setting the data dimension to d = 200 and working with a relatively small
number of training examples—here we will set the sample size to 20:

%matplotlib inline
import d2l
from mxnet import autograd, gluon, init, np, npx
from [Link] import nn
npx.set_np()

n_train, n_test, num_inputs, batch_size = 20, 100, 200, 1

true_w, true_b = [Link]((num_inputs, 1)) * 0.01, 0.05
train_data = d2l.synthetic_data(true_w, true_b, n_train)
train_iter = d2l.load_array(train_data, batch_size)
test_data = d2l.synthetic_data(true_w, true_b, n_test)
test_iter = d2l.load_array(test_data, batch_size, is_train=False)

4.5. Weight Decay 153

4.5.3 Implementation from Scratch

Next, we will show how to implement weight decay from scratch. All we have to do here is to add
the squared ℓ2 penalty as an additional loss term added to the original target function.∑The squared
norm penalty derives its name from the fact that we are adding the second power i wi2 . The ℓ2
is just one among an infinite class of norms call p-norms, many of which you might encounter in
the future. In general, for some number p, the ℓp norm is defined as

∑
d
∥w∥pp := |wi |p . (4.5.5)
i=1

Initializing Model Parameters

First, we will define a function to randomly initialize our model parameters and run attach_grad
on each to allocate memory for the gradients we will calculate.

def init_params():
w = [Link](scale=1, size=(num_inputs, 1))
b = [Link](1)
w.attach_grad()
b.attach_grad()
return [w, b]

Defining ℓ2 Norm Penalty

Perhaps the most convenient way to implement this penalty is to square all terms in place and
sum them up. We divide by 2 by convention (when we take the derivative of a quadratic function,
the 2 and 1/2 cancel out, ensuring that the expression for the update looks nice and simple).

def l2_penalty(w):
return (w**2).sum() / 2

Defining the Train and Test Functions

The following code defines how to train and test the model separately on the training dataset and
the test dataset. Unlike the previous sections, here, the ℓ2 norm penalty term is added when cal-
culating the final loss function. The linear network and the squared loss have not changed since
the previous chapter, so we will just import them via [Link] and d2l.squared_loss to reduce
clutter.

def train(lambd):
w, b = init_params()
net, loss = lambda X: [Link](X, w, b), d2l.squared_loss
num_epochs, lr = 100, 0.003
animator = [Link](xlabel='epochs', ylabel='loss', yscale='log',
xlim=[1, num_epochs], legend=['train', 'test'])
for epoch in range(1, num_epochs + 1):
(continues on next page)

154 Chapter 4. Multilayer Perceptrons

(continued from previous page)
for X, y in train_iter:
with [Link]():
# The L2 norm penalty term has been added
l = loss(net(X), y) + lambd * l2_penalty(w)
[Link]()
[Link]([w, b], lr, batch_size)
if epoch % 5 == 0:
[Link](epoch+1, (d2l.evaluate_loss(net, train_iter, loss),
d2l.evaluate_loss(net, test_iter, loss)))
print('l1 norm of w:', [Link](w).sum())

Training without Regularization

Next, letʼs train and test the high-dimensional linear regression model. When lambd = 0 we do
not use weight decay. As a result, while the training error decreases, the test error does not. This
is a perfect example of overfitting.

train(lambd=0)

l1 norm of w: 158.55702

Using Weight Decay

The example below shows that even though the training error increased, the error on the test set
decreased. This is precisely the improvement that we expect from using weight decay. While not
perfect, overfitting has been mitigated to some extent. In addition, the ℓ2 norm of the weight w is
smaller than without using weight decay.

train(lambd=3)

l1 norm of w: 0.49734637

4.5. Weight Decay 155

4.5.4 Concise Implementation

Because weight decay is ubiquitous in neural network optimization, Gluon makes it especially
convenient, integrating weight decay into the optimization algorithm itself for easy use in combi-
nation with any loss function. Moreover, this integration serves a computational benefit, allowing
implementation tricks to add weight decay to the algorithm, without any additional computational
overhead. Since the weight decay portion of the update depends only on the current value of each
parameter, and the optimizer must to touch each parameter once anyway.
In the following code, we specify the weight decay hyperparameter directly through the wd param-
eter when instantiating our Trainer. By default, Gluon decays both weights and biases simulta-
neously. Note that we can have different optimizers for different sets of parameters. For instance,
we can have one Trainer with weight decay for the weights w and another without weight decay
to take care of the bias b.

def train_gluon(wd):
net = [Link]()
[Link]([Link](1))
[Link]([Link](sigma=1))
loss = [Link].L2Loss()
num_epochs, lr = 100, 0.003
# The weight parameter has been decayed. Weight names generally end with
# "weight"
trainer_w = [Link](net.collect_params('.*weight'), 'sgd',
{'learning_rate': lr, 'wd': wd})
# The bias parameter has not decayed. Bias names generally end with "bias"
trainer_b = [Link](net.collect_params('.*bias'), 'sgd',
{'learning_rate': lr})
animator = [Link](xlabel='epochs', ylabel='loss', yscale='log',
xlim=[1, num_epochs], legend=['train', 'test'])
for epoch in range(1, num_epochs+1):
for X, y in train_iter:
with [Link]():
l = loss(net(X), y)
[Link]()
# Call the step function on each of the two Trainer instances to
# update the weight and bias separately
(continues on next page)

156 Chapter 4. Multilayer Perceptrons

(continued from previous page)
trainer_w.step(batch_size)
trainer_b.step(batch_size)
if epoch % 5 == 0:
[Link](epoch+1, (d2l.evaluate_loss(net, train_iter, loss),
d2l.evaluate_loss(net, test_iter, loss)))
print('L1 norm of w:', [Link](net[0].[Link]()).sum())

The plots look just the same as when we implemented weight decay from scratch but they run
a bit faster and are easier to implement, a benefit that will become more pronounced for large
problems.

train_gluon(0)

L1 norm of w: 152.12502

train_gluon(3)

L1 norm of w: 0.5179619

4.5. Weight Decay 157

So far, we only touched upon one notion of what constitutes a simple linear function. For nonlinear
functions, what constitutes simplicity can be a far more complex question. For instance, there exist
Reproducing Kernel Hilbert Spaces (RKHS)72 which allow one to use many of the tools introduced
for linear functions in a nonlinear context. Unfortunately, RKHS-based algorithms do not always
scale well to massive amounts of data. For the purposes of ∑this book, we limit ourselves to simply
summing over the weights for different layers, e.g., via l ∥wl ∥2 , which is equivalent to weight
decay applied to all layers.

Summary

• Regularization is a common method for dealing with overfitting. It adds a penalty term to
the loss function on the training set to reduce the complexity of the learned model.
• One particular choice for keeping the model simple is weight decay using an ℓ2 penalty. This
leads to weight decay in the update steps of the learning algorithm.
• Gluon provides automatic weight decay functionality in the optimizer by setting the hyper-
parameter wd.
• You can have different optimizers within the same training loop, e.g., for different sets of
parameters.

Exercises

1. Experiment with the value of λ in the estimation problem in this page. Plot training and test
accuracy as a function of λ. What do you observe?
2. Use a validation set to find the optimal value of λ. Is it really the optimal value? Does this
matter?
∑
3. What would the update equations look like if instead of ∥w∥2 we used i |wi | as our penalty
of choice (this is called ℓ1 regularization).
4. We know that ∥w∥2 = w⊤ w. Can you find a similar equation for matrices (mathematicians
call this the Frobenius norm73 )?
5. Review the relationship between training error and generalization error. In addition to
weight decay, increased training, and the use of a model of suitable complexity, what other
ways can you think of to deal with overfitting?
6. In Bayesian statistics we use the product of prior and likelihood to arrive at a posterior via
P (w | x) ∝ P (x | w)P (w). How can you identify P (w) with regularization?

72
[Link]
73
[Link]

158 Chapter 4. Multilayer Perceptrons

4.6 Dropout

Just now, we introduced the classical approach of regularizing statistical models by penalizing the
ℓ2 norm of the weights. In probabilistic terms, we could justify this technique by arguing that we
have assumed a prior belief that weights take values from a Gaussian distribution with mean 0.
More intuitively, we might argue that we encouraged the model to spread out its weights among
many features and rather than depending too much on a small number of potentially spurious
associations.

4.6.1 Overfitting Revisited

Given many more features than examples, linear models can overfit. But when there are many
more examples than features, we can generally count on linear models not to overfit. Unfortu-
nately, the reliability with which linear models generalize comes at a cost: Linear models canʼt
take into account interactions among features. For every feature, a linear model must assign ei-
ther a positive or a negative weight. They lack the flexibility to account for context.
In more formal text, youʼll see this fundamental tension between generalizability and flexibility
discussed as the bias-variance tradeoff. Linear models have high bias (they can only represent a
small class of functions), but low variance (they give similar results across different random sam-
ples of the data).
Deep neural networks take us to the opposite end of the bias-variance spectrum. Neural networks
are so flexible because they arenʼt confined to looking at each feature individually. Instead, they
can learn interactions among groups of features. For example, they might infer that “Nigeria”
and “Western Union” appearing together in an email indicates spam but that “Nigeria” without
“Western Union” does not.
Even when we only have a small number of features, deep neural networks are capable of overfit-
ting. In 2017, a group of researchers presented a now well-known demonstration of the incredible
flexibility of neural networks. They presented a neural network with randomly-labeled images
(there was no true pattern linking the inputs to the outputs) and found that the neural network,
optimized by SGD, could label every image in the training set perfectly.
Consider what this means. If the labels are assigned uniformly at random and there are 10 classes,
then no classifier can get better than 10% accuracy on holdout data. Yet even in these situations,
when there is no true pattern to be learned, neural networks can perfectly fit the training labels.

4.6.2 Robustness through Perturbations

Letʼs think briefly about what we expect from a good statistical model. We want it to do well on
unseen test data. One way we can accomplish this is by asking what constitutes a “simple” model?
Simplicity can come in the form of a small number of dimensions, which is what we did when
discussing fitting a model with monomial basis functions. Simplicity can also come in the form
of a small norm for the basis functions. This led us to weight decay (ℓ2 regularization). Yet a third
notion of simplicity that we can impose is that the function should be robust under small changes
in the input. For instance, when we classify images, we would expect that adding some random
noise to the pixels should be mostly harmless.
In 1995, Christopher Bishop formalized a form of this idea when he proved that training with input
noise is equivalent to Tikhonov regularization (Bishop, 1995). In other words, he drew a clear

4.6. Dropout 159

mathematical connection between the requirement that a function be smooth (and thus simple),
as we discussed in the section on weight decay, with and the requirement that it be resilient to
perturbations in the input.
Then in 2014, Srivastava et al. (Srivastava et al., 2014) developed a clever idea for how to apply
Bishopʼs idea to the internal layers of the network, too. Namely they proposed to inject noise
into each layer of the network before calculating the subsequent layer during training. They real-
ized that when training deep network with many layers, enforcing smoothness just on the input-
output mapping misses out on what is happening internally in the network. Their proposed idea
is called dropout, and it is now a standard technique that is widely used for training neural net-
works. Throughout training, on each iteration, dropout regularization consists simply of zeroing
out some fraction (typically 50%) of the nodes in each layer before calculating the subsequent
layer.
The key challenge then is how to inject this noise without introducing undue statistical bias. In
other words, we want to perturb the inputs to each layer during training in such a way that the
expected value of the layer is equal to the value it would have taken had we not introduced any
noise at all.
In Bishopʼs case, when we are adding Gaussian noise to a linear model, this is simple: At each
training iteration, just add noise sampled from a distribution with mean zero ϵ ∼ N (0, σ 2 ) to the
input x , yielding a perturbed point x′ = x + ϵ. In expectation, E[x′ ] = x.
In the case of dropout regularization, one can debias each layer by normalizing by the fraction of
nodes that were not dropped out. In other words, dropout with drop probability p is applied as
follows:
{
0 with probability p