Deep Learning Guide: Concepts & Implementation
Deep Learning Guide: Concepts & Implementation
Aston Zhang
Zachary C. Lipton
Mu Li
Alexander J. Smola
i
ii
Contents
Preface 1
Installation 9
1 Introduction 13
1.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 The Key Components: Data, Models, and Algorithms . . . . . . . . . . . . . . . . . . 16
1.3 Kinds of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.5 The Road to Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.6 Success Stories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
iii
2.2.8 Dot products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.2.9 Matrix-vector products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.2.10 Matrix-matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.2.11 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.2.12 Norms and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.2.13 Intermediate linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.3 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.3.1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.3.2 Training Mode and Prediction Mode . . . . . . . . . . . . . . . . . . . . . . . 63
2.3.3 Computing the Gradient of Python Control Flow . . . . . . . . . . . . . . . . 64
2.3.4 Head gradients and the chain rule . . . . . . . . . . . . . . . . . . . . . . . . 64
2.4 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.4.1 Basic probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.4.2 Dealing with multiple random variables . . . . . . . . . . . . . . . . . . . . . 72
2.4.3 Conditional independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.4.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.4.5 Scan the QR Code to . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.5 Naive Bayes Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.5.1 Optical Character Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.6 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.6.1 Finding all the functions and classes in the module . . . . . . . . . . . . . . . 87
2.6.2 Finding the usage of specific functions and classes . . . . . . . . . . . . . . . 88
2.6.3 API Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
iv
3.4 Softmax Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.4.1 Classification Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.4.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.4.3 Information Theory Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.4.4 Model Prediction and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.5 Image Classification Data (Fashion-MNIST) . . . . . . . . . . . . . . . . . . . . . . . 120
3.5.1 Getting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.5.2 Reading a Minibatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.6 Implementation of Softmax Regression from Scratch . . . . . . . . . . . . . . . . . . . 124
3.6.1 Initialize Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.6.2 The Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.6.3 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.6.4 The Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.6.5 Classification Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.6.6 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.6.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.7 Concise Implementation of Softmax Regression . . . . . . . . . . . . . . . . . . . . . 130
3.7.1 Initialize Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.7.2 The Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.7.3 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.7.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
v
4.6.1 Overfitting Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.6.2 Robustness through Perturbations . . . . . . . . . . . . . . . . . . . . . . . . 168
4.6.3 Dropout in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.6.4 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.6.5 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.7 Forward Propagation, Backward Propagation, and Computational Graphs . . . . . . . . 174
4.7.1 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.7.2 Computational Graph of Forward Propagation . . . . . . . . . . . . . . . . . . 175
4.7.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.7.4 Training a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4.8 Numerical Stability and Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
4.8.1 Vanishing and Exploding Gradients . . . . . . . . . . . . . . . . . . . . . . . 179
4.8.2 Parameter Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
4.9 Considering the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
4.9.1 Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
4.9.2 A Taxonomy of Learning Problems . . . . . . . . . . . . . . . . . . . . . . . 191
4.9.3 Fairness, Accountability, and Transparency in machine Learning . . . . . . . . 192
4.10 Predicting House Prices on Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
4.10.1 Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
4.10.2 Accessing and Reading Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 195
4.10.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
4.10.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.10.5 k-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
4.10.6 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.10.7 Predict and Submit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
vi
5.5.2 Gluon Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
5.6 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
5.6.1 Computing Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.6.2 NDArray and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.6.3 Gluon and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
vii
6.10 Networks with Parallel Concatenations (GoogLeNet) . . . . . . . . . . . . . . . . . . . 285
6.10.1 Inception Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
6.10.2 GoogLeNet Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
6.10.3 Data Acquisition and Training . . . . . . . . . . . . . . . . . . . . . . . . . . 289
6.11 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
6.11.1 Training Deep Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
6.11.2 Batch Normalization Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
6.11.3 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
6.11.4 Use a Batch Normalization LeNet . . . . . . . . . . . . . . . . . . . . . . . . 295
6.11.5 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
6.12 Residual Networks (ResNet) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
6.12.1 Function Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
6.12.2 Residual Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
6.12.3 ResNet Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
6.12.4 Data Acquisition and Training . . . . . . . . . . . . . . . . . . . . . . . . . . 305
6.13 Densely Connected Networks (DenseNet) . . . . . . . . . . . . . . . . . . . . . . . . . 306
6.13.1 Function Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
6.13.2 Dense Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
6.13.3 Transition Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
6.13.4 DenseNet Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
6.13.5 Data Acquisition and Training . . . . . . . . . . . . . . . . . . . . . . . . . . 310
viii
7.5.6 Training the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
7.6 Concise Implementation of Recurrent Neural Networks . . . . . . . . . . . . . . . . . 350
7.6.1 Defining the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
7.6.2 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
7.7 Backpropagation Through Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
7.7.1 A Simplified Recurrent Network . . . . . . . . . . . . . . . . . . . . . . . . . 355
7.7.2 The Computational Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
7.7.3 BPTT in Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
7.8 Gated Recurrent Units (GRU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
7.8.1 Gating the Hidden State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
7.8.2 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
7.8.3 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
7.9 Long Short Term Memory (LSTM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
7.9.1 Gated Memory Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
7.9.2 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
7.9.3 Define the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
7.9.4 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
7.10 Deep Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
7.10.1 Functional Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
7.10.2 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
7.10.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
7.11 Bidirectional Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 378
7.11.1 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
7.11.2 Bidirectional Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
ix
8.5.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
8.5.3 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
8.5.4 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
8.6 RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
8.6.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
8.6.2 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
8.6.3 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
8.7 Adadelta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
8.7.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
8.7.2 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
8.7.3 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
8.8 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
8.8.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
8.8.2 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
8.8.3 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
x
10.2 Fine Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
10.2.1 Hot Dog Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
10.3 Object Detection and Bounding Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . 470
10.3.1 Bounding Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
10.4 Anchor Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
10.4.1 Generate Multiple Anchor Boxes . . . . . . . . . . . . . . . . . . . . . . . . . 473
10.4.2 Intersection over Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
10.4.3 Labeling Training Set Anchor Boxes . . . . . . . . . . . . . . . . . . . . . . . 476
10.4.4 Output Bounding Boxes for Prediction . . . . . . . . . . . . . . . . . . . . . . 480
10.5 Multiscale Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
10.6 Object Detection Data Set (Pikachu) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
10.6.1 Download the Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
10.6.2 Read the Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
10.6.3 Graphic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
10.7 Single Shot Multibox Detection (SSD) . . . . . . . . . . . . . . . . . . . . . . . . . . 490
10.7.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
10.7.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
10.7.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
10.8 Region-based CNNs (R-CNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
10.8.1 R-CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
10.8.2 Fast R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
10.8.3 Faster R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
10.8.4 Mask R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
10.9 Semantic Segmentation and Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
10.9.1 Image Segmentation and Instance Segmentation . . . . . . . . . . . . . . . . . 508
10.9.2 Pascal VOC2012 Semantic Segmentation Data Set . . . . . . . . . . . . . . . 509
10.10 Fully Convolutional Networks (FCN) . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
10.10.1 Transposed Convolution Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 514
10.10.2 Construct a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
10.10.3 Initialize the Transposed Convolution Layer . . . . . . . . . . . . . . . . . . . 518
10.10.4 Read the Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
10.10.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
10.10.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
10.11 Neural Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
10.11.1 Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
10.11.2 Read the Content and Style Images . . . . . . . . . . . . . . . . . . . . . . . . 525
10.11.3 Preprocessing and Postprocessing . . . . . . . . . . . . . . . . . . . . . . . . 526
10.11.4 Extract Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
10.11.5 Define the Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
10.11.6 Create and Initialize the Composite Image . . . . . . . . . . . . . . . . . . . . 530
10.11.7 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
10.12 Image Classification (CIFAR-10) on Kaggle . . . . . . . . . . . . . . . . . . . . . . . 533
10.12.1 Obtain and Organize the Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 534
10.12.2 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
10.12.3 Read the Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
xi
10.12.4 Define the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
10.12.5 Define the Training Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 539
10.12.6 Train and Validate the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
10.12.7 Classify the Testing Set and Submit Results on Kaggle . . . . . . . . . . . . . 540
10.13 Dog Breed Identification (ImageNet Dogs) on Kaggle . . . . . . . . . . . . . . . . . . 541
10.13.1 Obtain and Organize the Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 543
10.13.2 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
10.13.3 Read the Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
10.13.4 Define the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
10.13.5 Define the Training Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 547
10.13.6 Train and Validate the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
10.13.7 Classify the Testing Set and Submit Results on Kaggle . . . . . . . . . . . . . 548
xii
11.9.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
11.9.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
11.10 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
11.10.1 Greedy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
11.10.2 Exhaustive Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
11.10.3 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
11.11 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
11.11.1 Compute the Context Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 599
11.11.2 Update the Hidden State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
11.12 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
11.12.1 Read and Pre-process Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
11.12.2 Encoder-Decoder with Attention Mechanism . . . . . . . . . . . . . . . . . . 603
11.12.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
11.12.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
11.12.5 Evaluation of Translation Results . . . . . . . . . . . . . . . . . . . . . . . . . 608
12 Appendix 611
12.1 List of Main Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
12.1.1 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
12.1.2 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
12.1.3 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
12.1.4 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
12.1.5 Derivatives and Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
12.1.6 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
12.1.7 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
12.2 Mathematical Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
12.2.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
12.2.2 Differentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
12.2.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
12.3 Using Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
12.3.1 Edit and Run the Code in This Book Locally . . . . . . . . . . . . . . . . . . . 620
12.3.2 Advanced Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
12.4 Using AWS to Run Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
12.4.1 Apply for an Account and Log In . . . . . . . . . . . . . . . . . . . . . . . . . 626
12.4.2 Create and Run an EC2 Instance . . . . . . . . . . . . . . . . . . . . . . . . . 626
12.4.3 Install CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
12.4.4 Acquire the Code for this Book and Install MXNet GPU Version . . . . . . . . 633
12.4.5 Run Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
12.4.6 Close Unused Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
12.5 GPU Purchase Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636
12.5.1 Selecting a GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636
12.5.2 Machine Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
12.6 How to Contribute to This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
12.7 d2l Package Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
xiii
xiv
Preface
Just a few years ago, there were no legions of deep learning scientists developing intelligent products
and services at major companies and startups. When the youngest of us (the authors) entered the field,
machine learning didn’t command headlines in daily newspapers. Our parents had no idea what machine
learning was, let alone why we might prefer it to a career in medicine or law. Machine learning was a
forward-looking academic discipline with a narrow set of real-world applications. And those applica-
tions, e.g. speech recognition and computer vision, required so much domain knowledge that they were
often regarded as separate areas entirely for which machine learning was one small component. Neural
networks, the antecedents of the deep learning models that we focus on in this book, were regarded as
outmoded tools.
In just the past five years, deep learning has taken the world by surprise, driving rapid progress in fields
as diverse as computer vision, natural language processing, automatic speech recognition, reinforcement
learning, and statistical modeling. With these advances in hand, we can now build cars that drive them-
selves (with increasing autonomy), smart reply systems that anticipate mundane replies, helping people
dig out from mountains of email, and software agents that dominate the world’s best humans at board
games like Go, a feat once deemed to be decades away. Already, these tools are exerting a widening
impact, changing the way movies are made, diseases are diagnosed, and playing a growing role in ba-
sic sciences – from astrophysics to biology. This book represents our attempt to make deep learning
approachable, teaching you both the concepts, the context, and the code.
1
About This Book
For any computing technology to reach its full impact, it must be well-understood, well-documented,
and supported by mature, well-maintained tools. The key ideas should be clearly distilled, minimizing
the onboarding time needing to bring new practitioners up to date. Mature libraries should automate
common tasks, and exemplar code should make it easy for practitioners to modify, apply, and extend
common applications to suit their needs. Take dynamic web applications as an example. Despite a
large number of companies, like Amazon, developing successful database-driven web applications in the
1990s, the full potential of this technology to aid creative entrepreneurs has only been realized over the
past ten years, owing to the development of powerful, well-documented frameworks.
Realizing deep learning presents unique challenges because any single application brings together various
disciplines. Applying deep learning requires simultaneously understanding (i) the motivations for casting
a problem in a particular way, (ii) the mathematics of a given modeling approach, (iii) the optimization
algorithms for fitting the models to data, (iv) and the engineering required to train models efficiently,
navigating the pitfalls of numerical computing and getting the most out of available hardware. Teaching
both the critical thinking skills required to formulate problems, the mathematics to solve them, and the
software tools to implement those solutions all in one place presents formidable challenges. Our goal in
this book is to present a unified resource to bring would-be practitioners up to speed.
We started this book project in July 2017 when we needed to explain MXNet’s (then new) Gluon interface
to our users. At the time, there were no resources that were simultaneously (1) up to date, (2) covered the
full breadth of modern machine learning with anything resembling of technical depth, and (3) interleaved
the exposition one expects from an engaging textbook with the clean runnable code one seeks in hands-on
tutorials. We found plenty of code examples for how to use a given deep learning framework (e.g. how
to do basic numerical computing with matrices in TensorFlow) or for implementing particular techniques
(e.g. code snippets for LeNet, AlexNet, ResNets, etc) in the form of blog posts or on GitHub. However,
these examples typically focused on how to implement a given approach, but left out the discussion of why
certain algorithmic decisions are made. While sporadic topics have been covered in blog posts, e.g. on
the website Distill or personal blogs, they only covered selected topics in deep learning, and often lacked
associated code. On the other hand, while several textbooks have emerged, most notably Goodfellow,
Bengio and Courville, 2016, which offers an excellent survey of the concepts behind deep learning, these
resources don’t marry the descriptions to realizations of the concepts in code, sometimes leaving readers
clueless as to how to implement them. Moreover, too many resources are hidden behind the paywalls of
commercial course providers.
We set out to create a resource that could (1) be freely available for everyone, (2) offer sufficient technical
depth to provide a starting point on the path to actually becoming an applied machine learning scientist,
(3) include runnable code, showing readers how to solve problems in practice, and (4) that allowed for
rapid updates, both by us, and also by the community at large, and (5) be complemented by a forum for
interactive discussion of technical details and to answer questions.
These goals were often in conflict. Equations, theorems, and citations are best managed and laid out in
LaTeX. Code is best described in Python. And webpages are native in HTML and JavaScript. Further-
2 Contents
more, we want the content to be accessible both as executable code, as a physical book, as a downloadable
PDF, and on the internet as a website. At present there exist no tools and no workflow perfectly suited to
these demands, so we had to assemble our own. We describe our approach in detail in the appendix. We
settled on Github to share the source and to allow for edits, Jupyter notebooks for mixing code, equations
and text, Sphinx as a rendering engine to generate multiple outputs, and Discourse for the forum. While
our system is not yet perfect, these choices provide a good compromise among the competing concerns.
We believe that this might be the first book published using such an integrated workflow.
Learning by Doing
Many textbooks teach a series of topics, each in exhaustive detail. For example, Chris Bishop’s excellent
textbook, Pattern Recognition and Machine Learning, teaches each topic so thoroughly, that getting to the
chapter on linear regression requires a non-trivial amount of work. While experts love this book precisely
for its thoroughness, for beginners, this property limits its usefulness as an introductory text.
In this book, we’ll teach most concepts just in time. In other words, you’ll learn concepts at the very
moment that they are needed to accomplish some practical end. While we take some time at the outset to
teach fundamental preliminaries, like linear algebra and probability. We want you to taste the satisfaction
of training your first model before worrying about more esoteric probability distributions.
Aside from a few preliminary notebooks that provide a crash course in the basic mathematical back-
ground, each subsequent notebook introduces both a reasonable number of new concepts and provides a
single self-contained working example – using a real dataset. This presents an organizational challenge.
Some models might logically be grouped together in a single notebook. And some ideas might be best
taught by executing several models in succession. On the other hand, there’s a big advantage to adhering
to a policy of 1 working example, 1 notebook: This makes it as easy as possible for you to start your own
research projects by leveraging our code. Just copy a notebook and start modifying it.
We will interleave the runnable code with background material as needed. In general, we will often err on
the side of making tools available before explaining them fully (and we will follow up by explaining the
background later). For instance, we might use stochastic gradient descent before fully explaining why it
is useful or why it works. This helps to give practitioners the necessary ammunition to solve problems
quickly, at the expense of requiring the reader to trust us with some curatorial decisions.
Throughout, we’ll be working with the MXNet library, which has the rare property of being flexible
enough for research while being fast enough for production. This book will teach deep learning concepts
from scratch. Sometimes, we want to delve into fine details about the models that would typically be
hidden from the user by Gluon’s advanced abstractions. This comes up especially in the basic tutorials,
where we want you to understand everything that happens in a given layer or optimizer. In these cases,
we’ll often present two versions of the example: one where we implement everything from scratch,
relying only on NDArray and automatic differentiation, and another, more practical example, where we
write succinct code using Gluon. Once we’ve taught you how some component works, we can just use
the Gluon version in subsequent tutorials.
Contents 3
Content and Structure
4 Contents
Fig. 1: Book structure
Code
Most sections of this book feature executable code. We recognize the importance of an interactive learning
experience in deep learning. At present certain intuitions can only be developed through trial and error,
tweaking the code in small ways and observing the results. Ideally, an elegant mathematical theory might
tell us precisely how to tweak our code to achieve a desired result. Unfortunately, at present such elegant
theories elude us. Despite our best attempts, our explanations for of various techniques might be lacking,
sometimes on account of our shortcomings, and equally often on account of the nascent state of the
science of deep learning. We are hopeful that as the theory of deep learning progresses, future editions of
this book will be able to provide insights in places the present edition cannot.
Most of the code in this book is based on Apache MXNet. MXNet is an open-source framework for
deep learning and the preferred choice of AWS (Amazon Web Services), as well as many colleges and
companies. All of the code in this book has passed tests under MXNet 1.2.0. However, due to the rapid
development of deep learning, some code in the print edition may not work properly in future versions of
MXNet. However, we plan to keep the online version remain up-to-date. In case of such problems, please
consult the section Installation and Running to update the code and runtime environment. At times, to
avoid unnecessary repetition, we encapsulate the frequently-imported and referred-to functions, classes,
etc. in this book in the gluonbook package, version number 1.0.0. We give a detailed overview of these
functions and classes in the appendix gluonbook package index
Contents 5
Target Audience
This book is for students (undergraduate or graduate), engineers, and researchers, who seek a solid grasp
of the practical techniques of deep learning. Because we explain every concept from scratch, no previous
background in deep learning or machine learning is required. Fully explaining the methods of deep
learning requires some mathematics and programming, but we’ll only assume that you come in with
some basics, including (the very basics of) linear algebra, calculus, probability, and Python programming.
Moreover, this book’s appendix provides a refresher on most of the mathematics covered in this book.
Most of the time, we will prioritize intuition and ideas over mathematical rigor. There are many terrific
books which can lead the interested reader further. For instance Linear Analysis by Bela Bollobas covers
linear algebra and functional analysis in great depth. All of Statistics is a terrific guide to statistics. And
if you have not used Python before, you may want to peruse the Python tutorial.
Forum
Associated with this book, we’ve launched a discussion forum, located at [Link]. When you
have questions on any section of the book, you can find the associated discussion page by scanning the
QR code at the end of the section to participate in its discussions. The authors of this book and broader
MXNet developer community frequently participate in forum discussions.
Acknowledgments
We are indebted to the hundreds of contributors for both the English and the Chinese drafts. They helped
improve the content and offered valuable feedback. Specifically, we thank every contributor of this En-
glish draft for making it better for everyone. Their GitHub IDs or names are (in no particular order):
alxnorden, avinashingit, bowen0701, brettkoonce, Chaitanya Prakash Bapat, cryptonaut, Davide Fiocco,
edgarroman, gkutiel, John Mitro, Liang Pu, Rahul Agarwal, mohamed-ali, mstewart141, Mike Müller,
NRauschmayr, Prakhar Srivastav, sad-, sfermigier, Sheng Zha, sundeepteki, topecongiro, tpdi, vermicelli,
Vishaal Kapoor, vishwesh5, YaYaB, Yuhong Chen, Evgeniy Smirnov, lgov, Simon Corston-Oliver, Ig-
orDzreyev, trungha-ngx, pmuens, alukovenko, senorcinco, vfdev-5, dsweet, Mohammad Mahdi Rahimi,
Abhishek Gupta, uwsd, DomKM, Lisa Oakley, bowen0701, arush15june, prasanth5reddy. Moreover,
we thank Amazon Web Services, especially Swami Sivasubramanian, Raju Gulabani, Charlie Bell, and
Andrew Jassy for their generous support in writing this book. Without the available time, resources,
discussions with colleagues, and continuous encouragement this book would not have happened.
Summary
• Deep learning has revolutionized pattern recognition, introducing technology that now powers a
wide range of technologies, including computer vision, natural language processing, automatic
speech recognition.
6 Contents
• To successfully apply deep learning, you must understand how to cast a problem, the mathematics
of modeling, the algorithms for fitting your models to data, and the engineering techniques to
implement it all.
• This book presents a comprehensive resource, including prose, figures, mathematics, and code, all
in one place.
• To answer questions related to this book, visit our forum at [Link]
• Apache MXNet is a powerful library for coding up deep learning models and running them in
parallel across GPU cores.
• Gluon is a high level library that makes it easy to code up deep learning models using Apache
MXNet.
• Conda is a Python package manager that ensures that all software dependencies are met.
• All notebooks are available for download on GitHub and the conda configurations needed to run
this book’s code are expressed in the [Link] file.
• If you plan to run this code on GPUs, don’t forget to install the necessary drivers and update your
configuration.
Exercises
Contents 7
8 Contents
Installation
To get you up and running with hands-on experiences, we’ll need you to set up with a Python environment,
Jupyter’s interactive notebooks, the relevant libraries, and the code needed to run the book.
This book with code can be downloaded for free. For simplicity we recommend conda, a popular Python
package manager to install all libraries. Windows users and Linux/macOS users can follow the instruc-
tions below respectively.
Windows Users
If it is your first time to run the code of this book, you need to complete the following 5 steps. Next time
you can jump directly to Step 4 and Step 5.
Step 1 is to download and install Miniconda according to the operating system in use. During the instal-
lation, it is required to choose the option Add Anaconda to the system PATH environment variable.
Step 2 is to download the compressed file containing the code of this book. It is available at https:
//[Link]/[Link]. After downloading the zip file, create a folder d2l-en and extract the zip
file into the folder. At the current folder, enter cmd on the address bar of File Explorer to enter the
command line mode.
Step 3 is to create a virtual (running) environment using conda to install the libraries needed by this book.
Here [Link] is placed in the downloaded zip file. Open the file with a text editor to see
9
the libraries (such as MXNet and d2lzh package) and their version numbers on which running the code
of the book is dependent.
conda env create -f [Link]
Step 4 is to activate the environment that is created earlier. Activating this environment is a prerequisite
for running the code of this book. To exit the environment, use the command conda deactivate (if
the conda version is lower than 4.4, use the command deactivate).
# If the conda version is lower than 4.4, use the command `activate gluon`
conda activate gluon
At this point open [Link] (usually automatically open) in the browser, then you can view
and run the code in each section of the book.
Linux/macOS Users
Step 1 is to download and install Miniconda according to the operating system in use. It is a sh file. Open
the Terminal application and enter the command to execute the sh file, such as
# The file name is subject to change, always use the one downloaded from the
# Miniconda website
sh Miniconda3-latest-Linux-x86_64.sh
The terms of use will be displayed during installation. Press to continue reading, press Q to exit reading.
After that, answer the following questions:
Do you accept the license terms? [yes|no]
[no] >>> yes
Do you wish the installer to prepend the Miniconda3 install location
to PATH in your /home/your_name/your_file ? [yes|no]
[no] >>> yes
After the installation is complete, conda should be made to take effect. Linux users need to run
source ~/.bashrc or restart the command line application; macOS users need to run source ~/
.bash_profile or restart the command line application.
Step 2 is to download the compressed file containing the code of this book, and extract it into the folder.
Run the following commands. For Linux users who do not install unzip, they can run the command
sudo apt install unzip to install it.
mkdir d2l-en && cd d2l-en
curl [Link] -o [Link]
unzip [Link] && rm [Link]
10 Contents
For Step 3 to Step 5, refer to the such steps for Windows users as described earlier. If the conda version
is lower than 4.4, replace the command in Step 4 with source activate gluon and exit the virtual
environment using the command source deactivate.
Since deep learning and MXNet grow fast, this open source book will be updated and released regularly.
To update the open source content of this book (e.g., code) with corresponding running environment (e.g.,
MXNet of a later version), follow the steps below.
Step 1 is to re-download the latest compressed file containing the code of this book. It is available at
[Link] After extracting the zip file, enter the folder d2l-en.
Step 2 is to update the running environment with the command
conda env update -f [Link]
The subsequent steps for activating the environment and running Jupyter are the same as those described
earlier.
GPU Support
By default MXNet is installed without GPU support to ensure that it will run on any computer (includ-
ing most laptops). Part of this book requires or recommends running with GPU. If your computer has
NVIDIA graphics cards and has installed CUDA, you should modify the conda environment to download
the CUDA enabled build.
Step 1 is to uninstall MXNet without GPU support. If you have installed the virtual environment for
running the book, you need to activate this environment then uninstall MXNet without GPU support:
pip uninstall mxnet
Contents 11
Then we only need to activate the virtual environment to use MXNet with GPU support to run the book.
Note that you need to repeat these 3 steps to use MXNet with GPU support if you download the updated
code later.
Exercises
1. Download the code for the book and install the runtime environment.
12 Contents
1
Introduction
Until recently, nearly all of the computer programs that we interacted with every day were coded by
software developers from first principles. Say that we wanted to write an application to manage an e-
commerce platform. After huddling around a whiteboard for a few hours to ponder the problem, we
would come up with the broad strokes of a working solution that would probably look something like
this: (i) users would interact with the application through an interface running in a web browser or mobile
application (ii) our application would rely on a commerical database engine to keep track of each user’s
state and maintain records of all historical transactions (ii) at the heart of our application, running in
parallel across many servers, the business logic (you might say, the brains) would map out in methodical
details the appropriate action to take in every conceivable circumstance.
To build the brains of our application, we’d have to step through every possible corner case that we
anticipate encountering, devising appropriate rules. Each time a customer clicks to add an item to their
shopping cart, we add an entry to the shopping cart database table, associating that user’s ID with the
requested product’s ID. While few developers ever get it completely right the first time (it might take some
test runs to work out the kinks), for the most part, we could write such a program from first principles
and confidently launch it before ever seeing a real customer. Our ability to design automated systems
from first principles that drive functioning products and systems, often in novel situations, is a remarkable
cognitive feat. And when you’re able to devise solutions that work 100% of the time. you should not be
using machine learning.
Fortunatelyfor the growing community of ML scientistsmany problems in automation don’t bend so easily
to human ingenuity. Imagine huddling around the whiteboard with the smartest minds you know, but this
time you are tackling any of the following problems: * Write a program that predicts tomorrow’s weather
given geographic information, satellite images, and a trailing window of past weather. * Write a program
13
that takes in a question, expressed in free-form text, and answers it correctly. * Write a program that given
an image can identify all the people it contains, drawing outlines around each. * Write a program that
presents users with products that they are likely to enjoy but unlikely, in the natural course of browsing,
to encounter.
In each of these cases, even elite programmers are incapable of coding up solutions from scratch. The
reasons for this can vary. Sometimes the program that we are looking for follows a pattern that changes
over time, and we need our programs to adapt. In other cases, the relationship (say between pixels, and
abstract categories) may be too complicated, requiring thousands or millions of computations that are
beyond our conscious understanding (even if our eyes manage the task effortlessly). Machine learning
(ML) is the study of powerful techniques that can learn behavior from experience. As ML algorithm
accumulates more experience, typically in the form of observational data or interactions with an envi-
ronment, their performance improves. Contrast this with our deterministic e-commerce platform, which
performs according to the same business logic, no matter how much experience accrues, until the devel-
opers themselves learn and decide that it’s time to update the software. In this book, we will teach you the
fundamentals of machine learning, and focus in particular on deep learning, a powerful set of techniques
driving innovations in areas as diverse as computer vision, natural language processing, healthcare, and
genomics.
Before we could begin writing, the authors of this book, like much of the work force, had to become
caffeinated. We hopped in the car and started driving. Using an iPhone, Alex called out Hey Siri’,
awakening the phone’s voice recognition system. Then Mu commanded directions to Blue Bottle coffee
shop’. The phone quickly displayed the transcription of his command. It also recognized that we were
asking for directions and launched the Maps application to fulfill our request. Once launched, the Maps
app identified a number of routes. Next to each route, the phone displayed a predicted transit time.
While we fabricated this story for pedagogical convenience, it demonstrates that in the span of just a few
seconds, our everyday interactions with a smartphone can engage several machine learning models.
Imagine just writing a program to respond to a wake word like Alexa’, Okay, Google’ or Siri’. Try coding
it up in a room by yourself with nothing but a computer and a code editor. How would you write such
a program from first principles? Think about it the problem is hard. Every second, the microphone will
collect roughly 44,000 samples. What rule could map reliably from a snippet of raw audio to confident
predictions {yes, no} on whether the snippet contains the wake word? If you’re stuck, don’t worry.
We don’t know how to write such a program from scratch either. That’s why we use ML.
Here’s the trick. Often, even when we don’t know how to tell a computer explicitly how to map from
inputs to outputs, we are nonetheless capable of performing the cognitive feat ourselves. In other words,
even if you don’t know how to program a computer to recognize the word Alexa’, you yourself are
14 1. Introduction
able to recognize the word Alexa’. Armed with this ability, we can collect a huge dataset containing
examples of audio and label those that do and that do not contain the wake word. In the ML approach, we
do not design a system explicitly to recognize wake words. Instead, we define a flexible program whose
behavior is determined by a number of parameters. Then we use the dataset to determine the best possible
set of parameters, those that improve the performance of our program with respect to some measure of
performance on the task of interest.
You can think of the parameters as knobs that we can turn, manipulating the behavior of the program.
Fixing the parameters, we call the program a model. The set of all distinct programs (input-output map-
pings) that we can produce just by manipulating the parameters is called a family of models. And the
meta-program that uses our dataset to choose the parameters is called a learning algorithm.
Before we can go ahead and engage the learning algorithm, we have to define the problem precisely,
pinning down the exact nature of the inputs and outputs, and choosing an appropriate model family. In
this case, our model receives a snippet of audio as input, and it generates a selection among {yes,
no} as outputwhich, if all goes according to plan, will closely approximate whether (or not) the snippet
contains the wake word.
If we choose the right family of models, then there should exist one setting of the knobs such that the
model fires yes every time it hears the word Alexa’. Because the exact choice of the wake word is
arbitrary, we’ll probably need a model family capable, via another setting of the knobs, of firing yes
on the word Apricot’. We expect that the same model should apply to Alexa’ recognition and Apricot’
recognition because these are similar tasks. However, we might need a different family of models entirely
if we want to deal with fundamentally different inputs or outputs, say if we wanted to map from images
to captions, or from English sentences to Chinese sentences.
As you might guess, if we just set all of the knobs randomly, it’s not likely that our model will recognize
Alexa’, Apricot’, or any other English word. In deep learning, the learning is the process by which we
discover the right setting of the knobs coercing the desired behaviour from our model.
The training process usually looks like this:
1. Start off with a randomly initialized model that can’t do anything useful.
2. Grab some of your labeled data (e.g. audio snippets and corresponding {yes,no} labels)
3. Tweak the knobs so the model sucks less with respect to those examples
4. Repeat until the model is awesome.
This way the detector will eventually learn to emit a very large positive number if it’s a cat, a very large
negative number if it’s a dog, and something closer to zero if it isn’t sure, and this barely scratches the
surface of what ML can do.
Deep learning is just one among many popular frameworks for solving machine learning problems. While
thus far, we’ve only talked about machine learning broadly and not deep learning, there’s a couple points
worth sneaking in here: First, the problems that we’ve discussed thus far: learning from raw audio
signal, directly from the pixels in images, and mapping between sentences of arbitrary lengths and across
languages are problems where deep learning excels and traditional ML tools faltered. Deep models
are deep in precisely the sense that they learn many layers of computation. It turns out that these many-
layered (or hierarchical) models are capable of addressing low-level perceptual data in a way that previous
tools could not. In bygone days, the crucial part of applying ML to these problems consisted of coming
up with manually engineered ways of transforming the data into some form amenable to shallow models.
One key advantage of deep learning is that it replaces not only the shallow models at the end of traditional
learning pipelines, but also the labor-intensive feature engineering. Secondly, by replacing much of
the domain-specific preprocessing, deep learning has eliminated many of the boundaries that previously
separated computer vision, speech recognition, natural language processing, medical informatics, and
other application areas, offering a unified set of tools for tackling diverse problems.
In our wake-word example, we described a dataset consisting of audio snippets and binary labels gave a
hand-wavy sense of how we might train a model to approximate a mapping from snippets to classifica-
tions. This sort of problem, where we try to predict a designated unknown label given known inputs (also
16 1. Introduction
called features or covariates), and examples of both is called supervised learning, and it’s just one among
many kinds of machine learning problems. In the next section, we’ll take a deep dive into the different
ML problems. First, we’d like to shed more light on some core components that will follow us around,
no matter what kind of ML problem we take on:
1. The data that we can learn from
2. A model of how to transform the data
3. A loss function that quantifies the badness of our model
4. An algorithm to adjust the model’s parameters to minimize the loss
1.2.1 Data
It might go without saying that you cannot do data science without data. We could lose hundreds of
pages pondering the precise nature of data but for now we’ll err on the practical side and focus on the key
properties to be concerned with. Generally we are concerned with a collection of examples (also called
data points, samples, or instances). In order to work with data usefully, we typically need to come up
with a suitable numerical representation. Each example typically consists of a collection of numerical
attributes called features or covariates.
If we were working with image data, each individual photograph might constitute an example, each
represented by an ordered list of numerical values corresponding to the brightness of each pixel. A
200 × 200 color photograph would consist of 200 × 200 × 3 = 120000 numerical values, corresponding
to the brightness of the red, green, and blue channels corresponding to each spatial location. In a more
traditional task, we might try to predict whether or not a patient will survive, given a standard set of
features such as age, vital signs, diagnoses, etc.
When every example is characterized by the same number of numerical values, we say that the data
consists of fixed-length vectors and we describe the (constant) length of the vectors as the dimensionality
of the data. As you might imagine, fixed length can be a convenient property. If we wanted to train a
model to recognize cancer in microscopy images, fixed-length inputs means we have one less thing to
worry about.
However, not all data can easily be represented as fixed length vectors. While we might expect micro-
scrope images to come from standard equipment, we can’t expect images mined from the internet to all
show up in the same size. While we might imagine cropping images to a standard size, text data resists
fixed-length representations even more stubbornly. Consider the product reviews left on e-commerce sites
like Amazon or TripAdvisor. Some are short: it stinks!. Others ramble for pages. One major advantage
of deep learning over traditional methods is the comparative grace with which modern models can handle
varying-length data.
Generally, the more data we have, the easier our job becomes. When we have more data, we can train
more powerful models, and rely less heavily on pre-conceived assumptions. The regime change from
(comparatively small) to big data is a major contributor to the success of modern deep learning. To drive
the point home, many of the most exciting models in deep learning either don’t work without large data
sets. Some others work in the low-data regime, but no better than traditional approaches.
1.2.2 Models
Most machine learning involves transforming the data in some sense. We might want to build a system
that ingests photos and predicts smiley-ness. Alternatively, we might want to ingest a set of sensor read-
ings and predict how normal vs anomalous the readings are. By model, we denote the computational
machinery for ingesting data of one type, and spitting out predictions of a possibly different type. In
particular, we are interested in statistical models that can be estimated from data. While simple models
are perfectly capable of addressing appropriately simple problems the problems that we focus on in this
book stretch the limits of classical methods. Deep learning is differentiated from classical approaches
principally by the set of powerful models that it focuses on. These models consist of many successive
transformations of the data that are chained together top to bottom, thus the name deep learning. On our
way to discussing deep neural networks, we’ll discuss some more traditional methods.
Earlier, we introduced machine learning as learning behavior from experience. By learning here, we
mean improving at some task over time. But who is to say what constitutes an improvement? You might
imagine that we could propose to update our model, and some people might disagree on whether the
proposed update constitued an improvement or a decline.
In order to develop a formal mathematical system of learning machines, we need to have formal measures
of how good (or bad) our models are. In machine learning, and optimization more generally, we call these
objective functions. By convention, we usually define objective funcitons so that lower is better. This
is merely a convention. You can take any function f for which higher is better, and turn it into a new
function f ′ that is qualitatively identical but for which lower is better by setting f ′ = −f . Because lower
is better, these functions are sometimes called loss functions or cost functions.
When trying to predict numerical values, the most common objective function is squared error (y − ŷ)2 .
For classification, the most common objective is to minimize error rate, i.e., the fraction of instances on
which our predictions disagree with the ground truth. Some objectives (like squared error) are easy to
18 1. Introduction
optimize. Others (like error rate) are difficult to optimize directly, owing to non-differentiability or other
complications. In these cases, it’s common to optimize a surrogate objective.
Typically, the loss function is defined with respect to the models parameters and depends upon the dataset.
The best values of our model’s parameters are learned by minimizing the loss incurred on a training set
consisting of some number of examples collected for training. However, doing well on the training data
doesn’t guarantee that we will do well on (unseen) test data. So we’ll typically want to split the available
data into two partitions: the training data (for fitting model parameters) and the test data (which is held
out for evaluation), reporting the following two quantities:
• Training Error: The error on that data on which the model was trained. You could think of this
as being like a student’s scores on practice exams used to prepare for some real exam. Even if the
results are encouraging, that does not guarantee success on the final exam.
• Test Error: This is the error incurred on an unseen test set. This can deviate significantly from
the training error. When a model fails to generalize to unseen data, we say that it is overfitting. In
real-life terms, this is like flunking the real exam despite doing well on practice exams.
Once we’ve got some data source and representation, a model, and a well-defined objective function, we
need an algorithm capable of searching for the best possible parameters for minimizing the loss function.
The most popular optimization algorithms for neural networks follow an approach called gradient descent.
In short, at each step, they check to see, for each parameter, which way the training set loss would move
if you perturbed that parameter just a small amount. They then update the parameter in the direction that
reduces the loss.
In the following sections, we will discuss a few types of machine learning in some more detail. We begin
with a list of objectives, i.e. a list of things that machine learning can do. Note that the objectives are
complemented with a set of techniques of how to accomplish them, i.e. training, types of data, etc. The
list below is really only sufficient to whet the readers’ appetite and to give us a common language when
we talk about problems. We will introduce a larger number of such problems as we go along.
Supervised learning addresses the task of predicting targets given input data. The targets, also commonly
called labels, are generally denoted y. The input data points, also commonly called examples or instances,
are typically denoted x. The goal is to produce a model fθ that maps an input x to a prediction fθ (x)
To ground this description in a concrete example, if we were working in healthcare, then we might want
to predict whether or not a patient would have a heart attack. This observation, heart attack or no heart
Regression
Perhaps the simplest supervised learning task to wrap your head around is Regression. Consider, for
example a set of data harvested from a database of home sales. We might construct a table, where each
row corresponds to a different house, and each column corresponds to some relevant attribute, such as the
square footage of a house, the number of bedrooms, the number of bathrooms, and the number of minutes
20 1. Introduction
(walking) to the center of town. Formally, we call one row in this dataset a feature vector, and the object
(e.g. a house) it’s associated with an example.
If you live in New York or San Francisco, and you are not the CEO of Amazon, Google, Microsoft, or
Facebook, the (sq. footage, no. of bedrooms, no. of bathrooms, walking distance) feature vector for your
home might look something like: [100, 0, .5, 60]. However, if you live in Pittsburgh, it might look more
like [3000, 4, 3, 10]. Feature vectors like this are essential for all the classic machine learning problems.
We’ll typically denote the feature vector for any one example xi and the set of feature vectors for all our
examples X.
What makes a problem a regression is actually the outputs. Say that you’re in the market for a new home,
you might want to estimate the fair market value of a house, given some features like these. The target
value, the price of sale, is a real number. We denote any individual target yi (corresponding to example
xi ) and the set of all targets y (corresponding to all examples X). When our targets take on arbitrary real
values in some range, we call this a regression problem. The goal of our model is to produce predictions
(guesses of the price, in our example) that closely approximate the actual target values. We denote these
predictions ŷi and if the notation seems unfamiliar, then just ignore it for now. We’ll unpack it more
thoroughly in the subsequent chapters.
Lots of practical problems are well-described regression problems. Predicting the rating that a user will
assign to a movie is a regression problem, and if you designed a great algorithm to accomplish this feat
in 2009, you might have won the $1 million Netflix prize. Predicting the length of stay for patients in
the hospital is also a regression problem. A good rule of thumb is that any How much? or How many?
problem should suggest regression.
• How many hours will this surgery take?’ - regression
• How many dogs are in this photo?’ - regression.
However, if you can easily pose your problem as Is this a _ ?’, then it’s likely, classification, a different
fundamental problem type that we’ll cover next. Even if you’ve never worked with machine learning
before, you’ve probably worked through a regression problem informally. Imagine, for example, that you
had your drains repaired and that your contractor spent x1 = 3 hours removing gunk from your sewage
pipes. Then she sent you a bill of y1 = $350. Now imagine that your friend hired the same contractor
for x2 = 2 hours and that she received a bill of y2 = $250. If someone then asked you how much to
expect on their upcoming gunk-removal invoice you might make some reasonable assumptions, such as
more hours worked costs more dollars. You might also assume that there’s some base charge and that the
contractor then charges per hour. If these assumptions held true, then given these two data points, you
could already identify the contractor’s pricing structure: $100 per hour plus $50 to show up at your house.
If you followed that much then you already understand the high-level idea behind linear regression (and
you just implicitly designed a linear model with bias).
In this case, we could produce the parameters that exactly matched the contractor’s prices. Sometimes
that’s not possible, e.g., if some of the variance owes to some factors besides your two features. In these
cases, we’ll try to learn models that minimize the distance between our predictions and the observed
As we will see later, the L2 loss corresponds to the assumption that our data was corrupted by Gaussian
noise, whereas the L1 loss corresponds to an assumption of noise from a Laplace distribution.
Classification
While regression models are great for addressing how many? questions, lots of problems don’t bend
comfortably to this template. For example, a bank wants to add check scanning to their mobile app. This
would involve the customer snapping a photo of a check with their smartphone’s camera and the machine
learning model would need to be able to automatically understand text seen in the image. It would also
need to understand hand-written text to be even more robust. This kind of system is referred to as optical
character recognition (OCR), and the kind of problem it solves is called a classification. It’s treated with
a distinct set of algorithms than those that are used for regression.
In classification, we want to look at a feature vector, like the pixel values in an image, and then predict
which category (formally called classes), among some set of options, an example belongs. For hand-
written digits, we might have 10 classes, corresponding to the digits 0 through 9. The simplest form of
classification is when there are only two classes, a problem which we call binary classification. For exam-
ple, our dataset X could consist of images of animals and our labels Y might be the classes {cat, dog}.
While in regression, we sought a regressor to output a real value ŷ, in classification, we seek a classifier,
whose output ŷ is the predicted class assignment.
For reasons that we’ll get into as the book gets more technical, it’s pretty hard to optimize a model that
can only output a hard categorical assignment, e.g. either cat or dog. It’s a lot easier instead to express the
model in the language of probabilities. Given an example x, the model assigns a probability ŷk to each
label k. Because these are probabilities, they need to be positive numbers and add up to 1. This means
that we only need K − 1 numbers to give the probabilities of K categories. This is easy to see for binary
classification. If there’s a 0.6 (60%) probability that an unfair coin comes up heads, then there’s a 0.4
(40%) probability that it comes up tails. Returning to our animal classification example, a classifier might
see an image and output the probability that the image is a cat Pr(y = cat|x) = 0.9. We can interpret
this number by saying that the classifier is 90% sure that the image depicts a cat. The magnitude of the
probability for the predicted class is one notion of confidence. It’s not the only notion of confidence and
we’ll discuss different notions of uncertainty in more advanced chapters.
When we have more than two possible classes, we call the problem multiclass classification. Com-
mon examples include hand-written character recognition [0, 1, 2, 3 ... 9, a, b, c, ...
]. While we attacked regression problems by trying to minimize the L1 or L2 loss functions, the common
22 1. Introduction
loss function for classification problems is called cross-entropy. In MXNet Gluon, the corresponding loss
function can be found here.
Note that the most likely class is not necessarily the one that you’re going to use for your decision.
Assume that you find this beautiful mushroom in your backyard:
Hence, the loss L incurred by eating the mushroom is L(a = eat|x) = 0.2 ∗ ∞ + 0.8 ∗ 0 = ∞, whereas
the cost of discarding it is L(a = discard|x) = 0.2 ∗ 0 + 0.8 ∗ 1 = 0.8.
Our caution was justified: as any mycologist would tell us, the above mushroom actually is a death cap.
Classification can get much more complicated than just binary, multiclass, of even multi-label classifica-
tion. For instance, there are some variants of classification for addressing hierarchies. Hierarchies assume
that there exist some relationships among the many classes. So not all errors are equal - we prefer to mis-
classify to a related class than to a distant class. Usually, this is referred to as hierarchical classification.
One early example is due to Linnaeus, who organized the animals in a hierarchy.
24 1. Introduction
In the case of animal classification, it might not be so bad to mistake a poodle for a schnauzer, but our
model would pay a huge penalty if it confused a poodle for a dinosaur. Which hierarchy is relevant might
depend on how you plan to use the model. For example, rattle snakes and garter snakes might be close on
the phylogenetic tree, but mistaking a rattler for a garter could be deadly.
Tagging
Some classification problems don’t fit neatly into the binary or multiclass classification setups. For ex-
ample, we could train a normal binary classifier to distinguish cats from dogs. Given the current state
of computer vision, we can do this easily, with off-the-shelf tools. Nonetheless, no matter how accurate
our model gets, we might find ourselves in trouble when the classifier encounters an image of the Town
Musicians of Bremen.
26 1. Introduction
tion. Auto-tagging problems are typically best described as multi-label classification problems. Think of
the tags people might apply to posts on a tech blog, e.g., machine learning’, technology’, gadgets’, pro-
gramming languages’, linux’, cloud computing’, AWS’. A typical article might have 5-10 tags applied
because these concepts are correlated. Posts about cloud computing’ are likely to mention AWS’ and
posts about machine learning’ could also deal with programming languages’.
We also have to deal with this kind of problem when dealing with the biomedical literature, where cor-
rectly tagging articles is important because it allows researchers to do exhaustive reviews of the literature.
At the National Library of Medicine, a number of professional annotators go over each article that gets
indexed in PubMed to associate it with the relevant terms from MeSH, a collection of roughly 28k tags.
This is a time-consuming process and the annotators typically have a one year lag between archiving
and tagging. Machine learning can be used here to provide provisional tags until each article can have a
proper manual review. Indeed, for several years, the BioASQ organization has hosted a competition to do
precisely this.
Sometimes we don’t just want to assign each example to a bucket or to a real value. In the field of
information retrieval, we want to impose a ranking on a set of items. Take web search for example,
the goal is less to determine whether a particular page is relevant for a query, but rather, which one of
the plethora of search results should be displayed for the user. We really care about the ordering of the
relevant search results and our learning algorithm needs to produce ordered subsets of elements from
a larger set. In other words, if we are asked to produce the first 5 letters from the alphabet, there is
a difference between returning A B C D E and C A B E D. Even if the result set is the same, the
ordering within the set matters nonetheless.
One possible solution to this problem is to score every element in the set of possible sets along with a
corresponding relevance score and then to retrieve the top-rated elements. PageRank is an early example
of such a relevance score. One of the peculiarities is that it didn’t depend on the actual query. Instead, it
simply helped to order the results that contained the query terms. Nowadays search engines use machine
learning and behavioral models to obtain query-dependent relevance scores. There are entire conferences
devoted to this subject.
Recommender systems
Recommender systems are another problem setting that is related to search and ranking. The problems
are similar insofar as the goal is to display a set of relevant items to the user. The main difference is the
emphasis on personalization to specific users in the context of recommender systems. For instance, for
movie recommendations, the results page for a SciFi fan and the results page for a connoisseur of Woody
Allen comedies might differ significantly.
Such problems occur, e.g. for movie, product or music recommendation. In some cases, customers will
provide explicit details about how much they liked the product (e.g. Amazon product reviews). In some
other cases, they might simply provide feedback if they are dissatisfied with the result (skipping titles
Sequence Learning
So far we’ve looked at problems where we have some fixed number of inputs and produce a fixed number
of outputs. Before we considered predicting home prices from a fixed set of features: square footage,
number of bedrooms, number of bathrooms, walking time to downtown. We also discussed mapping from
an image (of fixed dimension), to the predicted probabilities that it belongs to each of a fixed number of
classes, or taking a user ID and a product ID, and predicting a star rating. In these cases, once we feed our
fixed-length input into the model to generate an output, the model immediately forgets what it just saw.
This might be fine if our inputs truly all have the same dimensions and if successive inputs truly have
nothing to do with each other. But how would we deal with video snippets? In this case, each snippet
might consist of a different number of frames. And our guess of what’s going on in each frame might
be much stronger if we take into account the previous or succeeding frames. Same goes for language.
28 1. Introduction
One popular deep learning problem is machine translation: the task of ingesting sentences in some source
language and predicting their translation in another language.
These problems also occur in medicine. We might want a model to monitor patients in the intensive care
unit and to fire off alerts if their risk of death in the next 24 hours exceeds some threshold. We definitely
wouldn’t want this model to throw away everything it knows about the patient history each hour, and just
make its predictions based on the most recent measurements.
These problems are among the more exciting applications of machine learning and they are instances of
sequence learning. They require a model to either ingest sequences of inputs or to emit sequences of
outputs (or both!). These latter problems are sometimes referred to as seq2seq problems. Language
translation is a seq2seq problem. Transcribing text from spoken speech is also a seq2seq problem.
While it is impossible to consider all types of sequence transformations, a number of special cases are
worth mentioning:
This involves annotating a text sequence with attributes. In other words, the number of inputs and outputs
is essentially the same. For instance, we might want to know where the verbs and subjects are. Alterna-
tively, we might want to know which words are the named entities. In general, the goal is to decompose
and annotate text based on structural and grammatical assumptions to get some annotation. This sounds
more complex than it actually is. Below is a very simple example of annotating a sentence with tags
indicating which words refer to named entities.
With speech recognition, the input sequence x is the sound of a speaker, and the output y is the textual
transcript of what the speaker said. The challenge is that there are many more audio frames (sound is
typically sampled at 8kHz or 16kHz) than text, i.e. there is no 1:1 correspondence between audio and
text, since thousands of samples correspond to a single spoken word. These are seq2seq problems
where the output is much shorter than the input.
Text to Speech
Text-to-Speech (TTS) is the inverse of speech recognition. In other words, the input x is text and the
output y is an audio file. In this case, the output is much longer than the input. While it is easy for
humans to recognize a bad audio file, this isn’t quite so trivial for computers.
Machine Translation
Unlike the case of speech recognition, where corresponding inputs and outputs occur in the same order
(after alignment), in machine translation, order inversion can be vital. In other words, while we are
still converting one sequence into another, neither the number of inputs and outputs nor the order of
corresponding data points are assumed to be the same. Consider the following illustrative example of the
obnoxious tendency of Germans (Alex writing here) to place the verbs at the end of sentences.
A number of related problems exist. For instance, determining the order in which a user reads a webpage
is a two-dimensional layout analysis problem. Likewise, for dialogue problems, we need to take world-
knowledge and prior state into account. This is an active area of research.
All the examples so far were related to Supervised Learning, i.e. situations where we feed the model a
bunch of examples and a bunch of corresponding target values. You could think of supervised learning as
having an extremely specialized job and an extremely anal boss. The boss stands over your shoulder and
tells you exactly what to do in every situation until you learn to map from situations to actions. Working
for such a boss sounds pretty lame. On the other hand, it’s easy to please this boss. You just recognize
the pattern as quickly as possible and imitate their actions.
30 1. Introduction
In a completely opposite way, it could be frustrating to work for a boss who has no idea what they want
you to do. However, if you plan to be a data scientist, you’d better get used to it. The boss might just
hand you a giant dump of data and tell you to do some data science with it! This sounds vague because it
is. We call this class of problems unsupervised learning, and the type and number of questions we could
ask is limited only by our creativity. We will address a number of unsupervised learning techniques in
later chapters. To whet your appetite for now, we describe a few of the questions you might ask:
• Can we find a small number of prototypes that accurately summarize the data? Given a set of
photos, can we group them into landscape photos, pictures of dogs, babies, cats, mountain peaks,
etc.? Likewise, given a collection of users’ browsing activity, can we group them into users with
similar behavior? This problem is typically known as clustering.
• Can we find a small number of parameters that accurately capture the relevant properties of the
data? The trajectories of a ball are quite well described by velocity, diameter, and mass of the
ball. Tailors have developed a small number of parameters that describe human body shape fairly
accurately for the purpose of fitting clothes. These problems are referred to as subspace estimation
problems. If the dependence is linear, it is called principal component analysis.
• Is there a representation of (arbitrarily structured) objects in Euclidean space (i.e. the space of
vectors in Rn ) such that symbolic properties can be well matched? This is called representation
learning and it is used to describe entities and their relations, such as Rome - Italy + France =
Paris.
• Is there a description of the root causes of much of the data that we observe? For instance, if we
have demographic data about house prices, pollution, crime, location, education, salaries, etc., can
we discover how they are related simply based on empirical data? The field of directed graphical
models and causality deals with this.
• An important and exciting recent development is generative adversarial networks. They are
basically a procedural way of synthesizing data. The underlying statistical mechanisms are tests to
check whether real and fake data are the same. We will devote a few notebooks to them.
So far, we haven’t discussed where data actually comes from, or what actually happens when a machine
learning model generates an output. That’s because supervised learning and unsupervised learning do not
address these issues in a very sophisticated way. In either case, we grab a big pile of data up front, then do
our pattern recognition without ever interacting with the environment again. Because all of the learning
takes place after the algorithm is disconnected from the environment, this is called offline learning. For
supervised learning, the process looks like this:
If you’re interested in using machine learning to develop an agent that interacts with an environment
and takes actions, then you’re probably going to wind up focusing on reinforcement learning (RL). This
might include applications to robotics, to dialogue systems, and even to developing AI for video games.
32 1. Introduction
Deep reinforcement learning (DRL), which applies deep neural networks to RL problems, has surged in
popularity. The breakthrough deep Q-network that beat humans at Atari games using only the visual input
, and the AlphaGo program that dethroned the world champion at the board game Go are two prominent
examples.
Reinforcement learning gives a very general statement of a problem, in which an agent interacts with an
environment over a series of time steps. At each time step t, the agent receives some observation ot from
the environment, and must choose an action at which is then transmitted back to the environment. Finally,
the agent receives a reward rt from the environment. The agent then receives a subsequent observation,
and chooses a subsequent action, and so on. The behavior of an RL agent is governed by a policy. In
short, a policy is just a function that maps from observations (of the environment) to actions. The goal of
reinforcement learning is to produce a good policy.
It’s hard to overstate the generality of the RL framework. For example, we can cast any supervised
learning problem as an RL problem. Say we had a classification problem. We could create an RL agent
with one action corresponding to each class. We could then create an environment which gave a reward
that was exactly equal to the loss function from the original supervised problem.
That being said, RL can also address many problems that supervised learning cannot. For example, in
supervised learning we always expect that the training input comes associated with the correct label. But
in RL, we don’t assume that for each observation, the environment tells us the optimal action. In general,
we just get some reward. Moreover, the environment may not even tell us which actions led to the reward.
Consider for example the game of chess. The only real reward signal comes at the end of the game when
we either win, which we might assign a reward of 1, or when we lose, which we could assign a reward
of -1. So reinforcement learners must deal with the credit assignment problem. The same goes for an
employee who gets a promotion on October 11. That promotion likely reflects a large number of well-
chosen actions over the previous year. Getting more promotions in the future requires figuring out what
actions along the way led to the promotion.
Reinforcement learners may also have to deal with the problem of partial observability. That is, the
current observation might not tell you everything about your current state. Say a cleaning robot found
itself trapped in one of many identical closets in a house. Inferring the precise location (and thus state) of
the robot might require considering its previous observations before entering the closet.
The general reinforcement learning problem is a very general setting. Actions affect subsequent observa-
tions. Rewards are only observed corresponding to the chosen actions. The environment may be either
fully or partially observed. Accounting for all this complexity at once may ask too much of researchers.
Moreover not every practical problem exhibits all this complexity. As a result, researchers have studied a
number of special cases of reinforcement learning problems.
When the environment is fully observed, we call the RL problem a Markov Decision Process (MDP).
When the state does not depend on the previous actions, we call the problem a contextual bandit problem.
When there is no state, just a set of available actions with initially unknown rewards, this problem is the
classic multi-armed bandit problem.
1.4 Roots
Although deep learning is a recent invention, humans have held the desire to analyze data and to predict
future outcomes for centuries. In fact, much of natural science has its roots in this. For instance, the
Bernoulli distribution is named after Jacob Bernoulli (1655-1705), and the Gaussian distribution was dis-
covered by Carl Friedrich Gauss (1777-1855). He invented for instance the least mean squares algorithm,
which is still used today for a range of problems from insurance calculations to medical diagnostics.
These tools gave rise to an experimental approach in natural sciences - for instance, Ohm’s law relating
current and voltage in a resistor is perfectly described by a linear model.
Even in the middle ages mathematicians had a keen intuition of estimates. For instance, the geometry
book of Jacob Köbel (1460-1533) illustrates averaging the length of 16 adult men’s feet to obtain the
average foot length.
34 1. Introduction
Fig. 1.1: Estimating the length of a foot
Figure 1.1 illustrates how this estimator works. 16 adult men were asked to line up in a row, when leaving
church. Their aggregate length was then divided by 16 to obtain an estimate for what now amounts to 1
foot. This algorithm’ was later improved to deal with misshapen feet - the 2 men with the shortest and
longest feet respectively were sent away, averaging only over the remainder. This is one of the earliest
examples of the trimmed mean estimate.
Statistics really took off with the collection and availability of data. One of its titans, Ronald Fisher
(1890-1962), contributed significantly to its theory and also its applications in genetics. Many of his
algorithms (such as Linear Discriminant Analysis) and formulae (such as the Fisher Information Matrix)
are still in frequent use today (even the Iris dataset that he released in 1936 is still used sometimes to
illustrate machine learning algorithms).
A second influence for machine learning came from Information Theory (Claude Shannon, 1916-2001)
and the Theory of computation via Alan Turing (1912-1954). Turing posed the question can machines
1.4. Roots 35
think? in his famous paper Computing machinery and intelligence (Mind, October 1950). In what he
described as the Turing test, a machine can be considered intelligent if it is difficult for a human evaluator
to distinguish between the replies from a machine and a human being through textual interactions. To this
day, the development of intelligent machines is changing rapidly and continuously.
Another influence can be found in neuroscience and psychology. After all, humans clearly exhibit intel-
ligent behavior. It is thus only reasonable to ask whether one could explain and possibly reverse engineer
these insights. One of the oldest algorithms to accomplish this was formulated by Donald Hebb (1904-
1985).
In his groundbreaking book The Organization of Behavior (John Wiley & Sons, 1949) he posited that
neurons learn by positive reinforcement. This became known as the Hebbian learning rule. It is the
prototype of Rosenblatt’s perceptron learning algorithm and it laid the foundations of many stochastic
gradient descent algorithms that underpin deep learning today: reinforce desirable behavior and diminish
undesirable behavior to obtain good weights in a neural network.
Biological inspiration is what gave Neural Networks its name. For over a century (dating back to the
models of Alexander Bain, 1873 and James Sherrington, 1890) researchers have tried to assemble com-
putational circuits that resemble networks of interacting neurons. Over time the interpretation of biology
became more loose but the name stuck. At its heart lie a few key principles that can be found in most
networks today:
• The alternation of linear and nonlinear processing units, often referred to as layers’.
• The use of the chain rule (aka backpropagation) for adjusting parameters in the entire network at
once.
After initial rapid progress, research in Neural Networks languished from around 1995 until 2005. This
was due to a number of reasons. Training a network is computationally very expensive. While RAM
was plentiful at the end of the past century, computational power was scarce. Secondly, datasets were
relatively small. In fact, Fisher’s Iris dataset’ from 1932 was a popular tool for testing the efficacy of
algorithms. MNIST with its 60,000 handwritten digits was considered huge.
Given the scarcity of data and computation, strong statistical tools such as Kernel Methods, Decision
Trees and Graphical Models proved empirically superior. Unlike Neural Networks they did not require
weeks to train and provided predictable results with strong theoretical guarantees.
Much of this changed with the ready availability of large amounts of data, due to the World Wide Web,
the advent of companies serving hundreds of millions of users online, a dissemination of cheap, high
quality sensors, cheap data storage (Kryder’s law), and cheap computation (Moore’s law), in particular
in the form of GPUs, originally engineered for computer gaming. Suddenly algorithms and models that
seemed computationally infeasible became relevant (and vice versa). This is best illustrated in the table
below:
36 1. Introduction
Decade Dataset Mem- Floating Point Calculations per Sec-
ory ond
1970 100 (Iris) 1 KB 100 KF (Intel 8080)
1980 1 K (House prices in Boston) 100 KB 1 MF (Intel 80186)
1990 10 K (optical character recogni- 10 MB 10 MF (Intel 80486)
tion)
2000 10 M (web pages) 100 MB 1 GF (Intel Core)
2010 10 G (advertising) 1 GB 1 TF (Nvidia C2050)
2020 1 T (social network) 100 GB 1 PF (Nvidia DGX-2)
It is quite evident that RAM has not kept pace with the growth in data. At the same time, the increase in
computational power has outpaced that of the data available. This means that statistical models needed to
become more memory efficient (this is typically achieved by adding nonlinearities) while simultaneously
being able to spend more time on optimizing these parameters, due to an increased compute budget.
Consequently the sweet spot in machine learning and statistics moved from (generalized) linear models
and kernel methods to deep networks. This is also one of the reasons why many of the mainstays of deep
learning, such as Multilayer Perceptrons (e.g. McCulloch & Pitts, 1943), Convolutional Neural Networks
(Le Cun, 1992), Long Short Term Memory (Hochreiter & Schmidhuber, 1997), Q-Learning (Watkins,
1989), were essentially rediscovered’ in the past decade, after laying dormant for considerable time.
The recent progress in statistical models, applications, and algorithms, has sometimes been likened to the
Cambrian Explosion: a moment of rapid progress in the evolution of species. Indeed, the state of the art
is not just a mere consequence of available resources, applied to decades old algorithms. Note that the list
below barely scratches the surface of the ideas that have helped researchers achieve tremendous progress
over the past decade.
• Novel methods for capacity control, such as Dropout [3] allowed for training of relatively large
networks without the danger of overfitting, i.e. without the danger of merely memorizing large
parts of the training data. This was achieved by applying noise injection [4] throughout the network,
replacing weights by random variables for training purposes.
• Attention mechanisms solved a second problem that had plagued statistics for over a century: how
to increase the memory and complexity of a system without increasing the number of learnable
parameters. [5] found an elegant solution by using what can only be viewed as a learnable pointer
structure. That is, rather than having to remember an entire sentence, e.g. for machine translation
in a fixed-dimensional representation, all that needed to be stored was a pointer to the intermediate
state of the translation process. This allowed for significantly increased accuracy for long sentences,
since the model no longer needed to remember the entire sentence before beginning to generate
sentences.
• Multi-stage designs, e.g. via the Memory Networks [6] and the Neural Programmer-Interpreter [7]
allowed statistical modelers to describe iterative approaches to reasoning. These tools allow for an
internal state of the deep network to be modified repeatedly, thus carrying out subsequent steps in
a chain of reasoning, similar to how a processor can modify memory for a computation.
• Another key development was the invention of Generative Adversarial Networks [8]. Traditionally
Artificial Intelligence has a long history of delivering results that would be difficult to accomplish other-
wise. For instance, mail is sorted using optical character recognition. These systems have been deployed
38 1. Introduction
since the 90s (this is, after all, the source of the famous MNIST and USPS sets of handwritten digits). The
same applies to reading checks for bank deposits and scoring creditworthiness of applicants. Financial
transactions are checked for fraud automatically. This forms the backbone of many e-commerce payment
systems, such as PayPal, Stripe, AliPay, WeChat, Apple, Visa, MasterCard. Computer programs for chess
have been competitive for decades. Machine learning feeds search, recommendation, personalization and
ranking on the internet. In other words, artificial intelligence and machine learning are pervasive, albeit
often hidden from sight.
It is only recently that AI has been in the limelight, mostly due to solutions to problems that were consid-
ered intractable previously.
• Intelligent assistants, such as Apple’s Siri, Amazon’s Alexa, or Google’s assistant are able to answer
spoken questions with a reasonable degree of accuracy. This includes menial tasks such as turning
on light switches (a boon to the disabled) up to making barber’s appointments and offering phone
support dialog. This is likely the most noticeable sign that AI is affecting our lives.
• A key ingredient in digital assistants is the ability to recognize speech accurately. Gradually the
accuracy of such systems has increased to the point where they reach human parity [14] for certain
applications.
• Object recognition likewise has come a long way. Estimating the object in a picture was a fairly
challenging task in 2010. On the ImageNet benchmark Lin et al. [15] achieved a top-5 error rate of
28%. By 2017 Hu et al. [16] reduced this error rate to 2.25%. Similarly stunning results have been
achieved for identifying birds, or diagnosing skin cancer.
• Games used to be a bastion of human intelligence. Starting from TDGammon [23], a program
for playing Backgammon using temporal difference (TD) reinforcement learning, algorithmic and
computational progress has led to algorithms for a wide range of applications. Unlike Backgam-
mon, chess has a much more complex state space and set of actions. DeepBlue beat Gary Kasparov,
Campbell et al. [17], using massive parallelism, special purpose hardware and efficient search
through the game tree. Go is more difficult still, due to its huge state space. AlphaGo reached
human parity in 2015, Silver et al. [18] using Deep Learning combined with Monte Carlo tree
sampling. The challenge in Poker was that the state space is large and it is not fully observed (we
don’t know the opponents’ cards). Libratus exceeded human performance in Poker using efficiently
structured strategies; Brown and Sandholm [19]. This illustrates the impressive progress in games
and the fact that advanced algorithms played a crucial part in them.
• Another indication of progress in AI is the advent of self-driving cars and trucks. While full au-
tonomy is not quite within reach yet, excellent progress has been made in this direction, with com-
panies such as Momenta, Tesla, NVIDIA, MobilEye and Waymo shipping products that enable at
least partial autonomy. What makes full autonomy so challenging is that proper driving requires
the ability to perceive, to reason and to incorporate rules into a system. At present, Deep Learning
is used primarily in the computer vision aspect of these problems. The rest is heavily tuned by
engineers.
Again, the above list barely scratches the surface of what is considered intelligent and where machine
learning has led to impressive progress in a field. For instance, robotics, logistics, computational biology,
particle physics and astronomy owe some of their most impressive recent advances at least in parts to
Summary
• Machine learning studies how computer systems can use data to improve performance. It combines
ideas from statistics, data mining, artificial intelligence and optimization. Often it is used as a
means of implementing artificially intelligent solutions.
• As a class of machine learning, representational learning focuses on how to automatically find
the appropriate way to represent data. This is often accomplished by a progression of learned
transformations.
• Much of the recent progress has been triggered by an abundance of data arising from cheap sensors
and internet scale applications, and by significant progress in computation, mostly through GPUs.
• Whole system optimization is a key component in obtaining good performance. The availability of
efficient deep learning frameworks has made design and implementation of this significantly easier.
40 1. Introduction
Exercises
1. Which parts of code that you are currently writing could be learned’, i.e. improved by learning
and automatically determining design choices that are made in your code? Does your code include
heuristic design choices?
2. Which problems that you encounter have many examples for how to solve them, yet no specific
way to automate them? These may be prime candidates for using Deep Learning.
3. Viewing the development of Artificial Intelligence as a new industrial revolution, what is the rela-
tionship between algorithms and data? Is it similar to steam engines and coal (what is the funda-
mental difference)?
4. Where else can you apply the end-to-end training approach? Physics? Engineering? Econometrics?
References
[1] Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59(236), 433.
[2] Hebb, D. O. (1949). The organization of behavior; a neuropsychological theory. A Wiley Book in
Clinical Psychology. 62-78.
[3] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a
simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research,
15(1), 1929-1958.
[4] Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural computa-
tion, 7(1), 108-116.
[5] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align
and translate. arXiv preprint arXiv:1409.0473.
[6] Sukhbaatar, S., Weston, J., & Fergus, R. (2015). End-to-end memory networks. In Advances in neural
information processing systems (pp. 2440-2448).
[7] Reed, S., & De Freitas, N. (2015). Neural programmer-interpreters. arXiv preprint arXiv:1511.06279.
[8] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., & Bengio, Y.
(2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-
2680).
[9] Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using
cycle-consistent adversarial networks. arXiv preprint.
[10] Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved
quality, stability, and variation. arXiv preprint arXiv:1710.10196.
[11] Li, M. (2017). Scaling Distributed Machine Learning with System and Algorithm Co-design (Doc-
toral dissertation, PhD thesis, Intel).
42 1. Introduction
2
To get started with deep learning, we will need to develop a few basic skills. All machine learning
is concerned with extracting information from data. So we will begin by learning the practical skills
for storing and manipulating data with Apache MXNet. Moreover machine learning typically requires
working with large datasets, which we can think of as tables, where the rows correspond to examples and
the columns correspond to attributes. Linear algebra gives us a powerful set of techniques for working
with tabular data. We won’t go too far into the weeds but rather focus on the basic of matrix operations and
their implementation in Apache MXNet. Additionally, deep learning is all about optimization. We have
a model with some parameters and we want to find those that fit our data the best. Determining which
way to move each parameter at each step of an algorithm requires a little bit of calculus. Fortunately,
Apache MXNet’s autograd package covers this for us, and we will cover it next. Next, machine learning
is concerned with making predictions: what is the likely value of some unknown attribute, given the
information that we observe? To reason rigorously under uncertainty we will need to invoke the language
of probability and statistics. To conclude the chapter, we will present your first basic classifier, Naive
Bayes.
It is impossible to get anything done if we cannot manipulate data. Generally, there are two important
things we need to do with data: (i) acquire it and (ii) process it once it is inside the computer. There is
no point in acquiring data if we do not even know how to store it, so let’s get our hands dirty first by
playing with synthetic data. We will start by introducing the NDArray, MXNet’s primary tool for storing
43
and transforming data. If you have worked with NumPy before, you will notice that NDArrays are, by
design, similar to NumPy’s multi-dimensional array. However, they confer a few key advantages. First,
NDArrays support asynchronous computation on CPU, GPU, and distributed cloud architectures. Second,
they provide support for automatic differentiation. These properties make NDArray indispensable for
deep learning.
Throughout this chapter, we are aiming to get you up and running with the basic functionality. Do not
worry if you do not understand all of the basic math, like element-wise operations or normal distributions.
In the next two chapters we will take another pass at the same material, teaching the material in the context
of practical examples. On the other hand, if you want to go deeper into the mathematical content, see the
Math section in the appendix.
We begin by importing MXNet and the ndarray module from MXNet. Here, nd is short for ndarray.
In [1]: import mxnet as mx
from mxnet import nd
NDArrays represent (possibly multi-dimensional) arrays of numerical values. NDArrays with one axis
correspond (in math-speak) to vectors. NDArrays with two axes correspond to matrices. For arrays with
more than two axes, mathematicians do not have special namesthey simply call them tensors.
The simplest object we can create is a vector. To start, we can use arange to create a row vector with
12 consecutive integers.
In [2]: x = [Link](12)
x
Out[2]:
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.]
<NDArray 12 @cpu(0)>
When we print x, we can observe the property <NDArray 12 @cpu(0)> listed, which indicates that
x is a one-dimensional array of length 12 and that it resides in CPU main memory. The 0 in @cpu(0)
has no special meaning and does not represent a specific core.
We can get the NDArray instance shape through the shape property.
In [3]: [Link]
Out[3]: (12,)
We can also get the total number of elements in the NDArray instance through the size property. This
is the product of the elements of the shape. Since we are dealing with a vector here, both are identical.
In [4]: [Link]
Out[4]: 12
We use the reshape function to change the shape of one (possibly multi-dimensional) array, to another
that contains the same number of elements. For example, we can transform the shape of our line vector x
to (3, 4), which contains the same values but interprets them as a matrix containing 3 rows and 4 columns.
Reshaping by manually specifying each of the dimensions can get annoying. Once we know one of the
dimensions, why should we have to perform the division our selves to determine the other? For example,
above, to get a matrix with 3 rows, we had to specify that it should have 4 columns (to account for the
12 elements). Fortunately, NDArray can automatically work out one dimension given the other. We can
invoke this capability by placing -1 for the dimension that we would like NDArray to automatically infer.
In our case, instead of [Link]((3, 4)), we could have equivalently used [Link]((-1,
4)) or [Link]((3, -1)).
In [6]: [Link]((3, 4))
Out[6]:
[[-5.6088627e+35 4.5759401e-41 -5.6088627e+35 4.5759401e-41]
[ 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00]
[ 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00]]
<NDArray 3x4 @cpu(0)>
The empty method just grabs some memory and hands us back a matrix without setting the values of any
of its entries. This is very efficient but it means that the entries might take any arbitrary values, including
very big ones! Typically, we’ll want our matrices initialized either with ones, zeros, some known constant
or numbers randomly sampled from a known distribution.
Perhaps most often, we want an array of all zeros. To create an NDArray representing a tensor with all
elements set to 0 and a shape of (2, 3, 4) we can invoke:
In [7]: [Link]((2, 3, 4))
Out[7]:
[[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]]
<NDArray 2x3x4 @cpu(0)>
We can also specify the value of each element in the desired NDArray by supplying a Python list contain-
ing the numerical values.
In [9]: y = [Link]([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
y
Out[9]:
[[2. 1. 4. 3.]
[1. 2. 3. 4.]
[4. 3. 2. 1.]]
<NDArray 3x4 @cpu(0)>
In some cases, we will want to randomly sample the values of each element in the NDArray according
to some known probability distribution. This is especially common when we intend to use the array as a
parameter in a neural network. The following snippet creates an NDArray with a shape of (3,4). Each of
its elements is randomly sampled in a normal distribution with zero mean and unit variance.
In [10]: [Link](0, 1, shape=(3, 4))
Out[10]:
[[ 2.2122064 0.7740038 1.0434405 1.1839255 ]
[ 1.8917114 -1.2347414 -1.771029 -0.45138445]
[ 0.57938355 -1.856082 -1.9768796 -0.20801921]]
<NDArray 3x4 @cpu(0)>
2.1.2 Operations
Oftentimes, we want to apply functions to arrays. Some of the simplest and most useful functions are
the element-wise functions. These operate by performing a single scalar operation on the corresponding
elements of two arrays. We can create an element-wise function from any function that maps from the
scalars to the scalars. In math notations we would denote such a function as f : R → R. Given any two
vectors u and v of the same shape, and the function f, we can produce a vector c = F (u, v) by setting
ci ← f (ui , vi ) for all i. Here, we produced the vector-valued F : Rd → Rd by lifting the scalar function
to an element-wise vector operation. In MXNet, the common standard arithmetic operators (+,-,/,*,**)
have all been lifted to element-wise operations for identically-shaped tensors of arbitrary shape. We can
call element-wise operations on any two tensors of the same shape, including matrices.
In [11]: x = [Link]([1, 2, 4, 8])
y = nd.ones_like(x) * 2
print('x =', x)
print('x + y', x + y)
print('x - y', x - y)
print('x * y', x * y)
print('x / y', x / y)
x =
[1. 2. 4. 8.]
<NDArray 4 @cpu(0)>
In addition to computations by element, we can also perform matrix operations, like matrix multiplication
using the dot function. Next, we will perform matrix multiplication of x and the transpose of y. We
define x as a matrix of 3 rows and 4 columns, and y is transposed into a matrix of 4 rows and 3 columns.
The two matrices are multiplied to obtain a matrix of 3 rows and 3 columns (if you are confused about
what this means, do not worry - we will explain matrix operations in much more detail in the chapter on
linear algebra).
In [13]: x = [Link](12).reshape((3,4))
y = [Link]([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
[Link](x, y.T)
Out[13]:
[[ 18. 20. 10.]
[ 58. 60. 50.]
[ 98. 100. 90.]]
<NDArray 3x3 @cpu(0)>
We can also merge multiple NDArrays. For that, we need to tell the system along which dimension to
merge. The example below merges two matrices along dimension 0 (along rows) and dimension 1 (along
columns) respectively.
In [14]: [Link](x, y, dim=0)
[Link](x, y, dim=1)
Out[14]:
[[ 0. 1. 2. 3. 2. 1. 4. 3.]
[ 4. 5. 6. 7. 1. 2. 3. 4.]
[ 8. 9. 10. 11. 4. 3. 2. 1.]]
<NDArray 3x8 @cpu(0)>
Sometimes, we may want to construct binary NDArrays via logical statements. Take x == y as an
example. If x and y are equal for some entry, the new NDArray has a value of 1 at the same position;
otherwise it is 0.
In [15]: x == y
Summing all the elements in the NDArray yields an NDArray with only one element.
In [16]: [Link]()
Out[16]:
[66.]
<NDArray 1 @cpu(0)>
We can transform the result into a scalar in Python using the asscalar function. In the following
example, the ℓ2 norm of x yields a single element NDArray. The final result is transformed into a scalar.
In [17]: [Link]().asscalar()
Out[17]: 22.494442
For stylistic convenience, we can write [Link](), [Link](), [Link](), etc. also as [Link](y),
[Link](x), [Link](x).
In the above section, we saw how to perform operations on two NDArrays of the same shape. When their
shapes differ, a broadcasting mechanism may be triggered analogous to NumPy: first, copy the elements
appropriately so that the two NDArrays have the same shape, and then carry out operations by element.
In [18]: a = [Link](3).reshape((3, 1))
b = [Link](2).reshape((1, 2))
a, b
Out[18]: (
[[0.]
[1.]
[2.]]
<NDArray 3x1 @cpu(0)>,
[[0. 1.]]
<NDArray 1x2 @cpu(0)>)
Since a and b are (3x1) and (1x2) matrices respectively, their shapes do not match up if we want to add
them. NDArray addresses this by broadcasting’ the entries of both matrices into a larger (3x2) matrix as
follows: for matrix a it replicates the columns, for matrix b it replicates the rows before adding up both
element-wise.
In [19]: a + b
Out[19]:
[[0. 1.]
[1. 2.]
[2. 3.]]
<NDArray 3x2 @cpu(0)>
Just like in any other Python array, elements in an NDArray can be accessed by its index. In good Python
tradition the first element has index 0 and ranges are specified to include the first but not the last element.
By this logic 1:3 selects the second and third element. Let’s try this out by selecting the respective rows
in a matrix.
In [20]: x[1:3]
Out[20]:
[[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]]
<NDArray 2x4 @cpu(0)>
If we want to assign multiple elements the same value, we simply index all of them and then assign them
the value. For instance, [0:2, :] accesses the first and second rows. While we discussed indexing for
matrices, this obviously also works for vectors and for tensors of more than 2 dimensions.
In [22]: x[0:2, :] = 12
x
Out[22]:
[[12. 12. 12. 12.]
[12. 12. 12. 12.]
[ 8. 9. 10. 11.]]
<NDArray 3x4 @cpu(0)>
In the previous example, every time we ran an operation, we allocated new memory to host its results.
For example, if we write y = x + y, we will dereference the matrix that y used to point to and instead
point it at the newly allocated memory. In the following example we demonstrate this with Python’s
id() function, which gives us the exact address of the referenced object in memory. After running y =
y + x, we will find that id(y) points to a different location. That is because Python first evaluates y
+ x, allocating new memory for the result and then subsequently redirects y to point at this new location
in memory.
In [23]: before = id(y)
y = y + x
id(y) == before
Out[23]: False
While this looks pretty, x+y here will still allocate a temporary buffer to store the result of x+y before
copying it to y[:]. To make even better use of memory, we can directly invoke the underlying ndarray
operation, in this case elemwise_add, avoiding temporary buffers. We do this by specifying the out
keyword argument, which every ndarray operator supports:
In [25]: before = id(z)
nd.elemwise_add(x, y, out=z)
id(z) == before
Out[25]: True
If the value of x is not reused in subsequent computations, we can also use x[:] = x + y or x +=
y to reduce the memory overhead of the operation.
In [26]: before = id(x)
x += y
id(x) == before
Out[26]: True
Converting MXNet NDArrays to and from NumPy is easy. The converted arrays do not share memory.
This minor inconvenience is actually quite important: when you perform operations on the CPU or one
of the GPUs, you do not want MXNet having to wait whether NumPy might want to be doing something
else with the same chunk of memory. The array and asnumpy functions do the trick.
In [27]: import numpy as np
a = [Link]()
print(type(a))
b = [Link](a)
print(type(b))
<class '[Link]'>
<class '[Link]'>
1. Run the code in this section. Change the conditional statement x == y in this section to x < y
or x > y, and then see what kind of NDArray you can get.
2. Replace the two NDArrays that operate by element in the broadcast mechanism with other shapes,
e.g. three dimensional tensors. Is the result the same as expected?
3. Assume that we have three matrices a, b and c. Rewrite c = [Link](a, b.T) + c in the
most memory efficient manner.
Now that you can store and manipulate data, let’s briefly review the subset of basic linear algebra that you
will need to understand most of the models. We will introduce all the basic concepts, the corresponding
mathematical notation, and their realization in code all in one place. If you are already confident in your
basic linear algebra, feel free to skim through or skip this chapter.
In [1]: from mxnet import nd
2.2.1 Scalars
If you never studied linear algebra or machine learning, you are probably used to working with one
number at a time. And know how to do basic things like add them together or multiply them. For
example, in Palo Alto, the temperature is 52 degrees Fahrenheit. Formally, we call these values scalars.
If you wanted to convert this value to Celsius (using metric system’s more sensible unit of temperature
measurement), you would evaluate the expression c = (f − 32) ∗ 5/9 setting f to 52. In this equation,
each of the terms 32, 5, and 9 is a scalar value. The placeholders c and f that we use are called variables
and they represent unknown scalar values.
In mathematical notation, we represent scalars with ordinary lower-cased letters (x, y, z). We also denote
the space of all scalars as R. For expedience, we are going to punt a bit on what precisely a space is, but
for now, remember that if you want to say that x is a scalar, you can simply say x ∈ R. The symbol ∈
can be pronounced in and just denotes membership in a set.
print('x + y = ', x + y)
print('x * y = ', x * y)
print('x / y = ', x / y)
print('x ** y = ', [Link](x,y))
x + y =
[5.]
<NDArray 1 @cpu(0)>
x * y =
[6.]
<NDArray 1 @cpu(0)>
x / y =
[1.5]
<NDArray 1 @cpu(0)>
x ** y =
[9.]
<NDArray 1 @cpu(0)>
We can convert any NDArray to a Python float by calling its asscalar method. Note that this is
typically a bad idea. While you are doing this, NDArray has to stop doing anything else in order to hand
the result and the process control back to Python. And unfortunately Python is not very good at doing
things in parallel. So avoid sprinkling this operation liberally throughout your code or your networks will
take a long time to train.
In [3]: [Link]()
Out[3]: 3.0
2.2.2 Vectors
You can think of a vector as simply a list of numbers, for example [1.0,3.0,4.0,2.0]. Each of the
numbers in the vector consists of a single scalar value. We call these values the entries or components
of the vector. Often, we are interested in vectors whose values hold some real-world significance. For
example, if we are studying the risk that loans default, we might associate each applicant with a vector
whose components correspond to their income, length of employment, number of previous defaults, etc.
If we were studying the risk of heart attacks hospital patients potentially face, we might represent each
patient with a vector whose components capture their most recent vital signs, cholesterol levels, minutes
of exercise per day, etc. In math notation, we will usually denote vectors as bold-faced, lower-cased letters
(u, v, w). In MXNet, we work with vectors via 1D NDArrays with an arbitrary number of components.
In [4]: x = [Link](4)
print('x = ', x)
x =
[0. 1. 2. 3.]
<NDArray 4 @cpu(0)>
Let’s revisit some concepts from the previous section. A vector is just an array of numbers. And just as
every array has a length, so does every vector. In math notation, if we want to say that a vector x consists
of n real-valued scalars, we can express this as x ∈ Rn . The length of a vector is commonly called
its dimension. As with an ordinary Python array, we can access the length of an NDArray by calling
Python’s in-built len() function.
We can also access a vector’s length via its .shape attribute. The shape is a tuple that lists the dimen-
sionality of the NDArray along each of its axes. Because a vector can only be indexed along one axis, its
shape has just one element.
In [6]: [Link]
Out[6]: (4,)
Note that the word dimension is overloaded and this tends to confuse people. Some use the dimensionality
of a vector to refer to its length (the number of components). However some use the word dimensionality
to refer to the number of axes that an array has. In this sense, a scalar would have 0 dimensions and a
vector would have 1 dimension.
To avoid confusion, when we say 2D array or 3D array, we mean an array with 2 or 3 axes respec-
tively. But if we say :math:‘n‘-dimensional vector, we mean a vector of length :math:‘n‘.
In [7]: a = 2
x = [Link]([1,2,3])
y = [Link]([10,20,30])
print(a * x)
print(a * x + y)
[2. 4. 6.]
<NDArray 3 @cpu(0)>
2.2.4 Matrices
Just as vectors generalize scalars from order 0 to order 1, matrices generalize vectors from 1D to 2D.
Matrices, which we’ll typically denote with capital letters (A, B, C), are represented in code as arrays
We can create a matrix with n rows and m columns in MXNet by specifying a shape with two compo-
nents (n,m) when calling any of our favorite functions for instantiating an ndarray such as ones, or
zeros.
In [8]: A = [Link](20).reshape((5,4))
print(A)
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[12. 13. 14. 15.]
[16. 17. 18. 19.]]
<NDArray 5x4 @cpu(0)>
Matrices are useful data structures: they allow us to organize data that has different modalities of variation.
For example, rows in our matrix might correspond to different patients, while columns might correspond
to different attributes.
We can access the scalar elements aij of a matrix A by specifying the indices for the row (i) and column
(j) respectively. Leaving them blank via a : takes all elements along the respective dimension (as seen
in the previous section).
We can transpose the matrix through T. That is, if B = AT , then bij = aji for any i and j.
In [9]: print(A.T)
[[ 0. 4. 8. 12. 16.]
[ 1. 5. 9. 13. 17.]
[ 2. 6. 10. 14. 18.]
[ 3. 7. 11. 15. 19.]]
<NDArray 4x5 @cpu(0)>
2.2.5 Tensors
Just as vectors generalize scalars, and matrices generalize vectors, we can actually build data structures
with even more axes. Tensors give us a generic way of discussing arrays with an arbitrary number of
axes. Vectors, for example, are first-order tensors, and matrices are second-order tensors.
Using tensors will become more important when we start working with images, which arrive as 3D data
structures, with axes corresponding to the height, width, and the three (RGB) color channels. But in this
chapter, we’re going to skip this part and make sure you know the basics.
Scalars, vectors, matrices, and tensors of any order have some nice properties that we will often rely on.
For example, as you might have noticed from the definition of an element-wise operation, given operands
with the same shape, the result of any element-wise operation is a tensor of that same shape. Another
convenient property is that for all tensors, multiplication by a scalar produces a tensor of the same shape.
In math, given two tensors X and Y with the same shape, αX + Y has the same shape (numerical
mathematicians call this the AXPY operation).
In [11]: a = 2
x = [Link](3)
y = [Link](3)
print([Link])
print([Link])
print((a * x).shape)
print((a * x + y).shape)
(3,)
(3,)
(3,)
(3,)
Shape is not the the only property preserved under addition and multiplication by a scalar. These oper-
ations also preserve membership in a vector space. But we will postpone this discussion for the second
half of this chapter because it is not critical to getting your first models up and running.
[3.]
<NDArray 1 @cpu(0)>
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[12. 13. 14. 15.]
[16. 17. 18. 19.]]
<NDArray 5x4 @cpu(0)>
[190.]
<NDArray 1 @cpu(0)>
A related quantity is the mean, which is also called the average. We calculate the mean by dividing the
sum by the total number of elements. With mathematical notation, we could write the average over a
∑d ∑m ∑n
vector u as d1 i=1 ui and the average over a matrix A as n·m 1
i=1 j=1 aij . In code, we could just
call [Link]() on tensors of arbitrary shape:
In [14]: print([Link](A))
print([Link](A) / [Link])
[9.5]
<NDArray 1 @cpu(0)>
[9.5]
<NDArray 1 @cpu(0)>
So far, we have only performed element-wise operations, sums and averages. And if this was all we could
do, linear algebra probably would not deserve its own chapter. However, one of the most fundamental
operations is the dot product. Given two vectors u and v, the dot product uT v is a sum over the products
∑d
of the corresponding elements: uT v = i=1 ui · vi .
In [15]: x = [Link](4)
y = [Link](4)
print(x, y, [Link](x, y))
[0. 1. 2. 3.]
<NDArray 4 @cpu(0)>
[1. 1. 1. 1.]
Note that we can express the dot product of two vectors [Link](x, y) equivalently by performing an
element-wise multiplication and then a sum:
In [16]: [Link](x * y)
Out[16]:
[6.]
<NDArray 1 @cpu(0)>
Dot products are useful in a wide range of contexts. For example, given a set of weights w, the weighted
T
(∑ u could be)expressed as the dot product u w. When the weights are non-negative
sum of some values
d
and sum to one i=1 wi = 1 , the dot product expresses a weighted average. When two vectors each
have length one (we will discuss what length means below in the section on norms), dot products can also
capture the cosine of the angle between them.
Now that we know how to calculate dot products we can begin to understand matrix-vector products.
Let’s start off by visualizing a matrix A and a column vector x.
a11 a12 · · · a1m x1
a21 a22 · · · a2m x2
A= . .. .. .. , x = ..
.. . . . .
an1 an2 · · · anm xm
where each aTi ∈ Rm is a row vector representing the i-th row of the matrix A.
Then the matrix vector product y = Ax is simply a column vector y ∈ Rn where each entry yi is the dot
product aTi x.
T T
a1 x1 a1 x
aT2 x2 aT2 x
Ax = . . = .
.. .. ..
aTn xm aTn x
If you have gotten the hang of dot products and matrix-vector multiplication, then matrix-matrix multi-
plications should be pretty straightforward.
Say we have two matrices, A ∈ Rn×k and B ∈ Rk×m :
a11 a12 · · · a1k b11 b12 ··· b1m
a21 a22 · · · a2k b21 b22 ··· b2m
A= . .. .. .. , B = .. .. .. ..
.. . . . . . . .
an1 an2 · · · ank bk1 bk2 · · · bkm
To produce the matrix product C = AB, it’s easiest to think of A in terms of its row vectors and B in
terms of its column vectors:
T
a1
aT2 ( )
A = . , B = b1 b2 · · · bm .
..
aTn
Note here that each row vector aTi lies in Rk and that each column vector bj also lies in Rk .
Then to produce the matrix product C ∈ Rn×m we simply compute each entry cij as the dot product
aTi bj .
T T
a1 a1 b1 aT1 b2 · · · aT1 bm
aT2 (
) aT2 b1 aT2 b2 · · · aT2 bm
C = AB = . b1 b2 · · · bm = . .. .. ..
.. .. . . .
aTn aTn b1 aTn b2 · · · aTn bm
2.2.11 Norms
Before we can start implementing models, there is one last concept we are going to introduce. Some of
the most useful operators in linear algebra are norms. Informally, they tell us how big a vector or matrix
is. We represent norms with the notation ∥ · ∥. The · in this expression is just a placeholder. For example,
we would represent the norm of a vector x or matrix A as ∥x∥ or ∥A∥, respectively.
All norms must satisfy a handful of properties:
1. ∥αA∥ = |α|∥A∥
2. ∥A + B∥ ≤ ∥A∥ + ∥B∥
3. ∥A∥ ≥ 0
4. If ∀i, j, aij = 0, then ∥A∥ = 0
To put it in words, the first rule says that if we scale all the components of a matrix or vector by a constant
factor α, its norm also scales by the absolute value of the same constant factor. The second rule is the
familiar triangle inequality. The third rule simply says that the norm must be non-negative. That makes
sense, in most contexts the smallest size for anything is 0. The final rule basically says that the smallest
norm is achieved by a matrix or vector consisting of all zeros. It is possible to define a norm that gives
zero norm to nonzero matrices, but you cannot give nonzero norm to zero matrices. That may seem like
a mouthful, but if you digest it then you probably have grepped the important concepts here.
If you remember Euclidean distances (think Pythagoras’ theorem) from grade school, then non-negativity
and the triangle inequality might ring a bell. You might notice that norms sound a lot like measures of
distance.
√
In fact, the Euclidean distance x21 + · · · + x2n is a norm.
√∑ Specifically it is the ℓ2 -norm. An analogous
2
computation, performed over the entries of a matrix, e.g. i,j aij , is called the Frobenius norm. More
often, in machine learning we work with the squared ℓ2 norm (notated ℓ22 ). We also commonly work with
the ℓ1 norm. The ℓ1 norm is simply the sum of the absolute values. It has the convenient property of
placing less emphasis on outliers.
To calculate the ℓ2 norm, we can just call [Link]().
To calculate the L1-norm we can simply perform the absolute value and then sum over the elements.
In [20]: [Link]([Link](x))
Out[20]:
[6.]
<NDArray 1 @cpu(0)>
While we do not want to get too far ahead of ourselves, we do want you to anticipate why these con-
cepts are useful. In machine learning we are often trying to solve optimization problems: Maximize the
probability assigned to observed data. Minimize the distance between predictions and the ground-truth
observations. Assign vector representations to items (like words, products, or news articles) such that the
distance between similar items is minimized, and the distance between dissimilar items is maximized.
Oftentimes, these objectives, perhaps the most important component of a machine learning algorithm
(besides the data itself), are expressed as norms.
If you have made it this far, and understand everything that we have covered, then honestly, you are ready
to begin modeling. If you are feeling antsy, this is a perfectly reasonable place to move on. You already
know nearly all of the linear algebra required to implement a number of many practically useful models
and you can always circle back when you want to learn more.
But there is a lot more to linear algebra, even as concerns machine learning. At some point, if you plan
to make a career in machine learning, you will need to know more than what we have covered so far. In
the rest of this chapter, we introduce some useful, more advanced concepts.
Vectors are useful beyond being data structures to carry numbers. In addition to reading and writing
values to the components of a vector, and performing some useful mathematical operations, we can
analyze vectors in some interesting ways.
One important concept is the notion of a vector space. Here are the conditions that make a vector space:
• Additive axioms (we assume that x,y,z are all vectors): x+y = y +x and (x+y)+z = x+(y +z)
and 0 + x = x + 0 = x and (−x) + x = x + (−x) = 0.
• Multiplicative axioms (we assume that x is a vector and a, b are scalars): 0 · x = 0 and 1 · x = x
and (ab)x = a(bx).
Special matrices
There are a number of special matrices that we will use throughout this tutorial. Let’s look at them in a
bit of detail:
• Symmetric Matrix These are matrices where the entries below and above the diagonal are the
same. In other words, we have that M ⊤ = M . An example of such matrices are those that describe
pairwise distances, i.e. Mij = ∥xi − xj ∥. Likewise, the Facebook friendship graph can be written
as a symmetric matrix where Mij = 1 if i and j are friends and Mij = 0 if they are not. Note that
the Twitter graph is asymmetric - Mij = 1, i.e. i following j does not imply that Mji = 1, i.e. j
following i.
• Antisymmetric Matrix These matrices satisfy M ⊤ = −M . Note that any square matrix can
always be decomposed into a symmetric and into an antisymmetric matrix by using M = 12 (M +
M ⊤ ) + 21 (M − M ⊤ ).
• Diagonally Dominant Matrix These are matrices where the off-diagonal elements ∑ are small
to the main diagonal elements. In particular we have that Mii ≥
relative ∑ j̸=i Mij and
Mii ≥ j̸=i M ji . If a matrix has this property, we can often approximate M by its diagonal.
This is often expressed as diag(M ).
• Positive Definite Matrix These are matrices that have the nice property where x⊤ M x > 0 when-
ever x ̸= 0. Intuitively, they are a generalization of the squared norm of a vector ∥x∥2 = x⊤ x. It
is easy to check that whenever M = A⊤ A, this holds since there x⊤ M x = x⊤ A⊤ Ax = ∥Ax∥2 .
There is a somewhat more profound theorem which states that all positive definite matrices can be
written in this form.
Summary
In just a few pages (or one Jupyter notebook) we have taught you all the linear algebra you will need
to understand a good chunk of neural networks. Of course there is a lot more to linear algebra. And a
lot of that math is useful for machine learning. For example, matrices can be decomposed into factors,
and these decompositions can reveal low-dimensional structure in real-world datasets. There are entire
subfields of machine learning that focus on using matrix decompositions and their generalizations to
high-order tensors to discover structure in datasets and solve prediction problems. But this book focuses
on deep learning. And we believe you will be much more inclined to learn more mathematics once you
have gotten your hands dirty deploying useful machine learning models on real datasets. So while we
reserve the right to introduce more math much later on, we will wrap up this chapter here.
If you are eager to learn more about linear algebra, here are some of our favorite resources on the topic
• For a solid primer on basics, check out Gilbert Strang’s book Introduction to Linear Algebra
• Zico Kolter’s Linear Algebra Review and Reference
In machine learning, we train models, updating them successively so that they get better and better as they
see more and more data. Usually, getting better means minimizing a loss function, a score that answers
the question how bad is our model? With neural networks, we typically choose loss functions that are
differentiable with respect to our parameters. Put simply, this means that for each of the model’s parame-
ters, we can determine how much increasing or decreasing it might affect the loss. While the calculations
for taking these derivatives are straightforward, requiring only some basic calculus, for complex models,
working out the updates by hand can be a pain (and often error-prone).
The autograd package expedites this work by automatically calculating derivatives. And while many other
libraries require that we compile a symbolic graph to take automatic derivatives, autograd allows us
to take derivatives while writing ordinary imperative code. Every time we pass data through our model,
autograd builds a graph on the fly, tracking which data combined through which operations to produce
the output. This graph enables autograd to subsequently backpropagate gradients on command. Here
backpropagate simply means to trace through the compute graph, filling in the partial derivatives with
respect to each parameter. If you are unfamiliar with some of the math, e.g. gradients, please refer to the
Mathematical Basics section in the appendix.
In [1]: from mxnet import autograd, nd
As a toy example, say that we are interested in differentiating the mapping y = 2x⊤ x with respect to the
column vector x. To start, let’s create the variable x and assign it an initial value.
In [2]: x = [Link](4).reshape((4, 1))
print(x)
[[0.]
[1.]
[2.]
[3.]]
<NDArray 4x1 @cpu(0)>
Once we compute the gradient of y with respect to x, we will need a place to store it. We can tell an
NDArray that we plan to store a gradient by invoking its attach_grad() method.
Now we are going to compute y and MXNet will generate a computation graph on the fly. It is as if
MXNet turned on a recording device and captured the exact path by which each variable was generated.
Note that building the computation graph requires a nontrivial amount of computation. So MXNet
will only build the graph when explicitly told to do so. This happens by placing code inside a with
[Link](): block.
In [4]: with [Link]():
y = 2 * [Link](x.T, x)
print(y)
[[28.]]
<NDArray 1x1 @cpu(0)>
Since the shape of x is (4, 1), y is a scalar. Next, we can automatically find the gradient by calling the
backward function. It should be noted that if y is not a scalar, MXNet will first sum the elements in y
to get the new variable by default, and then find the gradient of the variable with respect to x.
In [5]: [Link]()
The gradient of the function y = 2x⊤ x with respect to x should be 4x. Now let’s verify that the gradient
produced is correct.
In [6]: print(([Link] - 4 * x).norm().asscalar() == 0)
print([Link])
True
[[ 0.]
[ 4.]
[ 8.]
[12.]]
<NDArray 4x1 @cpu(0)>
As you can see from the above, after calling the record function, MXNet will record and calculate the
gradient. In addition, autograd will also change the running mode from the prediction mode to the
training mode by default. This can be viewed by calling the is_training function.
In [7]: print(autograd.is_training())
with [Link]():
print(autograd.is_training())
False
True
In some cases, the same model behaves differently in the training and prediction modes (e.g. when using
neural techniques such as dropout and batch normalization). In other cases, some models may store more
auxiliary variables to make computing gradients easier. We will cover these differences in detail in later
chapters. For now, you do not need to worry about them.
One benefit of using automatic differentiation is that even if the computational graph of the function
contains Python’s control flow (such as conditional and loop control), we may still be able to find the
gradient of a variable. Consider the following program: It should be emphasized that the number of
iterations of the loop (while loop) and the execution of the conditional judgment (if statement) depend on
the value of the input b.
In [8]: def f(a):
b = a * 2
while [Link]().asscalar() < 1000:
b = b * 2
if [Link]().asscalar() > 0:
c = b
else:
c = 100 * b
return c
Note that the number of iterations of the while loop and the execution of the conditional statement (if then
else) depend on the value of a. To compute gradients, we need to record the calculation, and then call
the backward function to calculate the gradient.
In [9]: a = [Link](shape=1)
a.attach_grad()
with [Link]():
d = f(a)
[Link]()
Let’s analyze the f function defined above. As you can see, it is piecewise linear in its input a. In other
words, for any a there exists some constant such that for a given range f(a) = g * a. Consequently
d / a allows us to verify that the gradient is correct:
In [10]: print([Link] == (d / a))
[1.]
<NDArray 1 @cpu(0)>
Caution: This part is tricky and not necessary to understanding subsequent sections. That said, it is
needed if you want to build new layers from scratch. You can skip this on a first read.
Sometimes when we call the backward method, e.g. [Link](), where y is a function of x we are
just interested in the derivative of y with respect to x. Mathematicians write this as dy(x)
dx . At other times,
we may be interested in the gradient of z with respect to x, where z is a function of y, which in turn, is a
d
function of x. That is, we are interested in dx z(y(x)). Recall that by the chain rule
d dz(y) dy(x)
z(y(x)) = .
dx dy dx
[[0. ]
[4. ]
[0.8 ]
[0.12]]
<NDArray 4x1 @cpu(0)>
Summary
Exercises
1. In the control flow example where we calculate the derivative of d with respect to a, what would
happen if we changed the variable a to a random vector or matrix. At this point, the result of the
calculation f(a) is no longer a scalar. What happens to the result? How do we analyze this?
2. Redesign an example of finding the gradient of the control flow. Run and analyze the result.
3. In a second-price auction (such as in eBay or in computational advertising), the winning bidder
pays the second-highest price. Compute the gradient of the final price with respect to the winning
bidder’s bid using autograd. What does the result tell you about the mechanism? If you are
curious to learn more about second-price auctions, check out this paper by Edelman, Ostrovski and
Schwartz, 2005.
4. Why is the second derivative much more expensive to compute than the first derivative?
5. Derive the head gradient relationship for the chain rule. If you get stuck, use the Chain Rule article
on Wikipedia.
6. Assume f (x) = sin(x). Plot f (x) and dfdx (x)
on a graph, where you computed the latter without
any symbolic calculations, i.e. without exploiting that f ′ (x) = cos(x).
In some form or another, machine learning is all about making predictions. We might want to predict the
probability of a patient suffering a heart attack in the next year, given their clinical history. In anomaly
detection, we might want to assess how likely a set of readings from an airplane’s jet engine would
be, were it operating normally. In reinforcement learning, we want an agent to act intelligently in an
environment. This means we need to think about the probability of getting a high reward under each of
the available action. And when we build recommender systems we also need to think about probability.
For example, say hypothetically that worked for a large online bookseller. We might want to estimate
the probability that a particular user would buy a particular book. For this we need to use the language
of probability and statistics. Entire courses, majors, theses, careers, and even departments, are devoted
to probability. So naturally, our goal in this section isn’t to teach the whole subject. Instead we hope to
get you off the ground, to teach you just enough that you can start building your first machine learning
models, and to give you enough of a flavor for the subject that you can begin to explore it on your own if
you wish.
We’ve already invoked probabilities in previous sections without articulating what precisely they are or
giving a concrete example. Let’s get more serious now by considering the problem of distinguishing cats
and dogs based on photographs. This might sound simple but it’s actually a formidable challenge. To
start with, the difficulty of the problem may depend on the resolution of the image.
|:-:|:-:|:-:|:-:|:–:|
Say that we cast a die and want to know what the chance is of seeing a 1 rather than another digit. If the
die is fair, all six outcomes X = {1, . . . , 6} are equally likely to occur, and thus we would see a 1 in 1
out of 6 cases. Formally we state that 1 occurs with probability 16 .
For a real die that we receive from a factory, we might not know those proportions and we would need to
check whether it is tainted. The only way to investigate the die is by casting it many times and recording
the outcomes. For each cast of the die, we’ll observe a value {1, 2, . . . , 6}. Given these outcomes, we
want to investigate the probability of observing each outcome.
One natural approach for each value is to take the individual count for that value and to divide it by the
total number of tosses. This gives us an estimate of the probability of a given event. The law of large
numbers tell us that as the number of tosses grows this estimate will draw closer and closer to the true
underlying probability. Before going into the details of what’s going here, let’s try it out.
To start, let’s import the necessary packages:
Next, we’ll want to be able to cast the die. In statistics we call this process of drawing examples from
probability distributions sampling. The distribution which assigns probabilities to a number of discrete
choices is called the multinomial distribution. We’ll give a more formal definition of distribution later, but
at a high level, think of it as just an assignment of probabilities to events. In MXNet, we can sample from
the multinomial distribution via the aptly named [Link] function. The function
can be called in many ways, but we’ll focus on the simplest. To draw a single sample, we simply pass in
a vector of probabilities.
In [2]: probabilities = [Link](6) / 6
[Link](probabilities)
Out[2]:
[3]
<NDArray 1 @cpu(0)>
If you run the sampler a bunch of times, you’ll find that you get out random values each time. As with
estimating the fairness of a die, we often want to generate many samples from the same distribution. It
would be unbearably slow to do this with a Python for loop, so [Link] supports
drawing multiple samples at once, returning an array of independent samples in any shape we might
desire.
In [3]: print([Link](probabilities, shape=(10)))
print([Link](probabilities, shape=(5,10)))
[3 4 5 3 5 3 5 2 3 3]
<NDArray 10 @cpu(0)>
[[2 2 1 5 0 5 1 2 2 4]
[4 3 2 3 2 5 5 0 2 0]
[3 0 2 4 5 4 0 5 5 5]
[2 4 4 2 3 4 4 0 4 3]
[3 0 3 5 4 3 0 2 2 1]]
<NDArray 5x10 @cpu(0)>
Now that we know how to sample rolls of a die, we can simulate 1000 rolls. We can then go through and
count, after each of the 1000 rolls, how many times each number was rolled.
In [4]: rolls = [Link](probabilities, shape=(1000))
counts = [Link]((6,1000))
totals = [Link](6)
for i, roll in enumerate(rolls):
totals[int([Link]())] += 1
counts[:, i] = totals
To start, we can inspect the final tally at the end of 1000 rolls.
In [5]: totals / 1000
As you can see, the lowest estimated probability for any of the numbers is about .15 and the highest
estimated probability is 0.188. Because we generated the data from a fair die, we know that each number
actually has probability of 1/6, roughly .167, so these estimates are pretty good. We can also visualize
how these probabilities converge over time towards reasonable estimates.
To start let’s take a look at the counts array which has shape (6, 1000). For each time step (out of
1000), counts says how many times each of the numbers has shown up. So we can normalize each j-th
column of the counts vector by the number of tosses to give the current estimated probabilities at that
time. The counts object looks like this:
In [6]: counts
Out[6]:
[[ 0. 0. 0. ... 165. 166. 167.]
[ 1. 1. 1. ... 168. 168. 168.]
[ 0. 0. 0. ... 175. 175. 175.]
[ 0. 0. 0. ... 159. 159. 159.]
[ 0. 1. 2. ... 158. 158. 158.]
[ 0. 0. 0. ... 173. 173. 173.]]
<NDArray 6x1000 @cpu(0)>
[0. 1. 0. 0. 0. 0.]
<NDArray 6 @cpu(0)>
As you can see, after the first toss of the die, we get the extreme estimate that one of the numbers will be
rolled with probability 1.0 and that the others have probability 0. After 100 rolls, things already look a
bit more reasonable. We can visualize this convergence by using the plotting package matplotlib. If
you don’t have it installed, now would be a good time to install it.
In [8]: %matplotlib inline
from matplotlib import pyplot as plt
from IPython import display
display.set_matplotlib_formats('svg')
[Link](figsize=(8, 6))
for i in range(6):
[Link](estimates[i, :].asnumpy(), label=("P(die=" + str(i) +")"))
Each solid curve corresponds to one of the six values of the die and gives our estimated probability that
the die turns up that value as assessed after each of the 1000 turns. The dashed black line gives the true
underlying probability. As we get more data, the solid curves converge towards the true answer.
In our example of casting a die, we introduced the notion of a random variable. A random variable,
which we denote here as X can be pretty much any quantity and is not deterministic. Random variables
could take one value among a set of possibilities. We denote sets with brackets, e.g., {cat, dog, rabbit}.
The items contained in the set are called elements, and we can say that an element x is in the set S, by
writing x ∈ S. The symbol ∈ is read as in and denotes membership. For instance, we could truthfully
say dog ∈ {cat, dog, rabbit}. When dealing with the rolls of die, we are concerned with a variable
X ∈ {1, 2, 3, 4, 5, 6}.
Note that there is a subtle difference between discrete random variables, like the sides of a dice, and
continuous ones, like the weight and the height of a person. There’s little point in asking whether two
people have exactly the same height. If we take precise enough measurements you’ll find that no two
• For any two mutually exclusive events Z = z and X = x, the probability that either happens
is equal to the sum of their individual probabilities, that is Pr(Z = z ∪ X = x) = Pr(Z =
z) + Pr(X = x).
Very often, we’ll want to consider more than one random variable at a time. For instance, we may want
to model the relationship between diseases and symptoms. Given a disease and symptom, say flu’ and
cough’, either may or may not occur in a patient with some probability. While we hope that the probability
of both would be close to zero, we may want to estimate these probabilities and their relationships to each
other so that we may apply our inferences to effect better medical care.
As a more complicated example, images contain millions of pixels, thus millions of random variables.
And in many cases images will come with a label, identifying objects in the image. We can also think of
the label as a random variable. We can even think of all the metadata as random variables such as location,
time, aperture, focal length, ISO, focus distance, camera type, etc. All of these are random variables that
occur jointly. When we deal with multiple random variables, there are several quantities of interest. The
first is called the joint distribution Pr(A, B). Given any elements a and b, the joint distribution lets us
answer, what is the probability that A = a and B = b simultaneously? Note that for any values a and b,
Pr(A = a, B = b) ≤ Pr(A = a).
This has to be the case, since for A and B to happen, A has to happen and B also has to happen (and vice
versa). Thus A, B cannot be more likely than A or B individually. This brings us to an interesting ratio:
0 ≤ Pr(A,B)
Pr(A) ≤ 1. We call this a conditional probability and denote it by Pr(B|A), the probability that
B happens, provided that A has happened.
Using the definition of conditional probabilities, we can derive one of the most useful and celebrated
equations in statisticsBayes’ theorem. It goes as follows: By construction, we have that Pr(A, B) =
Pr(B|A) Pr(A). By symmetry, this also holds for Pr(A, B) = Pr(A|B) Pr(B). Solving for one of the
Pr(B|A) Pr(A)
Pr(A|B) =
Pr(B)
This is very useful if we want to infer one thing from another, say cause and effect but we only know
the properties in the reverse direction. One important operation that we need, to make this work, is
marginalization, i.e., the operation of determining Pr(A) and Pr(B) from Pr(A, B). We can see that
the probability of seeing A amounts to accounting for all possible choices of B and aggregating the joint
probabilities over all of them, i.e.
∑ ∑
Pr(A) = Pr(A, B ′ ) and Pr(B) = Pr(A′ , B)
B′ A′
Another useful property to check for is dependence vs. independence. Independence is when the oc-
currence of one event does not reveal any information about the occurrence of the other. In this case
Pr(B|A) = Pr(B). Statisticians typically exress this as A ⊥⊥ B. From Bayes’ Theorem, it follows
immediately that also Pr(A|B) = Pr(A). In all other cases we call A and B dependent. For instance,
two successive rolls of a die are independent. On the other hand, the position of a light switch and the
brightness in the room are not (they are not perfectly deterministic, though, since we could always have a
broken lightbulb, power failure, or a broken switch).
Let’s put our skills to the test. Assume that a doctor administers an AIDS test to a patient. This test is
fairly accurate and it fails only with 1% probability if the patient is healthy by reporting him as diseased.
Moreover, it never fails to detect HIV if the patient actually has it. We use D to indicate the diagnosis
and H to denote the HIV status. Written as a table the outcome Pr(D|H) looks as follows:
Note that the column sums are all one (but the row sums aren’t), since the conditional probability needs
to sum up to 1, just like the probability. Let us work out the probability of the patient having AIDS if the
test comes back positive. Obviously this is going to depend on how common the disease is, since it affects
the number of false alarms. Assume that the population is quite healthy, e.g. Pr(HIV positive) = 0.0015.
To apply Bayes’ Theorem, we need to determine
In other words, there’s only a 13.1% chance that the patient actually has AIDS, despite using a test that is
99% accurate. As we can see, statistics can be quite counterintuitive.
What should a patient do upon receiving such terrifying news? Likely, he/she would ask the physician to
administer another test to get clarity. The second test has different characteristics (it isn’t as good as the
first one).
Unfortunately, the second test comes back positive, too. Let us work out the requisite probabilities to
invoke Bayes’ Theorem.
• Pr(D1 = 1 and D2 = 1|H = 0) = 0.01 · 0.03 = 0.0003
• Pr(D1 = 1 and D2 = 1|H = 1) = 1 · 0.98 = 0.98
• Pr(D1 = 1 and D2 = 1) = 0.0003 · 0.9985 + 0.98 · 0.0015 = 0.00176955
0.98·0.0015
• Pr(H = 1|D1 = 1 and D2 = 1) = 0.00176955 = 0.831
That is, the second test allowed us to gain much higher confidence that not all is well. Despite the second
test being considerably less accurate than the first one, it still improved our estimate quite a bit. You
might ask, why couldn’t we just run the first test a second time? After all, the first test was more accurate.
The reason is that we needed a second test whose result is independent of the first test (given the true
diagnosis). In other words, we made the tacit assumption that Pr(D1 , D2 |H) = Pr(D1 |H) Pr(D2 |H).
Statisticians call such random variables conditionally independent. This is expressed as D1 ⊥⊥ D2 |H.
Often, when working with probabilistic models, we’ll want not just to estimate distributions from data,
but also to generate data by sampling from distributions. One of the simplest ways to sample random
numbers is to invoke the random method from Python’s random package.
In [9]: import random
for i in range(10):
print([Link]())
0.5625854181184194
0.7626588898103906
0.9147573048106804
0.21380539256166875
0.5118202320519344
0.45451055424315767
0.8633881425828397
0.19244054951990786
0.20263817255224714
0.747705094936477
Uniform Distribution
These numbers likely appear random. Note that their range is between 0 and 1 and they are evenly
distributed. Because these numbers are generated by default from the uniform distribution, there should
be no two sub-intervals of [0, 1] of equal size where numbers are more likely to lie in one interval than
the other. In other words, the chances of any of these numbers to fall into the interval [0.2, 0.3) are the
same as in the interval [.593264, .693264). In fact, these numbers are pseudo-random, and the computer
generates them by first producing a random integer and then dividing it by its maximum range. To sample
random integers directly, we can run the following snippet, which generates integers in the range between
1 and 100.
In [10]: for i in range(10):
print([Link](1, 100))
88
72
21
98
53
5
10
76
68
3
How might we check that randint is really uniform? Intuitively, the best strategy would be to run
sampler many times, say 1 million, and then count the number of times it generates each value to ensure
that the results are approximately uniform.
In [11]: import math
counts = [Link](100)
fig, axes = [Link](2, 3, figsize=(15, 8), sharex=True)
We can see from these figures that the initial number of counts looks strikingly uneven. If we sample
fewer than 100 draws from a distribution over 100 outcomes this should be expected. But even for 1000
samples there is a significant variability between the draws. What we are really aiming for is a situation
where the probability of drawing a number x is given by p(x).
Drawing from a uniform distribution over a set of 100 outcomes is simple. But what if we have nonuni-
form probabilities? Let’s start with a simple case, a biased coin which comes up heads with probability
0.35 and tails with probability 0.65. A simple way to sample from that is to generate a uniform random
variable over [0, 1] and if the number is less than 0.35, we output heads and otherwise we generate tails.
Let’s try this out.
In [12]: # Number of samples
n = 1000000
y = [Link](0, 1, n)
x = [Link](1, n+1)
# Count number of occurrences and divide by the number of total draws
p0 = [Link](y < 0.35) / x
[Link](figsize=(15, 8))
[Link](x, p0)
[Link](x, p1)
[Link]()
As we can see, on average, this sampler will generate 35% zeros and 65% ones. Now what if we have
more than two possible outcomes? We can simply generalize this idea as follows. Given any probability
distribution, e.g. p = [0.1, 0.2, 0.05, 0.3, 0.25, 0.1] we can compute its cumulative distribution (python’s
cumsum will do this for you) F = [0.1, 0.3, 0.35, 0.65, 0.9, 1]. Once we have this we draw a random
variable x from the uniform distribution U [0, 1] and then find the interval where F [i − 1] ≤ x < F [i]. We
then return i as the sample. By construction, the chances of hitting interval [F [i−1], F [i]) has probability
p(i).
Note that there are many more efficient algorithms for sampling than the one above. For instance, binary
search over F will run in O(log n) time for n random variables. There are even more clever algorithms,
such as the Alias Method to sample in constant time, after O(n) preprocessing.
The standard
( Normal
) distribution (aka the standard Gaussian distribution) is given by p(x) =
√1 exp − 1 x2 . Let’s plot it to get a feel for it.
2π 2
Sampling from this distribution is less trivial. First off, the support is infinite, that is, for any x the density
p(x) is positive. Secondly, the density is nonuniform. There are many tricks for sampling from it - the
key idea in all algorithms is to stratify p(x) in such a way as to map it to the uniform distribution U [0, 1].
One way to do this is with the probability integral transform.
∫x
Denote by F (x) = −∞ p(z)dz the cumulative distribution function (CDF) of p. This is in a way the
continuous version of the cumulative sum that we used previously. In the same way we can now define
the inverse map F −1 (ξ), where ξ is drawn uniformly. Unlike previously where we needed to find the
correct interval for the vector F (i.e. for the piecewise constant function), we now invert the function
F (x).
In practice, this is slightly more tricky since inverting the CDF is hard in the case of a Gaussian. It
turns out that the twodimensional integral is much easier to deal with, thus yielding two normal random
variables than one, albeit at the price of two uniformly distributed ones. For now, suffice it to say that
there are built-in algorithms to address this.
The normal distribution has yet another desirable property. In a way all distributions converge to it, if we
only average over a sufficiently large number of draws from any other distribution. To understand this in
a bit more detail, we need to introduce three important things: expected values, means and variances.
∫ expected value Ex∼p(x) [f (x)] of a function f under a distribution p is given by the integral
• The
x
p(x)f (x)dx. That is, we average over all possible outcomes, as given by p.
• A particularly important expected value is that for the function f (x) = x, i.e. µ := Ex∼p(x) [x]. It
provides us with some idea about the typical values of x.
• Another important quantity is the variance, i.e. the typical deviation from the mean σ 2 :=
distribution by σ, we need to lower it by σ1 to retain the same probability mass (i.e. the weight under the
distribution always needs to integrate out to 1).
Now we are ready to state one of the most fundamental theorems in statistics, the Central Limit Theorem.
It states that for sufficiently well-behaved random variables, in particular random variables with well-
defined mean and variance, the sum tends toward a normal distribution. To get some idea, let’s repeat the
experiment described in the beginning, but now using random variables with integer values of {0, 1, 2}.
In [14]: # Generate 10 random sequences of 10,000 uniformly distributed random
,→ variables
tmp = [Link](size=(10000,10))
x = 1.0 * (tmp > 0.3) + 1.0 * (tmp > 0.8)
mean = 1 * 0.5 + 2 * 0.2
variance = 1 * 0.5 + 4 * 0.2 - mean**2
print('mean {}, variance {}'.format(mean, variance))
[Link](figsize=(10,5))
for i in range(10):
[Link](y,z[:,i])
More distributions
Many more useful distributions exist. If you’re interested in going deeper, we recommend consulting a
dedicated book on statistics or looking up some common distributions on Wikipedia for further detail.
Some important distirbutions to be aware of include:
• Binomial Distribution It is used to describe the distribution over multiple draws from the same
( )probability π ∈
distribution, e.g. the number of heads when tossing a biased coin (i.e. a coin with
[0, 1] of returning heads) 10 times. The binomial probability is given by p(x) = nx π x (1 − π)n−x .
• Multinomial Distribution Often, we are concerned with more than two outcomes, e.g. when
∏k
rolling a dice multiple times. In this case, the distribution is given by p(x) = ∏k n! x ! i=1 πixi .
i=1 i
• Poisson Distribution This distribution models the occurrence of point events that happen with a
given rate, e.g. the number of raindrops arriving within a given amount of time in an area (weird fact
- the number of Prussian soldiers being killed by horses kicking them followed that distribution).
1 x −λ
Given a rate λ, the number of occurrences is given by p(x) = x! λ e .
Summary
So far, we covered probabilities, independence, conditional independence, and how to use this to draw
some basic conclusions. We also introduced some fundamental probability distributions and demon-
strated how to sample from them using Apache MXNet. This is already a powerful bit of knowledge,
and by itself a sufficient set of tools for developing some classic machine learning models. In the next
section, we will see how to operationalize this knowlege to build your first machine learning model: the
Naive Bayes classifier.
Exercises
1. Given two events with probability Pr(A) and Pr(B), compute upper and lower bounds on Pr(A ∪
B) and Pr(A ∩ B). Hint - display the situation using a Venn Diagram.
2. Assume that we have a sequence of events, say A, B and C, where B only depends on A and C
only on B, can you simplify the joint probability? Hint - this is a Markov Chain.
Discuss
Before we worry about complex optimization algorithms or GPUs, we can already deploy our first clas-
sifier, relying only on simple statistical estimators and our understanding of conditional independence.
Learning is all about making assumptions. If we want to classify a new data point that we’ve never seen
before we have to make some assumptions about which data points are similar to each other.
One popular (and remarkably simple) algorithm is the Naive Bayes Classifier. Note that one natural way
to express the classification task is via the probabilistic question: what is the most likely label given the
ŷ = argmaxy p(y|x)
Unfortunately, this requires that we estimate p(y|x) for every value of x = x1 , ..., xd . Imagine that each
feature could take one of 2 values. For example, the feature x1 = 1 might signify that the word apple
appears in a given document and x1 = 1 would signify that it does not. If we had 30 such binary features,
that would mean that we need to be prepared to classify any of 230 (over 1 billion!) possible values of the
input vector x.
Moreover, where is the learning? If we need to see every single possible example in order to predict the
corresponding label then we’re not really learning a pattern but just memorizing the dataset. Fortunately,
by making some assumptions about conditional independence, we can introduce some inductive bias and
build a model capable of generalizing from a comparatively modest selection of training examples.
To begin, let’s use Bayes Theorem, to express the classifier as
p(x|y)p(y)
ŷ = argmaxy
p(x)
Note that the denominator is the normalizing term p(x) which does not depend on the value of the label
y. As a result, we only need to worry about comparing the numerator across different values of y. Even
if calculating the demoninator turned out to be intractable, we could get away with ignoring it, so long
as we could evaluate the numerator. Fortunately,
∑ however, even if we wanted to recover the normalizing
constant, we could, since we know that y p(y|x) = 1, hence we can always recover the normalization
term. Now, using the chain rule of probability, we can express the term p(x|y) as
By itself, this expression doesn’t get us any further. We still must estimate roughly 2d parameters. How-
ever, if we assume that the features are conditionally indpendent ∏ of each other, given the label, then
suddenly we’re in much better shape, as this term simplifies to i p(xi |y), giving us the predictor
∏
ŷ = argmaxy = p(xi |y)p(y)
i
∏
Estimating each term in i p(xi |y) amounts to estimating just one parameter. So our assumption of
conditional independence has taken the complexity of our model (in terms of the number of parameters)
from an exponential dependence on the number of features to a linear dependence. Moreover, we can
now make predictions for examples that we’ve never seen before, because we just need to estimate the
terms p(xi |y), which can be estimated based on a number of different documents.
Let’s take a closer look at∏the key assumption that the attributes are all independent of each other, given
the labels, i.e., p(x|y) = i p(xi |y). Consider classifying emails into spam and ham. It’s fair to say that
the occurrence of the words Nigeria, prince, money, rich are all likely indicators that the e-mail
might be spam, whereas theorem, network, Bayes or statistics are good indicators that the
exchange is less likely to be part of an orchestrated attempt to wheedle out your bank account numbers.
Since images are much easier to deal with, we will illustrate the workings of a Naive Bayes classifier
for distinguishing digits on the MNIST dataset. The problem is that we don’t actually know p(y) and
p(xi |y). So we need to estimate it given some training data first. This is what is called training the
model. Estimating p(y) is not too hard. Since we are only dealing with 10 classes, this is pretty easy
- simply count the number of occurrences ny for each of the digits and divide it by the total amount of
data n. For instance, if digit 8 occurs n8 = 5, 800 times and we have a total of n = 60, 000 images, the
probability estimate is p(y = 8) = 0.0967.
Now on to slightly more difficult thingsp(xi |y). Since we picked black and white images, p(xi |y) denotes
the probability that pixel i is switched on for class y. Just like before we can go and count the number
of times niy such that an event occurs and divide it by the total number of occurrences of y, i.e. ny . But
there’s something slightly troubling: certain pixels may never be black (e.g. for very well cropped images
the corner pixels might always be white). A convenient way for statisticians to deal with this problem is
to add pseudo counts to all occurrences. Hence, rather than niy we use niy + 1 and instead of ny we use
ny + 1. This is also called Laplace Smoothing.
In [1]: %matplotlib inline
from matplotlib import pyplot as plt
from IPython import display
display.set_matplotlib_formats('svg')
import mxnet as mx
from mxnet import nd
import numpy as np
Now that we computed per-pixel counts of occurrence for all pixels, it’s time to see how our model
behaves. Time to plot it. This is where it is so much more convenient to work with images. Visualizing
28x28x10 probabilities (for each pixel for each class) would typically be an exercise in futility. However,
by plotting them as images we get a quick overview. The astute reader probably noticed by now that these
are some mean looking digits
In [3]: import [Link] as plt
fig, figarr = [Link](1, 10, figsize=(10, 10))
for i in range(10):
figarr[i].imshow(xcount[:, i].reshape((28, 28)).asnumpy(), cmap='hot')
figarr[i].axes.get_xaxis().set_visible(False)
figarr[i].axes.get_yaxis().set_visible(False)
[Link]()
print('Class probabilities', py)
Class probabilities
[0.09871688 0.11236461 0.09930012 0.10218297 0.09736711 0.09035161
0.09863356 0.10441593 0.09751708 0.09915014]
<NDArray 10 @cpu(0)>
Now we can compute the likelihoods of an image, given the model. This is statistician speak for p(x|y),
i.e. how likely it is to see a particular image under certain conditions (such as the label). Our Naive Bayes
model which assumed that all pixels are independent tells us that
∏
p(x|y) = p(xi |y)
i
p(x|y)p(y)
p(y|x) = ∑ ′
y ′ p(x|y )
This went horribly wrong! To find out why, let’s look at the per pixel probabilities. They’re typically
numbers between 0.001 and 1. We are multiplying 784 of them. At this point it is worth mentioning that
we are calculating these numbers on a computer, hence with a fixed range for the exponent. What happens
is that we experience numerical underflow, i.e. multiplying all the small numbers leads to something even
smaller until it is rounded down to zero. At that point we get division by zero with nan as a result.
To fix this we use the fact that log ab = log a + log b, i.e. we switch to summing logarithms. This will
get us unnormalized probabilities in log-space. To normalize terms we use the fact that
exp(a) exp(a + c)
=
exp(a) + exp(b) exp(a + c) + exp(b + c)
In particular, we can pick c = − max(a, b), which ensures that at least one of the terms in the denominator
is 1.
In [5]: logpx = [Link](px)
logpxneg = [Link](1-px)
logpy = [Link](py)
def bayespost(data):
# We need to incorporate the prior probability p(y) since p(y|x) is
# proportional to p(x|y) p(y)
logpost = [Link]()
logpost += (logpx * data + logpxneg * (1-data)).sum(0)
# Normalize to prevent overflow or underflow by subtracting the largest
# value
logpost -= [Link](logpost)
# Compute the softmax using logpx
post = [Link](logpost).asnumpy()
post /= [Link](post)
return post
# Show 10 images
ctr = 0
for data, label in mnist_test:
x = [Link]((784,1))
y = int(label)
post = bayespost(x)
if ctr == 10:
break
[Link]()
As we can see, this classifier works pretty well in many cases. However, the second last digit shows that
it can be both incompetent and overly confident of its incorrect estimates. That is, even if it is horribly
wrong, it generates probabilities close to 1 or 0. Not a classifier we should use very much nowadays any
longer. To see how well it performs overall, let’s compute the overall accuracy of the classifier.
In [6]: # Initialize counter
ctr = 0
err = 0
post = bayespost(x)
if (post[y] < [Link]()):
err += 1
Modern deep networks achieve error rates of less than 0.01. While Naive Bayes classifiers used to be
popular in the 80s and 90s, e.g. for spam filtering, their heydays are over. The poor performance is due to
the incorrect statistical assumptions that we made in our model: we assumed that each and every pixel are
independently generated, depending only on the label. This is clearly not how humans write digits, and
this wrong assumption led to the downfall of our overly naive (Bayes) classifier. Time to start building
Deep Networks.
Exercises
1. Design a Naive Bayes regression estimator where p(xi |y) is a normal distribution.
2. Under which situations does Naive Bayes work?
3. An eyewitness is sure that he could recognize the perpetrator with 90% accuracy, if he were to
encounter him again.
• Is this a useful statement if there are only 5 suspects?
• Is it still useful if there are 50?
2.6 Documentation
Due to constraints on the length of this book, we cannot possibly introduce every single MXNet function
and class (and you probably would no want us to). The API documentation and additional tutorials
and examples provide plenty of documentation beyond the book. In this section we provide you some
guidance to exploring the MXNet API.
In order to know which functions and classes can be called in a module, we invoke the dir function. For
instance, we can query all properties in the [Link] module as follows:
2.6. Documentation 87
In [1]: from mxnet import nd
print(dir([Link]))
['NDArray', '_Null', '__all__', '__builtins__', '__cached__', '__doc__', '__file__',
,→ '__loader__', '__name__', '__package__', '__spec__', '_internal',
,→ '_random_helper', 'current_context', 'exponential', 'exponential_like', 'gamma',
,→ 'gamma_like', 'generalized_negative_binomial',
,→ 'generalized_negative_binomial_like', 'multinomial', 'negative_binomial',
,→ 'negative_binomial_like', 'normal', 'normal_like', 'numeric_types', 'poisson',
,→ 'poisson_like', 'randint', 'randn', 'shuffle', 'uniform', 'uniform_like']
Generally, we can ignore functions that start and end with __ (special objects in Python) or functions
that start with a single _(usually internal functions). Based on the remaining function/attribute names,
we might hazard a guess that this module offers various methods for generating random numbers, in-
cluding sampling from the uniform distribution (uniform), normal distribution (normal), and Poisson
distribution (poisson).
For more specific instructions on how to use a given function or class, we can invoke the help function.
As an example, let’s explore the usage instructions for NDArray’s ones_like function.
In [2]: help(nd.ones_like)
Help on function ones_like:
Examples::
Parameters
----------
data : NDArray
The input
Returns
-------
out : NDArray or list of NDArrays
The output of this function.
In the Jupyter notebook, we can use ? to display the document in another window. For example, nd.
[Link]? will create content that is almost identical to help([Link]),
displaying it in a new browser window. In addition, if we use two question marks, e.g. [Link].
uniform??, the code implementing the function will also be displayed.
For further details on the API details check the MXNet website at [Link] You can find
the details under the appropriate headings (also for programming languages other than Python).
Exercise
2.6. Documentation 89
90 2. The Preliminaries: A Crashcourse
3
Before we get into the details of deep neural networks, we need to cover the basics of neural network
training. In this chapter, we will cover the entire training process, including defining simple neural net-
work architecures, handling data, specifying a loss function, and training the model. In order to make
things easier to grasp, we begin with the simplest concepts. Fortunately, classic statistical learning tech-
niques such as linear and logistic regression can be cast as shallow neural networks. Starting from these
classic algorthms, we’ll introduce you to the basics, providing the basis for more complex techniques
such as softmax regression (introduced at the end of this chapter) and multilayer perceptrons (introduced
in the next chapter).
To start off, we will introduce the problem of regression. This is the task of predicting a real valued target
y given a data point x. Regression problems are common in practice, arising whenever we want to predict
a continuous numerical value. Some examples of regression problems include predicting house prices,
stock prices, length of stay (for patients in the hospital), tomorrow’s temperature, demand forecasting
(for retail sales), and many more. Note that not every prediction problem is a regression problem. In
subsequent sections we will discuss classification problems, where our predictions are discrete categories.
91
3.1.1 Basic Elements of Linear Regression
Linear regression, which dates to Gauss and Legendre, is perhaps the simplest, and by far the most popular
approach to solving regression problems. What makes linear regression linear is that we assume that the
output truly can be expressed as a linear combination of the input features.
Linear Model
To keep things simple, we will start with running example in which we consider the problem of estimating
the price of a house (e.g. in dollars) based on area (e.g. in square feet) and age (e.g. in years). More
formally, the assumption of linearity suggests that our model can be expressed in the following form:
In economics papers, it is common for authors to write out linear models in this format with a gigantic
equation that spans multiple lines containing terms for every single feature. For the high-dimensional
data that we often address in machine learning, writing out the entire model can be tedious. In these
cases, we will find it more convenient to use linear algebra notation. In the case of d variables, we could
express our prediction ŷ as follows:
ŷ = w1 · x1 + ... + wd · xd + b
or alternatively, collecting all features into a single vector x and all parameters into a vector w, we can
express our linear model as ŷ = wT x + b.
Above, the vector x corresponds to a single data point. Commonly, we will want notation to refer to the
entire dataset of all input data points. This matrix, often denoted using a capital letter X, is called the
design matrix and contrains one row for every example, and one column for every feature.
Given a collection of data points X and a vector containing the corresponding target values y, the goal
of linear regression is to find the weight vector w and bias term b (also called an offset or intercept) that
associates each data point xi with an approximation ŷi of its corresponding label yi .
Expressed in terms of a single data point, this gives us the expression (same as above) ŷ = w⊤ x + b.
Finally, for a collection of data points X, the predictions ŷ can be expressed via the matrix-vector product:
ŷ = Xw + b
Even if we believe that the best model to relate x and y is linear, it’s unlikely that we’d find data where
y lines up exactly as a linear function of x. For example, both the target values y and the features X
might be subject to some amount of measurement error. Thus even when we believe that the linearity
assumption holds, we will typically incorporate a noise term to account for such errors.
Before we can go about solving for the best setting of the parameters w and b, we will need two more
things: (i) some way to measure the quality of the current model and (ii) some way to manipulate the
model to improve its quality.
The first thing that we need is training data. Sticking with our running example, we’ll need some col-
lection of examples for which we know both the actual selling price of each house as well as their cor-
responding area and age. Our goal is to identify model parameters that minimize the error between the
predicted price and the real price. In the terminology of machine learning, the data set is called a training
data’ or training set’, a house (often a house and its price) here comprises one sample’, and its actual
selling price is called a label’. The two factors used to predict the label are called features’ or covariates’.
Typically, we will use n to denote the number of samples in our dataset. We index the samples by i,
(i) (i)
denoting each input data point as x(i) = [x1 , x2 ] and the corresponding label as y (i) .
Loss Function
In model training, we need to measure the error between the predicted value and the real value of the
price. Usually, we will choose a non-negative number as the error. The smaller the value, the smaller the
error. A common choice is the square function. For given parameters w and b, we can express the error
of our prediction on a given a sample as follows:
1 ( (i) )2
l(i) (w, b) = ŷ − y (i) ,
2
The constant 1/2 is just for mathematical convenience, ensuring that after we take the derivative of the
loss, the constant coefficient will be 1. The smaller the error, the closer the predicted price is to the actual
price, and when the two are equal, the error will be zero.
Since the training dataset is given to us, and thus out of our control, the error is only a function of the
model parameters. In machine learning, we call the function that measures the error the loss function’.
The squared error function used here is commonly referred to as square loss’.
To make things a bit more concrete, consider the example below where we plot a regression problem for
a one-dimensional case, e.g. for a model where house prices depend only on area.
As you can see, large differences between estimates ŷ (i) and observations y (i) lead to even larger contri-
butions in terms of the loss, due to the quadratic dependence. To measure the quality of a model on the
entire dataset, we can simply average the losses on the training set.
1 ∑ (i) 1 ∑ 1 ( ⊤ (i) )2
n n
L(w, b) = l (w, b) = w x + b − y (i) .
n i=1 n i=1 2
When training the model, we want to find parameters (w∗ , b∗ ) that minimize the average loss across all
training samples:
Analytic Solution
Linear regression happens to be an unusually simple optimization problem. Unlike nearly every other
model that we will encounter in this book, linear regression can be solved easily with a simple formula,
yielding a global optimum. To start we can subsume the bias b into the parameter w by appending a
column to the design matrix consisting of all 1s. Then our prediction problem is to minimize ||y − Xw||.
Because this expression has a quadratic form it is clearly convex, and so long as the problem is not
degenerate (our features are linearly independent), it is strictly convex.
Thus there is just one global critical point on the loss surface corresponding to the global minimum.
w∗ = (X T X)−1 X T y
While simple problems like linear regression may admit analytic solutions, you should not get used to
such good fortune. Although analytic solutions allow for nice mathematical analysis, the requirement of
an analytic solution confines one to an restrictive set of models that would exclude all of deep learning.
Gradient descent
Even in cases where we cannot solve the models analytically, and even when the loss surfaces are high-
dimensional and nonconvex, it turns out that we can still make progress. Moreover, when those difficult-
to-optimize models are sufficiently superior for the task at hand, figuring out how to train them is well
worth the trouble.
The key trick behind nearly all of deep learning and that we will repeatedly throughout this book is to
reduce the error gradually by iteratively updating the parameters, each step moving the parameters in
the direction that incrementally lowers the loss function. This algorithm is called gradient descent. On
convex loss surfaces it will eventually converge to a global minimum, and while the same can’t be said
for nonconvex surfaces, it will at least lead towards a (hopefully good) local minimum.
The most naive application of gradient descent consists of taking the derivative of the true loss, which
is an average of the losses computed on every single example in the dataset. In practice, this can be
extremely slow. We must pass over the entire dataset before making a single update. Thus, we’ll often
settle for sampling a random mini-batch of examples every time we need to computer the update, a variant
called stochastic gradient descent.
In each iteration, we first randomly and uniformly sample a mini-batch B consisting of a fixed number of
training data examples. We then compute the derivative (gradient) of the average loss on the mini batch
with regard to the model parameters. Finally, the product of this result and a predetermined step size
η > 0 are used to update the parameters in the direction that lowers the loss.
We can express the update mathematically as follows (∂ denotes the partial derivative):
η ∑
(w, b) ← (w, b) − ∂(w,b) l(i) (w, b)
|B|
i∈B
To summarize, steps of the algorithm are the following: (i) we initialize the values of the model param-
eters, typically at random; (ii) we iterate over the data many times, updating the parameters in each by
moving the parameters in the direction of the negative gradient, as calculated on a random minibatch of
data.
For quadratic losses and linear functions we can write this out explicitly as follows. Note that w and x
are vectors. Here the more elegant vector notation makes the math much more readable than expressing
In the above equation |B| represents the number of samples (batch size) in each mini-batch, η is referred
to as learning rate’ and takes a positive number. It should be emphasized that the values of the batch
size and learning rate are set somewhat manually and are typically not learned through model training.
Therefore, they are referred to as hyper-parameters. What we usually call tuning hyper-parameters refers
to the adjustment of these terms. In the worst case this is performed through repeated trial and error until
the appropriate hyper-parameters are found. A better approach is to learn these as parts of model training.
This is an advanced topic and we do not cover them here for the sake of simplicity.
Model Prediction
After completing the training process, we record the estimated model parameters, denoted ŵ, b̂ (in gen-
eral the hat symbol denotes estimates). Note that the parameters that we learn via gradient descent are
not exactly equal to the true minimizers of the loss on the training set, that’s be cause gradient descent
converges slowly to a local minimum but does not achieve it exactly. Moreover if the problem has mul-
tiple local minimum, we may not necessarily achieve the lowest minimum. Fortunately, for deep neural
networks, finding parameters that minimize the loss on training data is seldom a significant problem. The
more formidable task is to find parameters that will achieve low loss on data that we have not seen before,
a challenge called generalization. We return to these topics throughout the book.
Given the learned learned linear regression model ŵ⊤ x + b̂, we can now estimate the price of any house
outside the training data set with area (square feet) as x1 and house age (year) as x2 . Here, estimation
also referred to as model prediction’ or model inference’.
Note that calling this step inference’ is a misnomer, but has become standard jargon in deep learning.
In statistics, inference’ means estimating parameters and outcomes based on other data. This misuse of
terminology in deep learning can be a source of confusion when talking to statisticians.
So far we only talked about linear functions. While neural networks cover a much richer family of
models, we can begin thinking of the linear model as a neural network by expressing it the language of
neural networks. To begin, let’s start by rewriting things in a layer’ notation.
Commonly, deep learning practitioners represent models visually using neural network diagrams. In Fig-
ure 3.1, we represent linear regression with a neural network diagram. The diagram shows the connectiv-
ity among the inputs and output, but does not depict the weights or biases (which are given implicitly).
In the above network, the inputs are x1 , x2 , . . . xd . Sometimes the number of inputs are referred to as the
feature dimension. For linear regression models, we act upon d inputs and output 1 value. Because there
is just a single computed neuron (node) in the graph, we can think of linear models as neural networks
consisting of just a single neuron. Since all inputs are connected to all outputs (in this case it’s just one),
this layer can also be regarded as an instance of a fully-connected layer, also commonly called a dense
layer.
Biology
Neural networks derive their name from their inspirations in neuroscience. Although linear regression
predates computation neuroscience, many of the models we subsequently discuss truly owe to neural
inspiration. To understand the neural inspiration for artificial neural networks it is worth while considering
the basic structure of a neuron. For the purpose of the analogy it is sufficient to consider the dendrites
(input terminals), the nucleus (CPU), the axon (output wire), and the axon terminals (output terminals)
which connect to other neurons via synapses.
Information xi arriving from other neurons (or environmental sensors such as the retina) is received in
the dendrites. In particular, that information is weighted by synaptic weights wi which determine how
to respond
∑ to the inputs (e.g. activation or inhibition via xi wi ). All this is aggregated in the nucleus
y = i xi wi + b, and this information is then sent for further processing in the axon y, typically after
some nonlinear processing via σ(y). From there it either reaches its destination (e.g. a muscle) or is fed
into another neuron via its dendrites.
Brain structures vary significantly. Some look (to us) rather arbitrary whereas others have a regular
structure. For example, the visual system of many insects is consistent across members of a species.
The analysis of such structures has often inspired neuroscientists to propose new architectures, and in
some cases, this has been successful. However, much research in artificial neural networks has little to do
with any direct inspiration in neuroscience, just as although airplanes are inspired by birds, the study of
orninthology hasn’t been the primary driver of aeronautics innovaton in the last century. Equal amounts
of inspiration these days comes from mathematics, statistics, and computer science.
In model training or prediction, we often use vector calculations and process multiple observations at the
same time. To illustrate why this matters, consider two methods of adding vectors. We begin by creating
two 1000 dimensional ones first.
In [1]: from mxnet import nd
from time import time
a = [Link](shape=10000)
b = [Link](shape=10000)
One way to add vectors is to add them one coordinate at a time using a for loop.
Obviously, the latter is vastly faster than the former. Vectorizing code is a good way of getting order of
magnitude speedups. Likewise, as we saw above, it also greatly simplifies the mathematics and with it, it
reduces the potential for errors in the notation.
The following is optional and can be skipped but it will greatly help with understanding some of the design
choices in building deep learning models. As we saw above, using the squared loss l(y, ŷ) = 21 (y − ŷ)2
has many nice properties, such as having a particularly simple derivative ∂ŷ l(y, ŷ) = (ŷ − y). That is,
the gradient is given by the difference between estimate and observation. You might reasonably point out
that linear regression is a classical statistical model. Legendre first developed the method of least squares
regression in 1805, which was shortly thereafter rediscovered by Gauss in 1809. To understand this a bit
better, recall the normal distribution with mean µ and variance σ 2 .
( )
1 1
p(x) = √ exp − 2 (x − µ) 2
2πσ 2 2σ
x = [Link](-7, 7, 0.01)
# Mean and variance pairs
parameters = [(0,1), (0,2), (3,1)]
As can be seen in the figure above, changing the mean shifts the function, increasing the variance makes
it more spread-out with a lower peak. The key assumption in linear regression with least mean squares
loss is that the observations actually arise from noisy observations, where noise is added to the data, e.g.
as part of the observations process.
y = w⊤ x + b + ϵ where ϵ ∼ N (0, σ 2 )
This allows us to write out the likelihood of seeing a particular y for a given x via
( )
1 1 ⊤
p(y|x) = √ exp − 2 (y − w x − b) 2
2πσ 2 2σ
A good way of finding the most likely values of b and w is to maximize the likelihood of the entire dataset
∏
n
p(Y |X) = p(y (i) |x(i) )
i=1
The notion of maximizing the likelihood of the data subject to the parameters is well known as the
Maximum Likelihood Principle and its estimators are usually called Maximum Likelihood Estimators
(MLE). Unfortunately, maximizing the product of many exponential functions is pretty awkward, both in
∑ 1 ( (i) )2
n
1 ⊤ (i)
− log P (Y |X) = log(2πσ 2 ) + y − w x − b
i=1
2 2σ 2
A closer inspection reveals that for the purpose of minimizing − log P (Y |X) we can skip the first term
since it doesn’t depend on w, b or even the data. The second term is identical to the objective we initially
introduced, but for the multiplicative constant σ12 . Again, this can be skipped if we just want to get the
most likely solution. It follows that maximum likelihood in a linear model with additive Gaussian noise
is equivalent to linear regression with squared loss.
Summary
• Key ingredients in a machine learning model are training data, a loss function, an optimization
algorithm, and quite obviously, the model itself.
• Vectorizing makes everything better (mostly math) and faster (mostly code).
• Minimizing an objective function and performing maximum likelihood can mean the same thing.
• Linear models are neural networks, too.
Exercises
∑
1. Assume that we have some data x1 , . . . xn ∈ R. Our goal is to find a constant b such that i (xi −
b)2 is minimized.
• Find the optimal closed form solution.
• What does this mean in terms of the Normal distribution?
2. Assume that we want to solve the optimization problem for linear regression with quadratic loss
explicitly in closed form. To keep things simple, you can omit the bias b from the problem.
• Rewrite the problem in matrix and vector notation (hint - treat all the data as a single matrix).
• Compute the gradient of the optimization problem with respect to w.
• Find the closed form solution by solving a matrix equation.
• When might this be better than using stochastic gradient descent (i.e. the incremental opti-
mization approach that we discussed above)? When will this break (hint - what happens for
high-dimensional x, what if many observations are very similar)?.
3. Assume that the noise model governing the additive noise ϵ is the exponential distribution. That is,
p(ϵ) = 12 exp(−|ϵ|).
• Write out the negative log-likelihood of the data under the model − log p(Y |X).
Now that you have some background on the ideas behind linear regression, we are ready to step through a
hands-on implementation. In this section, and similar ones that follow, we are going to implement all parts
of linear regression: the data pipeline, the model, the loss function, and the gradient descent optimizer,
from scratch. Not surprisingly, today’s deep learning frameworks can automate nearly all of this work, but
if you never learn to implement things from scratch, then you may never truly understand how the model
works. Moreover, when it comes time to customize models, defining our own layers, loss functions,
etc., knowing how things work under the hood will come in handy. Thus, we start off describing how
to implement linear regression relyin only on the primitives in the NDArray and autograd packages.
In the section immediately following, we will present the compact implementation, using all of Gluon’s
bells and whistles, but this is where we dive into the details.
To start off, we import the packages required to run this section’s experiments: we’ll be using
matplotlib for plotting, setting it to embed in the GUI.
In [1]: %matplotlib inline
from IPython import display
from matplotlib import pyplot as plt
from mxnet import autograd, nd
import random
For this demonstration, we will construct a simple artificial dataset so that we can easily visualize the
data and compare the true pattern to the learned parameters. We will set the number of examples in our
training set to be 1000 and the number of features (or covariates) to 2. This our synthetic dataset will be
y = Xw + b + ϵ
Following standard assumptions, we choose a noise term ϵ that obeys a normal distribution with mean
of 0, and in this example, we’ll set the its standard deviation to 0.01. The following code generates our
synthetic dataset:
In [2]: num_inputs = 2
num_examples = 1000
true_w = [Link]([2, -3.4])
true_b = 4.2
features = [Link](scale=1, shape=(num_examples, num_inputs))
labels = [Link](features, true_w) + true_b
labels += [Link](scale=0.01, shape=[Link])
Note that each row in features consists of a 2-dimensional data point and that each row in labels
consists of a 1-dimensional target value (a scalar).
In [3]: features[0], labels[0]
Out[3]: (
[2.2122064 0.7740038]
<NDArray 2 @cpu(0)>,
[6.000587]
<NDArray 1 @cpu(0)>)
By generating a scatter plot using the second features[:, 1] and labels, we can clearly observe
the linear correlation between the two.
In [4]: def use_svg_display():
# Display in vector graphics
display.set_matplotlib_formats('svg')
set_figsize()
[Link](figsize=(10, 6))
[Link](features[:, 1].asnumpy(), [Link](), 1);
Recall that training models, consists of making multiple passes over the dataset, grabbing one mini-batch
of examples at a time and using them to update our model. Since this process is so fundamental to training
machine learning algortihms, we need a utility for shuffling the data and accessing in mini-batches.
In the following code, we define a data_iter function to demonstrate one possible implementation of
this functionality. The function takes a batch size, a design matrix containing the features, and a vector
of labels, yielding minibatches of size batch_size, each consisting of a tuple of features and labels.
In [5]: # This function has been saved in the d2l package for future use
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
# The examples are read at random, in no particular order
[Link](indices)
for i in range(0, num_examples, batch_size):
j = [Link](indices[i: min(i + batch_size, num_examples)])
yield [Link](j), [Link](j)
# The take function will then return the corresponding element based
# on the indices
[[-1.1040542 0.859058 ]
[ 0.82249576 0.68154025]
[-0.169078 0.03178489]
[ 0.6558425 -0.50490594]
[-1.2315675 1.8952172 ]
[-0.42438623 -0.20454668]
[ 0.5591865 -1.8126351 ]
[-0.08319951 0.21010068]
[ 0.31943294 -0.7632253 ]
[ 1.0532789 0.24552767]]
<NDArray 10x2 @cpu(0)>
[-0.9210661 3.5357232 3.7491038 7.2224727 -4.7079477 4.0555987
11.479114 3.3371363 7.4097967 5.468723 ]
<NDArray 10 @cpu(0)>
It should be no surprise that as we run the iterator, we will obtain distinct minibatches each time until
all the data has been exhausted (try this). While the iterator implemented above is good for didactic
purposes, it is inefficient in ways that might get us in trouble on real problems. For example, it requires
that we load all data in memory and that we perform a lot of random memory access. The built-in iterators
implemented in Apache MXNet are considerably efficient and they can deal both with data stored on file
and data fed via a data stream.
Before we can begin optimizing our model’s parameters by gradient descent, we need to have some
parameters in the first place. In the following code, we initialize weights by sampling random numbers
from a normal distribution with mean 0 and a standard deviation of 0.01, setting the bias b to 0.
In [7]: w = [Link](scale=0.01, shape=(num_inputs, 1))
b = [Link](shape=(1,))
Now that we have initialized our parameters, our next task is to update them until they fit our data suf-
ficiently well. Each update will require taking the gradient (a multi-dimensional derivative) of our loss
function with respect to the parameters. Given this gradient, we will update each parameter in the direc-
tion that reduces the loss.
Next, we must define our model, relating its inputs and parameters to its outputs. Recall that to calculate
the output of the linear model, we simply take the matrix-vector dot product of the examples X and the
models weights w, and add the offset b to each example. Note that below [Link](X, w) is a vector
and b is a scalar. Recall that when we add a vector and a scalar, the scalar is added to each component of
the vector.
In [9]: # This function has been saved in the d2l package for future use
def linreg(X, w, b):
return [Link](X, w) + b
Since updating our model requires taking the gradient of our loss function, we ought to define the loss
function first. Here we will use the squared loss function as described in the previous section. In the
implementation, we need to transform the true value y into the predicted value’s shape y_hat. The
result returned by the following function will also be the same as the y_hat shape.
In [10]: # This function has been saved in the d2l package for future use
def squared_loss(y_hat, y):
return (y_hat - [Link](y_hat.shape)) ** 2 / 2
As we discussed in the previous section, linear regression has a closed-form solution. However, this isn’t
a book about linear regression, its a book about deep learning. Since none of the other models that this
book introduces can be solved analytically, we will take this opportunity to introduce your first working
example of stochastic gradient descent (SGD).
At each step, using one batch randomly drawn from our dataset, we’ll estimate the gradient of the loss with
respect to our parameters. Then, we’ll update our parameters a small amount in the direction that reduces
the loss. Assuming that the gradient has already been calculated, each parameter (param) already has its
gradient stored in [Link]. The following code applies the SGD update, given a set of parameters,
a learning rate, and a batch size. The size of the update step is determined by the learning rate lr.
Because our loss is calculated as a sum over the batch of examples, we normalize our step size by the
3.2.7 Training
Now that we have all of the parts in place, we are ready to implement the main training loop. It is crucial
that you understand this code because you will see training loops that are nearly identical to this one over
and over again throughout your career in deep learning.
In each iteration, we will grab minibatches of models, first passing them through our model to obtain a set
of predictions. After calculating the loss, we will call the backward function to backpropagate through
the network, storing the gradients with respect to each parameter in its corresponding .grad attribute.
Finally, we will call the optimization algorithm sgd to update the model parameters. Since we previously
set the batch size batch_size to 10, the loss shape l for each small batch is (10, 1).
In summary, we’ll execute the following loop:
• Initialize parameters (w, b)
• Repeat until done
∑
– Compute gradient g ← ∂(w,b) B1 i∈B l(xi , y i , w, b)
– Update parameters (w, b) ← (w, b) − ηg
In the code below, l is a vector of the losses for each example in the minibatch. Because l is not a scalar
variable, running [Link]() adds together the elements in l to obtain the new variable and then
calculates the gradient.
In each epoch (a pass through the data), we will iterate through the entire dataset (using the data_iter
function) once passing through every examples in the training dataset (assuming the number of examples
is divisible by the batch size). The number of epochs num_epochs and the learning rate lr are both
hyper-parameters, which we set here to 3 and 0.03, respectively. Unfortunately, setting hyper-parameters
is tricky and requires some adjustment by trial and error. We elide these details for now but revise them
later in the chapter on Optimization Algorithms.
In [12]: lr = 0.03 # Learning rate
num_epochs = 3 # Number of iterations
net = linreg # Our fancy linear model
loss = squared_loss # 0.5 (y-y')^2
In this case, because we used synthetic data (that we synthesized ourselves!), we know preisely what the
true parameters are. Thus, we can evaluate our success in training by comparing the true parameters with
those that we learned through our training loop. Indeed they turn out to be very close to each other.
In [13]: print('Error in estimating w', true_w - [Link](true_w.shape))
print('Error in estimating b', true_b - b)
Error in estimating w
[ 0.0004195 -0.00037026]
<NDArray 2 @cpu(0)>
Error in estimating b
[0.00063467]
<NDArray 1 @cpu(0)>
Note that we should not take it for granted that we are able to reover the parameters accurately. This only
happens for a special category problems: strongly convex optimization problems with enough’ data to
ensure that the noisy samples allow us to recover the underlying dependency. In most cases this is not the
case. In fact, the parameters of a deep network are rarely the same (or even close) between two different
runs, unless all conditions are identical, including the order in which the data is traversed. However, in
machine learning we are typically less concerned with recovering true underlying parameters, and more
concerned with parameters that lead to accurate prediction. Fortunately, even on difficult optimization
problems, that stochastic gradient descent can often lead to remarkably good solutions, due in part to the
fact that for the models we will be working with, there exist many sets of parameters that work well.
Summary
We saw how a deep network can be implemented and optimized from scratch, using just NDArray and
autograd, without any need for defining layers, fancy optimizers, etc. This only scratches the surface
of what is possible. In the following sections, we will describe additional models based on the concepts
that we have just introduced and learn how to implement them more concisely.
Exercises
1. What would happen if we were to initialize the weights w = 0. Would the algorithm still work?
2. Assume that you’re Georg Simon Ohm trying to come up with a model between voltage and current.
Can you use autograd to learn the parameters of your model.
3. Can you use Planck’s Law to determine the temperature of an object using spectral energy density.
The surge of deep learning has inspired the development of a variety of mature software frameworks,
that automate much of the repetitive work of implementing deep learning models. In the previous section
we relied only on NDarray for data storage and linear algebra and the auto-differentiation capabilities in
the autograd package. In practice, because many of the more abstract operations, e.g. data iterators,
loss functions, model architectures, and optimizers, are so common, deep learning libraries will give us
library functions for these as well.
In this section, we will introduce Gluon, MXNet’s high-level interface for implementing neural networks
and show how we can implement the linear regression model from the previous section much more
concisely.
To start, we will generate the same data set as that used in the previous section.
In [1]: from mxnet import autograd, nd
num_inputs = 2
num_examples = 1000
true_w = [Link]([2, -3.4])
true_b = 4.2
features = [Link](scale=1, shape=(num_examples, num_inputs))
labels = [Link](features, true_w) + true_b
labels += [Link](scale=0.01, shape=[Link])
Rather than rolling our own iterator, we can call upon Gluon’s data module to read data. Since data
is often used as a variable name, we will replace it with the pseudonym gdata (adding the first letter
of Gluon), too differentiate the imported data module from a variable we might define. The first step
will be to instantiate an ArrayDataset, which takes in one or more NDArrays as arguments. Here,
we pass in features and labels as arguments. Next, we will use the ArrayDataset to instantiate a
DataLoader, which also requires that we specify a batch_size and specify a Boolean value shuffle
indicating whether or not we want the DataLoader to shuffle the data on each epoch (pass through the
dataset).
In [2]: from [Link] import data as gdata
batch_size = 10
# Combine the features and labels of the training data
dataset = [Link](features, labels)
# Randomly reading mini-batches
data_iter = [Link](dataset, batch_size, shuffle=True)
Now we can use data_iter in much the same way as we called the data_iter function in the
previous section. To verify that it’s working, we can read and print the first mini-batch of instances.
In [3]: for X, y in data_iter:
print(X, y)
break
[[-0.3365844 -0.3710372 ]
[ 1.6168119 -0.38360843]
[ 0.45584002 0.04933728]
[ 2.6766868 0.6325003 ]
[ 0.34202248 0.46252233]
[-0.00381326 0.9484227 ]
[ 0.05124917 -1.3801315 ]
[-0.09838037 -0.08484158]
[-1.5530336 0.926422 ]
[-1.1014541 -0.41594216]]
<NDArray 10x2 @cpu(0)>
[ 4.791426 8.728727 4.9507527 7.3999124 3.3229208 0.9791935
8.99153 4.2787137 -2.0571606 3.4144146]
<NDArray 10 @cpu(0)>
When we implemented linear regression from scratch in the previous section, we had to define the model
parameters and explicitly write out the calculation to produce output using basic linear algebra opertions.
You should know how to do this. But once your models get more complex, even qualitatively simple
changes to the model might result in many low-level changes.
For standard operations, we can use Gluon’s predefined layers, which allow us to focus especially on the
layers used to construct the model rather than having to focus on the implementation.
Recall the architecture of a single layer network. The layer is fully connected since it connects all inputs
with all outputs by means of a matrix-vector multiplication. In Gluon, the fully-connected layer is defined
in the Dense class. Since we only want to generate a single scalar output, we set that number to 1.
In [5]: [Link]([Link](1))
It is worth noting that, for convenience, Gluon does not require us to specify the input shape for each
layer. So here, we don’t need to tell Gluon how many inputs go into this linear layer. When we first try
to pass data through our model, e.g., when we exedcute net(X) later, Gluon will automatically infer
the number of inputs to each layer. We will describe how this works in more detail in the chapter Deep
Learning Computation.
Before using net, we need to initialize the model parameters, such as the weights and biases in the lin-
ear regression model. We will import the initializer module from MXNet. This module provides
various methods for model parameter initialization. Gluon makes init available as a shortcut (abbrevi-
ation) to access the initializer package. By calling [Link](sigma=0.01), we specify
that each weight parameter should be randomly sampled from a normal distribution with mean 0 and
standard deviation 0.01. The bias parameter will be initialized to zero by default.
In [6]: from mxnet import init
[Link]([Link](sigma=0.01))
The code above looks straightforward but in reality something quite strange is happening here. We are
initializing parameters for a network even though we haven’t yet told Gluon how many dimensions the
In Gluon, the loss module defines various loss functions. We will replace the imported module loss
with the pseudonym gloss, and directly use its implementation of squared loss (L2Loss).
In [7]: from [Link] import loss as gloss
loss = gloss.L2Loss() # The squared loss is also known as the L2 norm loss
Not surpisingly, we aren’t the first people to implement mini-batch stochastic gradient descent, and thus
Gluon supports SGD alongside a number of variations on this algorithm through its Trainer class.
When we instantiate the Trainer, we’ll specify the parameters to optimize over (obtainable from our
net via net.collect_params()), the optimization algortihm we wish to use (sgd), and a dictionary
of hyper-parameters required by our optimization algorithm. SGD just requires that we set the value
learning_rate, (here we set it to 0.03).
In [8]: from mxnet import gluon
trainer = [Link](net.collect_params(), 'sgd', {'learning_rate': 0.03})
3.3.7 Training
You might have noticed that expressing our model through Gluon requires comparatively few lines of
code. We didn’t have to individually allocate parameters, define our loss function, or implement stochastic
gradient descent. Once we start working with much more complex models, the benefits of relying on
Gluon’s abstractions will grow considerably. But once we have all the basic pieces in place, the training
loop itself is strikingly similar to what we did when implementing everything from scratch.
To refresh your memory: for some number of epochs, we’ll make a complete pass over the dataset
(train_data), grabbing one mini-batch of inputs and corresponding ground-truth labels at a time. For each
batch, we’ll go through the following ritual:
• Generate predictions by calling net(X) and calculate the loss l (the forward pass).
• Calculate gradients by calling [Link]() (the backward pass).
• Update the model parameters by invoking our SGD optimizer (note that trainer already knows
which parameters to optimize over, so we just need to pass in the batch size.
For good measure, we compute the loss after each epoch and print it to monitor progress.
The model parameters we have learned and the actual model parameters are compared as below. We get
the layer we need from the net and access its weight (weight) and bias (bias). The parameters we
have learned and the actual parameters are very close.
In [10]: w = net[0].[Link]()
print('Error in estimating w', true_w.reshape([Link]) - w)
b = net[0].[Link]()
print('Error in estimating b', true_b - b)
Error in estimating w
[[ 0.00034404 -0.00010157]]
<NDArray 1x2 @cpu(0)>
Error in estimating b
[0.00047684]
<NDArray 1 @cpu(0)>
Summary
Exercises
In the last two sections, we worked through implementations linear regression, building everything from
scratch and again using Gluon to automate the most repetitive work.
Regression is the hammer we reach for when we want to answer how much? or how many? questions. If
you want to predict the number of dollars (the price) at which a house will be sold, or the number of wins
a baseball team might have, or the number of days that a patient will remain hospitalized before being
discharged, then you’re probably looking for a regression model.
In practice, we’re more often interested in classification: asking not how much but which one.
• Does this email belong in the spam folder or the inbox*?
• Is this customer more likley to sign up or not to sign up for a subscription service?*
• Does this image depict a donkey, a dog, a cat, or a rooster?
• Which movie is user most likely to watch next?
Colloquially, we use the word classification to describe two subtly different problems: (i) those where we
are interested only in hard assignments of examples to categories, and (ii) those where we wish to make
soft assignments, i.e., to assess the probability that each category applies. One reason why the distinction
between these tasks gets blurred is because most often, even when we only care about hard assignments,
we still use models that make soft assignments.
To get our feet wet, let’s start off with a somewhat contrived image classification problem. Here, each in-
put will be a grayscale 2-by-2 image. We can represent each pixel location as a single scalar, representing
each image with four features x1 , x2 , x3 , x4 . Further, let’s assume that each image belongs to one among
the categories cat, chicken and dog.
First, we have to choose how to represent the labels. We have two obvious choices. Perhaps the most
natural impulse would be to choose y ∈ {1, 2, 3}, where the integers represent {dog, cat, chicken}
respectively. This is a great way of storing such information on a computer. If the categories had some
natural ordering among them, say if we were trying to predict {baby, child, adolescent, adult}, then it
might even make sense to cast this problem as a regression and keep the labels in this format.
In our case, y would be a three-dimensional vector, with (1, 0, 0) corresponding to cat, (0, 1, 0) to chicken
and (0, 0, 1) to dog.
Network Architecture
In order to estimate multiple classes, we need a model with multiple outputs, one per category. This is
one of the main differences beween classification and regression models. To address classification with
linear models, we will need as many linear functions as we have outputs. Each output will correpsond to
its own linear function. In our case, since we have 4 features and 3 possible output categories, we will
need 12 scalars to represent the weights, (w with subscripts) and 3 scalars to represent the biases (b with
subscripts). We compute these three outputs, o1 , o2 , and o3 , for each input:
We can depict this calculation with the neural network diagram below. Just as in linear regression, softmax
regression is also a single-layer neural network. And since the calculation of each output, o1 , o2 , and o3 ,
depends on all inputs, x1 , x2 , x3 , and x4 , the output layer of softmax regression can also be described as
fully connected layer.
Softmax Operation
To express the model more compactly, we can use linear algebra notation. In vector form, we arrive at
o = Wx+b, a form better suited both for mathematics, and for writing code. Note that we have gathered
exp(oi )
ŷ = softmax(o) where ŷi = ∑
j exp(oj )
It is easy to see ŷ1 + ŷ2 + ŷ3 = 1 with 0 ≤ ŷi ≤ 1 for all i. Thus, ŷ is a proper probability distribution
and the values of o now assume an easily quantifiable meaning. Note that we can still find the most likely
class by
In short, the softmax operation perserves the orderings of its inputs, and thus does not alter the predicted
category vs our simpler argmax model. However, it gives the outputs o proper meaning: they are the
pre-softmax values determining the probabilities assigned to each category. Summarizing it all in vector
notation we get o(i) = Wx(i) + b where ŷ(i) = softmax(o(i) ).
Again, to improve computational efficiency and take advantage of GPUs, we will typicaly carry out
vector calculations for mini-batches of data. Assume that we are given a mini-batch X of examples with
O = XW + b
Ŷ = softmax(O)
This accelerates the dominant operation into a matrix-matrix product WX vs the matrix-vector products
we would be exectuting if we processed one example at a time. The softmax itself can be computed by
exponentiating all entries in O and then normalizing them by the sum appropriately.
Now that we have some mechanism for outputting probabilities, we need to transform this into a measure
of how accurate things are, i.e. we need a loss function. For this, we use the same concept that we already
encountered in linear regression, namely likelihood maximization.
Log-Likelihood
The softmax function maps o into a vector of probabilities corresponding to various outcomes, such as
p(y = cat|x). This allows us to compare the estimates with reality, simply by checking how well it
predicted what we observe.
∏
n ∑
n
p(Y |X) = p(y (i) |x(i) ) and thus − log p(Y |X) = − log p(y (i) |x(i) )
i=1 i=1
Maximizing p(Y |X) (and thus equivalently − log p(Y |X)) corresponds to predicting the label well. This
yields the loss function (we dropped the superscript (i) to avoid notation clutter):
∑
l = − log p(y|x) = − yj log ŷj
j
Here we used that by construction ŷ = softmax(o) and moreover, that the vector y consists of all zeroes
but for the correct label, such as (1, 0, 0). Hence the the sum over all coordinates j vanishes for all but
one term. Since all ŷj are probabilities, their logarithm is never larger than 0. Consequently, the loss
function is minimized if we correctly predict y with certainty, i.e. if p(y|x) = 1 for the correct label.
Since the Softmax and the corresponding loss are so common, it is worth while understanding a bit better
how it is computed. Plugging o into the definition of the loss l and using the definition of the softmax we
To understand a bit better what is going on, consider the derivative with respect to o. We get
exp(oj )
∂oj l = ∑ − yj = softmax(o)j − yj = Pr(y = j|x) − yj
k exp(ok )
In other words, the gradient is the difference between the probability assigned to the true class by our
model, as expressed by the probability p(y|x), and what actually happened, as expressed by y. In this
sense, it is very similar to what we saw in regression, where the gradient was the difference between the
observation y and estimate ŷ. This is not coincidence. In any exponential family model, the gradients of
the log-likelihood are given by precisely this term. This fact makes computing gradients easy in practice.
Cross-Entropy Loss
Now consider the case where we don’t just observe a single outcome but maybe, an entire distribution
over outcomes. We can use the same representation as before for y. The only difference is that rather
than a vector containing only binary entries, say (0, 0, 1), we now have a generic probability vector, say
(0.1, 0.2, 0.7). The math that we used previously to define the loss l still works out fine, just that the
interpretation is slightly more general. It is the expected value of the loss for a distribution over labels.
∑
l(y, ŷ) = − yj log ŷj
j
This loss is called the cross-entropy loss and it is one of the most commonly used losses for multiclass
classification. To demystify its name we need some information theory. The following section can be
skipped if needed.
Information theory deals with the problem of encoding, decoding, transmitting and manipulating infor-
mation (aka data), preferentially in as concise form as possible.
Entropy
A key concept is how many bits of information (or randomness) are contained in data. It can be measured
as the entropy of a distribution p via
∑
H[p] = −p(j) log p(j)
j
One way of measuring the difference between two distributions arises directly from the entropy. Since
H[p] is the minimum number of bits that we need to encode data drawn from p, we could ask how well
it is encoded if we pick the wrong’ distribution q. The amount of extra bits that we need to encode q
gives us some idea of how different these two distributions are. Let us compute this directly - recall that
to encode j using an optimal code for q would cost − log q(j) nats, and we need to use this in p(j) of all
cases. Hence we have
∑ ∑ p(j)
D(p∥q) = − p(j) log q(j) − H[p] = p(j) log
j j
q(j)
Note that minimizing D(p∥q) with respect to q is equivalent to minimizing the cross-entropy loss. This
can be seen directly by dropping H[p] which doesn’t depend on q. We thus showed that softmax regres-
sion tries the minimize the surprise (and thus the number of bits) we experience when seeing the true
label y rather than our prediction ŷ.
After training the softmax regression model, given any example features, we can predict the probability of
each output category. Normally, we use the category with the highest predicted probability as the output
category. The prediction is correct if it is consistent with the actual category (label). In the next part of the
experiment, we will use accuracy to evaluate the model’s performance. This is equal to the ratio between
the number of correct predictions and the total number of predictions.
Summary
• We introduced the softmax operation which takes a vector maps it into probabilities.
• Softmax regression applies to classification problems. It uses the probability distribution of the
output category in the softmax operation.
• Cross entropy is a good measure of the difference between two probability distributions. It mea-
sures the number of bits needed to encode the data given our model.
1. Show that the Kullback-Leibler divergence D(p∥q) is nonnegative for all distributions p and q.
Hint - use Jensen’s inequality, i.e. use the fact that − log x is a convex function.
∑
2. Show that log j exp(oj ) is a convex function in o.
3. We can explore the connection between exponential families and the softmax in some more depth
• Compute the second derivative of the cross entropy loss l(y, ŷ) for the softmax.
• Compute the variance of the distribution given by softmax(o) and show that it matches the
second derivative computed above.
4. Assume that we three classes which occur with equal probability, i.e. the probability vector is
( 13 , 31 , 13 ).
• What is the problem if we try to design a binary code for it? Can we match the entropy lower
bound on the number of bits?
• Can you design a better code. Hint - what happens if we try to encode two independent
observations? What if we encode n observations jointly?
5. Softmax is a misnomer for the mapping introduced above (but everyone in deep learning uses it).
The real softmax is defined as RealSoftMax(a, b) = log(exp(a) + exp(b)).
• Prove that RealSoftMax(a, b) > max(a, b).
• Prove that this holds for λ−1 RealSoftMax(λa, λb), provided that λ > 0.
• Show that for λ → ∞ we have λ−1 RealSoftMax(λa, λb) → max(a, b).
• What does the soft-min look like?
• Extend this to more than two numbers.
Before we implement softmax regression ourselves, let’s pick a real dataset to work with. To make
things visually compelling, we will pick an image classification dataset. The most commonly used image
%matplotlib inline
import d2l
from [Link] import data as gdata
import sys
import time
Conveniently, Gluon’s data package provides easy access to a number of benchmark datasets for testing
our models. The first time we invoke [Link](train=True) to collect the
training data, Gluon will automatically retrieve the dataset via our Internet connection. Subsequently,
Gluon will use the already-downloaded local copy. We specify whether we are requesting the training set
or the test set by setting the value of the parameter train to True or False, respectively. Recall that
we will only be using the training data for training, holding out the test set for a final evaluation of our
model.
In [2]: mnist_train = [Link](train=True)
mnist_test = [Link](train=False)
The number of images for each category in the training set and the testing set is 6,000 and 1,000, re-
spectively. Since there are 10 categories, the numbers of examples in the training set and the test set are
60,000 and 10,000, respectively.
In [3]: len(mnist_train), len(mnist_test)
Out[3]: (60000, 10000)
We can access any example by indexing into the dataset using square brackets []. In the following code,
we access the image and label corresponding to the first example.
In [4]: feature, label = mnist_train[0]
Our example, stored here in the variable feature corresponds to an image with a height and width of
28 pixels. Each pixel is an 8-bit unsigned integer (uint8) with values between 0 and 255. It is stored in a
3D NDArray. Its last dimension is the number of channels. Since the data set is a grayscale image, the
number of channels is 1. When we encounter color, images, we’ll have 3 channels for red, green, and
blue. To keep things simple, we will record the shape of the image with the height and width of h and w
pixels, respectively, as h × w or (h, w).
In [5]: [Link], [Link]
The label of each image is represented as a scalar in NumPy. Its type is a 32-bit integer.
In [6]: label, type(label), [Link]
Out[6]: (2, numpy.int32, dtype('int32'))
There are 10 categories in Fashion-MNIST: t-shirt, trousers, pullover, dress, coat, sandal, shirt, sneaker,
bag and ankle boot. The following function can convert a numeric label into a corresponding text label.
In [7]: # This function has been saved in the d2l package for future use
def get_fashion_mnist_labels(labels):
text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
return [text_labels[int(i)] for i in labels]
The following defines a function that can draw multiple images and corresponding labels in a single line.
In [8]: # This function has been saved in the d2l package for future use
def show_fashion_mnist(images, labels):
d2l.use_svg_display()
# Here _ means that we ignore (not use) variables
_, figs = [Link](1, len(images), figsize=(12, 12))
for f, img, lbl in zip(figs, images, labels):
[Link]([Link]((28, 28)).asnumpy())
f.set_title(lbl)
[Link].get_xaxis().set_visible(False)
[Link].get_yaxis().set_visible(False)
Next, let’s take a look at the image contents and text labels for the first nine examples in the training data
set.
In [9]: X, y = mnist_train[0:9]
show_fashion_mnist(X, get_fashion_mnist_labels(y))
To make our life easier when reading from the training and test sets we use a DataLoader rather than
creating one from scratch, as we did in the section on Linear Regression Implementation Starting from
Scratch. Recall that a data loader reads a mini-batch of data with an example number of batch_size
each time.
In practice, reading data can often be a significant performance bottleneck for training, especially when
the model is simple or when the computer is fast. A handy feature of Gluon’s DataLoader is the ability
to use multiple processes to speed up data reading (not currently supported on Windows). For instance,
we can set aside 4 processes to read the data (via num_workers).
train_iter = [Link](mnist_train.transform_first(transformer),
batch_size, shuffle=True,
num_workers=num_workers)
test_iter = [Link](mnist_test.transform_first(transformer),
batch_size, shuffle=False,
num_workers=num_workers)
The logic that we will use to obtain and read the Fashion-MNIST data set is encapsulated in the d2l.
load_data_fashion_mnist function, which we will use in later chapters. This function will return
two variables, train_iter and test_iter. As the content of this book continues to deepen, we will
further improve this function. Its full implementation will be described in the section Deep Convolutional
Neural Networks (AlexNet).
Let’s look at the time it takes to read the training data.
In [11]: start = [Link]()
for X, y in train_iter:
continue
'%.2f sec' % ([Link]() - start)
Out[11]: '1.31 sec'
Summary
• Fashion-MNIST is an apparel classification data set containing 10 categories, which we will use to
test the performance of different algorithms in later chapters.
• We store the shape of image using height and width of h and w pixels, respectively, as h × w or
(h, w).
• Data iterators are a key component for efficient performance. Use existing ones if available.
Exercises
Just as we implemented linear regression from scratch, we believe that multiclass logistic (softmax) re-
gression is similarly fundamental and you ought to know the gory details of how to implement it from
scratch. As with linear regression, after doing things by hand we will breeze through an implementation
in Gluon for comparison. To begin, let’s import our packages (only autograd, nd are needed here
because we will be doing the heavy lifting ourselves.)
In [1]: import sys
[Link](0, '..')
%matplotlib inline
import d2l
from mxnet import autograd, nd
We will work with the Fashion-MNIST dataset just introduced, cuing up an iterator with batch size 256.
In [2]: batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
Just as in linear regression, we represent each example as a vector. Since each example is a 28 × 28
image, we can flatten each example, treating them as 784 dimensional vectors. In the future, we’ll talk
about more sophisticated strategies for exploiting the spatial structure in images, but for now we treat
each pixel location as just another feature.
Recall that in softmax regression, we have as many outputs as there are categories. Because our dataset
has 10 categories, our network will have an output dimension of 10. Consequently, our weights will
Recall that we need to attach gradients to the model parameters. More literally, we are allocating memory
for future gradients to be stored and notifiying MXNet that we want gradients to be calculated with respect
to these parameters in the first place.
In [4]: W.attach_grad()
b.attach_grad()
Before implementing the softmax regression model, let’s briefly review how operators such as sum work
along specific dimensions in an NDArray. Given a matrix X we can sum over all elements (default) or
only over elements in the same column (axis=0) or the same row (axis=1). Note that if X is an array
with shape (2, 3)
and we sum over the columns ([Link](axis=0), the result will be a (1D) vector with shape (3,). If
we want to keep the number of axes in the original array (resulting in a 2D array with shape (1,3)),
rather than collapsing out the dimension that we summed over we can specify keepdims=True when
invoking sum.
We are now ready to implement the softmax function. Recall that softmax consists of two steps: First,
we exponentiate each term (using exp). Then, we sum over each row (we have one row per example
in the batch) to get the normalization constants for each example. Finally, we divide each row by its
normalization constant, ensuring that the result sums to 1. Before looking at the code, let’s recall what
this looks expressed as an equation:
exp(Xij )
softmax(X)ij = ∑
k exp(Xik )
The denominator, or normalization constant, is also sometimes called the partition function (and its log-
arithm the log-partition function). The origins of that name are in statistical physics where a related
equation models the distribution over an ensemble of particles).
As you can see, for any random input, we turn each element into a non-negative number. Moreover, each
row sums up to 1, as is required for a probability. Note that while this looks correct mathematically, we
were a bit sloppy in our implementation because failed to take precautions against numerical overflow or
underflow due to large (or very small) elements of the matrix, as we did in Naive Bayes.
In [7]: X = [Link](shape=(2, 5))
X_prob = softmax(X)
X_prob, X_prob.sum(axis=1)
Out[7]: (
[[0.21324193 0.33961776 0.1239742 0.27106097 0.05210521]
[0.11462264 0.3461234 0.19401033 0.29583326 0.04941036]]
<NDArray 2x5 @cpu(0)>,
[1.0000001 1. ]
<NDArray 2 @cpu(0)>)
Now that we have defined the softmax operation, we can implement the softmax regression model. The
below code defines the forward pass through the network. Note that we flatten each original image in
the batch into a vector with length num_inputs with the reshape function before passing the data
through our model.
In [8]: def net(X):
return softmax([Link]([Link]((-1, num_inputs)), W) + b)
Next, we need to implement the cross entropy loss function, introduced in the last section. This may be
the most common loss function in all of deep learning because, at the moment, classification problems
far outnumber regression problems.
Recall that cross entropy takes the negative log likelihood of the predicted probability assigned to the true
label − log p(y|x). Rather than iterating over the predictions with a Python for loop (which tends to be
inefficient), we can use the pick function which allows us to select the appropriate terms from the matrix
of softmax entries easily. Below, we illustrate the pick function on a toy example, with 3 categories and
2 examples.
In [9]: y_hat = [Link]([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
y = [Link]([0, 2], dtype='int32')
[Link](y_hat, y)
Out[9]:
[0.1 0.5]
<NDArray 2 @cpu(0)>
Given the predicted probability distribution y_hat, we typically choose the class with highest predicted
probability whenever we must output a hard prediction. Indeed, many applications require that we make
a choice. Gmail must catetegorize an email into Primary, Social, Updates, or Forums. It might estimate
probabilities internally, but at the end of the day it has to choose one among the categories.
When predictions are consistent with the actual category y, they are coorect. The classification accuracy
is the fraction of all predictions that are correct. Although we cannot optimize accuracy directly (it is
not differentiable), it’s often the performance metric that we care most about, and we will nearly always
report it when training classifiers.
To compute accuracy we do the following: First, we execute y_hat.argmax(axis=1) to gather the
predicted classes (given by the indices for the largest entires each row). The result has the same shape
as the variable y. Now we just need to check how frequently the two match. Since the equality operator
== is datatype-sensitive (e.g. an int and a float32 are never equal), we also need to convert both to
the same type (we pick float32). The result is an NDArray containing entries of 0 (false) and 1 (true).
Taking the mean yields the desired result.
In [11]: def accuracy(y_hat, y):
return (y_hat.argmax(axis=1) == [Link]('float32')).mean().asscalar()
We will continue to use the variables y_hat and y defined in the pick function, as the predicted proba-
bility distribution and label, respectively. We can see that the first example’s prediction category is 2 (the
largest element of the row is 0.6 with an index of 2), which is inconsistent with the actual label, 0. The
second example’s prediction category is 2 (the largest element of the row is 0.5 with an index of 2), which
is consistent with the actual label, 2. Therefore, the classification accuracy rate for these two examples is
0.5.
In [12]: accuracy(y_hat, y)
Out[12]: 0.5
Similarly, we can evaluate the accuracy for model net on the data set (accessed via data_iter).
In [13]: # The function will be gradually improved: the complete implementation will be
# discussed in the "Image Augmentation" section
def evaluate_accuracy(data_iter, net):
acc_sum, n = 0.0, 0
for X, y in data_iter:
y = [Link]('float32')
acc_sum += (net(X).argmax(axis=1) == y).sum().asscalar()
n += [Link]
return acc_sum / n
Because we initialized the net model with random weights, the accuracy of this model should be close
to random guessing, i.e. 0.1 for 10 classes.
The training loop for softmax regression should look strikingly familiar if you read through our imple-
mentation of linear regression earlier in this chapter. Again, we use the mini-batch stochastic gradient
descent to optimize the loss function of the model. Note that the number of epochs (num_epochs), and
learning rate (lr) are both adjustable hyper-parameters. By changing their values, we may be able to
increase the classification accuracy of the model. In practice we’ll want to split our data three ways into
training, validation, and test data, using the validation data to choose the best values of our hyperparam-
eters.
In [15]: num_epochs, lr = 5, 0.1
# This function has been saved in the d2l package for future use
def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
params=None, lr=None, trainer=None):
for epoch in range(num_epochs):
train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
for X, y in train_iter:
with [Link]():
y_hat = net(X)
l = loss(y_hat, y).sum()
[Link]()
if trainer is None:
[Link](params, lr, batch_size)
else:
# This will be illustrated in the next section
[Link](batch_size)
y = [Link]('float32')
train_l_sum += [Link]()
train_acc_sum += (y_hat.argmax(axis=1) == y).sum().asscalar()
n += [Link]
test_acc = evaluate_accuracy(test_iter, net)
print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'
% (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc))
Now that training is complete, our model is ready to classify some images. Given a series of images, we
will compare their actual labels (first line of text output) and the model predictions (second line of text
output).
In [16]: for X, y in test_iter:
break
true_labels = d2l.get_fashion_mnist_labels([Link]())
pred_labels = d2l.get_fashion_mnist_labels(net(X).argmax(axis=1).asnumpy())
titles = [truelabel + '\n' + predlabel
for truelabel, predlabel in zip(true_labels, pred_labels)]
d2l.show_fashion_mnist(X[0:9], titles[0:9])
Summary
With softmax regression, we can train models for multi-category classification. The training loop is very
similar to that in linear regression: retrieve and read data, define models and loss functions, then train
models using optimization algorithms. As you’ll soon find out, most common deep learning models have
similar training procedures.
Exercises
1. In this section, we directly implemented the softmax function based on the mathematical defini-
tion of the softmax operation. What problems might this cause (hint - try to calculate the size of
exp(50))?
2. The function cross_entropy in this section is implemented according to the definition of the
cross-entropy loss function. What could be the problem with this implementation (hint - consider
the domain of the logarithm)?
3. What solutions you can think of to fix the two problems above?
4. Is it always a good idea to return the most likely label. E.g. would you do this for medical diagno-
sis?
5. Assume that we want to use softmax regression to predict the next word based on some features.
What are some problems that might arise from a large vocabulary?
Just as Gluon made it much easier to implement linear regression, we’ll find it similarly (or possibly
more) convenient for implementing classification models. Again, we begin with our import ritual.
In [1]: import sys
[Link](0, '..')
%matplotlib inline
import d2l
from mxnet import gluon, init
from [Link] import loss as gloss, nn
Let’s stick with the Fashion-MNIST dataset and keep the batch size at 256 as in the last section.
In [2]: batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
As mentioned previously, the output layer of softmax regression is a fully connected (Dense) layer.
Therefore, to implement our model, we just need to add one Dense layer with 10 outputs to our
Sequential. Again, here, the Sequential isn’t really necessary, but we might as well form the
habit since it will be ubiquitous when implementing deep models. Again, we initialize the weights at
random with zero mean and standard deviation 0.01.
In [3]: net = [Link]()
[Link]([Link](10))
[Link]([Link](sigma=0.01))
In the previous example, we calculated our model’s output and then ran this output through the cross-
entropy loss. At its heart it uses -[Link](y_hat, y).log(). Mathematically, that’s a perfectly
reasonable thing to do. However, computationally, things can get hairy when dealing with exponentiation
due to numerical stability issues, a matter we’ve already discussed a few times (e.g. in when covering
Naive Bayes and in the problem set of the previous chapter). Recall that the softmax function calculates
i=1
( )
∑
n
= zj − log ezi
i=1
We’ll want to keep the conventional softmax function handy in case we ever want to evaluate the prob-
abilities output by our model. But instead of passing softmax probabilities into our new loss function,
we’ll just pass ŷ and compute the softmax and its log all at once inside the softmax_cross_entropy loss
function, which does smart things like the log-sum-exp trick (see on Wikipedia).
In [4]: loss = [Link]()
We use the mini-batch random gradient descent with a learning rate of 0.1 as the optimization algorithm.
Note that this is the same choice as for linear regression and it illustrates the general applicability of the
optimizers.
In [5]: trainer = [Link](net.collect_params(), 'sgd', {'learning_rate': 0.1})
Next, we use the training functions defined in the last section to train a model.
In [6]: num_epochs = 5
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None,
None, trainer)
epoch 1, loss 0.7894, train acc 0.746, test acc 0.805
epoch 2, loss 0.5727, train acc 0.812, test acc 0.817
epoch 3, loss 0.5300, train acc 0.823, test acc 0.825
epoch 4, loss 0.5045, train acc 0.830, test acc 0.834
epoch 5, loss 0.4895, train acc 0.834, test acc 0.839
Just as before, this algorithm converges to a solution that achieves an accuracy of 83.7%, albeit this time
with a lot fewer lines of code than before. Note that in many cases, Gluon takes specific precautions in
addition to the most well-known tricks for ensuring numerical stability. This saves us from many common
pitfalls that might befall us if we were to code all of our models from scratch.
Exercises
1. Try adjusting the hyper-parameters, such as batch size, epoch, and learning rate, to see what the
results are.
2. Why might the test accuracy decrease again after a while? How could we fix this?
Multilayer Perceptrons
In this chapter, we will introduce your first truly deep networks. The simplest deep networks are called
multilayer perceptrons, and they consist of many layers of neurons each fully connected to those in the
layer below (from which they receive input) and those above (which they, in turn, influence). When
we train high-capacity models we run the risk of overfitting. Thus, we will need to provide your first
rigorous introduction to the notions of overfitting, underfitting, and capacity control. To help you combat
these problems, we will introduce regularization techniques such as dropout and weight decay. We will
also discuss issues relating to numerical stability and parameter initialization that are key to successfully
training deep networks. Throughout, we focus on applying models to real data, aiming to give the reader a
firm grasp not just of the concepts but also of the practice of using deep networks. We punt matters relating
to the computational performance, scalability and efficiency of our models to subsequent chapters.
In the previous chapters, we showed how you could implement multiclass logistic regression (also called
softmax regression) for classifying images of clothing into the 10 possible categories. To get there, we
had to learn how to wrangle data, coerce our outputs into a valid probability distribution (via softmax),
how to apply an appropriate loss function, and how to optimize over our parameters. Now that we’ve
covered these preliminaries, we are free to focus our attention on the more exciting enterprise of designing
powerful models using deep neural networks.
133
4.1.1 Hidden Layers
Recall that for linear regression and sofmax regression, we mapped our inputs directly to our outputs via
a single linear transformation:
ô = softmax(Wx + b)
If our labels really were related to our input data by an approximately linear function, then this approach
would be perfect. But linearity is a strong assumption. Linearity implies that for whatever target value we
are trying to predict, increasing the value of each of our inputs should either drive the value of the output
up or drive it down, irrespective of the value of the other inputs.
Sometimes this makes sense! Say we are trying to predict whether an individual will or will not repay a
loan. We might reasonably imagine that all else being equal, an applicant with a higher income would be
more likely to repay than one with a lower income. In these cases, linear models might perform well, and
they might even be hard to beat.
But what about classifying images in FashionMNIST? Should increasing the intensity of the pixel at lo-
cation (13,17) always increase the likelihood that the image depicts a pocketbook? That seems ridiculous
because we all know that you cannot make sense out of an image without accounting for the interactions
among pixels.
As another case, consider trying to classify images based on whether they depict cats or dogs given
black-and-white images.
If we use a linear model, we’d basically be saying that for each pixel, increasing its value (making it
more white) must always increases the probability that the image depicts a dog or must always increase
the probability thatthe image depicts a cat. We would be making the absurd assumption that the only
requirement
Teasing out what is depicted in an image generally requires allowing more complex relationships between
our inputs and outputs. Thus we need models capable of discovering patterns that might be characterized
by interactions among the many features. We can over come these limitations of linear models and handle
a more general class of functions by incorporating one or more hidden layers. The easiest way to do this
is to stack many layers of neurons on top of each other. Each layer feeds into the layer above it, until we
generate an output. This architecture is commonly called a multilayer perceptron, often abbriviated as
MLP. The neural network diagram for an MLP looks like this:
Fig. 4.2: Multilayer perceptron with hidden layers. This example contains a hidden layer with 5 hidden
units in it.
The multilayer perceptron above has 4 inputs and 3 outputs, and the hidden layer in the middle contains 5
hidden units. Since the input layer does not involve any calculations, building this network would consist
of implementing 2 layers of computation. The neurons in the input layer are fully connected to the inputs
in the hidden layer. Likewise, the neurons in the hidden layer are fully connected to the neurons in the
output layer.
We can write out the calculations that define this one-hidden-layer MLP in mathematical notation as
follows:
h = W1 x + b1
o = W2 h + b2
ŷ = softmax(o)
In order to get a benefit from multilayer architectures, we need another key ingredienta nonlinearity σ to
be applied to each of the hidden units after each layer’s linear transformation. The most popular choice
for the nonlinearity these days is the recitified linear unit (ReLU) max(x, 0). After incorporating these
non-linearities it becomes impossible to merge layers.
h = σ(W1 x + b1 )
o = W2 h + b2
ŷ = softmax(o)
Clearly, we could continue stacking such hidden layers, e.g. h1 = σ(W1 x + b1 ) and h2 = σ(W2 h1 +
b2 ) on top of each other to obtain a true multilayer perceptron.
Multilayer perceptrons can account for complex interactions in the inputs because the hidden neurons
depend on the values of each of the inputs. It’s easy to design a hidden node that that does arbitrary
computation, such as, for instance, logical operations on its inputs. Moreover, for certain choices of
the activation function it’s widely known that multilayer perceptrons are universal approximators. That
means that even for a single-hidden-layer neural network, with enough nodes, and the right set of weights,
we can model any function at all! Actually learning that function is the hard part.
Moreover, just be cause a single-layer network can learn any function doesn’t mean that you should try
to solve all of your problems with single-layer networks. It turns out that we can approximate many
functions much more compactly if we use deeper (vs wider) neural networks. We’ll get more into the
math in a subsequent chapter, but for now let’s actually build an MLP. In this example, we’ll implement
a multilayer perceptron with two hidden layers and one output layer.
As before, by the matrix X, we denote a mini-batch of inputs. The calculations to produce outputs from
an MLP with two hidden layers can thus be expressed:
H1 = σ(W1 X + b1 )
H2 = σ(W2 H1 + b2 )
O = softmax(W3 H2 + b3 )
Beause they are so fundamental to deep leanring, before going further, let’s take a brief look at some
common activation functions.
ReLU Function
As stated above, the most popular choice, due to its simplicity of implementation and its efficacy in
training is the rectified linear unit (ReLU). ReLUs provide a very simple nonlinear transformation. Given
the element z, the function is defined as the maximum of that element and 0.
It can be understood that the ReLU function retains only positive elements and discards negative elements
(setting those nodes to 0). To get a better idea of what it looks like, we can plot it. For convenience, we
define a plotting function xyplot to take care of the gruntwork.
In [1]: import sys
[Link](0, '..')
%matplotlib inline
import d2l
from mxnet import autograd, nd
Because it is used so commonly, NDarray supports the relu function as a basic native operator. As you
can see, the activation function is piece-wise linear.
In [2]: x = [Link](-8.0, 8.0, 0.1)
x.attach_grad()
with [Link]():
y = [Link]()
xyplot(x, y, 'relu')
Note that there are many variants to the ReLU function, such as the parameterized ReLU (pReLU) of He
et al., 2015. This variation adds a linear term to the ReLU, so some information still gets through, even
The reason for using the ReLU is that its derivatives are particularly well behaved - either they vanish or
they just let the argument through. This makes optimization better behaved and it reduces the issue of the
vanishing gradient problem (more on this later).
Sigmoid Function
The sigmoid function transforms its inputs which take values in R to the interval (0, 1). Forthat reason,
the sigmoid is often called a squashing function: it squashes any input in the range (-inf, inf) to some
value in the range (0,1).
1
sigmoid(x) = .
1 + exp(−x)
In the earliest neural networks, scientists were interested in modeling biological neurons which either fire
or don’t fire. Thus the pioneers of this field, going all the way back to McCulloch and Pitts in the 1940s,
were focused on thresholding units. A thresholding function takes either value 0 (if the input is below the
threshold) or value 1 (if the input exceeds the threshold)
When attention shifted to gradient based learning, the sigmoid function was a natural choice because
it is a smooth, differentiable approximation to a thresholding unit. Sigmoids are still common as acti-
vation functions on the output units, when we want to interpret the outputs as probabilities for binary
classification problems (you can think of the sigmoid as a special case of the softmax) but the sigmoid
has mostly been replaced by the simpler and easier to train ReLU for most use in hidden layers. In the
Recurrent Neural Network chapter, we will describe how sigmoid units can be used to control the flow of
information in a neural network thanks to its capacity to transform the value range between 0 and 1.
See the sigmoid function plotted below. When the input is close to 0, the sigmoid function approaches a
linear transformation.
In [4]: with [Link]():
y = [Link]()
xyplot(x, y, 'sigmoid')
d exp(−x)
sigmoid(x) = = sigmoid(x) (1 − sigmoid(x)) .
dx (1 + exp(−x))2
The derivative of sigmoid function is plotted below. Note that when the input is 0, the derivative of
the sigmoid function reaches a maximum of 0.25. As the input diverges from 0 in either direction, the
derivative approaches 0.
In [5]: [Link]()
xyplot(x, [Link], 'grad of sigmoid')
Like the sigmoid function, the tanh (Hyperbolic Tangent) function also squashes its inputs, transforms
them into elements on the interval between -1 and 1:
1 − exp(−2x)
tanh(x) = .
1 + exp(−2x)
We plot the tanh function blow. Note that as the input nears 0, the tanh function approaches a linear
transformation. Although the shape of the function is similar to the sigmoid function, the tanh function
exhibits point symmetry about the origin of the coordinate system.
In [6]: with [Link]():
y = [Link]()
xyplot(x, y, 'tanh')
Summary
• The multilayer perceptron adds one or multiple fully-connected hidden layers between the output
and input layers and transforms the output of the hidden layer via an activation function.
• Commonly-used activation functions include the ReLU function, the sigmoid function, and the tanh
function.
Exercises
1. Compute the derivative of the tanh and the pReLU activation function.
2. Show that a multilayer perceptron using only ReLU (or pReLU) constructs a continuous piecewise
linear function.
3. Show that tanh(x) + 1 = 2sigmoid(2x).
4. Assume we have a multilayer perceptron without nonlinearities between the layers. In particular,
assume that we have d input dimensions, d output dimensions and that one of the layers had only
d/2 dimensions. Show that this network is less expressive (powerful) than a single layer perceptron.
Now that we know how multilayer perceptrons (MLPs) work in theory, let’s implement them. First, we
import the required packages.
In [1]: import sys
[Link](0, '..')
%matplotlib inline
import d2l
from mxnet import nd
from [Link] import loss as gloss
To compare against the results we previously achieved with vanilla softmax regression, we continue to
use the Fashion-MNIST image classification dataset.
In [2]: batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
Recall that this dataset contains 10 classes and that each image consists of a 28 × 28 = 784 grid of pixel
values. Since we’ll be discarding the spatial strucutre (for now), we can just think of this as a classifiation
dataset with 784 input features and 10 classes. In particular we will implement our MLP with one hidden
layer and 256 hidden units. Note that we can regard both of these choices as hyperparameters that could
be set based on performance on validation data. Typically, we’ll choose layer widths as powers of 2 to
make everything align nicely in memory.
Again, we will allocate several NDArrays to represent our parameters. Note that we now have one weight
matrix and one bias vector per layer. As always, we must call attach_grad to allocate memory for
the gradients with respect to these parameters.
In [3]: num_inputs, num_outputs, num_hiddens = 784, 10, 256
To make sure we know how everything works, we will use the maximum function to implement ReLU
ourselves, instead of invoking [Link] directly.
In [4]: def relu(X):
return [Link](X, 0)
As in softmax regression, we will reshape each 2D image into a flat vector of length num_inputs.
Finally, we cam implement our model with just a few lines of code.
In [5]: def net(X):
X = [Link]((-1, num_inputs))
H = relu([Link](X, W1) + b1)
return [Link](H, W2) + b2
For better numerical stability and because we already know how to implement softmax regres-
sion completely from scratch, we will use Gluon’s integrated function for calculating the softmax
and cross-entropy loss. Recall that we discussed some of these intricacies in the previous sec-
tion. We encourage the interested reader to examing the source code for [Link].
nnSoftmaxCrossEntropyLoss for more details.
In [6]: loss = [Link]()
4.2.5 Training
Steps for training the MLP are no different than for softmax regression. In the d2l package, we directly
call the train_ch3 function, whose implementation was introduced here. We set the number of epochs
to 10 and the learning rate to 0.5.
In [7]: num_epochs, lr = 10, 0.5
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
params, lr)
To see how well we did, let’s apply the model to some test data. If you’re interested, compare the result
to corresponding linear model.
In [8]: for X, y in test_iter:
break
true_labels = d2l.get_fashion_mnist_labels([Link]())
pred_labels = d2l.get_fashion_mnist_labels(net(X).argmax(axis=1).asnumpy())
titles = [truelabel + '\n' + predlabel
for truelabel, predlabel in zip(true_labels, pred_labels)]
d2l.show_fashion_mnist(X[0:9], titles[0:9])
This looks a bit better than our previous result, a good sign that we’re on the right path.
Summary
We saw that implementing a simple MLP is easy, even when done manually. That said, with a large
number of layers, this can get messy (e.g. naming and keeping track of the model parameters, etc).
Exercises
1. Change the value of the hyper-parameter num_hiddens in order to see how this hyperparameter
influences your results.
2. Try adding a new hidden layer to see how it affects the results.
3. How does changing the learning rate change the result.
4. What is the best result you can get by optimizing over all the parameters (learning rate, iterations,
number of hidden layers, number of hidden units per layer)?
Now that we learned how multilayer perceptrons (MLPs) work in theory, let’s implement them. We begin,
as always, by importing modules.
In [1]: import sys
[Link](0, '..')
import d2l
from mxnet import gluon, init
from [Link] import loss as gloss, nn
The only difference from our softmax regression implementation is that we add two Dense
(fully-connected) layers instead of one.
The first is our hidden layer, which has 256 hidden units and uses the ReLU activation function.
Note that as above we can invoke [Link]() multiple times in succession, but we can also invoke
it a single time, passing in multiple layers to be added the network. Thus, we could have equivalently
written [Link]([Link](256, activation='relu'), [Link](10)). Again, note
that as always, Gluon automatically infers the missing input dimensions to each layer.
Training the model follows the exact same steps as in our softmax regression implementation.
In [3]: batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
loss = [Link]()
trainer = [Link](net.collect_params(), 'sgd', {'learning_rate': 0.5})
num_epochs = 10
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None,
None, trainer)
Exercises
1. Try adding a few more hidden layers to see how the result changes.
2. Try out different activation functions. Which ones work best?
3. Try out different initializations of the weights.
As machine learning scientists, our goal is to discover general patterns. Say, for example, that we wish to
learn the pattern that associates genetic markers with the development of dementia in adulthood. It’s easy
enough to memorize our training set. Each person’s genes uniquely identify them, not just among people
represented in our dataset, but among all people on earth!
Given the genetic markers representing a some person, we don’t want our model to simply recognize oh,
that’s Bob, and then output the classification, say among {dementia, mild cognitive impairment, healthy},
that corresponds to Bob. Rather, our goal is to discover patterns that capture regularities in the underlying
population from which our training set was drawn. If we are successfuly in this endeavour, then we
can could successfully assess risk even for individuals that we have never encountered before. This
problemhow to disover patterns that generalizeis the fundamental problem of machine learning.
The danger is that when we train models, we access just a small sample of data. The largest public image
datasets contain roughly one million images. And more often we have to learn from thousands or tens of
thousands. In a large hospital system we might access hundreds of thousands of medical records. With
In order to discuss this phenomenon more formally, we need to differentiate between training error and
generalization error. The training error is the error of our model as calculated on the training data set,
while generalization error is the expectation of our model’s error were we to apply it to an infinite stream
of additional data points drawn from the same underlying data distribution as our original sample.
Problematically, we can never calculate the generalization error exactly. That’s because the imaginary
stream of inifinit data is an imaginary object. In practice, we must estimate the generalization error by
applying our model to an independent test set constituted of a random selection of data points that were
withheld from our training set.
The following three thought experiments will help illustrate this situation better. Consider a college
student trying to prepare for his final exam. A diligent student will strive to practice well and test her
abilities using exams from previous years. Nonetheless, doing well on past exams is no guarantee that
she will excel when it matters. For instance, the student might try to prepare by rote learning the answers
to the exam questions. This requires the student to memorize many things. She might even remember the
Since generalization is the fundamental problem in machine learning, you might not be surprised to
learn that many mathematicians and theorists have dedicated their lives to developing formal theories
to describe this phenomenon. In their eponymous theorem, Glivenko and Cantelli derived the rate at
which the training error converges to the generalization error. In a series of seminal papers, Vapnik and
Chervonenkis extended this theory to more general classes of functions. This work laid the foundations
of Statistical Learning Theory.
In the standard supervised learning setting, which we have addressed up until now and will stick
throughout most of this book, we assume that both the training data and the test data are drawn inde-
pendently from identical distributions (commonly called the i.i.d. assumption). This means that when
whatever process samples our data has no memory. The 2nd example drawn and the 3rd drawn are no
more correlated than the 2nd and the 2-millionth sample drawn.
Being a good machine learning scientist requires thinking critically, and already you should be poking
holes in this assumption, coming up with common cases where the assumption fails. What if we train a
mortality risk predictor on data collected from patients at UCSF, and apply it on patients at Massachusetts
General Hospital? These distributions are simply not identical. Moreover, draws might be correlated in
time. What if we are classifying the topics of Tweets. The news cycle would create temporal dependencies
in the topics being discussed violating any assumptions of independence.
Sometimes we can get away with minor violations of the i.i.d. assumption and our models will continue
to work remarkably well. After all, nearly every real-world application involves at least some minor
Model Complexity
When we have simple models and abundant data, we expect the generalization error to resemble the
training error. When we work with more complex models and fewer examples, we expect the training
error to go down but the generalization gap to grow. What precisely constitutes model complexity is a
complex matter. Many factors govern whether a model will generalize well. For example a model with
more parameters might be considered more complex. A model whose parameters can take a wider range
of values might be more complex. Often with neural networks, we think of a model that takes more
training steps as more complex, and one subject to early stopping as less complex.
It can be difficult to compare the complexity among members of substantially different model classes
(say a decision tree versus a neural network). For now, a simple rule of thumb is quite useful: A model
that can readily explain arbitrary facts is what statisticians view as complex, whereas one that has only
a limited expressive power but still manages to explain the data well is probably closer to the truth. In
philosophy, this is closely related to Popper’s criterion of falsifiability of a scientific theory: a theory is
good if it fits data and if there are specific tests which can be used to disprove it. This is important since
all statistical estimation is post hoc, i.e. we estimate after we observe the facts, hence vulnerable to the
associated fallacy. For now, we’ll put the philosophy aside and stick to more tangible issues.
In this chapter, to give you some intuition, we’ll focus on a few factors that tend to influence the general-
izability of a model class:
1. The number of tunable parameters. When the number of tunable parameters, sometimes called the
degrees of freedom, is large, models tend to be more susceptible to overfitting.
2. The values taken by the parameters. When weights can take a wider range of values, models can
be more susceptible to over fitting.
In machine learning, we usually select our final model after evaluating several candidate models.
This process is called model selection. Sometimes the models subject to comparison are fundamentally
different in nature (say, decision trees vs linear models). At other times, we are comparing members of
the same class of models that have been trained with different hyperparameter settings.
With multilayer perceptrons for example, we may wish to compare models with different numbers of
hidden layers, different numbers of hidden units, and various choices of the activation functions applied
to each hidden layer.
In order to determine the best among our candidate models, we will typically employ a validation set.
In principle we should not touch our test set until after we have chosen our all our hyper-parameters.
Were we to use the test data in the model selection process, there’s a risk that we might overfit the test
data. Then we would be in serious trouble. If we over fit our training data, there’s always the evaluation
on test data to keep us honest. But if we overfit the test data, how would we ever know?
Thus, we should never rely on the test data for model selection. And yet we cannot rely solely on the
training data for model selection either because
we cannot estimate the generalization error on the very data that we use to train the model.
The common practice to address this problem is to split our data three ways, incorporating a validation
set in addition to the training and test sets.
In practical applications, the picture gets muddier. While ideally we would only touch the test data once,
to assess the very best model or to compare a small number of models to each other, real-world test data
is seldom discarded after just one use. We can seldom afford a new test set for each round of experiments.
The result is a murky practice where the boundaries between validation and test data are worryingly
ambiguous.
Unless explicitly stated otherwise, in the experiments in this book we are really working with what
should rightly be called training data and validation data, with no true test sets. Therefore, the accuracy
reported in each experiment is really the validation accuracy and not a true test set accuracy. The good
K-Fold Cross-Validation
When training data is scarce, we might not even be able to afford to hold out enough data to constitute a
proper validation set. One popular solution to this problem is to employ :math:‘K‘-fold cross-validation.
Here, the original training data is split into K non-overlapping subsets. Then model training and valida-
tion are executed K times, each time training on K − 1 subsets and validating on a different subset (the
one not used for training in that round). Finally, the training and validation error rates are estimated by
averaging over the results from the K experiments.
When we compare the training and validation errors, we want to be mindful of two common situations:
First, we want to watch out for cases when our training error and validation error are both substantial but
there is a little gap between them. If the model is unable to reduce the training error, that could mean
that our model is too simple (i.e., insufficiently expressive) to capture the pattern that we are trying to
model. Moreover, since the generalziation gap between our training and valdiation errors is small, we
have reason to believe that we could get away with a more complex model. This phenomenon is known
as underfitting.
On the other hand, as we discussed above, we want to watch out for the cases when our training error is
significantly lower than our calidation error, indicating severe overfitting.
Note that overfitting is not always a bad thing. With deep learning especially, it’s well known that the
best predictive models often perform far better on training data than on holdout data. Ultimately, we
usually care more about the validation error than about the gap between the training and validation
errors.
Whether we ovefit or underfit can depend both on the complexity of our model and the size of the available
training datasets, two topics that we discuss below.
Model Complexity
To illustrate some classical intuition about overfitting and model complexity, we given an example using
polynomials. Given training data consisting of a single feature x and a corresponding real-valued label y,
we try to find the polynomial of degree d
∑
d
ŷ = xi wi
i=0
A higher-order polynomial function is more complex than a lower order polynomial function, since the
higher-order polynomial has more parameters and the model function’s selection range is wider.
Fixing the training data set, higher-order polynomial functions should always achieve lower (at worst,
equal) training error relative to lower degree polynomials. In fact, whenever the data points each have a
distinct value of x, a polynomial function with degree equal to the number of data points can fit the
training set perfectly. We visualize the relationship between polynomial degree and under- vs
over-fitting below.
The other big consideration to bear in mind is the dataset size. Fixing our model, the fewer samples we
have in the training dataset, the more likely (and more serverely) we are to encounter overfitting. As we
increase the amount of training data, the generalization error typically decreases. Moreover, in general,
more data never hurts. For a fixed task and data distribution, there is typically a relationship between
model complexity and dataset size. Given more data, we might profitably attempt to fit a more complex
model. Absent sufficient data, simpler models may be difficult to beat. For many tasks, deep learning only
outperforms linear models when many thousands of training examples are available. In part, the current
We can now explore these concepts interactively by fitting polynomials to data. To get started we’ll
import our usual packages.
In [1]: import sys
[Link](0, '..')
%matplotlib inline
import d2l
from mxnet import autograd, gluon, nd
from [Link] import data as gdata, loss as gloss, nn
First we need data. Given x, we will use the following cubic polynomial to generate the labels on training
and test data:
x2 x3
y = 5 + 1.2x − 3.4 + 5.6 + ϵ where ϵ ∼ N (0, 0.1)
2! 3!
The noise term ϵ obeys a normal distribution with a mean of 0 and a standard deviation of 0.1.
We’ll synthesize 100 samples each for the training set and test set.
In [2]: maxdegree = 20 # Maximum degree of the polynomial
n_train, n_test = 100, 100 # Training and test data set sizes
true_w = [Link](maxdegree) # Allocate lots of empty space
true_w[0:4] = [Link]([5, 1.2, -3.4, 5.6])
For optimization, we typically want to avoid very large values of gradients, losses, etc. This is why the
monomials stored in poly_features are rescaled from xi to i!1 xi . It allows us to avoid very large
values for large exponents i. Factorials are implemented in Gluon using the Gamma function, where
n! = Γ(n + 1).
Take a look at the first 2 samples from the generated data set. The value 1 is technically a feature, namely
the constant feature corresponding to the bias.
In [3]: features[:2], poly_features[:2], labels[:2]
We first define the plotting functionsemilogy, where the y axis makes use of the logarithmic scale.
In [4]: # This function has been saved in the d2l package for future use
def semilogy(x_vals, y_vals, x_label, y_label, x2_vals=None, y2_vals=None,
legend=None, figsize=(3.5, 2.5)):
d2l.set_figsize(figsize)
[Link](x_label)
[Link](y_label)
[Link](x_vals, y_vals)
if x2_vals and y2_vals:
[Link](x2_vals, y2_vals, linestyle=':')
[Link](legend)
Since we will be attempting to fit the generated dataset using models of varying complexity, we insert the
model definition into the fit_and_plot function. The training and testing steps involved in polyno-
mial function fitting are similar to those previously described in softmax regression.
In [5]: num_epochs, loss = 200, gloss.L2Loss()
We will begin by first using a third-order polynomial function with the same order as the data generation
function. The results show that this model’s training error rate when using the testing data set is low. The
trained model parameters are also close to the true values w = [5, 1.2, −3.4, 5.6].
In [6]: num_epochs = 1000
# Pick the first four dimensions, i.e. 1, x, x^2, x^3 from the polynomial
# features
fit_and_plot(poly_features[:n_train, 0:4], poly_features[n_train:, 0:4],
labels[:n_train], labels[n_train:])
final epoch: train loss 0.003981923 test loss 0.003887663
weight: [[ 5.016072 1.1759472 -3.4187133 5.642139 ]]
Let’s take another look at linear function fitting. After the decline in the early epoch, it becomes difficult
to further decrease this model’s training error rate.
After the last epoch iteration has been completed, the training error rate is still high. When used to fit
non-linear patterns (like the third-order polynomial function here) linear models are liable to underfit.
Now let’s try to train the model using a polynomial of too high degree. Here, there is insufficient data to
learn that the higher-degree coefficients should have values close to zero. As a result, our
overly-complex model is far too susceptible to being influenced by noise in the training data.
Of course, our training error will now be low (even lower than if we had the right model!) but our test
error will be high.
Try out different model complexities (n_degree) and training set sizes (n_subset) to gain some
intuition of what is happening.
In [8]: num_epochs = 1000
n_subset = 100 # Subset of data to train on
n_degree = 20 # Degree of polynomials
fit_and_plot(poly_features[1:n_subset, 0:n_degree],
poly_features[n_train:, 0:n_degree], labels[1:n_subset],
labels[n_train:])
final epoch: train loss 0.0063669207 test loss 0.011351531
weight: [[ 4.9834304 1.3025314 -3.259729 5.038 -0.38282588 1.6101354
0.04275072 0.22787853 -0.03212921 0.03362036 -0.01792445 0.01421338
0.00740858 0.01047226 -0.06528634 0.02145072 0.06565462 0.02129446
-0.02506039 -0.00960142]]
Summary
• Since the generalization error rate cannot be estimated based on the training error rate, simply min-
imizing the training error rate will not necessarily mean a reduction in the generalization error rate.
Machine learning models need to be careful to safeguard against overfitting such as to minimize
the generalization error.
• A validation set can be used for model selection (provided that it isn’t used too liberally).
• Underfitting means that the model is not able to reduce the training error rate while overfitting is a
result of the model training error rate being much lower than the testing data set rate.
• We should choose an appropriately complex model and avoid using insufficient training samples.
Exercises
1. Can you solve the polynomial regression problem exactly? Hint - use linear algebra.
2. Model selection for polynomials
• Plot the training error vs. model complexity (degree of the polynomial). What do you ob-
serve?
• Plot the test error in this case.
• Generate the same graph as a function of the amount of data?
Now that we have characterized the problem of overfitting and motivated the need for capacity control,
we can begin discussing some of the popular techniques used to these ends in practice. Recall that we
can always mitigate overfitting by going out and collecting more training data, that can be costly and time
consuming, typically making it impossible in the short run. For now, let’s assume that we have already
obtained as much high-quality data as our resources permit and focus on techniques aimed at limiting the
capacity of the fuction classes under consideration.
In our toy example, we saw that we could control the complexity of a polynomial by adjusting its degree.
However, most of machine learning does not consist of polynomial curve fitting. And moreover, even
when we focus on polynomial regression, when we deal with high-dimensional data, manipulating model
capacity by tweaking the degree d is problematic. To see why, note that for multivariate data we must
generalize the concept of polynomials to include monomials, which are simply products of powers of
variables. For example, x21 x2 , and x3 x25 are both monomials of degree 3. The number of such terms with
a given degree d blows up as a function of the degree d.
( )
Concretely, for vectors of dimensionality D, the number of monomials of a given degree d is D−1+d D−1 .
Hence, a small change in degree, even from say 1 to 2 or 2 to 3 would entail a massive blowup in the
complexity of our model. Thus, tweaking the degree is too blunt a hammer. Instead, we need a more
fine-grained tool for adjusting function complexity.
Weight decay (commonly called L2 regularization), might be the most widely-used technique for regu-
larizing parametric machine learning models. The basic intuition behind weight decay is the notion that
among all functions f , the function f = 0 is the simplest. Intuitively, we can then measure functions by
their proximity to zero. But how precisely should we measure the distance between a function and zero?
1 ∑ 1 ( ⊤ (i) )2
n
l(w, b) = w x + b − y (i) .
n i=1 2
Recall that x(i) are the observations, y (i) are labels, and (w, b) are the weight and bias parameters re-
spectively. To arrive at a new loss function that penalizes the size of the weight vector, we need to add
||mathbf w||2 , but how much should we add? To address this, we need to add a new hyperparameter, that
we will call the regularization constant and denote by λ:
λ
l(w, b) + ∥w∥2
2
This non-negatice parameter λ ≥ 0 governs the amount of regularization. For λ = 0, we recover our
original loss function, whereas for λ > 0 we ensure that w cannot grow too large. The astute reader
might wonder why we are squaring the norm of the weight vector. We do this for two reasons. First, we
do it for computational convenience. By squaring the L2 norm, we remove the square root, leaving the
sum of squares of each component of the weight vector. This is convenient because it is easy to compute
derivatives of a sum of terms (the sum of derivatives equals the derivative of the sum).
Moreover, you might ask, why the L2 norm in the first place and not the L1 norm, or some other dis-
tance function. In fact, several other choices are valid and are popular throughout statistics. While
L2-regularized linear models constitute the classic ridge regression algorithm L1-regularizaed linear re-
gression is a similarly fundamental model in statistics popularly known as lasso regression.
One mathematical reason for working with the L2 norm and not some other norm, is that it penalizes
large components of the weight vector much more than it penalizes small ones. This encourages our
learning algorithm to discover models which distribute their weight across a larger number of features,
which might make them more robust in practice since they do not depend precariously on a single feature.
The stochastic gradient descent updates for L2-regularied regression are as follows:
( )
ηλ η ∑ (i) ( ⊤ (i) )
w ← 1− w− x w x + b − y (i) ,
|B| |B|
i∈B
As before, we update w based on the amount by which our estimate differs from the observation. How-
ever, we also shrink the size of w towards 0. That’s why the method is sometimes called weight decay:
because the penalty term literally causes our optimization algorthm to decay the magnitude of the weight
For high-dimensional regression it is difficult to pick the right’ dimensions to omit. Weight-decay reg-
ularization is a much more convenient alternative. We will illustrate this below. First, we will generate
some synthetic data as before
∑
d
y = 0.05 + 0.01xi + ϵ where ϵ ∼ N (0, 0.01)
i=1
representing our label as a linear function of our inputs, corrupted by Gaussian noise with zero mean
and variance 0.01. To observe the effects of overfitting more easily, we can make our problem high-
dimensional, setting the data dimension to d = 200 and working with a relatively small number of
training exampleshere we’ll set the sample size to 20:
In [1]: import sys
[Link](0, '..')
%matplotlib inline
import d2l
from mxnet import autograd, gluon, init, nd
from [Link] import data as gdata, loss as gloss, nn
Next, we will show how to implement weight decay from scratch. All we have to do here is to add the
squared ℓ2 penalty as an additional loss term added to the original target function.
∑ The squared norm
penalty derives its name from the fact that we are adding the second power i wi2 . The ℓ2 is just one
among an infinite class of norms call p-norms, many of which you might encounter in the future. In
general, for some number p, the ℓp norm is defined as
∑
d
∥w∥pp := |wi |p
i=1
First, we’ll define a function to randomly initialize our model parameters and run attach_grad on
each to allocate memory for the gradients we will calculate.
In [2]: def init_params():
w = [Link](scale=1, shape=(num_inputs, 1))
b = [Link](shape=(1,))
w.attach_grad()
b.attach_grad()
return [w, b]
Perhaps the most convenient way to implement this penalty is to square all terms in place and summ them
up. We divide by 2 by convention (when we take the derivative of a quadratic function, the 2 and 1/2
cancel out, ensuring that the expression for the update looks nice and simple).
In [3]: def l2_penalty(w):
return (w**2).sum() / 2
The following code defines how to train and test the model separately on the training data set and the test
data set. Unlike the previous sections, here, the ℓ2 norm penalty term is added when calculating the final
loss function. The linear network and the squared loss haven’t changed since the previous chapter, so
we’ll just import them via [Link] and d2l.squared_loss to reduce clutter.
In [4]: batch_size, num_epochs, lr = 1, 100, 0.003
net, loss = [Link], d2l.squared_loss
train_iter = [Link]([Link](
train_features, train_labels), batch_size, shuffle=True)
def fit_and_plot(lambd):
w, b = init_params()
train_ls, test_ls = [], []
for _ in range(num_epochs):
for X, y in train_iter:
with [Link]():
# The L2 norm penalty term has been added
l = loss(net(X, w, b), y) + lambd * l2_penalty(w)
[Link]()
[Link]([w, b], lr, batch_size)
train_ls.append(loss(net(train_features, w, b),
train_labels).mean().asscalar())
test_ls.append(loss(net(test_features, w, b),
test_labels).mean().asscalar())
[Link](range(1, num_epochs + 1), train_ls, 'epochs', 'loss',
range(1, num_epochs + 1), test_ls, ['train', 'test'])
print('l2 norm of w:', [Link]().asscalar())
Next, let’s train and test the high-dimensional linear regression model. When lambd = 0 we do not
use weight decay. As a result, while the training error decreases, the test error does not. This is a perfect
example of overfitting.
In [5]: fit_and_plot(lambd=0)
l2 norm of w: 11.611942
The example below shows that even though the training error increased, the error on the test set decreased.
This is precisely the improvement that we expect from using weight decay. While not perfect, overfitting
has been mitigated to some extent. In addition, the ℓ2 norm of the weight w is smaller than without using
weight decay.
In [6]: fit_and_plot(lambd=3)
Because weight decay is ubiquitous in neural network optimization, Gluon makes it especially convenient,
integrating weight decay into the optimization algorithm itself for easy use in combination with any loss
function. Moreover, this integration serves a computational benefit, allowing implementation tricks to
add weight decay to the algorithm, without any additional computational overhead. Since the weight
decay portion of the update depdends only on the current value of each parameter, and the optimizer must
to touch each parameter once anyway.
In the following code, we specify the weight decay hyper-parameter directly through the wd parameter
when instantiating our Trainer. By default, Gluon decays both weights and biases simultaneously.
Note that we can have different optimizers for different sets of parameters. For instance, we can have one
Trainer with weight decay for the weights w and another without weight decay to take care of the bias
b.
In [7]: def fit_and_plot_gluon(wd):
net = [Link]()
[Link]([Link](1))
[Link]([Link](sigma=1))
# The weight parameter has been decayed. Weight names generally end with
# "weight".
trainer_w = [Link](net.collect_params('.*weight'), 'sgd',
{'learning_rate': lr, 'wd': wd})
# The bias parameter has not decayed. Bias names generally end with "bias"
trainer_b = [Link](net.collect_params('.*bias'), 'sgd',
{'learning_rate': lr})
train_ls, test_ls = [], []
for _ in range(num_epochs):
for X, y in train_iter:
with [Link]():
The plots look just the same as when we implemented weight decay from scratch but they run a bit faster
and are easier to implement, a benefit that will become more pronounced for large problems.
In [8]: fit_and_plot_gluon(0)
L2 norm of w: 13.311795
In [9]: fit_and_plot_gluon(3)
So far, we only touched upon one notion of what constitutes a simple linear function. For nonlinear
functions, what constitutes simplicity can be a far more complex question. For instance, there exist
Reproducing Kernel Hilbert Spaces (RKHS) which allow one to use many of the tools introduced for
linear functions in a nonlinear context. Unfortunately, RKHS-based algorithms do not always scale well
to massive amounts of data. For the purposes
∑ of this book, we limit ourselves to simply summing over the
weights for different layers, e.g. via l ∥wl ∥2 , which is equivalent to weight decay applied to all layers.
Summary
• Regularization is a common method for dealing with overfitting. It adds a penalty term to the loss
function on the training set to reduce the complexity of the learned model.
• One particular choice for keeping the model simple is weight decay using an ℓ2 penalty. This leads
to weight decay in the update steps of the learning algorithm.
• Gluon provides automatic weight decay functionality in the optimizer by setting the hyperparameter
wd.
• You can have different optimizers within the same training loop, e.g. for different sets of parame-
ters.
Exercises
1. Experiment with the value of λ in the estimation problem in this page. Plot training and test
accuracy as a function of λ. What do you observe?
2. Use a validation set to find the optimal value of λ. Is it really the optimal value? Does this matter?
4.6 Dropout
Just now, we introduced the classical approach of regularizing statistical models by penalyzing the ℓ2
norm of the weights. In probabilistic terms, we could justify this technique by arguing that we have
assumed a prior belief that weights take values from a Gaussian distribution with mean 0. More intuitively,
we might argue that we encouraged the model to spread out its weights among many features and rather
than depending too much on a small number of potentially spurious associations.
Given many more features than examples, linear models can overfit. But when there are many more ex-
amples than features, we can generally count on linear models not to overfit. Unfortunately, the reliability
with which linear models generalize comes at a cost: Linear models can’t take into account interactions
among features. For every feature, a linear model must assign either a positive or a negative weight. They
lack the flexibility to account for context.
In more formal texts, you’ll see this fundamental tension between generalizability and flexibility dis-
cussed as the bias-variance tradeoff. Linear models have high bias (they can only represent a small class
of functions), but low variance (they give similar results across different random samples of the data).
Deep neural networks take us to the opposite end of the bias-variance spectrum. Neural networks are
so flexible because they aren’t confined to looking at each feature individually. Instead, they can learn
Let’s think briefly about what we expect from a good statistical model. We want it to do well on unseen
test data. One way we can accomplish this is by asking what constitutes a a simple’ model? Simplicity
can come in the form of a small number of dimensions, which is what we did when discussing fitting
a model with monomial basis functions. Simplicity can also come in the form of a small norm for the
basis functions. This led us to weight decay (ℓ2 regularization). Yet a third notion of simplicity that we
can impose is that the function should be robust under small changes in the input. For instance, when we
classify images, we would expect that adding some random noise to the pixels should be mostly harmless.
In 1995, Christopher Bishop formalized a form of this idea when he proved that training with input
noise is equivalent to Tikhonov regularization. In other words, he drew a clear mathematical connection
between the requirement that a function be smooth (and thus simple), as we discussed in the section on
weight decay, with and the requirement that it be resilient to perturbations in the input.
Then in 2014, Srivastava et al., 2014, developed a clever idea for how to apply Bishop’s idea to the internal
layers of the network, too. Namely they proposed to inject noise into each layer of the network before
calculating the subsequent layer during training. They realized that when training deep network with
many layers, enforcing smoothness just on the input-output mapping misses out on what is happening
internally in the network. Their proposed idea is called dropout, and it is now a standard technique that
is widely used for training neural networks. Throughout trainin, on each iteration, dropout regularization
consists simply of zeroing out some fraction (typically 50%) of the nodes in each layer before calculating
the subsequent layer.
The key challenge then is how to inject this noise without introducing undue statistical bias. In other
words, we want to perturb the inputs to each layer during training in such a way that the expected value
of the layer is equal to the value it would have taken had we not introduced any noise at all.
In Bishop’s case, when we are adding Gaussian noise to a linear model, this is simple: At each training
iteration, just add noise sampled from a distribution with mean zero ϵ ∼ N (0, σ 2 ) to the input x , yielding
a perturbed point x′ = x + ϵ. In expectation, E[x′ ] = x.
In the case of dropout regularization, one can debias each layer by normalizing by the fraction of nodes
By design, the expectation remains unchanged, i.e., E[h′ ] = h. Intermediate activations h are replaced
by a random variable h′ with matching expectation. The name dropout’ arises from the notion that some
neurons drop out’ of the computation for the purpose of computing the final result. During training, we
replace intermediate activations with random variables.
Recall the multilayer perceptron with a hidden layer and 5 hidden units. Its architecture is given by
h = σ(W1 x + b1 )
o = W2 h + b 2
ŷ = softmax(o)
When we apply dropout to the hidden layer, we are essentially removing each hidden unit with probability
p, (i.e., setting their output to 0). We can view the result as a network containing only a subset of the
original neurons. In the image below, h2 and h5 are removed. Consequently, the calculation of y no
longer depends on h2 and h5 and their respective gradient also vanishes when performing backprop. In
this way, the calculation of the output layer cannot be overly dependent on any one element of h1 , . . . , h5 .
Intuitively, deep learning researchers often explain the inutition thusly: we do not want the network’s
output to depend too precariously on the exact activation pathway through the network. The original
authors of the dropout technique described their intuition as an effort to prevent the co-adaptation of
feature detectors.
To implement the dropout function for a single layer, we must draw as many samples from a Bernoulli
(binary) random variable as our layer has dimensions, where the random variable takes value 1 (keep)
with probability 1 − p and 0 (drop) with probability p. One easy way to implement this is to first draw
samples from the uniform distribution U [0, 1]. then we can keep those nodes for which the corresponding
sample is greater than p, dropping the rest.
In the following code, we implement a dropout function that drops out the elements in the NDArray
input X with probability drop_prob, rescaling the remainder as described above (dividing the survivors
by 1.0-drop_prob).
In [1]: import sys
[Link](0, '..')
import d2l
from mxnet import autograd, gluon, init, nd
from [Link] import loss as gloss, nn
We can test out the dropout function on a few examples. In the following lines of code, we pass our
input X through the dropout operation, with probabilities 0, 0.5, and 1, respectively.
In [2]: X = [Link](16).reshape((2, 8))
print(dropout(X, 0))
print(dropout(X, 0.5))
print(dropout(X, 1))
[[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 8. 9. 10. 11. 12. 13. 14. 15.]]
<NDArray 2x8 @cpu(0)>
[[0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0.]]
<NDArray 2x8 @cpu(0)>
Again, we can use the Fashion-MNIST dataset, introduced in the section Softmax Regression - Starting
From Scratch. We will define a multilayer perceptron with two hidden layers. The two hidden layers both
have 256 outputs.
In [3]: num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256
The model defined below concatenates the fully-connected layer and the activation function ReLU, using
dropout for the output of each activation function. We can set the dropout probability of each layer
separately. It is generally recommended to set a lower dropout probability closer to the input layer. Below
we set it to 0.2 and 0.5 for the first and second hidden layer respectively. By using the is_training
function described in the Autograd section, we can ensure that dropout is only active during training.
In [4]: drop_prob1, drop_prob2 = 0.2, 0.5
def net(X):
X = [Link]((-1, num_inputs))
H1 = ([Link](X, W1) + b1).relu()
# Use dropout only when training the model
if autograd.is_training():
# Add a dropout layer after the first fully connected layer
H1 = dropout(H1, drop_prob1)
H2 = ([Link](H1, W2) + b2).relu()
if autograd.is_training():
# Add a dropout layer after the second fully connected layer
H2 = dropout(H2, drop_prob2)
return [Link](H2, W3) + b3
This is similar to the training and testing of multilayer perceptrons described previously.
In [5]: num_epochs, lr, batch_size = 10, 0.5, 256
loss = [Link]()
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
params, lr)
Using Gluon, all we need to do is add a Dropout layer (also in the nn package) after each fully-
connected layer, passing in the dropout probability as the only argument to its constructor. During train-
ing, the Dropout layer will randomly drop out outputs of the previous layer (or equivalently, the inputs
to the subequent layer) according to the specified dropout probability. When MXNet is not in training
mode, the Dropout layer simply passes the data through during testing.
In [6]: net = [Link]()
[Link]([Link](256, activation="relu"),
# Add a dropout layer after the first fully connected layer
[Link](drop_prob1),
[Link](256, activation="relu"),
# Add a dropout layer after the second fully connected layer
[Link](drop_prob2),
[Link](10))
[Link]([Link](sigma=0.01))
Summary
• Beyond controlling the number of dimensions and the size of the weight vector, dropout is yet
another tool to avoid overfitting. Often all three are used jointly.
Exercises
1. Try out what happens if you change the dropout probabilities for layers 1 and 2. In particular, what
happens if you switch the ones for both layers?
2. Increase the number of epochs and compare the results obtained when using dropout with those
when not using it.
3. Compute the variance of the the activation random variables after applying dropout.
4. Why should you typically not using dropout?
5. If changes are made to the model to make it more complex, such as adding hidden layer units, will
the effect of using dropout to cope with overfitting be more obvious?
6. Using the model in this section as an example, compare the effects of using dropout and weight
decay. What if dropout and weight decay are used at the same time?
7. What happens if we apply dropout to the individual weights of the weight matrix rather than the
activations?
8. Replace the dropout activation with a random variable that takes on values of [0, γ/2, γ]. Can you
design something that works better than the binary dropout function? Why might you want to use
it? Why not?
References
[1] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). JMLR
In the previous sections, we used mini-batch stochastic gradient descent to train our models. When we
implemented the algorithm, we only worried about the calculations involved in forward propagation
through the model. In other words, we implemented the calculations required for the model to generate
output corresponding to come given input, but when it came time to calculate the gradients of each of our
parameters, we invoked the backward function, relying on the autograd module to figure out what
to do.
The automatic calculation of gradients profoundly simplifies the implementation of deep learning al-
gorithms. Before automatic differentiation, even small changes to complicated models would require
recalculating lots of derivatives by hand. Even academic papers would too often have to allocate lots of
page real estate to deriving update rules.
While we plan to continue relying on autograd, and we have already come a long way without every
discussing how these gradients are calculated efficiently under the hood, it’s important that you know how
updates are actually calculated if you want to go beyond a shallow