Deep Learning Fundamentals and Techniques
Deep Learning Fundamentals and Techniques
Installation xxxiv
Notation xxxvii
1 Introduction 1
1.1 A Motivating Example 2
1.2 Key Components 4
1.3 Kinds of Machine Learning Problems 7
1.4 Roots 20
1.5 The Road to Deep Learning 22
1.6 Success Stories 25
1.7 The Essence of Deep Learning 27
1.8 Summary 29
1.9 Exercises 29
2 Preliminaries 30
2.1 Data Manipulation 30
2.1.1 Getting Started 30
2.1.2 Indexing and Slicing 33
2.1.3 Operations 34
2.1.4 Broadcasting 35
2.1.5 Saving Memory 36
2.1.6 Conversion to Other Python Objects 37
2.1.7 Summary 37
2.1.8 Exercises 38
2.2 Data Preprocessing 38
2.2.1 Reading the Dataset 38
2.2.2 Data Preparation 39
2.2.3 Conversion to the Tensor Format 40
2.2.4 Discussion 40
2.2.5 Exercises 40
2.3 Linear Algebra 41
2.3.1 Scalars 41
iii
2.3.2 Vectors 42
2.3.3 Matrices 43
2.3.4 Tensors 44
2.3.5 Basic Properties of Tensor Arithmetic 45
2.3.6 Reduction 46
2.3.7 Non-Reduction Sum 47
2.3.8 Dot Products 48
2.3.9 Matrix–Vector Products 48
2.3.10 Matrix–Matrix Multiplication 49
2.3.11 Norms 50
2.3.12 Discussion 52
2.3.13 Exercises 53
2.4 Calculus 54
2.4.1 Derivatives and Differentiation 54
2.4.2 Visualization Utilities 56
2.4.3 Partial Derivatives and Gradients 58
2.4.4 Chain Rule 58
2.4.5 Discussion 59
2.4.6 Exercises 59
2.5 Automatic Differentiation 60
2.5.1 A Simple Function 60
2.5.2 Backward for Non-Scalar Variables 61
2.5.3 Detaching Computation 62
2.5.4 Gradients and Python Control Flow 63
2.5.5 Discussion 64
2.5.6 Exercises 64
2.6 Probability and Statistics 65
2.6.1 A Simple Example: Tossing Coins 66
2.6.2 A More Formal Treatment 68
2.6.3 Random Variables 69
2.6.4 Multiple Random Variables 70
2.6.5 An Example 73
2.6.6 Expectations 74
2.6.7 Discussion 76
2.6.8 Exercises 77
2.7 Documentation 78
2.7.1 Functions and Classes in a Module 78
2.7.2 Specific Functions and Classes 79
References 1089
Preface
Just a few years ago, there were no legions of deep learning scientists developing intelli-
gent products and services at major companies and startups. When we entered the field,
machine learning did not command headlines in daily newspapers. Our parents had no idea
what machine learning was, let alone why we might prefer it to a career in medicine or law.
Machine learning was a blue skies academic discipline whose industrial significance was
limited to a narrow set of real-world applications, including speech recognition and com-
puter vision. Moreover, many of these applications required so much domain knowledge
that they were often regarded as entirely separate areas for which machine learning was
one small component. At that time, neural networks—the predecessors of the deep learn-
ing methods that we focus on in this book—were generally regarded as outmoded.
Yet in just few years, deep learning has taken the world by surprise, driving rapid progress
in such diverse fields as computer vision, natural language processing, automatic speech
recognition, reinforcement learning, and biomedical informatics. Moreover, the success
of deep learning in so many tasks of practical interest has even catalyzed developments in
theoretical machine learning and statistics. With these advances in hand, we can now build
cars that drive themselves with more autonomy than ever before (though less autonomy
than some companies might have you believe), dialogue systems that debug code by asking
clarifying questions, and software agents beating the best human players in the world at
board games such as Go, a feat once thought to be decades away. Already, these tools exert
ever-wider influence on industry and society, changing the way movies are made, diseases
are diagnosed, and playing a growing role in basic sciences—from astrophysics, to climate
modeling, to weather prediction, to biomedicine.
This book represents our attempt to make deep learning approachable, teaching you the
concepts, the context, and the code.
date. Mature libraries should automate common tasks, and exemplar code should make
it easy for practitioners to modify, apply, and extend common applications to suit their
needs.
Testing the potential of deep learning presents unique challenges because any single appli-
cation brings together various disciplines. Applying deep learning requires simultaneously
understanding (i) the motivations for casting a problem in a particular way; (ii) the math-
ematical form of a given model; (iii) the optimization algorithms for fitting the models to
data; (iv) the statistical principles that tell us when we should expect our models to general-
ize to unseen data and practical methods for certifying that they have, in fact, generalized;
and (v) the engineering techniques required to train models efficiently, navigating the pit-
falls of numerical computing and getting the most out of available hardware. Teaching the
critical thinking skills required to formulate problems, the mathematics to solve them, and
the software tools to implement those solutions all in one place presents formidable chal-
lenges. Our goal in this book is to present a unified resource to bring would-be practitioners
up to speed.
When we started this book project, there were no resources that simultaneously (i) remained
up to date; (ii) covered the breadth of modern machine learning practices with sufficient
technical depth; and (iii) interleaved exposition of the quality one expects of a textbook
with the clean runnable code that one expects of a hands-on tutorial. We found plenty of
code examples illustrating how to use a given deep learning framework (e.g., how to do
basic numerical computing with matrices in TensorFlow) or for implementing particular
techniques (e.g., code snippets for LeNet, AlexNet, ResNet, etc.) scattered across various
blog posts and GitHub repositories. However, these examples typically focused on how to
implement a given approach, but left out the discussion of why certain algorithmic deci-
sions are made. While some interactive resources have popped up sporadically to address a
particular topic, e.g., the engaging blog posts published on the website Distill 1 , or personal
1 blogs, they only covered selected topics in deep learning, and often lacked associated code.
On the other hand, while several deep learning textbooks have emerged—e.g., Goodfellow
et al. (2016), which offers a comprehensive survey on the basics of deep learning—these
resources do not marry the descriptions to realizations of the concepts in code, sometimes
leaving readers clueless as to how to implement them. Moreover, too many resources are
hidden behind the paywalls of commercial course providers.
We set out to create a resource that could (i) be freely available for everyone; (ii) offer suffi-
cient technical depth to provide a starting point on the path to actually becoming an applied
machine learning scientist; (iii) include runnable code, showing readers how to solve prob-
lems in practice; (iv) allow for rapid updates, both by us and also by the community at large;
2 and (v) be complemented by a forum 2 for interactive discussion of technical details and to
answer questions.
xxvii Preface
These goals were often in conflict. Equations, theorems, and citations are best managed and
laid out in LaTeX. Code is best described in Python. And webpages are native in HTML
and JavaScript. Furthermore, we want the content to be accessible both as executable code,
as a physical book, as a downloadable PDF, and on the Internet as a website. No workflows
seemed suited to these demands, so we decided to assemble our own (Section B.6). We
settled on GitHub to share the source and to facilitate community contributions; Jupyter
notebooks for mixing code, equations and text; Sphinx as a rendering engine; and Discourse
as a discussion platform. While our system is not perfect, these choices strike a compromise
among the competing concerns. We believe that Dive into Deep Learning might be the first
book published using such an integrated workflow.
Learning by Doing
Many textbooks present concepts in succession, covering each in exhaustive detail. For
example, the excellent textbook of Bishop (2006), teaches each topic so thoroughly that
getting to the chapter on linear regression requires a nontrivial amount of work. While
experts love this book precisely for its thoroughness, for true beginners, this property limits
its usefulness as an introductory text.
In this book, we teach most concepts just in time. In other words, you will learn concepts
at the very moment that they are needed to accomplish some practical end. While we
take some time at the outset to teach fundamental preliminaries, like linear algebra and
probability, we want you to taste the satisfaction of training your first model before worrying
about more esoteric concepts.
Aside from a few preliminary notebooks that provide a crash course in the basic mathe-
matical background, each subsequent chapter both introduces a reasonable number of new
concepts and provides several self-contained working examples, using real datasets. This
presented an organizational challenge. Some models might logically be grouped together
in a single notebook. And some ideas might be best taught by executing several models
in succession. By contrast, there is a big advantage to adhering to a policy of one working
example, one notebook: This makes it as easy as possible for you to start your own research
projects by leveraging our code. Just copy a notebook and start modifying it.
Throughout, we interleave the runnable code with background material as needed. In gen-
eral, we err on the side of making tools available before explaining them fully (often filling
in the background later). For instance, we might use stochastic gradient descent before
explaining why it is useful or offering some intuition for why it works. This helps to give
practitioners the necessary ammunition to solve problems quickly, at the expense of requir-
ing the reader to trust us with some curatorial decisions.
This book teaches deep learning concepts from scratch. Sometimes, we delve into fine
details about models that would typically be hidden from users by modern deep learning
frameworks. This comes up especially in the basic tutorials, where we want you to un-
derstand everything that happens in a given layer or optimizer. In these cases, we often
present two versions of the example: one where we implement everything from scratch,
relying only on NumPy-like functionality and automatic differentiation, and a more prac-
xxviii Preface
tical example, where we write succinct code using the high-level APIs of deep learning
frameworks. After explaining how some component works, we rely on the high-level API
in subsequent tutorials.
t
Fig. 1 Book structure.
Code
Most sections of this book feature executable code. We believe that some intuitions are best
developed via trial and error, tweaking the code in small ways and observing the results.
Ideally, an elegant mathematical theory might tell us precisely how to tweak our code to
achieve a desired result. However, deep learning practitioners today must often tread where
no solid theory provides guidance. Despite our best attempts, formal explanations for the
efficacy of various techniques are still lacking, for a variety of reasons: the mathematics to
characterize these models can be so difficult; the explanation likely depends on properties
of the data that currently lack clear definitions; and serious inquiry on these topics has
only recently kicked into high gear. We are hopeful that as the theory of deep learning
progresses, each future edition of this book will provide insights that eclipse those presently
available.
To avoid unnecessary repetition, we capture some of our most frequently imported and used
functions and classes in the d2l package. Throughout, we mark blocks of code (such as
functions, classes, or collection of import statements) with #@save to indicate that they will
be accessed later via the d2l package. We offer a detailed overview of these classes and
functions in Section B.8. The d2l package is lightweight and only requires the following
dependencies:
#@save
import collections
import hashlib
import inspect
import math
import os
import random
import re
import shutil
import sys
import tarfile
import time
import zipfile
from collections import defaultdict
import pandas as pd
import requests
from IPython import display
from matplotlib import pyplot as plt
from matplotlib_inline import backend_inline
d2l = sys.modules[__name__]
xxx Preface
Most of the code in this book is based on PyTorch, a popular open-source framework that
has been enthusiastically embraced by the deep learning research community. All of the
code in this book has passed tests under the latest stable version of PyTorch. However, due
to the rapid development of deep learning, some code in the print edition may not work
properly in future versions of PyTorch. We plan to keep the online version up to date.
In case you encounter any problems, please consult Installation (page xxxiv) to update
your code and runtime environment. Below lists dependencies in our PyTorch implemen-
tation.
#@save
import numpy as np
import torch
import torchvision
from PIL import Image
from scipy.spatial import distance_matrix
from torch import nn
from torch.nn import functional as F
from torchvision import transforms
Target Audience
This book is for students (undergraduate or graduate), engineers, and researchers, who seek
a solid grasp of the practical techniques of deep learning. Because we explain every con-
4
cept from scratch, no previous background in deep learning or machine learning is required.
Fully explaining the methods of deep learning requires some mathematics and program-
ming, but we will only assume that you enter with some basics, including modest amounts
5
of linear algebra, calculus, probability, and Python programming. Just in case you have
forgotten anything, the online Appendix 4 provides a refresher on most of the mathematics
you will find in this book. Usually, we will prioritize intuition and ideas over mathematical
6 rigor. If you would like to extend these foundations beyond the prerequisites to understand
our book, we happily recommend some other terrific resources: Linear Analysis by Bol-
lobás (1999) covers linear algebra and functional analysis in great depth. All of Statistics
7
(Wasserman, 2013) provides a marvelous introduction to statistics. Joe Blitzstein’s books 5
and courses 6 on probability and inference are pedagogical gems. And if you have not used
Python before, you may want to peruse this Python tutorial 7 .
9
Notebooks, Website, GitHub, and Forum
All of our notebooks are available for download on the D2L.ai website 8 and on GitHub 9 .
Associated with this book, we have launched a discussion forum, located at discuss.d2l.ai
10
10 . Whenever you have questions on any section of the book, you can find a link to the
associated discussion page at the end of each notebook.
xxxi Preface
Acknowledgments
We are indebted to the hundreds of contributors for both the English and the Chinese drafts.
They helped improve the content and offered valuable feedback. This book was originally
implemented with MXNet as the primary framework. We thank Anirudh Dagar and Yuan
Tang for adapting a majority part of earlier MXNet code into PyTorch and TensorFlow im-
plementations, respectively. Since July 2021, we have redesigned and reimplemented this
book in PyTorch, MXNet, and TensorFlow, choosing PyTorch as the primary framework.
We thank Anirudh Dagar for adapting a majority part of more recent PyTorch code into
JAX implementations. We thank Gaosheng Wu, Liujun Hu, Ge Zhang, and Jiehang Xie
from Baidu for adapting a majority part of more recent PyTorch code into PaddlePaddle
implementations in the Chinese draft. We thank Shuai Zhang for integrating the LaTeX
style from the press into the PDF building.
On GitHub, we thank every contributor of this English draft for making it better for ev-
eryone. Their GitHub IDs or names are (in no particular order): alxnorden, avinashingit,
bowen0701, brettkoonce, Chaitanya Prakash Bapat, cryptonaut, Davide Fiocco, edgarro-
man, gkutiel, John Mitro, Liang Pu, Rahul Agarwal, Mohamed Ali Jamaoui, Michael (Stu)
Stewart, Mike Müller, NRauschmayr, Prakhar Srivastav, sad-, sfermigier, Sheng Zha, sun-
deepteki, topecongiro, tpdi, vermicelli, Vishaal Kapoor, Vishwesh Ravi Shrimali, YaYaB,
Yuhong Chen, Evgeniy Smirnov, lgov, Simon Corston-Oliver, Igor Dzreyev, Ha Nguyen,
pmuens, Andrei Lukovenko, senorcinco, vfdev-5, dsweet, Mohammad Mahdi Rahimi, Ab-
hishek Gupta, uwsd, DomKM, Lisa Oakley, Bowen Li, Aarush Ahuja, Prasanth Bud-
dareddygari, brianhendee, mani2106, mtn, lkevinzc, caojilin, Lakshya, Fiete Lüer, Surbhi
Vijayvargeeya, Muhyun Kim, dennismalmgren, adursun, Anirudh Dagar, liqingnz, Pe-
dro Larroy, lgov, ati-ozgur, Jun Wu, Matthias Blume, Lin Yuan, geogunow, Josh Gard-
ner, Maximilian Böther, Rakib Islam, Leonard Lausen, Abhinav Upadhyay, rongruosong,
Steve Sedlmeyer, Ruslan Baratov, Rafael Schlatter, liusy182, Giannis Pappas, ati-ozgur,
qbaza, dchoi77, Adam Gerson, Phuc Le, Mark Atwood, christabella, vn09, Haibin Lin,
jjangga0214, RichyChen, noelo, hansent, Giel Dops, dvincent1337, WhiteD3vil, Peter
Kulits, codypenta, joseppinilla, ahmaurya, karolszk, heytitle, Peter Goetz, rigtorp, Tiep
Vu, sfilip, mlxd, Kale-ab Tessera, Sanjar Adilov, MatteoFerrara, hsneto, Katarzyna Biesial-
ska, Gregory Bruss, Duy–Thanh Doan, paulaurel, graytowne, Duc Pham, sl7423, Jaedong
Hwang, Yida Wang, cys4, clhm, Jean Kaddour, austinmw, trebeljahr, tbaums, Cuong V.
Nguyen, pavelkomarov, vzlamal, NotAnotherSystem, J-Arun-Mani, jancio, eldarkurtic,
the-great-shazbot, doctorcolossus, gducharme, cclauss, Daniel-Mietchen, hoonose, bia-
giom, abhinavsp0730, jonathanhrandall, ysraell, Nodar Okroshiashvili, UgurKap, Jiyang
Kang, StevenJokes, Tomer Kaftan, liweiwp, netyster, ypandya, NishantTharani, heiligerl,
SportsTHU, Hoa Nguyen, manuel-arno-korfmann-webentwicklung, aterzis-personal, nxby,
Xiaoting He, Josiah Yoder, mathresearch, mzz2017, jroberayalas, iluu, ghejc, BSharmi,
vkramdev, simonwardjones, LakshKD, TalNeoran, djliden, Nikhil95, Oren Barkan, guoweis,
haozhu233, pratikhack, Yue Ying, tayfununal, steinsag, charleybeller, Andrew Lumsdaine,
Jiekui Zhang, Deepak Pathak, Florian Donhauser, Tim Gates, Adriaan Tijsseling, Ron
xxxii Preface
Medina, Gaurav Saha, Murat Semerci, Lei Mao, Levi McClenny, Joshua Broyde, jake221,
jonbally, zyhazwraith, Brian Pulfer, Nick Tomasino, Lefan Zhang, Hongshen Yang, Vin-
ney Cavallo, yuntai, Yuanxiang Zhu, amarazov, pasricha, Ben Greenawald, Shivam Upad-
hyay, Quanshangze Du, Biswajit Sahoo, Parthe Pandit, Ishan Kumar, HomunculusK, Lane
Schwartz, varadgunjal, Jason Wiener, Armin Gholampoor, Shreshtha13, eigen-arnav, Hyeong-
gyu Kim, EmilyOng, Bálint Mucsányi, Chase DuBois, Juntian Tao, Wenxiang Xu, Lifu
Huang, filevich, quake2005, nils-werner, Yiming Li, Marsel Khisamutdinov, Francesco
“Fuma” Fumagalli, Peilin Sun, Vincent Gurgul, qingfengtommy, Janmey Shukla, Mo Shan,
Kaan Sancak, regob, AlexSauer, Gopalakrishna Ramachandra, Tobias Uelwer, Chao Wang,
Tian Cao, Nicolas Corthorn, akash5474, kxxt, zxydi1992, Jacob Britton, Shuangchi He, zh-
mou, krahets, Jie-Han Chen, Atishay Garg, Marcel Flygare, adtygan, Nik Vaessen, bolded,
Louis Schlessinger, Balaji Varatharajan, atgctg, Kaixin Li, Victor Barbaros, Riccardo Musto,
Elizabeth Ho, azimjonn, Guilherme Miotto, Alessandro Finamore, Joji Joseph, Anthony
Biel, Zeming Zhao, shjustinbaek, gab-chen, nantekoto, Yutaro Nishiyama, Oren Amsalem,
Tian-MaoMao, Amin Allahyar, Gijs van Tulder, Mikhail Berkov, iamorphen, Matthew
Caseres, Andrew Walsh, pggPL, RohanKarthikeyan, Ryan Choi, and Likun Lei.
We thank Amazon Web Services, especially Wen-Ming Ye, George Karypis, Swami Siva-
subramanian, Peter DeSantis, Adam Selipsky, and Andrew Jassy for their generous support
in writing this book. Without the available time, resources, discussions with colleagues,
and continuous encouragement, this book would not have happened. During the prepara-
tion of the book for publication, Cambridge University Press has offered excellent support.
We thank our commissioning editor David Tranah for his help and professionalism.
Summary
Deep learning has revolutionized pattern recognition, introducing technology that now
powers a wide range of technologies, in such diverse fields as computer vision, natural
language processing, and automatic speech recognition. To successfully apply deep learn-
ing, you must understand how to cast a problem, the basic mathematics of modeling, the
algorithms for fitting your models to data, and the engineering techniques to implement it
all. This book presents a comprehensive resource, including prose, figures, mathematics,
and code, all in one place.
Exercises
3. Follow the links at the bottom of the section to the forum, where you will be able to
seek out help and discuss the book and find answers to your questions by engaging the
authors and broader community.
Discussions 12 .
12
Installation
In order to get up and running, we will need an environment for running Python, the Jupyter
Notebook, the relevant libraries, and the code needed to run the book itself.
Installing Miniconda
13 Your simplest option is to install Miniconda13 . Note that the Python 3.x version is required.
You can skip the following steps if your machine already has conda installed.
Visit the Miniconda website and determine the appropriate version for your system based
on your Python 3.x version and machine architecture. Suppose that your Python version is
3.9 (our tested version). If you are using macOS, you would download the bash script whose
name contains the strings “MacOSX”, navigate to the download location, and execute the
installation as follows (taking Intel Macs as an example):
A Linux user would download the file whose name contains the strings “Linux” and execute
the following at the download location:
A Windows user would download and install Miniconda by following its online instructions
14 14
. On Windows, you may search for cmd to open the Command Prompt (command-line
interpreter) for running commands.
~/miniconda3/bin/conda init
Then close and reopen your current shell. You should be able to create a new environment
as follows:
xxxiv
xxxv Installation
Before installing any deep learning framework, please first check whether or not you have
proper GPUs on your machine (the GPUs that power the display on a standard laptop are
not relevant for our purposes). For example, if your computer has NVIDIA GPUs and has
installed CUDA 15 , then you are all set. If your machine does not house any GPU, there
15 is no need to worry just yet. Your CPU provides more than enough horsepower to get you
through the first few chapters. Just remember that you will want to access GPUs before
running larger models.
You can install PyTorch (the specified versions are tested at the time of writing) with either
CPU or GPU support as follows:
Our next step is to install the d2l package that we developed in order to encapsulate fre-
quently used functions and classes found throughout this book:
Next, you will want to download the notebooks so that you can run each of the book’s
code blocks. Simply click on the “Notebooks” tab at the top of any HTML page on the
D2L.ai website 16 to download the code and then unzip it. Alternatively, you can fetch the
16 notebooks from the command line as follows:
If you do not already have unzip installed, first run sudo apt-get install unzip. Now
we can start the Jupyter Notebook server by running:
jupyter notebook
At this point, you can open http://localhost:8888 (it may have already opened automatically)
in your web browser. Then we can run the code for each section of the book. Whenever
you open a new command line window, you will need to execute conda activate d2l
to activate the runtime environment before running the D2L notebooks, or updating your
packages (either the deep learning framework or the d2l package). To exit the environment,
run conda deactivate.
Discussions 17 .
17
Notation
Throughout this book, we adhere to the following notational conventions. Note that some
of these symbols are placeholders, while others refer to specific objects. As a general rule
of thumb, the indefinite article “a” often indicates that the symbol is a placeholder and that
similarly formatted symbols can denote other objects of the same type. For example, “𝑥: a
scalar” means that lowercased letters generally represent scalar values, but “Z: the set of
integers” refers specifically to the symbol Z.
Numerical Objects
• 𝑥: a scalar
• x: a vector
• X: a matrix
• X: a general tensor
• I: the identity matrix (of some given dimension), i.e., a square matrix with 1 on all
diagonal entries and 0 on all off-diagonals
• 𝑥𝑖 , [x] 𝑖 : the 𝑖 th element of vector x
• 𝑥𝑖 𝑗 , 𝑥𝑖, 𝑗 ,[X] 𝑖 𝑗 , [X] 𝑖, 𝑗 : the element of matrix X at row 𝑖 and column 𝑗.
Set Theory
• X: a set
• Z: the set of integers
• Z+ : the set of positive integers
• R: the set of real numbers
• R𝑛 : the set of 𝑛-dimensional vectors of real numbers
xxxvii
xxxviii Notation
• R𝑎×𝑏 : The set of matrices of real numbers with 𝑎 rows and 𝑏 columns
• 𝑓 (·): a function
• 1(·): the indicator function; evaluates to 1 if the boolean argument is true, and 0 other-
wise
• k · k 𝑝 : ℓ 𝑝 norm
• k · k: ℓ2 norm
Calculus
𝑑𝑦
• 𝑑𝑥 : derivative of 𝑦 with respect to 𝑥
𝜕𝑦
• 𝜕𝑥 : partial derivative of 𝑦 with respect to 𝑥
• ∇x 𝑦: gradient of 𝑦 with respect to x
∫𝑏
• 𝑎 𝑓 (𝑥) 𝑑𝑥: definite integral of 𝑓 from 𝑎 to 𝑏 with respect to 𝑥
∫
• 𝑓 (𝑥) 𝑑𝑥: indefinite integral of 𝑓 with respect to 𝑥
• 𝑋: a random variable
• 𝑃: a probability distribution
• 𝑋 ∼ 𝑃: the random variable 𝑋 follows distribution 𝑃
• 𝑃(𝑋 = 𝑥): the probability assigned to the event where random variable 𝑋 takes value 𝑥
• 𝑃(𝑋 | 𝑌 ): the conditional probability distribution of 𝑋 given 𝑌
• 𝑝(·): a probability density function (PDF) associated with distribution 𝑃
• 𝐸 [𝑋]: expectation of a random variable 𝑋
• 𝑋 ⊥ 𝑌 : random variables 𝑋 and 𝑌 are independent
• 𝑋 ⊥ 𝑌 | 𝑍: random variables 𝑋 and 𝑌 are conditionally independent given 𝑍
• 𝜎𝑋 : standard deviation of random variable 𝑋
• Var(𝑋): variance of random variable 𝑋, equal to 𝜎𝑋2
• Cov(𝑋, 𝑌 ): covariance of random variables 𝑋 and 𝑌
Cov(𝑋,𝑌 )
• 𝜌(𝑋, 𝑌 ): the Pearson correlation coefficient between 𝑋 and 𝑌 , equals 𝜎𝑋 𝜎𝑌
Until recently, nearly every computer program that you might have interacted with during an
ordinary day was coded up as a rigid set of rules specifying precisely how it should behave.
Say that we wanted to write an application to manage an e-commerce platform. After
huddling around a whiteboard for a few hours to ponder the problem, we might settle on
the broad strokes of a working solution, for example: (i) users interact with the application
through an interface running in a web browser or mobile application; (ii) our application
interacts with a commercial-grade database engine to keep track of each user’s state and
maintain records of historical transactions; and (iii) at the heart of our application, the
business logic (you might say, the brains) of our application spells out a set of rules that
map every conceivable circumstance to the corresponding action that our program should
take.
To build the brains of our application, we might enumerate all the common events that our
program should handle. For example, whenever a customer clicks to add an item to their
shopping cart, our program should add an entry to the shopping cart database table, associ-
ating that user’s ID with the requested product’s ID. We might then attempt to step through
every possible corner case, testing the appropriateness of our rules and making any neces-
sary modifications. What happens if a user initiates a purchase with an empty cart? While
few developers ever get it completely right the first time (it might take some test runs to
work out the kinks), for the most part we can write such programs and confidently launch
them before ever seeing a real customer. Our ability to manually design automated sys-
tems that drive functioning products and systems, often in novel situations, is a remarkable
cognitive feat. And when you are able to devise solutions that work 100% of the time, you
typically should not be worrying about machine learning.
Fortunately for the growing community of machine learning scientists, many tasks that we
would like to automate do not bend so easily to human ingenuity. Imagine huddling around
the whiteboard with the smartest minds you know, but this time you are tackling one of the
following problems:
• Write a program that predicts tomorrow’s weather given geographic information, satellite
images, and a trailing window of past weather.
• Write a program that takes in a factoid question, expressed in free-form text, and answers
it correctly.
• Write a program that, given an image, identifies every person depicted in it and draws
outlines around each.
1
2 Introduction
• Write a program that presents users with products that they are likely to enjoy but un-
likely, in the natural course of browsing, to encounter.
For these problems, even elite programmers would struggle to code up solutions from
scratch. The reasons can vary. Sometimes the program that we are looking for follows
a pattern that changes over time, so there is no fixed right answer! In such cases, any
successful solution must adapt gracefully to a changing world. At other times, the rela-
tionship (say between pixels, and abstract categories) may be too complicated, requiring
thousands or millions of computations and following unknown principles. In the case of
image recognition, the precise steps required to perform the task lie beyond our conscious
understanding, even though our subconscious cognitive processes execute the task effort-
lessly.
Machine learning is the study of algorithms that can learn from experience. As a machine
learning algorithm accumulates more experience, typically in the form of observational
data or interactions with an environment, its performance improves. Contrast this with
our deterministic e-commerce platform, which follows the same business logic, no matter
how much experience accrues, until the developers themselves learn and decide that it is
time to update the software. In this book, we will teach you the fundamentals of machine
learning, focusing in particular on deep learning, a powerful set of techniques driving in-
novations in areas as diverse as computer vision, natural language processing, healthcare,
and genomics.
Before beginning writing, the authors of this book, like much of the work force, had to
become caffeinated. We hopped in the car and started driving. Using an iPhone, Alex called
out “Hey Siri”, awakening the phone’s voice recognition system. Then Mu commanded
“directions to Blue Bottle coffee shop”. The phone quickly displayed the transcription of
his command. It also recognized that we were asking for directions and launched the Maps
application (app) to fulfill our request. Once launched, the Maps app identified a number
of routes. Next to each route, the phone displayed a predicted transit time. While this story
was fabricated for pedagogical convenience, it demonstrates that in the span of just a few
seconds, our everyday interactions with a smart phone can engage several machine learning
models.
Imagine just writing a program to respond to a wake word such as “Alexa”, “OK Google”,
and “Hey Siri”. Try coding it up in a room by yourself with nothing but a computer and
a code editor, as illustrated in Fig. 1.1.1. How would you write such a program from first
principles? Think about it… the problem is hard. Every second, the microphone will col-
lect roughly 44,000 samples. Each sample is a measurement of the amplitude of the sound
wave. What rule could map reliably from a snippet of raw audio to confident predictions
{yes, no} about whether the snippet contains the wake word? If you are stuck, do not worry.
3 A Motivating Example
We do not know how to write such a program from scratch either. That is why we use ma-
chine learning.
t
Fig. 1.1.1 Identify a wake word.
Here is the trick. Often, even when we do not know how to tell a computer explicitly how
to map from inputs to outputs, we are nonetheless capable of performing the cognitive feat
ourselves. In other words, even if you do not know how to program a computer to rec-
ognize the word “Alexa”, you yourself are able to recognize it. Armed with this ability,
we can collect a huge dataset containing examples of audio snippets and associated labels,
indicating which snippets contain the wake word. In the currently dominant approach to
machine learning, we do not attempt to design a system explicitly to recognize wake words.
Instead, we define a flexible program whose behavior is determined by a number of pa-
rameters. Then we use the dataset to determine the best possible parameter values, i.e.,
those that improve the performance of our program with respect to a chosen performance
measure.
You can think of the parameters as knobs that we can turn, manipulating the behavior of
the program. Once the parameters are fixed, we call the program a model. The set of all
distinct programs (input–output mappings) that we can produce just by manipulating the
parameters is called a family of models. And the “meta-program” that uses our dataset to
choose the parameters is called a learning algorithm.
Before we can go ahead and engage the learning algorithm, we have to define the problem
precisely, pinning down the exact nature of the inputs and outputs, and choosing an ap-
propriate model family. In this case, our model receives a snippet of audio as input, and
the model generates a selection among {yes, no} as output. If all goes according to plan
the model’s guesses will typically be correct as to whether the snippet contains the wake
word.
If we choose the right family of models, there should exist one setting of the knobs such
that the model fires “yes” every time it hears the word “Alexa”. Because the exact choice of
the wake word is arbitrary, we will probably need a model family sufficiently rich that, via
another setting of the knobs, it could fire “yes” only upon hearing the word “Apricot”. We
expect that the same model family should be suitable for “Alexa” recognition and “Apricot”
recognition because they seem, intuitively, to be similar tasks. However, we might need a
different family of models entirely if we want to deal with fundamentally different inputs
or outputs, say if we wanted to map from images to captions, or from English sentences to
Chinese sentences.
As you might guess, if we just set all of the knobs randomly, it is unlikely that our model
will recognize “Alexa”, “Apricot”, or any other English word. In machine learning, the
learning is the process by which we discover the right setting of the knobs for coercing the
4 Introduction
desired behavior from our model. In other words, we train our model with data. As shown
in Fig. 1.1.2, the training process usually looks like the following:
1. Start off with a randomly initialized model that cannot do anything useful.
2. Grab some of your data (e.g., audio snippets and corresponding {yes, no} labels).
3. Tweak the knobs to make the model perform better as assessed on those examples.
4. Repeat Steps 2 and 3 until the model is awesome.
t
Fig. 1.1.2 A typical training process.
To summarize, rather than code up a wake word recognizer, we code up a program that can
learn to recognize wake words, if presented with a large labeled dataset. You can think of
this act of determining a program’s behavior by presenting it with a dataset as programming
with data. That is to say, we can “program” a cat detector by providing our machine learning
system with many examples of cats and dogs. This way the detector will eventually learn
to emit a very large positive number if it is a cat, a very large negative number if it is a
dog, and something closer to zero if it is not sure. This barely scratches the surface of what
machine learning can do. Deep learning, which we will explain in greater detail later, is
just one among many popular methods for solving machine learning problems.
In our wake word example, we described a dataset consisting of audio snippets and binary
labels, and we gave a hand-wavy sense of how we might train a model to approximate a
mapping from snippets to classifications. This sort of problem, where we try to predict a
designated unknown label based on known inputs given a dataset consisting of examples
for which the labels are known, is called supervised learning. This is just one among many
kinds of machine learning problems. Before we explore other varieties, we would like to
shed more light on some core components that will follow us around, no matter what kind
of machine learning problem we tackle:
1. The data that we can learn from.
2. A model of how to transform the data.
3. An objective function that quantifies how well (or badly) the model is doing.
4. An algorithm to adjust the model’s parameters to optimize the objective function.
5 Key Components
1.2.1 Data
It might go without saying that you cannot do data science without data. We could lose
hundreds of pages pondering what precisely data is, but for now, we will focus on the key
properties of the datasets that we will be concerned with. Generally, we are concerned with
a collection of examples. In order to work with data usefully, we typically need to come
up with a suitable numerical representation. Each example (or data point, data instance,
sample) typically consists of a set of attributes called features (sometimes called covariates
or inputs), based on which the model must make its predictions. In supervised learning
problems, our goal is to predict the value of a special attribute, called the label (or target),
that is not part of the model’s input.
If we were working with image data, each example might consist of an individual photo-
graph (the features) and a number indicating the category to which the photograph belongs
(the label). The photograph would be represented numerically as three grids of numerical
values representing the brightness of red, green, and blue light at each pixel location. For
example, a 200 × 200 pixel color photograph would consist of 200 × 200 × 3 = 120000
numerical values.
Alternatively, we might work with electronic health record data and tackle the task of pre-
dicting the likelihood that a given patient will survive the next 30 days. Here, our features
might consist of a collection of readily available attributes and frequently recorded mea-
surements, including age, vital signs, comorbidities, current medications, and recent pro-
cedures. The label available for training would be a binary value indicating whether each
patient in the historical data survived within the 30-day window.
In such cases, when every example is characterized by the same number of numerical fea-
tures, we say that the inputs are fixed-length vectors and we call the (constant) length of
the vectors the dimensionality of the data. As you might imagine, fixed-length inputs can
be convenient, giving us one less complication to worry about. However, not all data can
easily be represented as fixed-length vectors. While we might expect microscope images to
come from standard equipment, we cannot expect images mined from the Internet all to have
the same resolution or shape. For images, we might consider cropping them to a standard
size, but that strategy only gets us so far. We risk losing information in the cropped-out
portions. Moreover, text data resists fixed-length representations even more stubbornly.
Consider the customer reviews left on e-commerce sites such as Amazon, IMDb, and Tri-
pAdvisor. Some are short: “it stinks!”. Others ramble for pages. One major advantage of
deep learning over traditional methods is the comparative grace with which modern models
can handle varying-length data.
Generally, the more data we have, the easier our job becomes. When we have more data, we
can train more powerful models and rely less heavily on preconceived assumptions. The
regime change from (comparatively) small to big data is a major contributor to the success
of modern deep learning. To drive the point home, many of the most exciting models in
deep learning do not work without large datasets. Some others might work in the small
data regime, but are no better than traditional approaches.
Finally, it is not enough to have lots of data and to process it cleverly. We need the right
6 Introduction
data. If the data is full of mistakes, or if the chosen features are not predictive of the target
quantity of interest, learning is going to fail. The situation is captured well by the cliché:
garbage in, garbage out. Moreover, poor predictive performance is not the only poten-
tial consequence. In sensitive applications of machine learning, like predictive policing,
resume screening, and risk models used for lending, we must be especially alert to the con-
sequences of garbage data. One commonly occurring failure mode concerns datasets where
some groups of people are unrepresented in the training data. Imagine applying a skin can-
cer recognition system that had never seen black skin before. Failure can also occur when
the data does not only under-represent some groups but reflects societal prejudices. For ex-
ample, if past hiring decisions are used to train a predictive model that will be used to screen
resumes then machine learning models could inadvertently capture and automate historical
injustices. Note that this can all happen without the data scientist actively conspiring, or
even being aware.
1.2.2 Models
Most machine learning involves transforming the data in some sense. We might want to
build a system that ingests photos and predicts smiley-ness. Alternatively, we might want to
ingest a set of sensor readings and predict how normal vs. anomalous the readings are. By
model, we denote the computational machinery for ingesting data of one type, and spitting
out predictions of a possibly different type. In particular, we are interested in statistical
models that can be estimated from data. While simple models are perfectly capable of ad-
dressing appropriately simple problems, the problems that we focus on in this book stretch
the limits of classical methods. Deep learning is differentiated from classical approaches
principally by the set of powerful models that it focuses on. These models consist of many
successive transformations of the data that are chained together top to bottom, thus the
name deep learning. On our way to discussing deep models, we will also discuss some
more traditional methods.
examples on which our predictions disagree with the ground truth. Some objectives (e.g.,
squared error) are easy to optimize, while others (e.g., error rate) are difficult to optimize
directly, owing to non-differentiability or other complications. In these cases, it is common
instead to optimize a surrogate objective.
During optimization, we think of the loss as a function of the model’s parameters, and treat
the training dataset as a constant. We learn the best values of our model’s parameters by
minimizing the loss incurred on a set consisting of some number of examples collected for
training. However, doing well on the training data does not guarantee that we will do well
on unseen data. So we will typically want to split the available data into two partitions:
the training dataset (or training set), for learning model parameters; and the test dataset
(or test set), which is held out for evaluation. At the end of the day, we typically report
how our models perform on both partitions. You could think of training performance as
analogous to the scores that a student achieves on the practice exams used to prepare for
some real final exam. Even if the results are encouraging, that does not guarantee success
on the final exam. Over the course of studying, the student might begin to memorize the
practice questions, appearing to master the topic but faltering when faced with previously
unseen questions on the actual final exam. When a model performs well on the training set
but fails to generalize to unseen data, we say that it is overfitting to the training data.
The wake word problem in our motivating example is just one among many that machine
learning can tackle. To motivate the reader further and provide us with some common
language that will follow us throughout the book, we now provide a broad overview of the
landscape of machine learning problems.
labels are unknown. The supervision comes into play because, for choosing the parame-
ters, we (the supervisors) provide the model with a dataset consisting of labeled examples.
In probabilistic terms, we typically are interested in estimating the conditional probability
of a label given input features. While it is just one among several paradigms, supervised
learning accounts for the majority of successful applications of machine learning in indus-
try. Partly that is because many important tasks can be described crisply as estimating the
probability of something unknown given a particular set of available data:
• Predict the price of a stock next month based on this month’s financial reporting data.
While all supervised learning problems are captured by the simple description “predicting
the labels given input features”, supervised learning itself can take diverse forms and require
tons of modeling decisions, depending on (among other considerations) the type, size, and
quantity of the inputs and outputs. For example, we use different models for processing
sequences of arbitrary lengths and fixed-length vector representations. We will visit many
of these problems in depth throughout this book.
Informally, the learning process looks something like the following. First, grab a big col-
lection of examples for which the features are known and select from them a random subset,
acquiring the ground truth labels for each. Sometimes these labels might be available data
that have already been collected (e.g., did a patient die within the following year?) and
other times we might need to employ human annotators to label the data, (e.g., assigning
images to categories). Together, these inputs and corresponding labels comprise the train-
ing set. We feed the training dataset into a supervised learning algorithm, a function that
takes as input a dataset and outputs another function: the learned model. Finally, we can
feed previously unseen inputs to the learned model, using its outputs as predictions of the
corresponding label. The full process is drawn in Fig. 1.3.1.
t
Fig. 1.3.1 Supervised learning.
Regression
Perhaps the simplest supervised learning task to wrap your head around is regression. Con-
sider, for example, a set of data harvested from a database of home sales. We might con-
struct a table, in which each row corresponds to a different house, and each column cor-
responds to some relevant attribute, such as the square footage of a house, the number of
bedrooms, the number of bathrooms, and the number of minutes (walking) to the center
of town. In this dataset, each example would be a specific house, and the corresponding
9 Kinds of Machine Learning Problems
feature vector would be one row in the table. If you live in New York or San Francisco, and
you are not the CEO of Amazon, Google, Microsoft, or Facebook, the (sq. footage, no. of
bedrooms, no. of bathrooms, walking distance) feature vector for your home might look
something like: [600, 1, 1, 60]. However, if you live in Pittsburgh, it might look more like
[3000, 4, 3, 10]. Fixed-length feature vectors like this are essential for most classic machine
learning algorithms.
What makes a problem a regression is actually the form of the target. Say that you are in the
market for a new home. You might want to estimate the fair market value of a house, given
some features such as above. The data here might consist of historical home listings and the
labels might be the observed sales prices. When labels take on arbitrary numerical values
(even within some interval), we call this a regression problem. The goal is to produce a
model whose predictions closely approximate the actual label values.
Lots of practical problems are easily described as regression problems. Predicting the rating
that a user will assign to a movie can be thought of as a regression problem and if you
designed a great algorithm to accomplish this feat in 2009, you might have won the 1-
million-dollar Netflix prize 19 . Predicting the length of stay for patients in the hospital is
19 also a regression problem. A good rule of thumb is that any how much? or how many?
problem is likely to be regression. For example:
• How many hours will this surgery take?
• How much rainfall will this town have in the next six hours?
Even if you have never worked with machine learning before, you have probably worked
through a regression problem informally. Imagine, for example, that you had your drains re-
paired and that your contractor spent 3 hours removing gunk from your sewage pipes. Then
they sent you a bill of 350 dollars. Now imagine that your friend hired the same contractor
for 2 hours and received a bill of 250 dollars. If someone then asked you how much to
expect on their upcoming gunk-removal invoice you might make some reasonable assump-
tions, such as more hours worked costs more dollars. You might also assume that there is
some base charge and that the contractor then charges per hour. If these assumptions held
true, then given these two data examples, you could already identify the contractor’s pricing
structure: 100 dollars per hour plus 50 dollars to show up at your house. If you followed
that much, then you already understand the high-level idea behind linear regression.
In this case, we could produce the parameters that exactly matched the contractor’s prices.
Sometimes this is not possible, e.g., if some of the variation arises from factors beyond
your two features. In these cases, we will try to learn models that minimize the distance
between our predictions and the observed values. In most of our chapters, we will focus on
minimizing the squared error loss function. As we will see later, this loss corresponds to
the assumption that our data were corrupted by Gaussian noise.
Classification
While regression models are great for addressing how many? questions, lots of problems do
not fit comfortably in this template. Consider, for example, a bank that wants to develop a
10 Introduction
check scanning feature for its mobile app. Ideally, the customer would simply snap a photo
of a check and the app would automatically recognize the text from the image. Assuming
that we had some ability to segment out image patches corresponding to each handwritten
character, then the primary remaining task would be to determine which character among
some known set is depicted in each image patch. These kinds of which one? problems
are called classification and require a different set of tools from those used for regression,
although many techniques will carry over.
In classification, we want our model to look at features, e.g., the pixel values in an image,
and then predict to which category (sometimes called a class) among some discrete set
of options, an example belongs. For handwritten digits, we might have ten classes, corre-
sponding to the digits 0 through 9. The simplest form of classification is when there are
only two classes, a problem which we call binary classification. For example, our dataset
could consist of images of animals and our labels might be the classes {cat, dog}. Whereas
in regression we sought a regressor to output a numerical value, in classification we seek a
classifier, whose output is the predicted class assignment.
For reasons that we will get into as the book gets more technical, it can be difficult to opti-
mize a model that can only output a firm categorical assignment, e.g., either “cat” or “dog”.
In these cases, it is usually much easier to express our model in the language of probabili-
ties. Given features of an example, our model assigns a probability to each possible class.
Returning to our animal classification example where the classes are {cat, dog}, a classi-
fier might see an image and output the probability that the image is a cat as 0.9. We can
interpret this number by saying that the classifier is 90% sure that the image depicts a cat.
The magnitude of the probability for the predicted class conveys a notion of uncertainty.
It is not the only one available and we will discuss others in chapters dealing with more
advanced topics.
When we have more than two possible classes, we call the problem multiclass classification.
Common examples include handwritten character recognition {0, 1, 2, ... 9, a, b, c, ...}. While
we attacked regression problems by trying to minimize the squared error loss function, the
common loss function for classification problems is called cross-entropy, whose name will
be demystified when we introduce information theory in later chapters.
Note that the most likely class is not necessarily the one that you are going to use for your
decision. Assume that you find a beautiful mushroom in your backyard as shown in Fig.
1.3.2.
Now, assume that you built a classifier and trained it to predict whether a mushroom is poi-
sonous based on a photograph. Say our poison-detection classifier outputs that the proba-
bility that Fig. 1.3.2 shows a death cap is 0.2. In other words, the classifier is 80% sure that
our mushroom is not a death cap. Still, you would have to be a fool to eat it. That is because
the certain benefit of a delicious dinner is not worth a 20% risk of dying from it. In other
words, the effect of the uncertain risk outweighs the benefit by far. Thus, in order to make
a decision about whether to eat the mushroom, we need to compute the expected detriment
associated with each action which depends both on the likely outcomes and the benefits or
harms associated with each. In this case, the detriment incurred by eating the mushroom
11 Kinds of Machine Learning Problems
t
Fig. 1.3.2 Death cap - do not eat!
might be 0.2 × ∞ + 0.8 × 0 = ∞, whereas the loss of discarding it is 0.2 × 0 + 0.8 × 1 = 0.8.
Our caution was justified: as any mycologist would tell us, the mushroom in Fig. 1.3.2 is
actually a death cap.
Classification can get much more complicated than just binary or multiclass classification.
For instance, there are some variants of classification addressing hierarchically structured
classes. In such cases not all errors are equal—if we must err, we might prefer to misclassify
to a related class rather than a distant class. Usually, this is referred to as hierarchical
classification. For inspiration, you might think of Linnaeus 20 , who organized fauna in a
20 hierarchy.
In the case of animal classification, it might not be so bad to mistake a poodle for a schnauzer,
but our model would pay a huge penalty if it confused a poodle with a dinosaur. Which
hierarchy is relevant might depend on how you plan to use the model. For example, rat-
tlesnakes and garter snakes might be close on the phylogenetic tree, but mistaking a rattler
for a garter could have fatal consequences.
Tagging
Some classification problems fit neatly into the binary or multiclass classification setups.
For example, we could train a normal binary classifier to distinguish cats from dogs. Given
the current state of computer vision, we can do this easily, with off-the-shelf tools. Nonethe-
less, no matter how accurate our model gets, we might find ourselves in trouble when the
classifier encounters an image of the Town Musicians of Bremen, a popular German fairy
tale featuring four animals (Fig. 1.3.3).
As you can see, the photo features a cat, a rooster, a dog, and a donkey, with some trees in
the background. If we anticipate encountering such images, multiclass classification might
not be the right problem formulation. Instead, we might want to give the model the option
of saying the image depicts a cat, a dog, a donkey, and a rooster.
12 Introduction
t
Fig. 1.3.3 A donkey, a dog, a cat, and a rooster.
The problem of learning to predict classes that are not mutually exclusive is called multi-
label classification. Auto-tagging problems are typically best described in terms of multi-
label classification. Think of the tags people might apply to posts on a technical blog, e.g.,
“machine learning”, “technology”, “gadgets”, “programming languages”, “Linux”, “cloud
computing”, “AWS”. A typical article might have 5–10 tags applied. Typically, tags will
exhibit some correlation structure. Posts about “cloud computing” are likely to mention
“AWS” and posts about “machine learning” are likely to mention “GPUs”.
Sometimes such tagging problems draw on enormous label sets. The National Library of
Medicine employs many professional annotators who associate each article to be indexed in
PubMed with a set of tags drawn from the Medical Subject Headings (MeSH) ontology, a
collection of roughly 28,000 tags. Correctly tagging articles is important because it allows
researchers to conduct exhaustive reviews of the literature. This is a time-consuming pro-
cess and typically there is a one-year lag between archiving and tagging. Machine learning
can provide provisional tags until each article has a proper manual review. Indeed, for
21
several years, the BioASQ organization has hosted competitions 21 for this task.
13 Kinds of Machine Learning Problems
Search
In the field of information retrieval, we often impose ranks on sets of items. Take web
search for example. The goal is less to determine whether a particular page is relevant for a
query, but rather which, among a set of relevant results, should be shown most prominently
to a particular user. One way of doing this might be to first assign a score to every element
in the set and then to retrieve the top-rated elements. PageRank 22 , the original secret
22 sauce behind the Google search engine, was an early example of such a scoring system.
Weirdly, the scoring provided by PageRank did not depend on the actual query. Instead,
they relied on a simple relevance filter to identify the set of relevant candidates and then
used PageRank to prioritize the more authoritative pages. Nowadays, search engines use
machine learning and behavioral models to obtain query-dependent relevance scores. There
are entire academic conferences devoted to this subject.
Recommender Systems
Recommender systems are another problem setting that is related to search and ranking.
The problems are similar insofar as the goal is to display a set of items relevant to the user.
The main difference is the emphasis on personalization to specific users in the context of
recommender systems. For instance, for movie recommendations, the results page for a
science fiction fan and the results page for a connoisseur of Peter Sellers comedies might
differ significantly. Similar problems pop up in other recommendation settings, e.g., for
retail products, music, and news recommendation.
In some cases, customers provide explicit feedback, communicating how much they liked a
particular product (e.g., the product ratings and reviews on Amazon, IMDb, or Goodreads).
In other cases, they provide implicit feedback, e.g., by skipping titles on a playlist, which
might indicate dissatisfaction or maybe just indicate that the song was inappropriate in
context. In the simplest formulations, these systems are trained to estimate some score,
such as an expected star rating or the probability that a given user will purchase a particular
item.
Given such a model, for any given user, we could retrieve the set of objects with the largest
scores, which could then be recommended to the user. Production systems are consider-
ably more advanced and take detailed user activity and item characteristics into account
when computing such scores. Fig. 1.3.4 displays the deep learning books recommended by
Amazon based on personalization algorithms tuned to capture Aston’s preferences.
Despite their tremendous economic value, recommender systems naively built on top of
predictive models suffer some serious conceptual flaws. To start, we only observe censored
feedback: users preferentially rate movies that they feel strongly about. For example, on
a five-point scale, you might notice that items receive many one- and five-star ratings but
that there are conspicuously few three-star ratings. Moreover, current purchase habits are
often a result of the recommendation algorithm currently in place, but learning algorithms
do not always take this detail into account. Thus it is possible for feedback loops to form
where a recommender system preferentially pushes an item that is then taken to be better
(due to greater purchases) and in turn is recommended even more frequently. Many of
14 Introduction
t
Fig. 1.3.4 Deep learning books recommended by Amazon.
these problems—about how to deal with censoring, incentives, and feedback loops—are
important open research questions.
Sequence Learning
So far, we have looked at problems where we have some fixed number of inputs and produce
a fixed number of outputs. For example, we considered predicting house prices given a
fixed set of features: square footage, number of bedrooms, number of bathrooms, and the
transit time to downtown. We also discussed mapping from an image (of fixed dimension)
to the predicted probabilities that it belongs to each among a fixed number of classes and
predicting star ratings associated with purchases based on the user ID and product ID alone.
In these cases, once our model is trained, after each test example is fed into our model, it
is immediately forgotten. We assumed that successive observations were independent and
thus there was no need to hold on to this context.
But how should we deal with video snippets? In this case, each snippet might consist of
a different number of frames. And our guess of what is going on in each frame might be
much stronger if we take into account the previous or succeeding frames. The same goes for
language. For example, one popular deep learning problem is machine translation: the task
of ingesting sentences in some source language and predicting their translations in another
language.
Such problems also occur in medicine. We might want a model to monitor patients in the
intensive care unit and to fire off alerts whenever their risk of dying in the next 24 hours
exceeds some threshold. Here, we would not throw away everything that we know about
15 Kinds of Machine Learning Problems
the patient history every hour, because we might not want to make predictions based only
on the most recent measurements.
Questions like these are among the most exciting applications of machine learning and
they are instances of sequence learning. They require a model either to ingest sequences
of inputs or to emit sequences of outputs (or both). Specifically, sequence-to-sequence
learning considers problems where both inputs and outputs consist of variable-length se-
quences. Examples include machine translation and speech-to-text transcription. While it
is impossible to consider all types of sequence transformations, the following special cases
are worth mentioning.
Tagging and Parsing. This involves annotating a text sequence with attributes. Here,
the inputs and outputs are aligned, i.e., they are of the same number and occur in a corre-
sponding order. For instance, in part-of-speech (PoS) tagging, we annotate every word in
a sentence with the corresponding part of speech, i.e., “noun” or “direct object”. Alterna-
tively, we might want to know which groups of contiguous words refer to named entities,
like people, places, or organizations. In the cartoonishly simple example below, we might
just want to indicate whether or not any word in the sentence is part of a named entity
(tagged as “Ent”).
Automatic Speech Recognition. With speech recognition, the input sequence is an audio
recording of a speaker (Fig. 1.3.5), and the output is a transcript of what the speaker said.
The challenge is that there are many more audio frames (sound is typically sampled at
8kHz or 16kHz) than text, i.e., there is no 1:1 correspondence between audio and text,
since thousands of samples may correspond to a single spoken word. These are sequence-
to-sequence learning problems, where the output is much shorter than the input. While
humans are remarkably good at recognizing speech, even from low-quality audio, getting
computers to perform the same feat is a formidable challenge.
t
Fig. 1.3.5 -D-e-e-p- L-ea-r-ni-ng- in an audio recording.
Text to Speech. This is the inverse of automatic speech recognition. Here, the input is text
and the output is an audio file. In this case, the output is much longer than the input.
Machine Translation. Unlike the case of speech recognition, where corresponding inputs
and outputs occur in the same order, in machine translation, unaligned data poses a new
challenge. Here the input and output sequences can have different lengths, and the corre-
16 Introduction
sponding regions of the respective sequences may appear in a different order. Consider the
following illustrative example of the peculiar tendency of Germans to place the verbs at the
end of sentences:
Many related problems pop up in other learning tasks. For instance, determining the order
in which a user reads a webpage is a two-dimensional layout analysis problem. Dialogue
problems exhibit all kinds of additional complications, where determining what to say next
requires taking into account real-world knowledge and the prior state of the conversation
across long temporal distances. Such topics are active areas of research.
• Is there a description of the root causes of much of the data that we observe? For instance,
if we have demographic data about house prices, pollution, crime, location, education,
and salaries, can we discover how they are related simply based on empirical data?
The fields concerned with causality and probabilistic graphical models tackle such
questions.
• Another important and exciting recent development in unsupervised learning is the ad-
vent of deep generative models. These models estimate the density of the data, either
explicitly or implicitly. Once trained, we can use a generative model either to score
examples according to how likely they are, or to sample synthetic examples from the
learned distribution. Early deep learning breakthroughs in generative modeling came
with the invention of variational autoencoders (Kingma and Welling, 2014, Rezende
et al., 2014) and continued with the development of generative adversarial networks
(Goodfellow et al., 2014). More recent advances include normalizing flows (Dinh et
al., 2014, Dinh et al., 2017) and diffusion models (Ho et al., 2020, Sohl-Dickstein et
al., 2015, Song and Ermon, 2019, Song et al., 2021).
A further development in unsupervised learning has been the rise of self-supervised learn-
ing, techniques that leverage some aspect of the unlabeled data to provide supervision. For
text, we can train models to “fill in the blanks” by predicting randomly masked words us-
ing their surrounding words (contexts) in big corpora without any labeling effort (Devlin
et al., 2018)! For images, we may train models to tell the relative position between two
cropped regions of the same image (Doersch et al., 2015), to predict an occluded part of an
image based on the remaining portions of the image, or to predict whether two examples
are perturbed versions of the same underlying image. Self-supervised models often learn
representations that are subsequently leveraged by fine-tuning the resulting models on some
downstream task of interest.
t
Fig. 1.3.6 Collecting data for supervised learning from an environment.
account for the way its actions might impact the future observations of the agent, and so
offline learning is inappropriate.
Considering the interaction with an environment opens a whole set of new modeling ques-
tions. The following are just a few examples.
• Does the environment want to help us, e.g., a user reading text into a speech recognizer?
• Does the environment want to beat us, e.g., spammers adapting their emails to evade
spam filters?
• Does the environment have shifting dynamics? For example, would future data always
resemble the past or would the patterns change over time, either naturally or in re-
sponse to our automated tools?
These questions raise the problem of distribution shift, where training and test data are
different. An example of this, that many of us may have met, is when taking exams written
by a lecturer, while the homework was composed by their teaching assistants. Next, we
briefly describe reinforcement learning, a rich framework for posing learning problems in
which an agent interacts with an environment.
Reinforcement learning gives a very general statement of a problem in which an agent inter-
acts with an environment over a series of time steps. At each time step, the agent receives
some observation from the environment and must choose an action that is subsequently
transmitted back to the environment via some mechanism (sometimes called an actuator),
when, after each loop, the agent receives a reward from the environment. This process is
19 Kinds of Machine Learning Problems
illustrated in Fig. 1.3.7. The agent then receives a subsequent observation, and chooses a
subsequent action, and so on. The behavior of a reinforcement learning agent is governed
by a policy. In brief, a policy is just a function that maps from observations of the environ-
ment to actions. The goal of reinforcement learning is to produce good policies.
t
Fig. 1.3.7 The interaction between reinforcement learning and an environment.
It is hard to overstate the generality of the reinforcement learning framework. For example,
supervised learning can be recast as reinforcement learning. Say we had a classification
problem. We could create a reinforcement learning agent with one action corresponding
to each class. We could then create an environment which gave a reward that was exactly
equal to the loss function from the original supervised learning problem.
Further, reinforcement learning can also address many problems that supervised learning
cannot. For example, in supervised learning, we always expect that the training input comes
associated with the correct label. But in reinforcement learning, we do not assume that,
for each observation the environment tells us the optimal action. In general, we just get
some reward. Moreover, the environment may not even tell us which actions led to the
reward.
Consider the game of chess. The only real reward signal comes at the end of the game when
we either win, earning a reward of, say, 1, or when we lose, receiving a reward of, say,
−1. So reinforcement learners must deal with the credit assignment problem: determining
which actions to credit or blame for an outcome. The same goes for an employee who gets
a promotion on October 11. That promotion likely reflects a number of well-chosen actions
over the previous year. Getting promoted in the future requires figuring out which actions
along the way led to the earlier promotions.
Reinforcement learners may also have to deal with the problem of partial observability.
That is, the current observation might not tell you everything about your current state. Say
your cleaning robot found itself trapped in one of many identical closets in your house.
Rescuing the robot involves inferring its precise location which might require considering
earlier observations prior to it entering the closet.
Finally, at any given point, reinforcement learners might know of one good policy, but
there might be many other better policies that the agent has never tried. The reinforcement
learner must constantly choose whether to exploit the best (currently) known strategy as a
policy, or to explore the space of strategies, potentially giving up some short-term reward
in exchange for knowledge.
The general reinforcement learning problem has a very general setting. Actions affect sub-
20 Introduction
sequent observations. Rewards are only observed when they correspond to the chosen ac-
tions. The environment may be either fully or partially observed. Accounting for all this
complexity at once may be asking too much. Moreover, not every practical problem ex-
hibits all this complexity. As a result, researchers have studied a number of special cases
of reinforcement learning problems.
When the environment is fully observed, we call the reinforcement learning problem a
Markov decision process. When the state does not depend on the previous actions, we call
it a contextual bandit problem. When there is no state, just a set of available actions with
initially unknown rewards, we have the classic multi-armed bandit problem.
1.4 Roots
We have just reviewed a small subset of problems that machine learning can address. For
a diverse set of machine learning problems, deep learning provides powerful tools for their
solution. Although many deep learning methods are recent inventions, the core ideas be-
hind learning from data have been studied for centuries. In fact, humans have held the
desire to analyze data and to predict future outcomes for ages, and it is this desire that is
at the root of much of natural science and mathematics. Two examples are the Bernoulli
distribution, named after Jacob Bernoulli (1655–1705) 23 , and the Gaussian distribution
23 discovered by Carl Friedrich Gauss (1777–1855) 24 . Gauss invented, for instance, the least
mean squares algorithm, which is still used today for a multitude of problems from insur-
ance calculations to medical diagnostics. Such tools enhanced the experimental approach
24 in the natural sciences—for instance, Ohm’s law relating current and voltage in a resistor
is perfectly described by a linear model.
Even in the middle ages, mathematicians had a keen intuition of estimates. For instance,
the geometry book of Jacob Köbel (1460–1533) 25 illustrates averaging the length of 16
25 adult men’s feet to estimate the typical foot length in the population (Fig. 1.4.1).
As a group of individuals exited a church, 16 adult men were asked to line up in a row
and have their feet measured. The sum of these measurements was then divided by 16 to
obtain an estimate for what now is called one foot. This “algorithm” was later improved to
deal with misshapen feet; The two men with the shortest and longest feet were sent away,
averaging only over the remainder. This is among the earliest examples of a trimmed mean
estimate.
Statistics really took off with the availability and collection of data. One of its pioneers,
Ronald Fisher (1890–1962) 26 , contributed significantly to its theory and also its applica-
26 tions in genetics. Many of his algorithms (such as linear discriminant analysis) and con-
cepts (such as the Fisher information matrix) still hold a prominent place in the founda-
tions of modern statistics. Even his data resources had a lasting impact. The Iris dataset
that Fisher released in 1936 is still sometimes used to demonstrate machine learning algo-
rithms. Fisher was also a proponent of eugenics, which should remind us that the morally
21 Roots
t
Fig. 1.4.1 Estimating the length of a foot.
dubious use of data science has as long and enduring a history as its productive use in
industry and the natural sciences.
Other influences for machine learning came from the information theory of Claude Shan-
non (1916–2001) 27 and the theory of computation proposed by Alan Turing (1912–1954)
28
27 . Turing posed the question “can machines think?” in his famous paper Computing Ma-
chinery and Intelligence (Turing, 1950). Describing what is now known as the Turing test,
he proposed that a machine can be considered intelligent if it is difficult for a human evalu-
28 ator to distinguish between the replies from a machine and those of a human, based purely
on textual interactions.
Further influences came from neuroscience and psychology. After all, humans clearly ex-
hibit intelligent behavior. Many scholars have asked whether one could explain and pos-
sibly reverse engineer this capacity. One of the first biologically inspired algorithms was
formulated by Donald Hebb (1904–1985) 29 . In his groundbreaking book The Organiza-
29 tion of Behavior (Hebb, 1949), he posited that neurons learn by positive reinforcement.
This became known as the Hebbian learning rule. These ideas inspired later work, such
as Rosenblatt’s perceptron learning algorithm, and laid the foundations of many stochastic
gradient descent algorithms that underpin deep learning today: reinforce desirable behav-
ior and diminish undesirable behavior to obtain good settings of the parameters in a neural
network.
22 Introduction
Biological inspiration is what gave neural networks their name. For over a century (dating
back to the models of Alexander Bain, 1873, and James Sherrington, 1890), researchers
have tried to assemble computational circuits that resemble networks of interacting neurons.
Over time, the interpretation of biology has become less literal, but the name stuck. At its
heart lie a few key principles that can be found in most networks today:
• The alternation of linear and nonlinear processing units, often referred to as layers.
• The use of the chain rule (also known as backpropagation) for adjusting parameters in
the entire network at once.
After initial rapid progress, research in neural networks languished from around 1995 until
2005. This was mainly due to two reasons. First, training a network is computationally
very expensive. While random-access memory was plentiful at the end of the past century,
computational power was scarce. Second, datasets were relatively small. In fact, Fisher’s
Iris dataset from 1936 was still a popular tool for testing the efficacy of algorithms. The
MNIST dataset with its 60,000 handwritten digits was considered huge.
Given the scarcity of data and computation, strong statistical tools such as kernel methods,
decision trees, and graphical models proved empirically superior in many applications.
Moreover, unlike neural networks, they did not require weeks to train and provided pre-
dictable results with strong theoretical guarantees.
Much of this changed with the availability of massive amounts of data, thanks to the World
Wide Web, the advent of companies serving hundreds of millions of users online, a dis-
semination of low-cost, high-quality sensors, inexpensive data storage (Kryder’s law), and
cheap computation (Moore’s law). In particular, the landscape of computation in deep
learning was revolutionized by advances in GPUs that were originally engineered for com-
puter gaming. Suddenly algorithms and models that seemed computationally infeasible
were within reach. This is best illustrated in tab_intro_decade.
Note that random-access memory has not kept pace with the growth in data. At the same
time, increases in computational power have outpaced the growth in datasets. This means
that statistical models need to become more memory efficient, and so they are free to spend
more computer cycles optimizing parameters, thanks to the increased compute budget.
Consequently, the sweet spot in machine learning and statistics moved from (generalized)
linear models and kernel methods to deep neural networks. This is also one of the rea-
sons why many of the mainstays of deep learning, such as multilayer perceptrons (McCul-
loch and Pitts, 1943), convolutional neural networks (LeCun et al., 1998), long short-term
memory (Hochreiter and Schmidhuber, 1997), and Q-Learning (Watkins and Dayan, 1992),
were essentially “rediscovered” in the past decade, after lying comparatively dormant for
considerable time.
The recent progress in statistical models, applications, and algorithms has sometimes been
likened to the Cambrian explosion: a moment of rapid progress in the evolution of species.
Indeed, the state of the art is not just a mere consequence of available resources applied
to decades-old algorithms. Note that the list of ideas below barely scratches the surface of
what has helped researchers achieve tremendous progress over the past decade.
• Novel methods for capacity control, such as dropout (Srivastava et al., 2014), have helped
to mitigate overfitting. Here, noise is injected (Bishop, 1995) throughout the neural
network during training.
• Attention mechanisms solved a second problem that had plagued statistics for over a
century: how to increase the memory and complexity of a system without increasing
the number of learnable parameters. Researchers found an elegant solution by using
what can only be viewed as a learnable pointer structure (Bahdanau et al., 2014).
Rather than having to remember an entire text sequence, e.g., for machine translation
in a fixed-dimensional representation, all that needed to be stored was a pointer to the
intermediate state of the translation process. This allowed for significantly increased
accuracy for long sequences, since the model no longer needed to remember the entire
sequence before commencing the generation of a new one.
• Built solely on attention mechanisms, the Transformer architecture (Vaswani et al., 2017)
24 Introduction
• Modeling probabilities of text sequences, language models can predict text given other
text. Scaling up the data, model, and compute has unlocked a growing number of
capabilities of language models to perform desired tasks via human-like text genera-
tion based on input text (Anil et al., 2023, Brown et al., 2020, Chowdhery et al., 2022,
Hoffmann et al., 2022, OpenAI, 2023, Rae et al., 2021, Touvron et al., 2023a, Touvron
et al., 2023b). For instance, aligning language models with human intent (Ouyang et
al., 2022), OpenAI’s ChatGPT 30 allows users to interact with it in a conversational
30 way to solve problems, such as code debugging and creative writing.
• Multi-stage designs, e.g., via the memory networks (Sukhbaatar et al., 2015) and the neu-
ral programmer-interpreter (Reed and De Freitas, 2015) permitted statistical modelers
to describe iterative approaches to reasoning. These tools allow for an internal state of
the deep neural network to be modified repeatedly, thus carrying out subsequent steps
in a chain of reasoning, just as a processor can modify memory for a computation.
• A key development in deep generative modeling was the invention of generative adver-
sarial networks (Goodfellow et al., 2014). Traditionally, statistical methods for density
estimation and generative models focused on finding proper probability distributions
and (often approximate) algorithms for sampling from them. As a result, these algo-
rithms were largely limited by the lack of flexibility inherent in the statistical models.
The crucial innovation in generative adversarial networks was to replace the sampler
by an arbitrary algorithm with differentiable parameters. These are then adjusted in
such a way that the discriminator (effectively a two-sample test) cannot distinguish
fake from real data. Through the ability to use arbitrary algorithms to generate data,
density estimation was opened up to a wide variety of techniques. Examples of gal-
loping zebras (Zhu et al., 2017) and of fake celebrity faces (Karras et al., 2017) are
each testimony to this progress. Even amateur doodlers can produce photorealistic
images just based on sketches describing the layout of a scene (Park et al., 2019).
• Furthermore, while the diffusion process gradually adds random noise to data samples,
diffusion models (Ho et al., 2020, Sohl-Dickstein et al., 2015) learn the denoising pro-
cess to gradually construct data samples from random noise, reversing the diffusion
process. They have started to replace generative adversarial networks in more recent
deep generative models, such as in DALL-E 2 (Ramesh et al., 2022) and Imagen (Sa-
haria et al., 2022) for creative art and image generation based on text descriptions.
• In many cases, a single GPU is insufficient for processing the large amounts of data
25 Success Stories
available for training. Over the past decade the ability to build parallel and distributed
training algorithms has improved significantly. One of the key challenges in designing
scalable algorithms is that the workhorse of deep learning optimization, stochastic
gradient descent, relies on relatively small minibatches of data to be processed. At
the same time, small batches limit the efficiency of GPUs. Hence, training on 1,024
GPUs with a minibatch size of, say, 32 images per batch amounts to an aggregate
minibatch of about 32,000 images. Work, first by Li (2017) and subsequently by You
31 et al. (2017) and Jia et al. (2018) pushed the size up to 64,000 observations, reducing
training time for the ResNet-50 model on the ImageNet dataset to less than 7 minutes.
By comparison, training times were initially of the order of days.
32
• The ability to parallelize computation has also contributed to progress in reinforcement
learning. This has led to significant progress in computers achieving superhuman
performance on tasks like Go, Atari games, Starcraft, and in physics simulations (e.g.,
33 using MuJoCo) where environment simulators are available. See, e.g., Silver et al.
(2016) for a description of such achievements in AlphaGo. In a nutshell, reinforcement
learning works best if plenty of (state, action, reward) tuples are available. Simulation
34
provides such an avenue.
• Deep learning frameworks have played a crucial role in disseminating ideas. The first
generation of open-source frameworks for neural network modeling consisted of Caffe
31
35 , Torch 32 , and Theano 33 . Many seminal papers were written using these tools.
These have now been superseded by TensorFlow 34 (often used via its high-level API
Keras 35 ), CNTK 36 , Caffe 2 37 , and Apache MXNet 38 . The third generation of
36
frameworks consists of so-called imperative tools for deep learning, a trend that was
arguably ignited by Chainer 39 , which used a syntax similar to Python NumPy to
describe models. This idea was adopted by both PyTorch 40 , the Gluon API 41 of
MXNet, and JAX 42 .
37
The division of labor between system researchers building better tools and statistical mod-
elers building better neural networks has greatly simplified things. For instance, training a
38 linear logistic regression model used to be a nontrivial homework problem, worthy to give
to new machine learning Ph.D. students at Carnegie Mellon University in 2014. By now,
this task can be accomplished with under 10 lines of code, putting it firmly within the reach
of any programmer.
39
40
1.6 Success Stories
41 Artificial intelligence has a long history of delivering results that would be difficult to ac-
complish otherwise. For instance, mail sorting systems using optical character recognition
have been deployed since the 1990s. This is, after all, the source of the famous MNIST
42 dataset of handwritten digits. The same applies to reading checks for bank deposits and
scoring creditworthiness of applicants. Financial transactions are checked for fraud auto-
26 Introduction
matically. This forms the backbone of many e-commerce payment systems, such as PayPal,
Stripe, AliPay, WeChat, Apple, Visa, and MasterCard. Computer programs for chess have
been competitive for decades. Machine learning feeds search, recommendation, personal-
ization, and ranking on the Internet. In other words, machine learning is pervasive, albeit
often hidden from sight.
It is only recently that AI has been in the limelight, mostly due to solutions to problems that
were considered intractable previously and that are directly related to consumers. Many of
such advances are attributed to deep learning.
• Intelligent assistants, such as Apple’s Siri, Amazon’s Alexa, and Google’s assistant, are
able to respond to spoken requests with a reasonable degree of accuracy. This in-
cludes menial jobs, like turning on light switches, and more complex tasks, such as
arranging barber’s appointments and offering phone support dialog. This is likely the
most noticeable sign that AI is affecting our lives.
• A key ingredient in digital assistants is their ability to recognize speech accurately. The
accuracy of such systems has gradually increased to the point of achieving parity with
humans for certain applications (Xiong et al., 2018).
• Object recognition has likewise come a long way. Identifying the object in a picture was
a fairly challenging task in 2010. On the ImageNet benchmark researchers from NEC
Labs and University of Illinois at Urbana-Champaign achieved a top-five error rate
of 28% (Lin et al., 2010). By 2017, this error rate was reduced to 2.25% (Hu et al.,
2018). Similarly, stunning results have been achieved for identifying birdsong and for
diagnosing skin cancer.
• Prowess in games used to provide a measuring stick for human ability. Starting from
TD-Gammon, a program for playing backgammon using temporal difference rein-
forcement learning, algorithmic and computational progress has led to algorithms for
a wide range of applications. Compared with backgammon, chess has a much more
complex state space and set of actions. DeepBlue beat Garry Kasparov using mas-
sive parallelism, special-purpose hardware and efficient search through the game tree
(Campbell et al., 2002). Go is more difficult still, due to its huge state space. AlphaGo
reached human parity in 2015, using deep learning combined with Monte Carlo tree
sampling (Silver et al., 2016). The challenge in Poker was that the state space is large
and only partially observed (we do not know the opponents’ cards). Libratus exceeded
human performance in Poker using efficiently structured strategies (Brown and Sand-
holm, 2017).
This barely scratches the surface of significant applications of machine learning. For in-
stance, robotics, logistics, computational biology, particle physics, and astronomy owe
some of their most impressive recent advances at least in parts to machine learning, which
is thus becoming a ubiquitous tool for engineers and scientists.
Fortunately, we are far from a sentient AI system that could deliberately manipulate its
human creators. First, AI systems are engineered, trained, and deployed in a specific, goal-
oriented manner. While their behavior might give the illusion of general intelligence, it is a
combination of rules, heuristics and statistical models that underlie the design. Second, at
present, there are simply no tools for artificial general intelligence that are able to improve
themselves, reason about themselves, and that are able to modify, extend, and improve their
own architecture while trying to solve general tasks.
A much more pressing concern is how AI is being used in our daily lives. It is likely that
many routine tasks, currently fulfilled by humans, can and will be automated. Farm robots
will likely reduce the costs for organic farmers but they will also automate harvesting op-
erations. This phase of the industrial revolution may have profound consequences for large
swaths of society, since menial jobs provide much employment in many countries. Fur-
thermore, statistical models, when applied without care, can lead to racial, gender, or age
bias and raise reasonable concerns about procedural fairness if automated to drive conse-
quential decisions. It is important to ensure that these algorithms are used with care. With
what we know today, this strikes us as a much more pressing concern than the potential of
malevolent superintelligence for destroying humanity.
Thus far, we have talked in broad terms about machine learning. Deep learning is the subset
of machine learning concerned with models based on many-layered neural networks. It is
deep in precisely the sense that its models learn many layers of transformations. While this
might sound narrow, deep learning has given rise to a dizzying array of models, techniques,
problem formulations, and applications. Many intuitions have been developed to explain
the benefits of depth. Arguably, all machine learning has many layers of computation, the
first consisting of feature processing steps. What differentiates deep learning is that the
operations learned at each of the many layers of representations are learned jointly from
data.
28 Introduction
The problems that we have discussed so far, such as learning from the raw audio signal, the
raw pixel values of images, or mapping between sentences of arbitrary lengths and their
counterparts in foreign languages, are those where deep learning excels and traditional
methods falter. It turns out that these many-layered models are capable of addressing low-
level perceptual data in a way that previous tools could not. Arguably the most significant
commonality in deep learning methods is end-to-end training. That is, rather than assem-
bling a system based on components that are individually tuned, one builds the system and
then tunes their performance jointly. For instance, in computer vision scientists used to
separate the process of feature engineering from the process of building machine learn-
ing models. The Canny edge detector (Canny, 1987) and Lowe’s SIFT feature extractor
(Lowe, 2004) reigned supreme for over a decade as algorithms for mapping images into
feature vectors. In bygone days, the crucial part of applying machine learning to these
problems consisted of coming up with manually-engineered ways of transforming the data
into some form amenable to shallow models. Unfortunately, there is only so much that
humans can accomplish by ingenuity in comparison with a consistent evaluation over mil-
lions of choices carried out automatically by an algorithm. When deep learning took over,
these feature extractors were replaced by automatically tuned filters that yielded superior
accuracy.
Thus, one key advantage of deep learning is that it replaces not only the shallow models at
the end of traditional learning pipelines, but also the labor-intensive process of feature engi-
neering. Moreover, by replacing much of the domain-specific preprocessing, deep learning
has eliminated many of the boundaries that previously separated computer vision, speech
recognition, natural language processing, medical informatics, and other application areas,
thereby offering a unified set of tools for tackling diverse problems.
Another difference from previous work is the acceptance of suboptimal solutions, dealing
with nonconvex nonlinear optimization problems, and the willingness to try things before
proving them. This new-found empiricism in dealing with statistical problems, combined
with a rapid influx of talent has led to rapid progress in the development of practical algo-
rithms, albeit in many cases at the expense of modifying and re-inventing tools that existed
for decades.
In the end, the deep learning community prides itself on sharing tools across academic and
corporate boundaries, releasing many excellent libraries, statistical models, and trained
networks as open source. It is in this spirit that the notebooks forming this book are freely
available for distribution and use. We have worked hard to lower the barriers of access for
29 Summary
anyone wishing to learn about deep learning and we hope that our readers will benefit from
this.
1.8 Summary
Machine learning studies how computer systems can leverage experience (often data) to
improve performance at specific tasks. It combines ideas from statistics, data mining, and
optimization. Often, it is used as a means of implementing AI solutions. As a class of
machine learning, representational learning focuses on how to automatically find the ap-
propriate way to represent data. Considered as multi-level representation learning through
learning many layers of transformations, deep learning replaces not only the shallow mod-
els at the end of traditional machine learning pipelines, but also the labor-intensive process
of feature engineering. Much of the recent progress in deep learning has been triggered
by an abundance of data arising from cheap sensors and Internet-scale applications, and
by significant progress in computation, mostly through GPUs. Furthermore, the availabil-
ity of efficient deep learning frameworks has made design and implementation of whole
system optimization significantly easier, and this is a key component in obtaining high
performance.
1.9 Exercises
1. Which parts of code that you are currently writing could be “learned”, i.e., improved
by learning and automatically determining design choices that are made in your code?
Does your code include heuristic design choices? What data might you need to learn
the desired behavior?
2. Which problems that you encounter have many examples for their solution, yet no spe-
cific way for automating them? These may be prime candidates for using deep learning.
3. Describe the relationships between algorithms, data, and computation. How do char-
acteristics of the data and the current available computational resources influence the
appropriateness of various algorithms?
4. Name some settings where end-to-end training is not currently the default approach but
where it might be useful.
Discussions 43 .
43
2 Preliminaries
To prepare for your dive into deep learning, you will need a few survival skills: (i) tech-
niques for storing and manipulating data; (ii) libraries for ingesting and preprocessing data
from a variety of sources; (iii) knowledge of the basic linear algebraic operations that we
apply to high-dimensional data elements; (iv) just enough calculus to determine which di-
rection to adjust each parameter in order to decrease the loss function; (v) the ability to
automatically compute derivatives so that you can forget much of the calculus you just
learned; (vi) some basic fluency in probability, our primary language for reasoning under
uncertainty; and (vii) some aptitude for finding answers in the official documentation when
you get stuck.
In short, this chapter provides a rapid introduction to the basics that you will need to follow
most of the technical content in this book.
In order to get anything done, we need some way to store and manipulate data. Generally,
there are two important things we need to do with data: (i) acquire them; and (ii) process
them once they are inside the computer. There is no point in acquiring data without some
way to store it, so to start, let’s get our hands dirty with 𝑛-dimensional arrays, which we
also call tensors. If you already know the NumPy scientific computing package, this will be
a breeze. For all modern deep learning frameworks, the tensor class (ndarray in MXNet,
Tensor in PyTorch and TensorFlow) resembles NumPy’s ndarray, with a few killer fea-
tures added. First, the tensor class supports automatic differentiation. Second, it leverages
GPUs to accelerate numerical computation, whereas NumPy only runs on CPUs. These
properties make neural networks both easy to code and fast to run.
import torch
With two axes, a tensor is called a matrix. With 𝑘 > 2 axes, we drop the specialized names
and just refer to the object as a 𝑘 th -order tensor.
PyTorch provides a variety of functions for creating new tensors prepopulated with values.
For example, by invoking arange(n), we can create a vector of evenly spaced values, start-
ing at 0 (included) and ending at n (not included). By default, the interval size is 1. Unless
otherwise specified, new tensors are stored in main memory and designated for CPU-based
computation.
x = torch.arange(12, dtype=torch.float32)
x
tensor([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.])
Each of these values is called an element of the tensor. The tensor x contains 12 elements.
We can inspect the total number of elements in a tensor via its numel method.
x.numel()
12
We can access a tensor’s shape (the length along each axis) by inspecting its shape attribute.
Because we are dealing with a vector here, the shape contains just a single element and is
identical to the size.
x.shape
torch.Size([12])
We can change the shape of a tensor without altering its size or values, by invoking reshape.
For example, we can transform our vector x whose shape is (12,) to a matrix X with shape
(3, 4). This new tensor retains all elements but reconfigures them into a matrix. Notice that
the elements of our vector are laid out one row at a time and thus x[3] == X[0, 3].
X = x.reshape(3, 4)
X
Note that specifying every shape component to reshape is redundant. Because we already
know our tensor’s size, we can work out one component of the shape given the rest. For
example, given a tensor of size 𝑛 and target shape (ℎ, 𝑤), we know that 𝑤 = 𝑛/ℎ. To
32 Preliminaries
automatically infer one component of the shape, we can place a -1 for the shape component
that should be inferred automatically. In our case, instead of calling x.reshape(3, 4), we
could have equivalently called x.reshape(-1, 4) or x.reshape(3, -1).
Practitioners often need to work with tensors initialized to contain all 0s or 1s. We can
construct a tensor with all elements set to 0 and a shape of (2, 3, 4) via the zeros func-
tion.
torch.zeros((2, 3, 4))
torch.ones((2, 3, 4))
We often wish to sample each element randomly (and independently) from a given prob-
ability distribution. For example, the parameters of neural networks are often initialized
randomly. The following snippet creates a tensor with elements drawn from a standard
Gaussian (normal) distribution with mean 0 and standard deviation 1.
torch.randn(3, 4)
Finally, we can construct tensors by supplying the exact values for each element by sup-
plying (possibly nested) Python list(s) containing numerical literals. Here, we construct a
matrix with a list of lists, where the outermost list corresponds to axis 0, and the inner list
corresponds to axis 1.
33 Data Manipulation
tensor([[2, 1, 4, 3],
[1, 2, 3, 4],
[4, 3, 2, 1]])
X[-1], X[1:3]
Beyond reading them, we can also write elements of a matrix by specifying indices.
X[1, 2] = 17
X
If we want to assign multiple elements the same value, we apply the indexing on the left-
hand side of the assignment operation. For instance, [:2, :] accesses the first and second
rows, where : takes all the elements along axis 1 (column). While we discussed indexing
for matrices, this also works for vectors and for tensors of more than two dimensions.
X[:2, :] = 12
X
2.1.3 Operations
Now that we know how to construct tensors and how to read from and write to their ele-
ments, we can begin to manipulate them with various mathematical operations. Among the
most useful of these are the elementwise operations. These apply a standard scalar opera-
tion to each element of a tensor. For functions that take two tensors as inputs, elementwise
operations apply some standard binary operator on each pair of corresponding elements.
We can create an elementwise function from any function that maps from a scalar to a
scalar.
In mathematical notation, we denote such unary scalar operators (taking one input) by the
signature 𝑓 : R → R. This just means that the function maps from any real number onto
some other real number. Most standard operators, including unary ones like 𝑒 𝑥 , can be
applied elementwise.
torch.exp(x)
Likewise, we denote binary scalar operators, which map pairs of real numbers to a (single)
real number via the signature 𝑓 : R, R → R. Given any two vectors u and v of the
same shape, and a binary operator 𝑓 , we can produce a vector c = 𝐹 (u, v) by setting
𝑐 𝑖 ← 𝑓 (𝑢 𝑖 , 𝑣 𝑖 ) for all 𝑖, where 𝑐 𝑖 , 𝑢 𝑖 , and 𝑣 𝑖 are the 𝑖 th elements of vectors c, u, and v.
Here, we produced the vector-valued 𝐹 : R𝑑 , R𝑑 → R𝑑 by lifting the scalar function to an
elementwise vector operation. The common standard arithmetic operators for addition (+),
subtraction (-), multiplication (*), division (/), and exponentiation (**) have all been lifted
to elementwise operations for identically-shaped tensors of arbitrary shape.
x = torch.tensor([1.0, 2, 4, 8])
y = torch.tensor([2, 2, 2, 2])
x + y, x - y, x * y, x / y, x ** y
We can also concatenate multiple tensors, stacking them end-to-end to form a larger one.
We just need to provide a list of tensors and tell the system along which axis to concatenate.
The example below shows what happens when we concatenate two matrices along rows
35 Data Manipulation
(axis 0) instead of columns (axis 1). We can see that the first output’s axis-0 length (6) is
the sum of the two input tensors’ axis-0 lengths (3 + 3); while the second output’s axis-1
length (8) is the sum of the two input tensors’ axis-1 lengths (4 + 4).
X = torch.arange(12, dtype=torch.float32).reshape((3,4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
torch.cat((X, Y), dim=0), torch.cat((X, Y), dim=1)
X == Y
Summing all the elements in the tensor yields a tensor with only one element.
X.sum()
tensor(66.)
2.1.4 Broadcasting
By now, you know how to perform elementwise binary operations on two tensors of the
same shape. Under certain conditions, even when shapes differ, we can still perform ele-
mentwise binary operations by invoking the broadcasting mechanism. Broadcasting works
according to the following two-step procedure: (i) expand one or both arrays by copying
elements along axes with length 1 so that after this transformation, the two tensors have the
same shape; (ii) perform an elementwise operation on the resulting arrays.
a = torch.arange(3).reshape((3, 1))
b = torch.arange(2).reshape((1, 2))
a, b
36 Preliminaries
(tensor([[0],
[1],
[2]]),
tensor([[0, 1]]))
Since a and b are 3 × 1 and 1 × 2 matrices, respectively, their shapes do not match up.
Broadcasting produces a larger 3 × 2 matrix by replicating matrix a along the columns and
matrix b along the rows before adding them elementwise.
a + b
tensor([[0, 1],
[1, 2],
[2, 3]])
before = id(Y)
Y = Y + X
id(Y) == before
False
This might be undesirable for two reasons. First, we do not want to run around allocat-
ing memory unnecessarily all the time. In machine learning, we often have hundreds of
megabytes of parameters and update all of them multiple times per second. Whenever
possible, we want to perform these updates in place. Second, we might point at the same
parameters from multiple variables. If we do not update in place, we must be careful to
update all of these references, lest we spring a memory leak or inadvertently refer to stale
parameters.
Fortunately, performing in-place operations is easy. We can assign the result of an oper-
ation to a previously allocated array Y by using slice notation: Y[:] = <expression>.
To illustrate this concept, we overwrite the values of tensor Z, after initializing it, using
zeros_like, to have the same shape as Y.
37 Data Manipulation
Z = torch.zeros_like(Y)
print('id(Z):', id(Z))
Z[:] = X + Y
print('id(Z):', id(Z))
id(Z): 140381179266448
id(Z): 140381179266448
If the value of X is not reused in subsequent computations, we can also use X[:] = X + Y
or X += Y to reduce the memory overhead of the operation.
before = id(X)
X += Y
id(X) == before
True
A = X.numpy()
B = torch.from_numpy(A)
type(A), type(B)
(numpy.ndarray, torch.Tensor)
To convert a size-1 tensor to a Python scalar, we can invoke the item function or Python’s
built-in functions.
a = torch.tensor([3.5])
a, a.item(), float(a), int(a)
2.1.7 Summary
The tensor class is the main interface for storing and manipulating data in deep learning li-
braries. Tensors provide a variety of functionalities including construction routines; index-
ing and slicing; basic mathematics operations; broadcasting; memory-efficient assignment;
and conversion to and from other Python objects.
38 Preliminaries
2.1.8 Exercises
1. Run the code in this section. Change the conditional statement X == Y to X < Y or X >
Y, and then see what kind of tensor you can get.
2. Replace the two tensors that operate by element in the broadcasting mechanism with
other shapes, e.g., 3-dimensional tensors. Is the result the same as expected?
Discussions 44 .
44
So far, we have been working with synthetic data that arrived in ready-made tensors. How-
ever, to apply deep learning in the wild we must extract messy data stored in arbitrary
45
formats, and preprocess it to suit our needs. Fortunately, the pandas library 45 can do much
of the heavy lifting. This section, while no substitute for a proper pandas tutorial 46 , will
give you a crash course on some of the most common routines.
46
2.2.1 Reading the Dataset
Comma-separated values (CSV) files are ubiquitous for the storing of tabular (spreadsheet-
like) data. In them, each line corresponds to one record and consists of several (comma-
separated) fields, e.g., “Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,field
of gravitational physics”. To demonstrate how to load CSV files with pandas, we create a
CSV file below ../data/house_tiny.csv. This file represents a dataset of homes, where
each row corresponds to a distinct home and the columns correspond to the number of
rooms (NumRooms), the roof type (RoofType), and the price (Price).
import os
Now let’s import pandas and load the dataset with read_csv.
import pandas as pd
data = pd.read_csv(data_file)
print(data)
39 Data Preprocessing
For missing numerical values, one common heuristic is to replace the NaN entries with the
mean value of the corresponding column.
inputs = inputs.fillna(inputs.mean())
print(inputs)
import torch
X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))
X, y
2.2.4 Discussion
You now know how to partition data columns, impute missing variables, and load pan-
das data into tensors. In Section 5.7, you will pick up some more data processing skills.
While this crash course kept things simple, data processing can get hairy. For example,
rather than arriving in a single CSV file, our dataset might be spread across multiple files
extracted from a relational database. For instance, in an e-commerce application, customer
addresses might live in one table and purchase data in another. Moreover, practitioners face
myriad data types beyond categorical and numeric, for example, text strings, images, audio
data, and point clouds. Oftentimes, advanced tools and efficient algorithms are required
47 in order to prevent data processing from becoming the biggest bottleneck in the machine
learning pipeline. These problems will arise when we get to computer vision and natural
language processing. Finally, we must pay attention to data quality. Real-world datasets are
48
often plagued by outliers, faulty measurements from sensors, and recording errors, which
must be addressed before feeding the data into any model. Data visualization tools such as
seaborn 47 , Bokeh 48 , or matplotlib 49 can help you to manually inspect the data and develop
intuitions about the type of problems you may need to address.
49
2.2.5 Exercises
50 1. Try loading datasets, e.g., Abalone from the UCI Machine Learning Repository 50 and
inspect their properties. What fraction of them has missing values? What fraction of
the variables is numerical, categorical, or text?
51 2. Try indexing and selecting data columns by name rather than by column number. The
pandas documentation on indexing 51 has further details on how to do this.
41 Linear Algebra
3. How large a dataset do you think you could load this way? What might be the limita-
tions? Hint: consider the time to read the data, representation, processing, and memory
footprint. Try this out on your laptop. What happens if you try it out on a server?
4. How would you deal with data that has a very large number of categories? What if the
category labels are all unique? Should you include the latter?
5. What alternatives to pandas can you think of? How about loading NumPy tensors from
a file 52 ? Check out Pillow 53 , the Python Imaging Library.
52
Discussions 54 .
53
2.3 Linear Algebra
54
By now, we can load datasets into tensors and manipulate these tensors with basic math-
ematical operations. To start building sophisticated models, we will also need a few tools
from linear algebra. This section offers a gentle introduction to the most essential concepts,
starting from scalar arithmetic and ramping up to matrix multiplication.
import torch
2.3.1 Scalars
Most everyday mathematics consists of manipulating numbers one at a time. Formally, we
call these values scalars. For example, the temperature in Palo Alto is a balmy 72 degrees
Fahrenheit. If you wanted to convert the temperature to Celsius you would evaluate the
expression 𝑐 = 95 ( 𝑓 − 32), setting 𝑓 to 72. In this equation, the values 5, 9, and 32 are
constant scalars. The variables 𝑐 and 𝑓 in general represent unknown scalars.
We denote scalars by ordinary lower-cased letters (e.g., 𝑥, 𝑦, and 𝑧) and the space of all
(continuous) real-valued scalars by R. For expedience, we will skip past rigorous defini-
tions of spaces: just remember that the expression 𝑥 ∈ R is a formal way to say that 𝑥 is
a real-valued scalar. The symbol ∈ (pronounced “in”) denotes membership in a set. For
example, 𝑥, 𝑦 ∈ {0, 1} indicates that 𝑥 and 𝑦 are variables that can only take values 0 or
1.
Scalars are implemented as tensors that contain only one element. Below, we assign two
scalars and perform the familiar addition, multiplication, division, and exponentiation op-
erations.
x = torch.tensor(3.0)
y = torch.tensor(2.0)
x + y, x * y, x / y, x**y
42 Preliminaries
2.3.2 Vectors
For current purposes, you can think of a vector as a fixed-length array of scalars. As with
their code counterparts, we call these scalars the elements of the vector (synonyms include
entries and components). When vectors represent examples from real-world datasets, their
values hold some real-world significance. For example, if we were training a model to
predict the risk of a loan defaulting, we might associate each applicant with a vector whose
components correspond to quantities like their income, length of employment, or number of
previous defaults. If we were studying the risk of heart attack, each vector might represent
a patient and its components might correspond to their most recent vital signs, cholesterol
levels, minutes of exercise per day, etc. We denote vectors by bold lowercase letters, (e.g.,
x, y, and z).
Vectors are implemented as 1st -order tensors. In general, such tensors can have arbitrary
lengths, subject to memory limitations. Caution: in Python, as in most programming lan-
guages, vector indices start at 0, also known as zero-based indexing, whereas in linear
algebra subscripts begin at 1 (one-based indexing).
x = torch.arange(3)
x
tensor([0, 1, 2])
We can refer to an element of a vector by using a subscript. For example, 𝑥2 denotes the
second element of x. Since 𝑥2 is a scalar, we do not bold it. By default, we visualize vectors
by stacking their elements vertically.
𝑥1
.
x = .. , (2.3.1)
𝑥
𝑛
Here 𝑥 1 , . . . , 𝑥 𝑛 are elements of the vector. Later on, we will distinguish between such
column vectors and row vectors whose elements are stacked horizontally. Recall that we
access a tensor’s elements via indexing.
x[2]
tensor(2)
len(x)
We can also access the length via the shape attribute. The shape is a tuple that indicates
a tensor’s length along each axis. Tensors with just one axis have shapes with just one
element.
x.shape
torch.Size([3])
Oftentimes, the word “dimension” gets overloaded to mean both the number of axes and the
length along a particular axis. To avoid this confusion, we use order to refer to the number
of axes and dimensionality exclusively to refer to the number of components.
2.3.3 Matrices
Just as scalars are 0th -order tensors and vectors are 1st -order tensors, matrices are 2nd -order
tensors. We denote matrices by bold capital letters (e.g., X, Y, and Z), and represent them
in code by tensors with two axes. The expression A ∈ R𝑚×𝑛 indicates that a matrix A
contains 𝑚 × 𝑛 real-valued scalars, arranged as 𝑚 rows and 𝑛 columns. When 𝑚 = 𝑛, we
say that a matrix is square. Visually, we can illustrate any matrix as a table. To refer to an
individual element, we subscript both the row and column indices, e.g., 𝑎 𝑖 𝑗 is the value that
belongs to A’s 𝑖 th row and 𝑗 th column:
𝑎 11 𝑎 12 ··· 𝑎 1𝑛
𝑎 21 𝑎 22 ··· 𝑎 2𝑛
A= .
.. .. .. .. . (2.3.2)
. . .
𝑎 𝑎 𝑚2 ··· 𝑎 𝑚𝑛
𝑚1
In code, we represent a matrix A ∈ R𝑚×𝑛 by a 2nd -order tensor with shape (𝑚, 𝑛). We can
convert any appropriately sized 𝑚 × 𝑛 tensor into an 𝑚 × 𝑛 matrix by passing the desired
shape to reshape:
A = torch.arange(6).reshape(3, 2)
A
tensor([[0, 1],
[2, 3],
[4, 5]])
Sometimes we want to flip the axes. When we exchange a matrix’s rows and columns, the
result is called its transpose. Formally, we signify a matrix A’s transpose by A> and if
44 Preliminaries
A.T
tensor([[0, 2, 4],
[1, 3, 5]])
Symmetric matrices are the subset of square matrices that are equal to their own transposes:
A = A> . The following matrix is symmetric:
Matrices are useful for representing datasets. Typically, rows correspond to individual
records and columns correspond to distinct attributes.
2.3.4 Tensors
While you can go far in your machine learning journey with only scalars, vectors, and
matrices, eventually you may need to work with higher-order tensors. Tensors give us
a generic way of describing extensions to 𝑛th -order arrays. We call software objects of
the tensor class “tensors” precisely because they too can have arbitrary numbers of axes.
While it may be confusing to use the word tensor for both the mathematical object and its
realization in code, our meaning should usually be clear from context. We denote general
tensors by capital letters with a special font face (e.g., X, Y, and Z) and their indexing
mechanism (e.g., 𝑥𝑖 𝑗 𝑘 and [X] 1,2𝑖−1,3 ) follows naturally from that of matrices.
Tensors will become more important when we start working with images. Each image
arrives as a 3rd -order tensor with axes corresponding to the height, width, and channel. At
each spatial location, the intensities of each color (red, green, and blue) are stacked along the
channel. Furthermore, a collection of images is represented in code by a 4th -order tensor,
where distinct images are indexed along the first axis. Higher-order tensors are constructed,
as were vectors and matrices, by growing the number of shape components.
45 Linear Algebra
torch.arange(24).reshape(2, 3, 4)
tensor([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone() # Assign a copy of A to B by allocating new memory
A, A + B
The elementwise product of two matrices is called their Hadamard product (denoted ).
We can spell out the entries of the Hadamard product of two matrices A, B ∈ R𝑚×𝑛 :
𝑎 11 𝑏 11 𝑎 12 𝑏 12 ... 𝑎 1𝑛 𝑏 1𝑛
𝑎 21 𝑏 21 𝑎 22 𝑏 22 ... 𝑎 2𝑛 𝑏 2𝑛
A B= . .. .. .. . (2.3.4)
.. . . .
𝑎 𝑏 𝑎 𝑚2 𝑏 𝑚2 ... 𝑎 𝑚𝑛 𝑏 𝑚𝑛
𝑚1 𝑚1
A * B
Adding or multiplying a scalar and a tensor produces a result with the same shape as
the original tensor. Here, each element of the tensor is added to (or multiplied by) the
scalar.
a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape
46 Preliminaries
(tensor([[[ 2, 3, 4, 5],
[ 6, 7, 8, 9],
[10, 11, 12, 13]],
2.3.6 Reduction
Often, we wish to calculate the sum of a tensor’s elements. To express the sum of the
Í𝑛
elements in a vector x of length 𝑛, we write 𝑖=1 𝑥𝑖 . There is a simple function for it:
x = torch.arange(3, dtype=torch.float32)
x, x.sum()
To express sums over the elements of tensors of arbitrary shape, we simply sum over all
its axes. For example, the sum of the elements of an 𝑚 × 𝑛 matrix A could be written
Í𝑚 Í𝑛
𝑖=1 𝑗=1 𝑎 𝑖 𝑗 .
A.shape, A.sum()
By default, invoking the sum function reduces a tensor along all of its axes, eventually
producing a scalar. Our libraries also allow us to specify the axes along which the tensor
should be reduced. To sum over all elements along the rows (axis 0), we specify axis=0 in
sum. Since the input matrix reduces along axis 0 to generate the output vector, this axis is
missing from the shape of the output.
A.shape, A.sum(axis=0).shape
Specifying axis=1 will reduce the column dimension (axis 1) by summing up elements of
all the columns.
A.shape, A.sum(axis=1).shape
Reducing a matrix along both rows and columns via summation is equivalent to summing
up all the elements of the matrix.
tensor(True)
A related quantity is the mean, also called the average. We calculate the mean by dividing
the sum by the total number of elements. Because computing the mean is so common, it
gets a dedicated library function that works analogously to sum.
(tensor(2.5000), tensor(2.5000))
Likewise, the function for calculating the mean can also reduce a tensor along specific
axes.
(tensor([[ 3.],
[12.]]),
torch.Size([2, 1]))
For instance, since sum_A keeps its two axes after summing each row, we can divide A by
sum_A with broadcasting to create a matrix where each row sums up to 1.
A / sum_A
If we want to calculate the cumulative sum of elements of A along some axis, say axis=0
(row by row), we can call the cumsum function. By design, this function does not reduce
the input tensor along any axis.
A.cumsum(axis=0)
Equivalently, we can calculate the dot product of two vectors by performing an elementwise
multiplication followed by a sum:
torch.sum(x * y)
tensor(3.)
Dot products are useful in a wide range of contexts. For example, given some set of val-
ues, denoted by a vector x ∈ R𝑛 , and a set of weights, denoted by w ∈ R𝑛 , the weighted
sum of the values in x according to the weights w could be expressed as the dot product
Í𝑛
x> w. When the weights are nonnegative and sum to 1, i.e., 𝑖=1 𝑤 𝑖 = 1 , the dot prod-
uct expresses a weighted average. After normalizing two vectors to have unit length, the
dot products express the cosine of the angle between them. Later in this section, we will
formally introduce this notion of length.
a> a>
1> 1> x
a a x
2 2
Ax = . x = . . (2.3.6)
.. ..
a> a> x
𝑚 𝑚
To express a matrix–vector product in code, we use the mv function. Note that the column
dimension of A (its length along axis 1) must be the same as the dimension of x (its length).
Python has a convenience operator @ that can execute both matrix–vector and matrix–matrix
products (depending on its arguments). Thus we can write A@x.
𝑎 11 𝑎 12 ··· 𝑎 1𝑘 𝑏 11 𝑏 12 ··· 𝑏 1𝑚
𝑎 21 𝑎 22 ··· 𝑎 2𝑘 𝑏 21 𝑏 22 ··· 𝑏 2𝑚
A= .
.. .. .. .. , B= .
.. .. .. .. . (2.3.7)
. . . . . .
𝑎 𝑎 𝑛2 ··· 𝑎 𝑛𝑘 𝑏 𝑏 𝑘2 ··· 𝑏 𝑘𝑚
𝑛1 𝑘1
Let a> 𝑘
𝑖 ∈ R denote the row vector representing the 𝑖 row of the matrix A and let b 𝑗 ∈ R
th 𝑘
50 Preliminaries
a>
1>
a
2
A= . , B = b1 b2 ··· b𝑚 . (2.3.8)
..
a>
𝑛
To form the matrix product C ∈ R𝑛×𝑚 , we simply compute each element 𝑐 𝑖 𝑗 as the dot
product between the 𝑖 th row of A and the 𝑗 th column of B, i.e., a>
𝑖 b𝑗:
B = torch.ones(3, 4)
torch.mm(A, B), A@B
2.3.11 Norms
Some of the most useful operators in linear algebra are norms. Informally, the norm of a
vector tells us how big it is. For instance, the ℓ2 norm measures the (Euclidean) length of a
vector. Here, we are employing a notion of size that concerns the magnitude of a vector’s
components (not its dimensionality).
A norm is a function k · k that maps a vector to a scalar and satisfies the following three
properties:
1. Given any vector x, if we scale (all elements of) the vector by a scalar 𝛼 ∈ R, its norm
scales accordingly:
3. The norm of a vector is nonnegative and it only vanishes if the vector is zero:
Many functions are valid norms and different norms encode different notions of size. The
Euclidean norm that we all learned in elementary school geometry when calculating the
hypotenuse of a right triangle is the square root of the sum of squares of a vector’s elements.
Formally, this is called the ℓ2 norm and expressed as
v
t 𝑛
Õ
kxk 2 = 𝑥𝑖2 . (2.3.13)
𝑖=1
u = torch.tensor([3.0, -4.0])
torch.norm(u)
tensor(5.)
The ℓ1 norm is also common and the associated measure is called the Manhattan distance.
By definition, the ℓ1 norm sums the absolute values of a vector’s elements:
Õ
𝑛
kxk 1 = |𝑥 𝑖 | . (2.3.14)
𝑖=1
torch.abs(u).sum()
tensor(7.)
Both the ℓ2 and ℓ1 norms are special cases of the more general ℓ 𝑝 norms:
! 1/ 𝑝
Õ
𝑛
𝑝
kxk 𝑝 = |𝑥 𝑖 | . (2.3.15)
𝑖=1
In the case of matrices, matters are more complicated. After all, matrices can be viewed
both as collections of individual entries and as objects that operate on vectors and transform
them into other vectors. For instance, we can ask by how much longer the matrix–vector
product Xv could be relative to v. This line of thought leads to what is called the spectral
52 Preliminaries
norm. For now, we introduce the Frobenius norm, which is much easier to compute and
defined as the square root of the sum of the squares of a matrix’s elements:
v
u
tÕ 𝑚 Õ 𝑛
kXk F = 𝑥𝑖2𝑗 . (2.3.16)
𝑖=1 𝑗=1
torch.norm(torch.ones((4, 9)))
tensor(6.)
While we do not want to get too far ahead of ourselves, we already can plant some intu-
ition about why these concepts are useful. In deep learning, we are often trying to solve
optimization problems: maximize the probability assigned to observed data; maximize the
revenue associated with a recommender model; minimize the distance between predictions
and the ground truth observations; minimize the distance between representations of photos
of the same person while maximizing the distance between representations of photos of dif-
ferent people. These distances, which constitute the objectives of deep learning algorithms,
are often expressed as norms.
2.3.12 Discussion
In this section, we have reviewed all the linear algebra that you will need to understand a
significant chunk of modern deep learning. There is a lot more to linear algebra, though,
and much of it is useful for machine learning. For example, matrices can be decomposed
into factors, and these decompositions can reveal low-dimensional structure in real-world
datasets. There are entire subfields of machine learning that focus on using matrix decom-
positions and their generalizations to high-order tensors to discover structure in datasets
and solve prediction problems. But this book focuses on deep learning. And we believe
you will be more inclined to learn more mathematics once you have gotten your hands dirty
applying machine learning to real datasets. So while we reserve the right to introduce more
mathematics later on, we wrap up this section here.
If you are eager to learn more linear algebra, there are many excellent books and online
resources. For a more advanced crash course, consider checking out Strang (1993), Kolter
(2008), and Petersen and Pedersen (2008).
To recap:
• Scalars, vectors, matrices, and tensors are the basic mathematical objects used in linear
algebra and have zero, one, two, and an arbitrary number of axes, respectively.
• Tensors can be sliced or reduced along specified axes via indexing, or operations such
as sum and mean, respectively.
53 Linear Algebra
• Elementwise products are called Hadamard products. By contrast, dot products, matrix–
vector products, and matrix–matrix products are not elementwise operations and in
general return objects having shapes that are different from the the operands.
• Norms capture various notions of the magnitude of a vector (or matrix), and are com-
monly applied to the difference of two vectors to measure their distance apart.
• Common vector norms include the ℓ1 and ℓ2 norms, and common matrix norms include
the spectral and Frobenius norms.
2.3.13 Exercises
1. Prove that the transpose of the transpose of a matrix is the matrix itself: (A> ) > = A.
2. Given two matrices A and B, show that sum and transposition commute: A> + B> =
(A + B) > .
3. Given any square matrix A, is A + A> always symmetric? Can you prove the result by
using only the results of the previous two exercises?
4. We defined the tensor X of shape (2, 3, 4) in this section. What is the output of len(X)?
Write your answer without implementing any code, then check your answer using code.
5. For a tensor X of arbitrary shape, does len(X) always correspond to the length of a
certain axis of X? What is that axis?
6. Run A / A.sum(axis=1) and see what happens. Can you analyze the results?
7. When traveling between two points in downtown Manhattan, what is the distance that
you need to cover in terms of the coordinates, i.e., in terms of avenues and streets? Can
you travel diagonally?
8. Consider a tensor of shape (2, 3, 4). What are the shapes of the summation outputs
along axes 0, 1, and 2?
9. Feed a tensor with three or more axes to the linalg.norm function and observe its
output. What does this function compute for tensors of arbitrary shape?
tialized with Gaussian random variables. You want to compute the product ABC. Is
there any difference in memory footprint and speed, depending on whether you compute
(AB)C or A(BC). Why?
any difference in speed depending on whether you compute AB or AC> ? Why? What
changes if you initialize C = B> without cloning memory? Why?
12. Consider three matrices, say A, B, C ∈ R100×200 . Construct a tensor with three axes by
54 Preliminaries
stacking [A, B, C]. What is the dimensionality? Slice out the second coordinate of the
third axis to recover B. Check that your answer is correct.
Discussions 55 .
55
2.4 Calculus
For a long time, how to calculate the area of a circle remained a mystery. Then, in Ancient
Greece, the mathematician Archimedes came up with the clever idea to inscribe a series of
polygons with increasing numbers of vertices on the inside of a circle (Fig. 2.4.1). For a
polygon with 𝑛 vertices, we obtain 𝑛 triangles. The height of each triangle approaches the
radius 𝑟 as we partition the circle more finely. At the same time, its base approaches 2𝜋𝑟/𝑛,
since the ratio between arc and secant approaches 1 for a large number of vertices. Thus,
the area of the polygon approaches 𝑛 · 𝑟 · 12 (2𝜋𝑟/𝑛) = 𝜋𝑟 2 .
t
Fig. 2.4.1 Finding the area of a circle as a limit procedure.
This limiting procedure is at the root of both differential calculus and integral calculus. The
former can tell us how to increase or decrease a function’s value by manipulating its argu-
ments. This comes in handy for the optimization problems that we face in deep learning,
where we repeatedly update our parameters in order to decrease the loss function. Opti-
mization addresses how to fit our models to training data, and calculus is its key prerequisite.
However, do not forget that our ultimate goal is to perform well on previously unseen data.
That problem is called generalization and will be a key focus of other chapters.
%matplotlib inline
import numpy as np
from matplotlib_inline import backend_inline
from d2l import torch as d2l
This term on the right hand side is called a limit and it tells us what happens to the value of
an expression as a specified variable approaches a particular value. This limit tells us what
the ratio between a perturbation ℎ and the change in the function value 𝑓 (𝑥 + ℎ) − 𝑓 (𝑥)
converges to as we shrink its size to zero.
When 𝑓 0 (𝑥) exists, 𝑓 is said to be differentiable at 𝑥; and when 𝑓 0 (𝑥) exists for all 𝑥 on a
set, e.g., the interval [𝑎, 𝑏], we say that 𝑓 is differentiable on this set. Not all functions are
differentiable, including many that we wish to optimize, such as accuracy and the area under
the receiving operating characteristic (AUC). However, because computing the derivative
of the loss is a crucial step in nearly all algorithms for training deep neural networks, we
often optimize a differentiable surrogate instead.
We can interpret the derivative 𝑓 0 (𝑥) as the instantaneous rate of change of 𝑓 (𝑥) with
respect to 𝑥. Let’s develop some intuition with an example. Define 𝑢 = 𝑓 (𝑥) = 3𝑥 2 −
4𝑥.
def f(x):
return 3 * x ** 2 - 4 * x
There are several equivalent notational conventions for derivatives. Given 𝑦 = 𝑓 (𝑥), the
following expressions are equivalent:
𝑑𝑦 𝑑𝑓 𝑑
𝑓 0 (𝑥) = 𝑦 0 = = = 𝑓 (𝑥) = 𝐷 𝑓 (𝑥) = 𝐷 𝑥 𝑓 (𝑥), (2.4.2)
𝑑𝑥 𝑑𝑥 𝑑𝑥
where the symbols 𝑑𝑑𝑥 and 𝐷 are differentiation operators. Below, we present the deriva-
tives of some common functions:
𝑑
𝐶 =0 for any constant 𝐶
𝑑𝑥
𝑑 𝑛
𝑥 = 𝑛𝑥 𝑛−1 for 𝑛 ≠ 0
𝑑𝑥 (2.4.3)
𝑑 𝑥
𝑒 = 𝑒𝑥
𝑑𝑥
𝑑
ln 𝑥 = 𝑥 −1 .
𝑑𝑥
56 Preliminaries
Functions composed from differentiable functions are often themselves differentiable. The
following rules come in handy for working with compositions of any differentiable func-
tions 𝑓 and 𝑔, and constant 𝐶.
𝑑 𝑑
[𝐶 𝑓 (𝑥)] = 𝐶 𝑓 (𝑥) Constant multiple rule
𝑑𝑥 𝑑𝑥
𝑑 𝑑 𝑑
[ 𝑓 (𝑥) + 𝑔(𝑥)] = 𝑓 (𝑥) + 𝑔(𝑥) Sum rule
𝑑𝑥 𝑑𝑥 𝑑𝑥
𝑑 𝑑 𝑑 (2.4.4)
[ 𝑓 (𝑥)𝑔(𝑥)] = 𝑓 (𝑥) 𝑔(𝑥) + 𝑔(𝑥) 𝑓 (𝑥) Product rule
𝑑𝑥 𝑑𝑥 𝑑𝑥
𝑑 𝑑
𝑑 𝑓 (𝑥) 𝑔(𝑥) 𝑑 𝑥 𝑓 (𝑥) − 𝑓 (𝑥) 𝑑 𝑥 𝑔(𝑥)
= Quotient rule
𝑑𝑥 𝑔(𝑥) 𝑔 2 (𝑥)
Using this, we can apply the rules to find the derivative of 3𝑥 2 − 4𝑥 via
𝑑 𝑑 𝑑
[3𝑥 2 − 4𝑥] = 3 𝑥 2 − 4 𝑥 = 6𝑥 − 4. (2.4.5)
𝑑𝑥 𝑑𝑥 𝑑𝑥
Plugging in 𝑥 = 1 shows that, indeed, the derivative equals 2 at this location. Note that
derivatives tell us the slope of a function at a particular location.
Conveniently, we can set figure sizes with set_figsize. Since the import statement from
matplotlib import pyplot as plt was marked via #@save in the d2l package, we can
call d2l.plt.
The set_axes function can associate axes with properties, including labels, ranges, and
scales.
#@save
def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
"""Set the axes for matplotlib."""
axes.set_xlabel(xlabel), axes.set_ylabel(ylabel)
(continues on next page)
57 Calculus
axes.set_xscale(xscale), axes.set_yscale(yscale)
axes.set_xlim(xlim), axes.set_ylim(ylim)
if legend:
axes.legend(legend)
axes.grid()
With these three functions, we can define a plot function to overlay multiple curves. Much
of the code here is just ensuring that the sizes and shapes of inputs match.
#@save
def plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None,
ylim=None, xscale='linear', yscale='linear',
fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
"""Plot data points."""
if has_one_axis(X): X = [X]
if Y is None:
X, Y = [[]] * len(X), X
elif has_one_axis(Y):
Y = [Y]
if len(X) != len(Y):
X = X * len(Y)
set_figsize(figsize)
if axes is None:
axes = d2l.plt.gca()
axes.cla()
for x, y, fmt in zip(X, Y, fmts):
axes.plot(x,y,fmt) if len(x) else axes.plot(y,fmt)
set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
Now we can plot the function 𝑢 = 𝑓 (𝑥) and its tangent line 𝑦 = 2𝑥 − 3 at 𝑥 = 1, where the
coefficient 2 is the slope of the tangent line.
x = np.arange(0, 3, 0.1)
plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])
58 Preliminaries
When there is no ambiguity, ∇x 𝑓 (x) is typically replaced by ∇ 𝑓 (x). The following rules
come in handy for differentiating multivariate functions:
• For all A ∈ R𝑚×𝑛 we have ∇x Ax = A> and ∇x x> A = A.
• For square matrices A ∈ R𝑛×𝑛 we have that ∇x x> Ax = (A + A> )x and in particular
∇x kxk 2 = ∇x x> x = 2x.
Similarly, for any matrix X, we have ∇X kXk 2F = 2X.
where A ∈ R𝑛×𝑚 is a matrix that contains the derivative of vector u with respect to vector
x. Thus, evaluating the gradient requires computing a vector–matrix product. This is one
of the key reasons why linear algebra is such an integral building block in building deep
learning systems.
2.4.5 Discussion
While we have just scratched the surface of a deep topic, a number of concepts already come
into focus: first, the composition rules for differentiation can be applied routinely, enabling
us to compute gradients automatically. This task requires no creativity and thus we can
focus our cognitive powers elsewhere. Second, computing the derivatives of vector-valued
functions requires us to multiply matrices as we trace the dependency graph of variables
from output to input. In particular, this graph is traversed in a forward direction when
we evaluate a function and in a backwards direction when we compute gradients. Later
chapters will formally introduce backpropagation, a computational procedure for applying
the chain rule.
From the viewpoint of optimization, gradients allow us to determine how to move the pa-
rameters of a model in order to lower the loss, and each step of the optimization algorithms
used throughout this book will require calculating the gradient.
2.4.6 Exercises
1. So far we took the rules for derivatives for granted. Using the definition and limits prove
the properties for (i) 𝑓 (𝑥) = 𝑐, (ii) 𝑓 (𝑥) = 𝑥 𝑛 , (iii) 𝑓 (𝑥) = 𝑒 𝑥 and (iv) 𝑓 (𝑥) = log 𝑥.
2. In the same vein, prove the product, sum, and quotient rule from first principles.
3. Prove that the constant multiple rule follows as a special case of the product rule.
5. What does it mean that 𝑓 0 (𝑥) = 0 for some 𝑥? Give an example of a function 𝑓 and a
location 𝑥 for which this might hold.
8. What is the gradient of the function 𝑓 (x) = kxk 2 ? What happens for x = 0?
9. Can you write out the chain rule for the case where 𝑢 = 𝑓 (𝑥, 𝑦, 𝑧) and 𝑥 = 𝑥(𝑎, 𝑏),
𝑦 = 𝑦(𝑎, 𝑏), and 𝑧 = 𝑧(𝑎, 𝑏)?
10. Given a function 𝑓 (𝑥) that is invertible, compute the derivative of its inverse 𝑓 −1 (𝑥).
Here we have that 𝑓 −1 ( 𝑓 (𝑥)) = 𝑥 and conversely 𝑓 ( 𝑓 −1 (𝑦)) = 𝑦. Hint: use these
56 properties in your derivation.
Discussions 56 .
60 Preliminaries
Recall from Section 2.4 that calculating derivatives is the crucial step in all the optimization
algorithms that we will use to train deep networks. While the calculations are straightfor-
ward, working them out by hand can be tedious and error-prone, and these issues only grow
as our models become more complex.
Fortunately all modern deep learning frameworks take this work off our plates by offering
automatic differentiation (often shortened to autograd). As we pass data through each
successive function, the framework builds a computational graph that tracks how each value
depends on others. To calculate derivatives, automatic differentiation works backwards
through this graph applying the chain rule. The computational algorithm for applying the
chain rule in this fashion is called backpropagation.
While autograd libraries have become a hot concern over the past decade, they have a
long history. In fact the earliest references to autograd date back over half of a century
(Wengert, 1964). The core ideas behind modern backpropagation date to a PhD thesis
from 1980 (Speelpenning, 1980) and were further developed in the late 1980s (Griewank,
1989). While backpropagation has become the default method for computing gradients,
it is not the only option. For instance, the Julia programming language employs forward
propagation (Revels et al., 2016). Before exploring methods, let’s first master the autograd
package.
import torch
x = torch.arange(4.0)
x
Before we calculate the gradient of 𝑦 with respect to x, we need a place to store it. In
general, we avoid allocating new memory every time we take a derivative because deep
learning requires successively computing derivatives with respect to the same parameters
a great many times, and we might risk running out of memory. Note that the gradient of
a scalar-valued function with respect to a vector x is vector-valued with the same shape as
x.
61 Automatic Differentiation
y = 2 * torch.dot(x, x)
y
tensor(28., grad_fn=<MulBackward0>)
We can now take the gradient of y with respect to x by calling its backward method. Next,
we can access the gradient via x’s grad attribute.
y.backward()
x.grad
We already know that the gradient of the function 𝑦 = 2x> x with respect to x should be
4x. We can now verify that the automatic gradient computation and the expected result are
identical.
x.grad == 4 * x
Now let’s calculate another function of x and take its gradient. Note that PyTorch does not
automatically reset the gradient buffer when we record a new gradient. Instead, the new
gradient is added to the already-stored gradient. This behavior comes in handy when we
want to optimize the sum of multiple objective functions. To reset the gradient buffer, we
can call x.grad.zero_() as follows:
While Jacobians do show up in some advanced machine learning techniques, more com-
monly we want to sum up the gradients of each component of y with respect to the full
vector x, yielding a vector of the same shape as x. For example, we often have a vector
representing the value of our loss function calculated separately for each example among a
batch of training examples. Here, we just want to sum up the gradients computed individ-
ually for each example.
Because deep learning frameworks vary in how they interpret gradients of non-scalar ten-
sors, PyTorch takes some steps to avoid confusion. Invoking backward on a non-scalar
elicits an error unless we tell PyTorch how to reduce the object to a scalar. More formally,
we need to provide some vector v such that backward will compute v> 𝜕x y rather than
𝜕x y. This next part may be confusing, but for reasons that will become clear later, this
argument (representing v) is named gradient. For a more detailed description, see Yang
Zhang’s Medium post 57 .
57
x.grad.zero_()
y = x * x
y.backward(gradient=torch.ones(len(y))) # Faster: y.sum().backward()
x.grad
x.grad.zero_()
y = x * x
u = y.detach()
z = u * x
z.sum().backward()
x.grad == u
63 Automatic Differentiation
Note that while this procedure detaches y’s ancestors from the graph leading to z, the com-
putational graph leading to y persists and thus we can calculate the gradient of y with
respect to x.
x.grad.zero_()
y.sum().backward()
x.grad == 2 * x
def f(a):
b = a * 2
while b.norm() < 1000:
b = b * 2
if b.sum() > 0:
c = b
else:
c = 100 * b
return c
Below, we call this function, passing in a random value, as input. Since the input is a
random variable, we do not know what form the computational graph will take. However,
whenever we execute f(a) on a specific input, we realize a specific computational graph
and can subsequently run backward.
a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()
Even though our function f is, for demonstration purposes, a bit contrived, its dependence
on the input is quite simple: it is a linear function of a with piecewise defined scale. As
64 Preliminaries
such, f(a) / a is a vector of constant entries and, moreover, f(a) / a needs to match the
gradient of f(a) with respect to a.
a.grad == d / a
tensor(True)
Dynamic control flow is very common in deep learning. For instance, when processing
text, the computational graph depends on the length of the input. In these cases, automatic
differentiation becomes vital for statistical modeling since it is impossible to compute the
gradient a priori.
2.5.5 Discussion
You have now gotten a taste of the power of automatic differentiation. The development of
libraries for calculating derivatives both automatically and efficiently has been a massive
productivity booster for deep learning practitioners, liberating them so they can focus on
less menial. Moreover, autograd lets us design massive models for which pen and paper
gradient computations would be prohibitively time consuming. Interestingly, while we use
autograd to optimize models (in a statistical sense) the optimization of autograd libraries
themselves (in a computational sense) is a rich subject of vital interest to framework design-
ers. Here, tools from compilers and graph manipulation are leveraged to compute results
in the most expedient and memory-efficient manner.
For now, try to remember these basics: (i) attach gradients to those variables with respect
to which we desire derivatives; (ii) record the computation of the target value; (iii) execute
the backpropagation function; and (iv) access the resulting gradient.
2.5.6 Exercises
1. Why is the second derivative much more expensive to compute than the first derivative?
2. After running the function for backpropagation, immediately run it again and see what
happens. Investigate.
3. In the control flow example where we calculate the derivative of d with respect to a,
what would happen if we changed the variable a to a random vector or a matrix? At
this point, the result of the calculation f(a) is no longer a scalar. What happens to the
result? How do we analyze this?
4. Let 𝑓 (𝑥) = sin(𝑥). Plot the graph of 𝑓 and of its derivative 𝑓 0 . Do not exploit the fact
that 𝑓 0 (𝑥) = cos(𝑥) but rather use automatic differentiation to get the result.
5. Let 𝑓 (𝑥) = ((log 𝑥 2 ) · sin 𝑥) + 𝑥 −1 . Write out a dependency graph tracing results from
𝑥 to 𝑓 (𝑥).
6. Use the chain rule to compute the derivative 𝑑𝑑 𝑥𝑓 of the aforementioned function, placing
each term on the dependency graph that you constructed previously.
65 Probability and Statistics
7. Given the graph and the intermediate derivative results, you have a number of options
when computing the gradient. Evaluate the result once starting from 𝑥 to 𝑓 and once
from 𝑓 tracing back to 𝑥. The path from 𝑥 to 𝑓 is commonly known as forward differ-
entiation, whereas the path from 𝑓 to 𝑥 is known as backward differentiation.
8. When might you want to use forward, and when backward, differentiation? Hint: con-
sider the amount of intermediate data needed, the ability to parallelize steps, and the
size of matrices and vectors involved.
Discussions 58 .
58
One way or another, machine learning is all about uncertainty. In supervised learning, we
want to predict something unknown (the target) given something known (the features). De-
pending on our objective, we might attempt to predict the most likely value of the target.
Or we might predict the value with the smallest expected distance from the target. And
sometimes we wish not only to predict a specific value but to quantify our uncertainty. For
example, given some features describing a patient, we might want to know how likely they
are to suffer a heart attack in the next year. In unsupervised learning, we often care about
uncertainty. To determine whether a set of measurements are anomalous, it helps to know
how likely one is to observe values in a population of interest. Furthermore, in reinforce-
ment learning, we wish to develop agents that act intelligently in various environments.
This requires reasoning about how an environment might be expected to change and what
rewards one might expect to encounter in response to each of the available actions.
Probability is the mathematical field concerned with reasoning under uncertainty. Given a
probabilistic model of some process, we can reason about the likelihood of various events.
The use of probabilities to describe the frequencies of repeatable events (like coin tosses) is
fairly uncontroversial. In fact, frequentist scholars adhere to an interpretation of probability
that applies only to such repeatable events. By contrast Bayesian scholars use the language
of probability more broadly to formalize reasoning under uncertainty. Bayesian probability
is characterized by two unique features: (i) assigning degrees of belief to non-repeatable
events, e.g., what is the probability that a dam will collapse?; and (ii) subjectivity. While
Bayesian probability provides unambiguous rules for how one should update their beliefs in
light of new evidence, it allows for different individuals to start off with different prior be-
liefs. Statistics helps us to reason backwards, starting off with collection and organization
of data and backing out to what inferences we might draw about the process that generated
the data. Whenever we analyze a dataset, hunting for patterns that we hope might charac-
terize a broader population, we are employing statistical thinking. Many courses, majors,
theses, careers, departments, companies, and institutions have been devoted to the study of
probability and statistics. While this section only scratches the surface, we will provide the
foundation that you need to begin building models.
66 Preliminaries
%matplotlib inline
import random
import torch
from torch.distributions.multinomial import Multinomial
from d2l import torch as d2l
Formally, the quantity 1/2 is called a probability and here it captures the certainty with
which any given toss will come up heads. Probabilities assign scores between 0 and 1 to
outcomes of interest, called events. Here the event of interest is heads and we denote the
corresponding probability 𝑃(heads). A probability of 1 indicates absolute certainty (imag-
ine a trick coin where both sides were heads) and a probability of 0 indicates impossibility
(e.g., if both sides were tails). The frequencies 𝑛h /𝑛 and 𝑛t /𝑛 are not probabilities but rather
statistics. Probabilities are theoretical quantities that underly the data generating process.
Here, the probability 1/2 is a property of the coin itself. By contrast, statistics are empirical
quantities that are computed as functions of the observed data. Our interests in probabilis-
tic and statistical quantities are inextricably intertwined. We often design special statistics
called estimators that, given a dataset, produce estimates of model parameters such as prob-
abilities. Moreover, when those estimators satisfy a nice property called consistency, our
estimates will converge to the corresponding probability. In turn, these inferred probabili-
ties tell about the likely statistical properties of data from the same population that we might
encounter in the future.
Suppose that we stumbled upon a real coin for which we did not know the true 𝑃(heads).
To investigate this quantity with statistical methods, we need to (i) collect some data; and
(ii) design an estimator. Data acquisition here is easy; we can toss the coin many times
and record all the outcomes. Formally, drawing realizations from some underlying random
process is called sampling. As you might have guessed, one natural estimator is the ratio
of the number of observed heads to the total number of tosses.
Now, suppose that the coin was in fact fair, i.e., 𝑃(heads) = 0.5. To simulate tosses of a
fair coin, we can invoke any random number generator. There are some easy ways to draw
samples of an event with probability 0.5. For example Python’s random.random yields
numbers in the interval [0, 1] where the probability of lying in any sub-interval [𝑎, 𝑏] ⊂
67 Probability and Statistics
[0, 1] is equal to 𝑏 − 𝑎. Thus we can get out 0 and 1 with probability 0.5 each by testing
whether the returned float number is greater than 0.5:
num_tosses = 100
heads = sum([random.random() > 0.5 for _ in range(num_tosses)])
tails = num_tosses - heads
print("heads, tails: ", [heads, tails])
More generally, we can simulate multiple draws from any variable with a finite number
of possible outcomes (like the toss of a coin or roll of a die) by calling the multinomial
function, setting the first argument to the number of draws and the second as a list of prob-
abilities associated with each of the possible outcomes. To simulate ten tosses of a fair coin,
we assign probability vector [0.5, 0.5], interpreting index 0 as heads and index 1 as tails.
The function returns a vector with length equal to the number of possible outcomes (here,
2), where the first component tells us the number of occurrences of heads and the second
component tells us the number of occurrences of tails.
tensor([50., 50.])
Each time you run this sampling process, you will receive a new random value that may
differ from the previous outcome. Dividing by the number of tosses gives us the frequency
of each outcome in our data. Note that these frequencies, just like the probabilities that they
are intended to estimate, sum to 1.
tensor([0.4800, 0.5200])
Here, even though our simulated coin is fair (we ourselves set the probabilities [0.5, 0.
5]), the counts of heads and tails may not be identical. That is because we only drew a
relatively small number of samples. If we did not implement the simulation ourselves, and
only saw the outcome, how would we know if the coin were slightly unfair or if the possible
deviation from 1/2 was just an artifact of the small sample size? Let’s see what happens
when we simulate 10,000 tosses.
tensor([0.4966, 0.5034])
In general, for averages of repeated events (like coin tosses), as the number of repetitions
grows, our estimates are guaranteed to converge to the true underlying probabilities. The
mathematical formulation of this phenomenon is called the law of large numbers and the
central limit theorem tells us that in many situations, as the sample size 𝑛 grows, these
√
errors should go down at a rate of (1/ 𝑛). Let’s get some more intuition by studying how
our estimate evolves as we grow the number of tosses from 1 to 10,000.
d2l.set_figsize((4.5, 3.5))
d2l.plt.plot(estimates[:, 0], label=("P(coin=heads)"))
d2l.plt.plot(estimates[:, 1], label=("P(coin=tails)"))
d2l.plt.axhline(y=0.5, color='black', linestyle='dashed')
d2l.plt.gca().set_xlabel('Samples')
d2l.plt.gca().set_ylabel('Estimated probability')
d2l.plt.legend();
Each solid curve corresponds to one of the two values of the coin and gives our estimated
probability that the coin turns up that value after each group of experiments. The dashed
black line gives the true underlying probability. As we get more data by conducting more
experiments, the curves converge towards the true probability. You might already begin to
see the shape of some of the more advanced questions that preoccupy statisticians: How
quickly does this convergence happen? If we had already tested many coins manufactured
at the same plant, how might we incorporate this information?
rics (checking the deviation). However, to go much further, we will need to be more pre-
cise.
When dealing with randomness, we denote the set of possible outcomes S and call it the
sample space or outcome space. Here, each element is a distinct possible outcome. In
the case of rolling a single coin, S = {heads, tails}. For a single die, S = {1, 2, 3, 4, 5, 6}.
When flipping two coins, possible outcomes are {(heads, heads), (heads, tails), (tails, heads), (tails, tails)}.
Events are subsets of the sample space. For instance, the event “the first coin toss comes
up heads” corresponds to the set {(heads, heads), (heads, tails)}. Whenever the outcome
𝑧 of a random experiment satisfies 𝑧 ∈ A, then event A has occurred. For a single roll
of a die, we could define the events “seeing a 5” (A = {5}) and “seeing an odd number”
(B = {1, 3, 5}). In this case, if the die came up 5, we would say that both A and B occurred.
On the other hand, if 𝑧 = 3, then A did not occur but B did.
A probability function maps events onto real values 𝑃 : A ⊆ S → [0, 1]. The probabil-
ity, denoted 𝑃(A), of an event A in the given sample space S, has the following proper-
ties:
• For any countable sequence of events A1 , A2 , . . . that are mutually exclusive (i.e., A𝑖 ∩
A 𝑗 = ∅ for all 𝑖 ≠ 𝑗), the probability that any of them happens is equal to the sum of
Ð Í∞
their individual probabilities, i.e., 𝑃( ∞𝑖=1 A 𝑖 ) = 𝑖=1 𝑃(A 𝑖 ).
random variable. Knowing that the alarm went off, we might suspect that the house was
likely burgled.
Every value taken by a random variable corresponds to a subset of the underlying sample
space. Thus the occurrence where the random variable 𝑋 takes value 𝑣, denoted by 𝑋 = 𝑣,
is an event and 𝑃(𝑋 = 𝑣) denotes its probability. Sometimes this notation can get clunky,
and we can abuse notation when the context is clear. For example, we might use 𝑃(𝑋) to
refer broadly to the distribution of 𝑋, i.e., the function that tells us the probability that 𝑋
takes any given value. Other times we write expressions like 𝑃(𝑋, 𝑌 ) = 𝑃(𝑋)𝑃(𝑌 ), as a
shorthand to express a statement that is true for all of the values that the random variables
𝑋 and 𝑌 can take, i.e., for all 𝑖, 𝑗 it holds that 𝑃(𝑋 = 𝑖 and 𝑌 = 𝑗) = 𝑃(𝑋 = 𝑖)𝑃(𝑌 = 𝑗).
Other times, we abuse notation by writing 𝑃(𝑣) when the random variable is clear from the
context. Since an event in probability theory is a set of outcomes from the sample space,
we can specify a range of values for a random variable to take. For example, 𝑃(1 ≤ 𝑋 ≤ 3)
denotes the probability of the event {1 ≤ 𝑋 ≤ 3}.
Note that there is a subtle difference between discrete random variables, like flips of a coin
or tosses of a die, and continuous ones, like the weight and the height of a person sampled
at random from the population. In this case we seldom really care about someone’s exact
height. Moreover, if we took precise enough measurements, we would find that no two
people on the planet have the exact same height. In fact, with fine enough measurements,
you would never have the same height when you wake up and when you go to sleep. There
is little point in asking about the exact probability that someone is 1.801392782910287192
meters tall. Instead, we typically care more about being able to say whether someone’s
height falls into a given interval, say between 1.79 and 1.81 meters. In these cases we work
with probability densities. The height of exactly 1.80 meters has no probability, but nonzero
density. To work out the probability assigned to an interval, we must take an integral of the
density over that interval.
every combination of values that the variables can jointly take. The probability function
that assigns probabilities to each of these combinations (e.g. 𝐴 = 𝑎 and 𝐵 = 𝑏) is called the
joint probability function and simply returns the probability assigned to the intersection
of the corresponding subsets of the sample space. The joint probability assigned to the
event where random variables 𝐴 and 𝐵 take values 𝑎 and 𝑏, respectively, is denoted 𝑃( 𝐴 =
𝑎, 𝐵 = 𝑏), where the comma indicates “and”. Note that for any values 𝑎 and 𝑏, it follows
that
It tells us the new probability associated with the event 𝐵 = 𝑏, once we condition on the
fact 𝐴 = 𝑎 took place. We can think of this conditional probability as restricting attention
only to the subset of the sample space associated with 𝐴 = 𝑎 and then renormalizing so that
all probabilities sum to 1. Conditional probabilities are in fact just ordinary probabilities
and thus respect all of the axioms, as long as we condition all terms on the same event and
thus restrict attention to the same sample space. For instance, for disjoint events B and B 0 ,
we have that 𝑃(B ∪ B 0 | 𝐴 = 𝑎) = 𝑃(B | 𝐴 = 𝑎) + 𝑃(B 0 | 𝐴 = 𝑎).
Using the definition of conditional probabilities, we can derive the famous result called
Bayes’ theorem. By construction, we have that 𝑃( 𝐴, 𝐵) = 𝑃(𝐵 | 𝐴)𝑃( 𝐴) and 𝑃( 𝐴, 𝐵) =
𝑃( 𝐴 | 𝐵)𝑃(𝐵). Combining both equations yields 𝑃(𝐵 | 𝐴)𝑃( 𝐴) = 𝑃( 𝐴 | 𝐵)𝑃(𝐵) and
hence
𝑃(𝐵 | 𝐴)𝑃( 𝐴)
𝑃( 𝐴 | 𝐵) = . (2.6.3)
𝑃(𝐵)
This simple equation has profound implications because it allows us to reverse the order of
conditioning. If we know how to estimate 𝑃(𝐵 | 𝐴), 𝑃( 𝐴), and 𝑃(𝐵), then we can estimate
𝑃( 𝐴 | 𝐵). We often find it easier to estimate one term directly but not the other and Bayes’
theorem can come to the rescue here. For instance, if we know the prevalence of symptoms
for a given disease, and the overall prevalences of the disease and symptoms, respectively,
we can determine how likely someone is to have the disease based on their symptoms. In
some cases we might not have direct access to 𝑃(𝐵), such as the prevalence of symptoms.
In this case a simplified version of Bayes’ theorem comes in handy:
Í
Since we know that 𝑃( 𝐴 | 𝐵) must be normalized to 1, i.e., 𝑎 𝑃( 𝐴 = 𝑎 | 𝐵) = 1, we can
use it to compute
𝑃(𝐵 | 𝐴)𝑃( 𝐴)
𝑃( 𝐴 | 𝐵) = Í . (2.6.5)
𝑎 𝑃(𝐵 | 𝐴 = 𝑎)𝑃( 𝐴 = 𝑎)
In Bayesian statistics, we think of an observer as possessing some (subjective) prior be-
liefs about the plausibility of the available hypotheses encoded in the prior 𝑃(𝐻), and a
likelihood function that says how likely one is to observe any value of the collected evi-
dence for each of the hypotheses in the class 𝑃(𝐸 | 𝐻). Bayes’ theorem is then interpreted
as telling us how to update the initial prior 𝑃(𝐻) in light of the available evidence 𝐸 to
produce posterior beliefs 𝑃(𝐻 | 𝐸) = 𝑃 (𝐸𝑃| 𝐻(𝐸) 𝑃) (𝐻 ) . Informally, this can be stated as “pos-
terior equals prior times likelihood, divided by the evidence”. Now, because the evidence
𝑃(𝐸) is the same for all hypotheses, we can get away with simply normalizing over the
hypotheses.
Í
Note that 𝑎 𝑃( 𝐴 = 𝑎 | 𝐵) = 1 also allows us to marginalize over random variables.
That is, we can drop variables from a joint distribution such as 𝑃( 𝐴, 𝐵). After all, we have
that
Õ Õ
𝑃(𝐵 | 𝐴 = 𝑎)𝑃( 𝐴 = 𝑎) = 𝑃(𝐵, 𝐴 = 𝑎) = 𝑃(𝐵). (2.6.6)
𝑎 𝑎
Independence is another fundamentally important concept that forms the backbone of many
important ideas in statistics. In short, two variables are independent if conditioning on the
value of 𝐴 does not cause any change to the probability distribution associated with 𝐵 and
vice versa. More formally, independence, denoted 𝐴 ⊥ 𝐵, requires that 𝑃( 𝐴 | 𝐵) = 𝑃( 𝐴)
and, consequently, that 𝑃( 𝐴, 𝐵) = 𝑃( 𝐴 | 𝐵)𝑃(𝐵) = 𝑃( 𝐴)𝑃(𝐵). Independence is often
an appropriate assumption. For example, if the random variable 𝐴 represents the outcome
from tossing one fair coin and the random variable 𝐵 represents the outcome from tossing
another, then knowing whether 𝐴 came up heads should not influence the probability of 𝐵
coming up heads.
Independence is especially useful when it holds among the successive draws of our data
from some underlying distribution (allowing us to make strong statistical conclusions) or
when it holds among various variables in our data, allowing us to work with simpler models
that encode this independence structure. On the other hand, estimating the dependencies
among random variables is often the very aim of learning. We care to estimate the probabil-
ity of disease given symptoms specifically because we believe that diseases and symptoms
are not independent.
Note that because conditional probabilities are proper probabilities, the concepts of inde-
pendence and dependence also apply to them. Two random variables 𝐴 and 𝐵 are condition-
ally independent given a third variable 𝐶 if and only if 𝑃( 𝐴, 𝐵 | 𝐶) = 𝑃( 𝐴 | 𝐶)𝑃(𝐵 | 𝐶).
Interestingly, two variables can be independent in general but become dependent when
conditioning on a third. This often occurs when the two random variables 𝐴 and 𝐵 cor-
respond to causes of some third variable 𝐶. For example, broken bones and lung cancer
might be independent in the general population but if we condition on being in the hospital
then we might find that broken bones are negatively correlated with lung cancer. That is
73 Probability and Statistics
because the broken bone explains away why some person is in the hospital and thus lowers
the probability that they are hospitalized because of having lung cancer.
And conversely, two dependent random variables can become independent upon condition-
ing on a third. This often happens when two otherwise unrelated events have a common
cause. Shoe size and reading level are highly correlated among elementary school students,
but this correlation disappears if we condition on age.
2.6.5 An Example
Let’s put our skills to the test. Assume that a doctor administers an HIV test to a patient.
This test is fairly accurate and fails only with 1% probability if the patient is healthy but
reported as diseased, i.e., healthy patients test positive in 1% of cases. Moreover, it never
fails to detect HIV if the patient actually has it. We use 𝐷 1 ∈ {0, 1} to indicate the diagnosis
(0 if negative and 1 if positive) and 𝐻 ∈ {0, 1} to denote the HIV status.
Note that the column sums are all 1 (but the row sums do not), since they are conditional
probabilities. Let’s compute the probability of the patient having HIV if the test comes
back positive, i.e., 𝑃(𝐻 = 1 | 𝐷 1 = 1). Intuitively this is going to depend on how common
the disease is, since it affects the number of false alarms. Assume that the population is
fairly free of the disease, e.g., 𝑃(𝐻 = 1) = 0.0015. To apply Bayes’ theorem, we need to
apply marginalization to determine
This leads us to
𝑃(𝐷 1 = 1 | 𝐻 = 1)𝑃(𝐻 = 1)
𝑃(𝐻 = 1 | 𝐷 1 = 1) = = 0.1306. (2.6.8)
𝑃(𝐷 1 = 1)
In other words, there is only a 13.06% chance that the patient actually has HIV, despite the
test being pretty accurate. As we can see, probability can be counterintuitive. What should a
patient do upon receiving such terrifying news? Likely, the patient would ask the physician
to administer another test to get clarity. The second test has different characteristics and it
is not as good as the first one.
Unfortunately, the second test comes back positive, too. Let’s calculate the requisite prob-
abilities to invoke Bayes’ theorem by assuming conditional independence:
𝑃(𝐷 1 = 1, 𝐷 2 = 1 | 𝐻 = 0) = 𝑃(𝐷 1 = 1 | 𝐻 = 0)𝑃(𝐷 2 = 1 | 𝐻 = 0) = 0.0003,
𝑃(𝐷 1 = 1, 𝐷 2 = 1 | 𝐻 = 1) = 𝑃(𝐷 1 = 1 | 𝐻 = 1)𝑃(𝐷 2 = 1 | 𝐻 = 1) = 0.98.
(2.6.9)
Now we can apply marginalization to obtain the probability that both tests come back pos-
itive:
𝑃(𝐷 1 = 1, 𝐷 2 = 1)
= 𝑃(𝐷 1 = 1, 𝐷 2 = 1, 𝐻 = 0) + 𝑃(𝐷 1 = 1, 𝐷 2 = 1, 𝐻 = 1)
= 𝑃(𝐷 1 = 1, 𝐷 2 = 1 | 𝐻 = 0)𝑃(𝐻 = 0) + 𝑃(𝐷 1 = 1, 𝐷 2 = 1 | 𝐻 = 1)𝑃(𝐻 = 1)
= 0.00176955.
(2.6.10)
Finally, the probability of the patient having HIV given that both tests are positive is
𝑃(𝐷 1 = 1, 𝐷 2 = 1 | 𝐻 = 1)𝑃(𝐻 = 1)
𝑃(𝐻 = 1 | 𝐷 1 = 1, 𝐷 2 = 1) = = 0.8307. (2.6.11)
𝑃(𝐷 1 = 1, 𝐷 2 = 1)
That is, the second test allowed us to gain much higher confidence that not all is well. De-
spite the second test being considerably less accurate than the first one, it still significantly
improved our estimate. The assumption of both tests being conditionally independent of
each other was crucial for our ability to generate a more accurate estimate. Take the ex-
treme case where we run the same test twice. In this situation we would expect the same
outcome both times, hence no additional insight is gained from running the same test again.
The astute reader might have noticed that the diagnosis behaved like a classifier hiding in
plain sight where our ability to decide whether a patient is healthy increases as we obtain
more features (test outcomes).
2.6.6 Expectations
Often, making decisions requires not just looking at the probabilities assigned to individ-
ual events but composing them together into useful aggregates that can provide us with
guidance. For example, when random variables take continuous scalar values, we often
care about knowing what value to expect on average. This quantity is formally called an
expectation. If we are making investments, the first quantity of interest might be the return
we can expect, averaging over all the possible outcomes (and weighting by the appropri-
ate probabilities). For instance, say that with 50% probability, an investment might fail
altogether, with 40% probability it might provide a 2× return, and with 10% probability
it might provide a 10× return 10×. To calculate the expected return, we sum over all re-
turns, multiplying each by the probability that they will occur. This yields the expectation
0.5 · 0 + 0.4 · 2 + 0.1 · 10 = 1.8. Hence the expected return is 1.8×.
∫
Likewise, for densities we obtain 𝐸 [𝑋] = 𝑥 𝑑𝑝(𝑥). Sometimes we are interested in the
expected value of some function of 𝑥. We can calculate these expectations as
Õ ∫
𝐸 𝑥∼𝑃 [ 𝑓 (𝑥)] = 𝑓 (𝑥)𝑃(𝑥) and 𝐸 𝑥∼𝑃 [ 𝑓 (𝑥)] = 𝑓 (𝑥) 𝑝(𝑥) 𝑑𝑥 (2.6.13)
𝑥
for discrete probabilities and densities, respectively. Returning to the investment exam-
ple from above, 𝑓 might be the utility (happiness) associated with the return. Behavior
economists have long noted that people associate greater disutility with losing money than
the utility gained from earning one dollar relative to their baseline. Moreover, the value
of money tends to be sub-linear. Possessing 100k dollars versus zero dollars can make the
difference between paying the rent, eating well, and enjoying quality healthcare versus suf-
fering through homelessness. On the other hand, the gains due to possessing 200k versus
100k are less dramatic. Reasoning like this motivates the cliché that “the utility of money
is logarithmic”.
If the utility associated with a total loss were −1, and the utilities associated with returns of
1, 2, and 10 were 1, 2 and 4, respectively, then the expected happiness of investing would
be 0.5 · (−1) + 0.4 · 2 + 0.1 · 4 = 0.7 (an expected loss of utility of 30%). If indeed this were
your utility function, you might be best off keeping the money in the bank.
For financial decisions, we might also want to measure how risky an investment is. Here, we
care not just about the expected value but how much the actual values tend to vary relative
to this value. Note that we cannot just take the expectation of the difference between the
actual and expected values. This is because the expectation of a difference is the difference
of the expectations, i.e., 𝐸 [𝑋 − 𝐸 [𝑋]] = 𝐸 [𝑋] − 𝐸 [𝐸 [𝑋]] = 0. However, we can look at
the expectation of any non-negative function of this difference. The variance of a random
variable is calculated by looking at the expected value of the squared differences:
Var[𝑋] = 𝐸 (𝑋 − 𝐸 [𝑋]) 2 = 𝐸 [𝑋 2 ] − 𝐸 [𝑋] 2 . (2.6.14)
Here the equality follows by expanding (𝑋 − 𝐸 [𝑋]) 2 = 𝑋 2 − 2𝑋 𝐸 [𝑋] + 𝐸 [𝑋] 2 and taking
expectations for each term. The square root of the variance is another useful quantity called
the standard deviation. While this and the variance convey the same information (either can
be calculated from the other), the standard deviation has the nice property that it is expressed
in the same units as the original quantity represented by the random variable.
Lastly, the variance of a function of a random variable is defined analogously as
Returning to our investment example, we can now compute the variance of the investment.
It is given by 0.5 · 0 + 0.4 · 22 + 0.1 · 102 − 1.82 = 8.36. For all intents and purposes this
is a risky investment. Note that by mathematical convention mean and variance are often
referenced as 𝜇 and 𝜎 2 . This is particularly the case whenever we use it to parametrize a
Gaussian distribution.
In the same way as we introduced expectations and variance for scalar random variables,
we can do so for vector-valued ones. Expectations are easy, since we can apply them el-
def
ementwise. For instance, 𝝁 = 𝐸 x∼𝑃 [x] has coordinates 𝜇𝑖 = 𝐸 x∼𝑃 [𝑥𝑖 ]. Covariances
76 Preliminaries
are more complicated. We define them by taking expectations of the outer product of the
difference between random variables and their mean:
def
𝚺 = Covx∼𝑃 [x] = 𝐸 x∼𝑃 (x − 𝝁) (x − 𝝁) > . (2.6.16)
This matrix 𝚺 is referred to as the covariance matrix. An easy way to see its effect is to
consider some vector v of the same size as x. It follows that
v> 𝚺v = 𝐸 x∼𝑃 v> (x − 𝝁) (x − 𝝁) > v = Var 𝑥∼𝑃 [v> x]. (2.6.17)
As such, 𝚺 allows us to compute the variance for any linear function of x by a simple
matrix multiplication. The off-diagonal elements tell us how correlated the coordinates
are: a value of 0 means no correlation, where a larger positive value means that they are
more strongly correlated.
2.6.7 Discussion
In machine learning, there are many things to be uncertain about! We can be uncertain
about the value of a label given an input. We can be uncertain about the estimated value of
a parameter. We can even be uncertain about whether data arriving at deployment is even
from the same distribution as the training data.
By aleatoric uncertainty, we mean uncertainty that is intrinsic to the problem, and due to
genuine randomness unaccounted for by the observed variables. By epistemic uncertainty,
we mean uncertainty over a model’s parameters, the sort of uncertainty that we can hope
to reduce by collecting more data. We might have epistemic uncertainty concerning the
probability that a coin turns up heads, but even once we know this probability, we are left
with aleatoric uncertainty about the outcome of any future toss. No matter how long we
watch someone tossing a fair coin, we will never be more or less than 50% certain that
the next toss will come up heads. These terms come from mechanical modeling, (see e.g.,
Der Kiureghian and Ditlevsen (2009) for a review on this aspect of uncertainty quantifica-
tion 59 ). It is worth noting, however, that these terms constitute a slight abuse of language.
59
The term epistemic refers to anything concerning knowledge and thus, in the philosophical
sense, all uncertainty is epistemic.
We saw that sampling data from some unknown probability distribution can provide us with
information that can be used to estimate the parameters of the data generating distribution.
That said, the rate at which this is possible can be quite slow. In our coin tossing example
(and many others) we can do no better than to design estimators that converge at a rate of
√
1/ 𝑛, where 𝑛 is the sample size (e.g., the number of tosses). This means that by going
from 10 to 1000 observations (usually a very achievable task) we see a tenfold reduction of
uncertainty, whereas the next 1000 observations help comparatively little, offering only a
1.41 times reduction. This is a persistent feature of machine learning: while there are often
easy gains, it takes a very large amount of data, and often with it an enormous amount of
computation, to make further gains. For an empirical review of this fact for large scale
language models see Revels et al. (2016).
We also sharpened our language and tools for statistical modeling. In the process of that
77 Probability and Statistics
we learned about conditional probabilities and about one of the most important equations
in statistics—Bayes’ theorem. It is an effective tool for decoupling information conveyed
by data through a likelihood term 𝑃(𝐵 | 𝐴) that addresses how well observations 𝐵 match
a choice of parameters 𝐴, and a prior probability 𝑃( 𝐴) which governs how plausible a par-
ticular choice of 𝐴 was in the first place. In particular, we saw how this rule can be applied
to assign probabilities to diagnoses, based on the efficacy of the test and the prevalence of
the disease itself (i.e., our prior).
Lastly, we introduced a first set of nontrivial questions about the effect of a specific proba-
bility distribution, namely expectations and variances. While there are many more than just
linear and quadratic expectations for a probability distribution, these two already provide
a good deal of knowledge about the possible behavior of the distribution. For instance,
Chebyshev’s inequality 60 states that 𝑃(|𝑋 − 𝜇| ≥ 𝑘𝜎) ≤ 1/𝑘 2 , where 𝜇 is the expecta-
60 tion, 𝜎 2 is the variance of the distribution, and 𝑘 > 1 is a confidence parameter of our
choosing.
√ It tells us that draws from a distribution lie with at least 50% probability within
√
a [− 2𝜎, 2𝜎] interval centered on the expectation.
2.6.8 Exercises
1. Give an example where observing more data can reduce the amount of uncertainty about
the outcome to an arbitrarily low level.
2. Give an example where observing more data will only reduce the amount of uncertainty
up to a point and then no further. Explain why this is the case and where you expect this
point to occur.
3. We empirically demonstrated convergence to the mean for the toss of a coin. Calculate
the variance of the estimate of the probability that we see a head after drawing 𝑛 samples.
1. How does the variance scale with the number of observations?
2. Use Chebyshev’s inequality to bound the deviation from the expectation.
3. How does it relate to the central limit theorem?
4. Assume that we draw 𝑚 samples 𝑥 𝑖 from a probability distribution with zero mean and
def Í𝑚
unit variance. Compute the averages 𝑧 𝑚 = 𝑚 −1 𝑖=1 𝑥𝑖 . Can we apply Chebyshev’s
inequality for every 𝑧 𝑚 independently? Why not?
5. Given two events with probability 𝑃(A) and 𝑃(B), compute upper and lower bounds
on 𝑃(A ∪ B) and 𝑃(A ∩ B). Hint: graph the situation using a Venn diagram 61 .
61
6. Assume that we have a sequence of random variables, say 𝐴, 𝐵, and 𝐶, where 𝐵 only de-
pends on 𝐴, and 𝐶 only depends on 𝐵, can you simplify the joint probability 𝑃( 𝐴, 𝐵, 𝐶)?
Hint: this is a Markov chain 62 .
62
7. In Section 2.6.5, assume that the outcomes of the two tests are not independent. In
particular assume that either test on its own has a false positive rate of 10% and a false
negative rate of 1%. That is, assume that 𝑃(𝐷 = 1 | 𝐻 = 0) = 0.1 and that 𝑃(𝐷 =
0 | 𝐻 = 1) = 0.01. Moreover, assume that for 𝐻 = 1 (infected) the test outcomes are
78 Preliminaries
2.7 Documentation
While we cannot possibly introduce every single PyTorch function and class (and the infor-
mation might become outdated quickly), the API documentation 65 and additional tutorials
65 66
and examples provide such documentation. This section provides some guidance for
how to explore the PyTorch API.
66 import torch
print(dir(torch.distributions))
Generally, we can ignore functions that start and end with __ (special objects in Python) or
functions that start with a single _(usually internal functions). Based on the remaining func-
tion or attribute names, we might hazard a guess that this module offers various methods for
generating random numbers, including sampling from the uniform distribution (uniform),
normal distribution (normal), and multinomial distribution (multinomial).
help(torch.ones)
ones(...)
ones(*size, *, out=None, dtype=None, layout=torch.strided, device=None,
↩→ requires_grad=False) -> Tensor
Returns a tensor filled with the scalar value 1, with the shape defined
80 Preliminaries
Args:
size (int...): a sequence of integers defining the shape of the␣
↩→output tensor.
Keyword arguments:
out (Tensor, optional): the output tensor.
dtype (torch.dtype, optional): the desired data type of returned␣
↩→tensor.
Default: torch.strided.
device (torch.device, optional): the desired device of returned␣
↩→tensor.
Default: if None, uses the current device for the default tensor␣
↩→type
Example::
>>> torch.ones(2, 3)
tensor([[ 1., 1., 1.],
[ 1., 1., 1.]])
>>> torch.ones(5)
tensor([ 1., 1., 1., 1., 1.])
From the documentation, we can see that the ones function creates a new tensor with the
specified shape and sets all the elements to the value of 1. Whenever possible, you should
run a quick test to confirm your interpretation:
torch.ones(4)
In the Jupyter notebook, we can use ? to display the document in another window. For
example, list? will create content that is almost identical to help(list), displaying it
in a new browser window. In addition, if we use two question marks, such as list??, the
Python code implementing the function will also be displayed.
The official documentation provides plenty of descriptions and examples that are beyond
this book. We emphasize important use cases that will get you started quickly with prac-
tical problems, rather than completeness of coverage. We also encourage you to study the
source code of the libraries to see examples of high-quality implementations of production
code. By doing this you will become a better engineer in addition to becoming a better
scientist.
Discussions 67 .
67
3 Linear Neural Networks for Regression
Before we worry about making our neural networks deep, it will be helpful to implement
some shallow ones, for which the inputs connect directly to the outputs. This will prove im-
portant for a few reasons. First, rather than getting distracted by complicated architectures,
we can focus on the basics of neural network training, including parametrizing the output
layer, handling data, specifying a loss function, and training the model. Second, this class
of shallow networks happens to comprise the set of linear models, which subsumes many
classical methods of statistical prediction, including linear and softmax regression. Un-
derstanding these classical tools is pivotal because they are widely used in many contexts
and we will often need to use them as baselines when justifying the use of fancier archi-
tectures. This chapter will focus narrowly on linear regression and the next one will extend
our modeling repertoire by developing linear neural networks for classification.
Regression problems pop up whenever we want to predict a numerical value. Common ex-
amples include predicting prices (of homes, stocks, etc.), predicting the length of stay (for
patients in the hospital), forecasting demand (for retail sales), among numerous others. Not
every prediction problem is one of classical regression. Later on, we will introduce classifi-
cation problems, where the goal is to predict membership among a set of categories.
As a running example, suppose that we wish to estimate the prices of houses (in dollars)
based on their area (in square feet) and age (in years). To develop a model for predicting
house prices, we need to get our hands on data, including the sales price, area, and age for
each home. In the terminology of machine learning, the dataset is called a training dataset
or training set, and each row (containing the data corresponding to one sale) is called an
example (or data point, instance, sample). The thing we are trying to predict (price) is
called a label (or target). The variables (age and area) upon which the predictions are
based are called features (or covariates).
%matplotlib inline
import math
import time
import numpy as np
(continues on next page)
82
83 Linear Regression
import torch
from d2l import torch as d2l
3.1.1 Basics
Linear regression is both the simplest and most popular among the standard tools for tack-
ling regression problems. Dating back to the dawn of the 19th century (Gauss, 1809, Leg-
endre, 1805), linear regression flows from a few simple assumptions. First, we assume that
the relationship between features x and target 𝑦 is approximately linear, i.e., that the con-
ditional mean 𝐸 [𝑌 | 𝑋 = x] can be expressed as a weighted sum of the features x. This
setup allows that the target value may still deviate from its expected value on account of
observation noise. Next, we can impose the assumption that any such noise is well behaved,
following a Gaussian distribution. Typically, we will use 𝑛 to denote the number of exam-
ples in our dataset. We use superscripts to enumerate samples and targets, and subscripts
to index coordinates. More concretely, x (𝑖) denotes the 𝑖 th sample and 𝑥 (𝑖)
𝑗 denotes its 𝑗
th
coordinate.
Model
At the heart of every solution is a model that describes how features can be transformed
into an estimate of the target. The assumption of linearity means that the expected value of
the target (price) can be expressed as a weighted sum of the features (area and age):
Here 𝑤 area and 𝑤 age are called weights, and 𝑏 is called a bias (or offset or intercept). The
weights determine the influence of each feature on our prediction. The bias determines the
value of the estimate when all features are zero. Even though we will never see any newly-
built homes with precisely zero area, we still need the bias because it allows us to express
all linear functions of our features (rather than restricting us to lines that pass through the
origin). Strictly speaking, (3.1.1) is an affine transformation of input features, which is
characterized by a linear transformation of features via a weighted sum, combined with a
translation via the added bias. Given a dataset, our goal is to choose the weights w and
the bias 𝑏 that, on average, make our model’s predictions fit the true prices observed in the
data as closely as possible.
In disciplines where it is common to focus on datasets with just a few features, explicitly
expressing models long-form, as in (3.1.1), is common. In machine learning, we usually
work with high-dimensional datasets, where it is more convenient to employ compact lin-
ear algebra notation. When our inputs consist of 𝑑 features, we can assign each an index
(between 1 and 𝑑) and express our prediction 𝑦ˆ (in general the “hat” symbol denotes an
estimate) as
𝑦ˆ = 𝑤 1 𝑥1 + · · · + 𝑤 𝑑 𝑥 𝑑 + 𝑏. (3.1.2)
84 Linear Neural Networks for Regression
Collecting all features into a vector x ∈ R𝑑 and all weights into a vector w ∈ R𝑑 , we can
express our model compactly via the dot product between w and x:
𝑦ˆ = w> x + 𝑏. (3.1.3)
In (3.1.3), the vector x corresponds to the features of a single example. We will often
find it convenient to refer to features of our entire dataset of 𝑛 examples via the design
matrix X ∈ R𝑛×𝑑 . Here, X contains one row for every example and one column for every
feature. For a collection of features X, the predictions ŷ ∈ R𝑛 can be expressed via the
matrix–vector product:
ŷ = Xw + 𝑏, (3.1.4)
where broadcasting (Section 2.1.4) is applied during the summation. Given features of a
training dataset X and corresponding (known) labels y, the goal of linear regression is to
find the weight vector w and the bias term 𝑏 such that, given features of a new data example
sampled from the same distribution as X, the new example’s label will (in expectation) be
predicted with the smallest error.
Even if we believe that the best model for predicting 𝑦 given x is linear, we would not
expect to find a real-world dataset of 𝑛 examples where 𝑦 (𝑖) exactly equals w> x (𝑖) + 𝑏
for all 1 ≤ 𝑖 ≤ 𝑛. For example, whatever instruments we use to observe the features X
and labels y, there might be a small amount of measurement error. Thus, even when we
are confident that the underlying relationship is linear, we will incorporate a noise term to
account for such errors.
Before we can go about searching for the best parameters (or model parameters) w and 𝑏,
we will need two more things: (i) a measure of the quality of some given model; and (ii) a
procedure for updating the model to improve its quality.
Loss Function
Naturally, fitting our model to the data requires that we agree on some measure of fitness
(or, equivalently, of unfitness). Loss functions quantify the distance between the real and
predicted values of the target. The loss will usually be a nonnegative number where smaller
values are better and perfect predictions incur a loss of 0. For regression problems, the most
common loss function is the squared error. When our prediction for an example 𝑖 is 𝑦ˆ (𝑖)
and the corresponding true label is 𝑦 (𝑖) , the squared error is given by:
1 (𝑖) 2
𝑙 (𝑖) (w, 𝑏) = 𝑦ˆ − 𝑦 (𝑖) . (3.1.5)
2
The constant 12 makes no real difference but proves to be notationally convenient, since it
cancels out when we take the derivative of the loss. Because the training dataset is given
to us, and thus is out of our control, the empirical error is only a function of the model
parameters. In Fig. 3.1.1, we visualize the fit of a linear regression model in a problem
with one-dimensional inputs.
Note that large differences between estimates 𝑦ˆ (𝑖) and targets 𝑦 (𝑖) lead to even larger contri-
butions to the loss, due to its quadratic form (this quadraticity can be a double-edge sword;
85 Linear Regression
t
Fig. 3.1.1 Fitting a linear regression model to one-dimensional data.
while it encourages the model to avoid large errors it can also lead to excessive sensitivity
to anomalous data). To measure the quality of a model on the entire dataset of 𝑛 examples,
we simply average (or equivalently, sum) the losses on the training set:
1 Õ (𝑖) 1 Õ 1 > (𝑖) 2
𝑛 𝑛
𝐿 (w, 𝑏) = 𝑙 (w, 𝑏) = w x + 𝑏 − 𝑦 (𝑖) . (3.1.6)
𝑛 𝑖=1 𝑛 𝑖=1 2
When training the model, we seek parameters (w∗ , 𝑏 ∗ ) that minimize the total loss across
all training examples:
w∗ , 𝑏 ∗ = argmin 𝐿 (w, 𝑏). (3.1.7)
w ,𝑏
Analytic Solution
Unlike most of the models that we will cover, linear regression presents us with a surpris-
ingly easy optimization problem. In particular, we can find the optimal parameters (as
assessed on the training data) analytically by applying a simple formula as follows. First,
we can subsume the bias 𝑏 into the parameter w by appending a column to the design ma-
trix consisting of all 1s. Then our prediction problem is to minimize ky − Xwk 2 . As long
as the design matrix X has full rank (no feature is linearly dependent on the others), then
there will be just one critical point on the loss surface and it corresponds to the minimum
of the loss over the entire domain. Taking the derivative of the loss with respect to w and
setting it equal to zero yields:
Solving for w provides us with the optimal solution for the optimization problem. Note
that this solution
will only be unique when the matrix X> X is invertible, i.e., when the columns of the design
matrix are linearly independent (Golub and Van Loan, 1996).
While simple problems like linear regression may admit analytic solutions, you should
not get used to such good fortune. Although analytic solutions allow for nice mathematical
analysis, the requirement of an analytic solution is so restrictive that it would exclude almost
all exciting aspects of deep learning.
86 Linear Neural Networks for Regression
The key technique for optimizing nearly every deep learning model, and which we will
call upon throughout this book, consists of iteratively reducing the error by updating the
parameters in the direction that incrementally lowers the loss function. This algorithm is
called gradient descent.
The most naive application of gradient descent consists of taking the derivative of the loss
function, which is an average of the losses computed on every single example in the dataset.
In practice, this can be extremely slow: we must pass over the entire dataset before making
a single update, even if the update steps might be very powerful (Liu and Nocedal, 1989).
Even worse, if there is a lot of redundancy in the training data, the benefit of a full update
is limited.
The other extreme is to consider only a single example at a time and to take update steps
based on one observation at a time. The resulting algorithm, stochastic gradient descent
(SGD) can be an effective strategy (Bottou, 2010), even for large datasets. Unfortunately,
SGD has drawbacks, both computational and statistical. One problem arises from the fact
that processors are a lot faster multiplying and adding numbers than they are at moving data
from main memory to processor cache. It is up to an order of magnitude more efficient
to perform a matrix–vector multiplication than a corresponding number of vector–vector
operations. This means that it can take a lot longer to process one sample at a time compared
to a full batch. A second problem is that some of the layers, such as batch normalization
(to be described in Section 8.5), only work well when we have access to more than one
observation at a time.
The solution to both problems is to pick an intermediate strategy: rather than taking a full
batch or only a single sample at a time, we take a minibatch of observations (Li et al., 2014).
The specific choice of the size of the said minibatch depends on many factors, such as the
amount of memory, the number of accelerators, the choice of layers, and the total dataset
size. Despite all that, a number between 32 and 256, preferably a multiple of a large power
of 2, is a good start. This leads us to minibatch stochastic gradient descent.
In its most basic form, in each iteration 𝑡, we first randomly sample a minibatch B𝑡 consist-
ing of a fixed number |B| of training examples. We then compute the derivative (gradient)
of the average loss on the minibatch with respect to the model parameters. Finally, we mul-
tiply the gradient by a predetermined small positive value 𝜂, called the learning rate, and
subtract the resulting term from the current parameter values. We can express the update
as follows:
𝜂 Õ
(w, 𝑏) ← (w, 𝑏) − 𝜕( w,𝑏) 𝑙 (𝑖) (w, 𝑏). (3.1.10)
|B|
𝑖 ∈ B𝑡
In summary, minibatch SGD proceeds as follows: (i) initialize the values of the model
87 Linear Regression
parameters, typically at random; (ii) iteratively sample random minibatches from the data,
updating the parameters in the direction of the negative gradient. For quadratic losses and
affine transformations, this has a closed-form expansion:
𝜂 Õ 𝜂 Õ (𝑖) > (𝑖)
w←w− 𝜕w 𝑙 (𝑖) (w, 𝑏) = w − x w x + 𝑏 − 𝑦 (𝑖)
|B| 𝑖 ∈ B |B| 𝑖 ∈ B
𝑡 𝑡
𝜂 Õ
(𝑖) 𝜂 Õ > (𝑖) (3.1.11)
𝑏←𝑏− 𝜕𝑏 𝑙 (w, 𝑏) =𝑏− w x + 𝑏 − 𝑦 (𝑖) .
|B| 𝑖 ∈ B |B| 𝑖 ∈ B
𝑡 𝑡
Since we pick a minibatch B we need to normalize by its size |B|. Frequently minibatch
size and learning rate are user-defined. Such tunable parameters that are not updated in the
training loop are called hyperparameters. They can be tuned automatically by a number
of techniques, such as Bayesian optimization (Frazier, 2018). In the end, the quality of the
solution is typically assessed on a separate validation dataset (or validation set).
After training for some predetermined number of iterations (or until some other stopping
criterion is met), we record the estimated model parameters, denoted ŵ, 𝑏.ˆ Note that even if
our function is truly linear and noiseless, these parameters will not be the exact minimizers
of the loss, nor even deterministic. Although the algorithm converges slowly towards the
minimizers it typically will not find them exactly in a finite number of steps. Moreover,
the minibatches B used for updating the parameters are chosen at random. This breaks
determinism.
Predictions
Given the model ŵ> x + 𝑏, ˆ we can now make predictions for a new example, e.g., pre-
dicting the sales price of a previously unseen house given its area 𝑥1 and age 𝑥2 . Deep
learning practitioners have taken to calling the prediction phase inference but this is a bit of
a misnomer—inference refers broadly to any conclusion reached on the basis of evidence,
including both the values of the parameters and the likely label for an unseen instance. If
anything, in the statistics literature inference more often denotes parameter inference and
this overloading of terminology creates unnecessary confusion when deep learning prac-
titioners talk to statisticians. In the following we will stick to prediction whenever possi-
ble.
When training our models, we typically want to process whole minibatches of examples si-
multaneously. Doing this efficiently requires that we vectorize the calculations and leverage
fast linear algebra libraries rather than writing costly for-loops in Python.
To see why this matters so much, let’s consider two methods for adding vectors. To start, we
instantiate two 10,000-dimensional vectors containing all 1s. In the first method, we loop
over the vectors with a Python for-loop. In the second, we rely on a single call to +.
n = 10000
a = torch.ones(n)
b = torch.ones(n)
Now we can benchmark the workloads. First, we add them, one coordinate at a time, using
a for-loop.
c = torch.zeros(n)
t = time.time()
for i in range(n):
c[i] = a[i] + b[i]
f'{time.time() - t:.5f} sec'
'0.17802 sec'
t = time.time()
d = a + b
f'{time.time() - t:.5f} sec'
'0.00036 sec'
The second method is dramatically faster than the first. Vectorizing code often yields order-
of-magnitude speedups. Moreover, we push more of the mathematics to the library so we
do not have to write as many calculations ourselves, reducing the potential for errors and
increasing portability of the code.
distribution and linear regression with squared loss share a deeper connection than com-
mon parentage.
To begin, recall that a normal distribution with mean 𝜇 and variance 𝜎 2 (standard deviation
𝜎) is given as
1 1
𝑝(𝑥) = √ exp − 2 (𝑥 − 𝜇) 2 . (3.1.12)
2𝜋𝜎 2 2𝜎
Below we define a function to compute the normal distribution.
Note that changing the mean corresponds to a shift along the 𝑥-axis, and increasing the
variance spreads the distribution out, lowering its peak.
One way to motivate linear regression with squared loss is to assume that observations arise
from noisy measurements, where the noise 𝜖 follows the normal distribution N (0, 𝜎 2 ):
Thus, we can now write out the likelihood of seeing a particular 𝑦 for a given x via
1 1
𝑃(𝑦 | x) = √ exp − 2 (𝑦 − w> x − 𝑏) 2 . (3.1.14)
2𝜋𝜎 2 2𝜎
As such, the likelihood factorizes. According to the principle of maximum likelihood, the
90 Linear Neural Networks for Regression
best values of parameters w and 𝑏 are those that maximize the likelihood of the entire
dataset:
Ö
𝑛
𝑃(y | X) = 𝑝(𝑦 (𝑖) | x (𝑖) ). (3.1.15)
𝑖=1
The equality follows since all pairs (x (𝑖) , 𝑦 (𝑖) ) were drawn independently of each other. Es-
timators chosen according to the principle of maximum likelihood are called maximum like-
lihood estimators. While, maximizing the product of many exponential functions, might
look difficult, we can simplify things significantly, without changing the objective, by max-
imizing the logarithm of the likelihood instead. For historical reasons, optimizations are
more often expressed as minimization rather than maximization. So, without changing any-
thing, we can minimize the negative log-likelihood, which we can express as follows:
Õ 1 (𝑖) 2
𝑛
1 > (𝑖)
− log 𝑃(y | X) = log(2𝜋𝜎 2 ) + 𝑦 − w x − 𝑏 . (3.1.16)
𝑖=1
2 2𝜎 2
If we assume that 𝜎 is fixed, we can ignore the first term, because it does not depend on w
or 𝑏. The second term is identical to the squared error loss introduced earlier, except for
the multiplicative constant 𝜎12 . Fortunately, the solution does not depend on 𝜎 either. It
follows that minimizing the mean squared error is equivalent to the maximum likelihood
estimation of a linear model under the assumption of additive Gaussian noise.
Fig. 3.1.2 depicts linear regression as a neural network. The diagram highlights the con-
nectivity pattern, such as how each input is connected to the output, but not the specific
values taken by the weights or biases.
t
Fig. 3.1.2 Linear regression is a single-layer neural network.
The inputs are 𝑥1 , . . . , 𝑥 𝑑 . We refer to 𝑑 as the number of inputs or the feature dimensional-
ity in the input layer. The output of the network is 𝑜 1 . Because we are just trying to predict
a single numerical value, we have only one output neuron. Note that the input values are all
given. There is just a single computed neuron. In summary, we can think of linear regres-
sion as a single-layer fully connected neural network. We will encounter networks with far
more layers in later chapters.
91 Linear Regression
Biology
Because linear regression predates computational neuroscience, it might seem anachro-
nistic to describe linear regression in terms of neural networks. Nonetheless, they were a
natural place to start when the cyberneticists and neurophysiologists Warren McCulloch
and Walter Pitts began to develop models of artificial neurons. Consider the cartoonish
picture of a biological neuron in Fig. 3.1.3, consisting of dendrites (input terminals), the
nucleus (CPU), the axon (output wire), and the axon terminals (output terminals), enabling
connections to other neurons via synapses.
Dendrite
Axon Terminal
Node of
Ranvier
Cell body
t
Nucleus
Fig. 3.1.3 The real neuron (source: “Anatomy and Physiology” by the US National Cancer
Institute’s Surveillance, Epidemiology and End Results (SEER) Program).
Information 𝑥𝑖 arriving from other neurons (or environmental sensors) is received in the
dendrites. In particular, that information is weighted by synaptic weights 𝑤 𝑖 , determining
the effect of the inputs, e.g., activation or inhibition via the product 𝑥 𝑖 𝑤 𝑖 . The weighted
inputs arriving from multiple sources are aggregated in the nucleus as a weighted sum 𝑦 =
Í
𝑖 𝑥 𝑖 𝑤 𝑖 + 𝑏, possibly subject to some nonlinear postprocessing via a function 𝜎(𝑦). This
information is then sent via the axon to the axon terminals, where it reaches its destination
(e.g., an actuator such as a muscle) or it is fed into another neuron via its dendrites.
Certainly, the high-level idea that many such units could be combined, provided they have
the correct connectivity and learning algorithm, to produce far more interesting and com-
plex behavior than any one neuron alone could express arises from our study of real bi-
ological neural systems. At the same time, most research in deep learning today draws
inspiration from a much wider source. We invoke Russell and Norvig (2016) who pointed
out that although airplanes might have been inspired by birds, ornithology has not been
the primary driver of aeronautics innovation for some centuries. Likewise, inspiration in
deep learning these days comes in equal or greater measure from mathematics, linguistics,
psychology, statistics, computer science, and many other fields.
3.1.5 Summary
In this section, we introduced traditional linear regression, where the parameters of a linear
function are chosen to minimize squared loss on the training set. We also motivated this
choice of objective both via some practical considerations and through an interpretation
of linear regression as maximimum likelihood estimation under an assumption of linearity
and Gaussian noise. After discussing both computational considerations and connections to
92 Linear Neural Networks for Regression
statistics, we showed how such linear models could be expressed as simple neural networks
where the inputs are directly wired to the output(s). While we will soon move past linear
models altogether, they are sufficient to introduce most of the components that all of our
models require: parametric forms, differentiable objectives, optimization via minibatch
stochastic gradient descent, and ultimately, evaluation on previously unseen data.
3.1.6 Exercises
1. Assume that we have some data 𝑥1 , . . . , 𝑥 𝑛 ∈ R. Our goal is to find a constant 𝑏 such
Í
that 𝑖 (𝑥 𝑖 − 𝑏) 2 is minimized.
1. Find an analytic solution for the optimal value of 𝑏.
2. How does this problem and its solution relate to the normal distribution?
Í Í
3. What if we change the loss from 𝑖 (𝑥𝑖 − 𝑏) 2 to 𝑖 |𝑥𝑖 − 𝑏|? Can you find the optimal
solution for 𝑏?
2. Prove that the affine functions that can be expressed by x> w + 𝑏 are equivalent to linear
functions on (x, 1).
Í
3. Assume that you want to find quadratic functions of x, i.e., 𝑓 (x) = 𝑏 + 𝑖 𝑤 𝑖 𝑥𝑖 +
Í
𝑗 ≤𝑖 𝑤 𝑖 𝑗 𝑥 𝑖 𝑥 𝑗 . How would you formulate this in a deep network?
4. Recall that one of the conditions for the linear regression problem to be solvable was
that the design matrix X> X has full rank.
1. What happens if this is not the case?
2. How could you fix it? What happens if you add a small amount of coordinate-wise
independent Gaussian noise to all entries of X?
3. What is the expected value of the design matrix X> X in this case?
4. What happens with stochastic gradient descent when X> X does not have full rank?
5. Assume that the noise model governing the additive noise 𝜖 is the exponential distribu-
tion. That is, 𝑝(𝜖) = 12 exp(−|𝜖 |).
1. Write out the negative log-likelihood of the data under the model − log 𝑃(y | X).
2. Can you find a closed form solution?
3. Suggest a minibatch stochastic gradient descent algorithm to solve this problem.
What could possibly go wrong (hint: what happens near the stationary point as we
keep on updating the parameters)? Can you fix this?
6. Assume that we want to design a neural network with two layers by composing two
linear layers. That is, the output of the first layer becomes the input of the second layer.
Why would such a naive composition not work?
7. What happens if you want to use regression for realistic price estimation of houses or
stock prices?
93 Object-Oriented Design for Implementation
1. Show that the additive Gaussian noise assumption is not appropriate. Hint: can we
have negative prices? What about fluctuations?
2. Why would regression to the logarithm of the price be much better, i.e., 𝑦 = log price?
3. What do you need to worry about when dealing with pennystock, i.e., stock with very
low prices? Hint: can you trade at all possible prices? Why is this a bigger problem
for cheap stock? For more information review the celebrated Black–Scholes model
for option pricing (Black and Scholes, 1973).
8. Suppose we want to use regression to estimate the number of apples sold in a grocery
store.
1. What are the problems with a Gaussian additive noise model? Hint: you are selling
apples, not oil.
import time
import numpy as np
import torch
from torch import nn
from d2l import torch as d2l
3.2.1 Utilities
We need a few utilities to simplify object-oriented programming in Jupyter notebooks. One
of the challenges is that class definitions tend to be fairly long blocks of code. Notebook
readability demands short code fragments, interspersed with explanations, a requirement
incompatible with the style of programming common for Python libraries. The first utility
function allows us to register functions as methods in a class after the class has been created.
In fact, we can do so even after we have created instances of the class! It allows us to split
the implementation of a class into multiple code blocks.
Let’s have a quick look at how to use it. We plan to implement a class A with a method do.
Instead of having code for both A and do in the same code block, we can first declare the
class A and create an instance a.
class A:
def __init__(self):
self.b = 1
a = A()
Next we define the method do as we normally would, but not in class A’s scope. Instead,
we decorate this method by add_to_class with class A as its argument. In doing so, the
method is able to access the member variables of A just as we would expect had it been
included as part of A’s definition. Let’s see what happens when we invoke it for the instance
a.
@add_to_class(A)
def do(self):
print('Class attribute "b" is', self.b)
a.do()
The second one is a utility class that saves all arguments in a class’s __init__ method
95 Object-Oriented Design for Implementation
as class attributes. This allows us to extend constructor call signatures implicitly without
additional code.
We defer its implementation into Section B.7. To use it, we define our class that inherits
from HyperParameters and calls save_hyperparameters in the __init__ method.
self.a = 1 self.b = 2
There is no self.c = True
The final utility allows us to plot experiment progress interactively while it is going on.
In deference to the much more powerful (and complex) TensorBoard 71 we name it Pro-
71 gressBoard. The implementation is deferred to Section B.7. For now, let’s simply see it
in action.
The draw method plots a point (x, y) in the figure, with label specified in the legend.
The optional every_n smooths the line by only showing 1/𝑛 points in the figure. Their
values are averaged from the 𝑛 neighbor points in the original figure.
In the following example, we draw sin and cos with a different smoothness. If you run this
code block, you will see the lines grow in animation.
board = d2l.ProgressBoard('x')
for x in np.arange(0, 10, 0.1):
board.draw(x, np.sin(x), 'sin', every_n=2)
board.draw(x, np.cos(x), 'cos', every_n=10)
96 Linear Neural Networks for Regression
3.2.2 Models
The Module class is the base class of all models we will implement. At the very least
we need three methods. The first, __init__, stores the learnable parameters, the train-
ing_step method accepts a data batch to return the loss value, and finally, configure_optimizers
returns the optimization method, or a list of them, that is used to update the learnable pa-
rameters. Optionally we can define validation_step to report the evaluation measures.
Sometimes we put the code for computing the output into a separate forward method to
make it more reusable.
def configure_optimizers(self):
raise NotImplementedError
You may notice that Module is a subclass of nn.Module, the base class of neural networks
in PyTorch. It provides convenient features for handling neural networks. For example, if
we define a forward method, such as forward(self, X), then for an instance a we can
invoke this method by a(X). This works since it calls the forward method in the built-in
__call__ method. You can find more details and examples about nn.Module in Section
6.1.
3.2.3 Data
The DataModule class is the base class for data. Quite frequently the __init__ method is
used to prepare the data. This includes downloading and preprocessing if needed. The
train_dataloader returns the data loader for the training dataset. A data loader is a
(Python) generator that yields a data batch each time it is used. This batch is then fed
into the training_step method of Module to compute the loss. There is an optional
val_dataloader to return the validation dataset loader. It behaves in the same manner,
except that it yields data batches for the validation_step method in Module.
def train_dataloader(self):
return self.get_dataloader(train=True)
def val_dataloader(self):
return self.get_dataloader(train=False)
3.2.4 Training
The Trainer class trains the learnable parameters in the Module class with data specified
in DataModule. The key method is fit, which accepts two arguments: model, an instance
of Module, and data, an instance of DataModule. It then iterates over the entire dataset
98 Linear Neural Networks for Regression
max_epochs times to train the model. As before, we will defer the implementation of this
method to later chapters.
def fit_epoch(self):
raise NotImplementedError
3.2.5 Summary
To highlight the object-oriented design for our future deep learning implementation, the
above classes simply show how their objects store data and interact with each other. We
will keep enriching implementations of these classes, such as via @add_to_class, in the
rest of the book. Moreover, these fully implemented classes are saved in the D2L library 72
72
, a lightweight toolkit that makes structured modeling for deep learning easy. In particular,
it facilitates reusing many components between projects without changing much at all. For
instance, we can replace just the optimizer, just the model, just the dataset, etc.; this degree
of modularity pays dividends throughout the book in terms of conciseness and simplicity
(this is why we added it) and it can do the same for your own projects.
3.2.6 Exercises
73 1. Locate full implementations of the above classes that are saved in the D2L library 73
. We strongly recommend that you look at the implementation in detail once you have
gained some more familiarity with deep learning modeling.
99 Synthetic Regression Data
2. Remove the save_hyperparameters statement in the B class. Can you still print self.a
and self.b? Optional: if you have dived into the full implementation of the HyperPa-
rameters class, can you explain why?
Discussions 74 .
74
Machine learning is all about extracting information from data. So you might wonder,
what could we possibly learn from synthetic data? While we might not care intrinsically
about the patterns that we ourselves baked into an artificial data generating model, such
datasets are nevertheless useful for didactic purposes, helping us to evaluate the properties
of our learning algorithms and to confirm that our implementations work as expected. For
example, if we create data for which the correct parameters are known a priori, then we
can check that our model can in fact recover them.
%matplotlib inline
import random
import torch
from d2l import torch as d2l
y = Xw + 𝑏 + 𝝐 . (3.3.1)
For convenience we assume that 𝝐 is drawn from a normal distribution with mean 𝜇 = 0
and standard deviation 𝜎 = 0.01. Note that for object-oriented design we add the code to
the __init__ method of a subclass of d2l.DataModule (introduced in Section 3.2.3). It is
good practice to allow the setting of any additional hyperparameters. We accomplish this
with save_hyperparameters(). The batch_size will be determined later.
Below, we set the true parameters to w = [2, −3.4] > and 𝑏 = 4.2. Later, we can check our
estimated parameters against these ground truth values.
Each row in features consists of a vector in R2 and each row in labels is a scalar. Let’s
have a look at the first entry.
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
if train:
indices = list(range(0, self.num_train))
# The examples are read in random order
random.shuffle(indices)
else:
indices = list(range(self.num_train, self.num_train+self.num_val))
for i in range(0, len(indices), self.batch_size):
batch_indices = torch.tensor(indices[i: i+self.batch_size])
yield self.X[batch_indices], self.y[batch_indices]
To build some intuition, let’s inspect the first minibatch of data. Each minibatch of fea-
tures provides us with both its size and the dimensionality of input features. Likewise, our
minibatch of labels will have a matching shape given by batch_size.
101 Synthetic Regression Data
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
Throughout the iteration we obtain distinct minibatches until the entire dataset has been
exhausted (try this). While the iteration implemented above is good for didactic purposes,
it is inefficient in ways that might get us into trouble with real problems. For example, it
requires that we load all the data in memory and that we perform lots of random memory
access. The built-in iterators implemented in a deep learning framework are considerably
more efficient and they can deal with sources such as data stored in files, data received via
a stream, and data generated or processed on the fly. Next let’s try to implement the same
method using built-in iterators.
@d2l.add_to_class(d2l.DataModule) #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
tensors = tuple(a[indices] for a in tensors)
dataset = torch.utils.data.TensorDataset(*tensors)
return torch.utils.data.DataLoader(dataset, self.batch_size,
shuffle=train)
@d2l.add_to_class(SyntheticRegressionData) #@save
def get_dataloader(self, train):
i = slice(0, self.num_train) if train else slice(self.num_train, None)
return self.get_tensorloader((self.X, self.y), train, i)
The new data loader behaves just like the previous one, except that it is more efficient and
has some added functionality.
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
102 Linear Neural Networks for Regression
For instance, the data loader provided by the framework API supports the built-in __len__
method, so we can query its length, i.e., the number of batches.
len(data.train_dataloader())
32
3.3.4 Summary
Data loaders are a convenient way of abstracting out the process of loading and manipu-
lating data. This way the same machine learning algorithm is capable of processing many
different types and sources of data without the need for modification. One of the nice things
about data loaders is that they can be composed. For instance, we might be loading images
and then have a postprocessing filter that crops them or modifies them in other ways. As
such, data loaders can be used to describe an entire data processing pipeline.
As for the model itself, the two-dimensional linear model is about the simplest we might
encounter. It lets us test out the accuracy of regression models without worrying about
having insufficient amounts of data or an underdetermined system of equations. We will
put this to good use in the next section.
3.3.5 Exercises
1. What will happen if the number of examples cannot be divided by the batch size. How
would you change this behavior by specifying a different argument by using the frame-
work’s API?
2. Suppose that we want to generate a huge dataset, where both the size of the parameter
vector w and the number of examples num_examples are large.
2. How would you shuffle the data if it is held on disk? Your task is to design an efficient
algorithm that does not require too many random reads or writes. Hint: pseudoran-
dom permutation generators 75 allow you to design a reshuffle without the need to
75
store the permutation table explicitly (Naor and Reingold, 1999).
3. Implement a data generator that produces new data on the fly, every time the iterator is
called.
4. How would you design a random data generator that generates the same data each time
it is called?
76
Discussions 76 .
103 Linear Regression Implementation from Scratch
We are now ready to work through a fully functioning implementation of linear regression.
In this section, we will implement the entire method from scratch, including (i) the model;
(ii) the loss function; (iii) a minibatch stochastic gradient descent optimizer; and (iv) the
training function that stitches all of these pieces together. Finally, we will run our synthetic
data generator from Section 3.3 and apply our model on the resulting dataset. While modern
deep learning frameworks can automate nearly all of this work, implementing things from
scratch is the only way to make sure that you really know what you are doing. Moreover,
when it is time to customize models, defining our own layers or loss functions, understand-
ing how things work under the hood will prove handy. In this section, we will rely only
on tensors and automatic differentiation. Later, we will introduce a more concise imple-
mentation, taking advantage of the bells and whistles of deep learning frameworks while
retaining the structure of what follows below.
%matplotlib inline
import torch
from d2l import torch as d2l
Next we must define our model, relating its input and parameters to its output. Using the
same notation as (3.1.4) for our linear model we simply take the matrix–vector product of
the input features X and the model weights w, and add the offset 𝑏 to each example. The
product Xw is a vector and 𝑏 is a scalar. Because of the broadcasting mechanism (see
Section 2.1.4), when we add a vector and a scalar, the scalar is added to each component of
the vector. The resulting forward method is registered in the LinearRegressionScratch
class via add_to_class (introduced in Section 3.2.1).
104 Linear Neural Networks for Regression
@d2l.add_to_class(LinearRegressionScratch) #@save
def forward(self, X):
return torch.matmul(X, self.w) + self.b
@d2l.add_to_class(LinearRegressionScratch) #@save
def loss(self, y_hat, y):
l = (y_hat - y) ** 2 / 2
return l.mean()
def step(self):
for param in self.params:
param -= self.lr * param.grad
def zero_grad(self):
(continues on next page)
105 Linear Regression Implementation from Scratch
We next define the configure_optimizers method, which returns an instance of the SGD
class.
@d2l.add_to_class(LinearRegressionScratch) #@save
def configure_optimizers(self):
return SGD([self.w, self.b], self.lr)
3.4.4 Training
Now that we have all of the parts in place (parameters, loss function, model, and optimizer),
we are ready to implement the main training loop. It is crucial that you understand this
code fully since you will employ similar training loops for every other deep learning model
covered in this book. In each epoch, we iterate through the entire training dataset, passing
once through every example (assuming that the number of examples is divisible by the
batch size). In each iteration, we grab a minibatch of training examples, and compute its
loss through the model’s training_step method. Then we compute the gradients with
respect to each parameter. Finally, we will call the optimization algorithm to update the
model parameters. In summary, we will execute the following loop:
• Initialize parameters (w, 𝑏)
• Repeat until done
Í
– Compute gradient g ← 𝜕( w,𝑏) | B1 | 𝑖∈ B 𝑙 (x (𝑖) , 𝑦 (𝑖) , w, 𝑏)
– Update parameters (w, 𝑏) ← (w, 𝑏) − 𝜂g
Recall that the synthetic regression dataset that we generated in Section 3.3 does not provide
a validation dataset. In most cases, however, we will want a validation dataset to measure
our model quality. Here we pass the validation dataloader once in each epoch to mea-
sure the model performance. Following our object-oriented design, the prepare_batch
and fit_epoch methods are registered in the d2l.Trainer class (introduced in Section
3.2.4).
@d2l.add_to_class(d2l.Trainer) #@save
def prepare_batch(self, batch):
return batch
@d2l.add_to_class(d2l.Trainer) #@save
def fit_epoch(self):
self.model.train()
for batch in self.train_dataloader:
loss = self.model.training_step(self.prepare_batch(batch))
(continues on next page)
106 Linear Neural Networks for Regression
self.optim.zero_grad()
with torch.no_grad():
loss.backward()
if self.gradient_clip_val > 0: # To be discussed later
self.clip_gradients(self.gradient_clip_val, self.model)
self.optim.step()
self.train_batch_idx += 1
if self.val_dataloader is None:
return
self.model.eval()
for batch in self.val_dataloader:
with torch.no_grad():
self.model.validation_step(self.prepare_batch(batch))
self.val_batch_idx += 1
We are almost ready to train the model, but first we need some training data. Here we use
the SyntheticRegressionData class and pass in some ground truth parameters. Then
we train our model with the learning rate lr=0.03 and set max_epochs=3. Note that in
general, both the number of epochs and the learning rate are hyperparameters. In general,
setting hyperparameters is tricky and we will usually want to use a three-way split, one
set for training, a second for hyperparameter selection, and the third reserved for the final
evaluation. We elide these details for now but will revise them later.
Because we synthesized the dataset ourselves, we know precisely what the true parameters
are. Thus, we can evaluate our success in training by comparing the true parameters with
those that we learned through our training loop. Indeed they turn out to be very close to
each other.
with torch.no_grad():
print(f'error in estimating w: {data.w - model.w.reshape(data.w.shape)}')
print(f'error in estimating b: {data.b - model.b}')
107 Linear Regression Implementation from Scratch
We should not take the ability to exactly recover the ground truth parameters for granted.
In general, for deep models unique solutions for the parameters do not exist, and even
for linear models, exactly recovering the parameters is only possible when no feature is
linearly dependent on the others. However, in machine learning, we are often less concerned
with recovering true underlying parameters, but rather with parameters that lead to highly
accurate prediction (Vapnik, 1992). Fortunately, even on difficult optimization problems,
stochastic gradient descent can often find remarkably good solutions, owing partly to the
fact that, for deep networks, there exist many configurations of the parameters that lead to
highly accurate prediction.
3.4.5 Summary
In this section, we took a significant step towards designing deep learning systems by im-
plementing a fully functional neural network model and training loop. In this process, we
built a data loader, a model, a loss function, an optimization procedure, and a visualization
and monitoring tool. We did this by composing a Python object that contains all relevant
components for training a model. While this is not yet a professional-grade implementation
it is perfectly functional and code like this could already help you to solve small problems
quickly. In the coming sections, we will see how to do this both more concisely (avoiding
boilerplate code) and more efficiently (using our GPUs to their full potential).
3.4.6 Exercises
1. What would happen if we were to initialize the weights to zero. Would the algorithm
still work? What if we initialized the parameters with variance 1000 rather than 0.01?
2. Assume that you are Georg Simon Ohm 77 trying to come up with a model for resis-
77 tance that relates voltage and current. Can you use automatic differentiation to learn the
parameters of your model?
3. Can you use Planck’s Law 78 to determine the temperature of an object using spectral
78
energy density? For reference, the spectral density 𝐵 of radiation emanating from a
−1
2 ℎ𝑐
black body is 𝐵(𝜆, 𝑇) = 2ℎ𝑐 𝜆5 · exp 𝜆𝑘𝑇 − 1 . Here 𝜆 is the wavelength, 𝑇 is the
temperature, 𝑐 is the speed of light, ℎ is Planck’s constant, and 𝑘 is the Boltzmann
constant. You measure the energy for different wavelengths 𝜆 and you now need to fit
the spectral density curve to Planck’s law.
4. What are the problems you might encounter if you wanted to compute the second deriva-
tives of the loss? How would you fix them?
5. Why is the reshape method needed in the loss function?
6. Experiment using different learning rates to find out how quickly the loss function value
drops. Can you reduce the error by increasing the number of epochs of training?
108 Linear Neural Networks for Regression
7. If the number of examples cannot be divided by the batch size, what happens to data_iter
at the end of an epoch?
8. Try implementing a different loss function, such as the absolute value loss (y_hat -
d2l.reshape(y, y_hat.shape)).abs().sum().
2. Check whether there is a difference in behavior if you actively perturb some entries,
such as 𝑦 5 = 10000, of y.
3. Can you think of a cheap solution for combining the best aspects of squared loss and
absolute value loss? Hint: how can you avoid really large gradient values?
9. Why do we need to reshuffle the dataset? Can you design a case where a maliciously
constructed dataset would break the optimization algorithm otherwise?
79 Discussions 79 .
Deep learning has witnessed a sort of Cambrian explosion over the past decade. The sheer
number of techniques, applications and algorithms by far surpasses the progress of pre-
vious decades. This is due to a fortuitous combination of multiple factors, one of which
is the powerful free tools offered by a number of open-source deep learning frameworks.
Theano (Bergstra et al., 2010), DistBelief (Dean et al., 2012), and Caffe (Jia et al., 2014)
arguably represent the first generation of such models that found widespread adoption.
In contrast to earlier (seminal) works like SN2 (Simulateur Neuristique) (Bottou and Le
Cun, 1988), which provided a Lisp-like programming experience, modern frameworks of-
fer automatic differentiation and the convenience of Python. These frameworks allow us
to automate and modularize the repetitive work of implementing gradient-based learning
algorithms.
In Section 3.4, we relied only on (i) tensors for data storage and linear algebra; and (ii)
automatic differentiation for calculating gradients. In practice, because data iterators, loss
functions, optimizers, and neural network layers are so common, modern libraries imple-
ment these components for us as well. In this section, we will show you how to implement
the linear regression model from Section 3.4 concisely by using high-level APIs of deep
learning frameworks.
import numpy as np
import torch
from torch import nn
from d2l import torch as d2l
109 Concise Implementation of Linear Regression
For standard operations, we can use a framework’s predefined layers, which allow us to
focus on the layers used to construct the model rather than worrying about their implemen-
tation. Recall the architecture of a single-layer network as described in Fig. 3.1.2. The
layer is called fully connected, since each of its inputs is connected to each of its outputs
by means of a matrix–vector multiplication.
In PyTorch, the fully connected layer is defined in Linear and LazyLinear classes (avail-
able since version 1.8.0). The latter allows users to specify merely the output dimension,
while the former additionally asks for how many inputs go into this layer. Specifying input
shapes is inconvenient and may require nontrivial calculations (such as in convolutional
layers). Thus, for simplicity, we will use such “lazy” layers whenever we can.
In the forward method we just invoke the built-in __call__ method of the predefined
layers to compute the outputs.
@d2l.add_to_class(LinearRegression) #@save
def forward(self, X):
return self.net(X)
@d2l.add_to_class(LinearRegression) #@save
def loss(self, y_hat, y):
fn = nn.MSELoss()
return fn(y_hat, y)
110 Linear Neural Networks for Regression
@d2l.add_to_class(LinearRegression) #@save
def configure_optimizers(self):
return torch.optim.SGD(self.parameters(), self.lr)
3.5.4 Training
You might have noticed that expressing our model through high-level APIs of a deep learn-
ing framework requires fewer lines of code. We did not have to allocate parameters indi-
vidually, define our loss function, or implement minibatch SGD. Once we start working
with much more complex models, the advantages of the high-level API will grow consid-
erably.
Now that we have all the basic pieces in place, the training loop itself is the same as the
one we implemented from scratch. So we just call the fit method (introduced in Section
3.2.4), which relies on the implementation of the fit_epoch method in Section 3.4, to train
our model.
model = LinearRegression(lr=0.03)
data = d2l.SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2)
trainer = d2l.Trainer(max_epochs=3)
trainer.fit(model, data)
Below, we compare the model parameters learned by training on finite data and the actual
parameters that generated our dataset. To access parameters, we access the weights and bias
of the layer that we need. As in our implementation from scratch, note that our estimated
parameters are close to their true counterparts.
@d2l.add_to_class(LinearRegression) #@save
(continues on next page)
111 Concise Implementation of Linear Regression
def get_w_b(self):
return (self.net.weight.data, self.net.bias.data)
w, b = model.get_w_b()
3.5.5 Summary
This section contains the first implementation of a deep network (in this book) to tap into
the conveniences afforded by modern deep learning frameworks, such as MXNet (Chen
et al., 2015), JAX (Frostig et al., 2018), PyTorch (Paszke et al., 2019), and Tensorflow
(Abadi et al., 2016). We used framework defaults for loading data, defining a layer, a loss
function, an optimizer and a training loop. Whenever the framework provides all necessary
features, it is generally a good idea to use them, since the library implementations of these
components tend to be heavily optimized for performance and properly tested for reliability.
At the same time, try not to forget that these modules can be implemented directly. This is
especially important for aspiring researchers who wish to live on the leading edge of model
development, where you will be inventing new components that cannot possibly exist in
any current library.
In PyTorch, the data module provides tools for data processing, the nn module defines a
large number of neural network layers and common loss functions. We can initialize the pa-
rameters by replacing their values with methods ending with _. Note that we need to specify
the input dimensions of the network. While this is trivial for now, it can have significant
knock-on effects when we want to design complex networks with many layers. Careful
considerations of how to parametrize these networks is needed to allow portability.
3.5.6 Exercises
1. How would you need to change the learning rate if you replace the aggregate loss over
the minibatch with an average over the loss on the minibatch?
2. Review the framework documentation to see which loss functions are provided. In par-
ticular, replace the squared loss with Huber’s robust loss function. That is, use the loss
function
(
0 |𝑦 − 𝑦 0 | − 𝜎2 if |𝑦 − 𝑦 0 | > 𝜎
𝑙 (𝑦, 𝑦 ) = 1 0 2
(3.5.1)
2𝜎 (𝑦 − 𝑦 ) otherwise
4. What is the effect on the solution if you change the learning rate and the number of
epochs? Does it keep on improving?
5. How does the solution change as you vary the amount of data generated?
1. Plot the estimation error for ŵ − w and 𝑏ˆ − 𝑏 as a function of the amount of data.
Hint: increase the amount of data logarithmically rather than linearly, i.e., 5, 10, 20,
50, …, 10,000 rather than 1000, 2000, …, 10,000.
2. Why is the suggestion in the hint appropriate?
Discussions 80 .
80
3.6 Generalization
Consider two college students diligently preparing for their final exam. Commonly, this
preparation will consist of practicing and testing their abilities by taking exams adminis-
tered in previous years. Nonetheless, doing well on past exams is no guarantee that they will
excel when it matters. For instance, imagine a student, Extraordinary Ellie, whose prepara-
tion consisted entirely of memorizing the answers to previous years’ exam questions. Even
if Ellie were endowed with an extraordinary memory, and thus could perfectly recall the an-
swer to any previously seen question, she might nevertheless freeze when faced with a new
(previously unseen) question. By comparison, imagine another student, Inductive Irene,
with comparably poor memorization skills, but a knack for picking up patterns. Note that
if the exam truly consisted of recycled questions from a previous year, Ellie would handily
outperform Irene. Even if Irene’s inferred patterns yielded 90% accurate predictions, they
could never compete with Ellie’s 100% recall. However, even if the exam consisted entirely
of fresh questions, Irene might maintain her 90% average.
As machine learning scientists, our goal is to discover patterns. But how can we be sure that
we have truly discovered a general pattern and not simply memorized our data? Most of the
time, our predictions are only useful if our model discovers such a pattern. We do not want
to predict yesterday’s stock prices, but tomorrow’s. We do not need to recognize already
diagnosed diseases for previously seen patients, but rather previously undiagnosed ailments
in previously unseen patients. This problem—how to discover patterns that generalize—is
the fundamental problem of machine learning, and arguably of all of statistics. We might
cast this problem as just one slice of a far grander question that engulfs all of science:
when are we ever justified in making the leap from particular observations to more general
statements?
In real life, we must fit our models using a finite collection of data. The typical scales
of that data vary wildly across domains. For many important medical problems, we can
only access a few thousand data points. When studying rare diseases, we might be lucky to
access hundreds. By contrast, the largest public datasets consisting of labeled photographs,
e.g., ImageNet (Deng et al., 2009), contain millions of images. And some unlabeled image
113 Generalization
collections such as the Flickr YFC100M dataset can be even larger, containing over 100
million images (Thomee et al., 2016). However, even at this extreme scale, the number of
available data points remains infinitesimally small compared to the space of all possible
images at a megapixel resolution. Whenever we work with finite samples, we must keep in
mind the risk that we might fit our training data, only to discover that we failed to discover
a generalizable pattern.
The phenomenon of fitting closer to our training data than to the underlying distribution is
called overfitting, and techniques for combatting overfitting are often called regularization
methods. While it is no substitute for a proper introduction to statistical learning theory
(see Boucheron et al. (2005), Vapnik (1998)), we will give you just enough intuition to get
going. We will revisit generalization in many chapters throughout the book, exploring both
what is known about the principles underlying generalization in various models, and also
heuristic techniques that have been found (empirically) to yield improved generalization on
tasks of practical interest.
To begin with, we need to differentiate between the training error 𝑅emp , which is a statistic
calculated on the training dataset, and the generalization error 𝑅, which is an expectation
taken with respect to the underlying distribution. You can think of the generalization error
as what you would see if you applied your model to an infinite stream of additional data
examples drawn from the same underlying data distribution. Formally the training error is
expressed as a sum (with the same notation as Section 3.1):
1Õ
𝑛
𝑅emp [X, y, 𝑓 ] = 𝑙 (x (𝑖) , 𝑦 (𝑖) , 𝑓 (x (𝑖) )), (3.6.1)
𝑛 𝑖=1
Problematically, we can never calculate the generalization error 𝑅 exactly. Nobody ever
tells us the precise form of the density function 𝑝(x, 𝑦). Moreover, we cannot sample
an infinite stream of data points. Thus, in practice, we must estimate the generalization
error by applying our model to an independent test set constituted of a random selection
of examples X0 and labels y0 that were withheld from our training set. This consists of
114 Linear Neural Networks for Regression
applying the same formula that was used for calculating the empirical training error but to
a test set X0 , y0 .
Crucially, when we evaluate our classifier on the test set, we are working with a fixed classi-
fier (it does not depend on the sample of the test set), and thus estimating its error is simply
the problem of mean estimation. However the same cannot be said for the training set. Note
that the model we wind up with depends explicitly on the selection of the training set and
thus the training error will in general be a biased estimate of the true error on the underly-
ing population. The central question of generalization is then when should we expect our
training error to be close to the population error (and thus the generalization error).
Model Complexity
In classical theory, when we have simple models and abundant data, the training and gen-
eralization errors tend to be close. However, when we work with more complex models
and/or fewer examples, we expect the training error to go down but the generalization gap
to grow. This should not be surprising. Imagine a model class so expressive that for any
dataset of 𝑛 examples, we can find a set of parameters that can perfectly fit arbitrary labels,
even if randomly assigned. In this case, even if we fit our training data perfectly, how can
we conclude anything about the generalization error? For all we know, our generalization
error might be no better than random guessing.
In general, absent any restriction on our model class, we cannot conclude, based on fitting
the training data alone, that our model has discovered any generalizable pattern (Vapnik et
al., 1994). On the other hand, if our model class was not capable of fitting arbitrary labels,
then it must have discovered a pattern. Learning-theoretic ideas about model complexity
derived some inspiration from the ideas of Karl Popper, an influential philosopher of sci-
ence, who formalized the criterion of falsifiability. According to Popper, a theory that can
explain any and all observations is not a scientific theory at all! After all, what has it told us
about the world if it has not ruled out any possibility? In short, what we want is a hypothesis
that could not explain any observations we might conceivably make and yet nevertheless
happens to be compatible with those observations that we in fact make.
Now what precisely constitutes an appropriate notion of model complexity is a complex
matter. Often, models with more parameters are able to fit a greater number of arbitrarily
assigned labels. However, this is not necessarily true. For instance, kernel methods operate
in spaces with infinite numbers of parameters, yet their complexity is controlled by other
means (Schölkopf and Smola, 2002). One notion of complexity that often proves useful
is the range of values that the parameters can take. Here, a model whose parameters are
permitted to take arbitrary values would be more complex. We will revisit this idea in the
next section, when we introduce weight decay, your first practical regularization technique.
Notably, it can be difficult to compare complexity among members of substantially different
model classes (say, decision trees vs. neural networks).
At this point, we must stress another important point that we will revisit when introducing
deep neural networks. When a model is capable of fitting arbitrary labels, low training
error does not necessarily imply low generalization error. However, it does not necessarily
115 Generalization
imply high generalization error either! All we can say with confidence is that low training
error alone is not enough to certify low generalization error. Deep neural networks turn
out to be just such models: while they generalize well in practice, they are too powerful
to allow us to conclude much on the basis of training error alone. In these cases we must
rely more heavily on our holdout data to certify generalization after the fact. Error on the
holdout data, i.e., validation set, is called the validation error.
for estimating the label 𝑦. This is just a linear regression problem where our features are
given by the powers of 𝑥, the model’s weights are given by 𝑤 𝑖 , and the bias is given by 𝑤 0
since 𝑥 0 = 1 for all 𝑥. Since this is just a linear regression problem, we can use the squared
error as our loss function.
A higher-order polynomial function is more complex than a lower-order polynomial func-
tion, since the higher-order polynomial has more parameters and the model function’s selec-
tion range is wider. Fixing the training dataset, higher-order polynomial functions should
always achieve lower (at worst, equal) training error relative to lower-degree polynomials.
In fact, whenever each data example has a distinct value of 𝑥, a polynomial function with
degree equal to the number of data examples can fit the training set perfectly. We compare
116 Linear Neural Networks for Regression
the relationship between polynomial degree (model complexity) and both underfitting and
overfitting in Fig. 3.6.1.
t
Fig. 3.6.1 Influence of model complexity on underfitting and overfitting.
Dataset Size
As the above bound already indicates, another big consideration to bear in mind is dataset
size. Fixing our model, the fewer samples we have in the training dataset, the more likely
(and more severely) we are to encounter overfitting. As we increase the amount of training
data, the generalization error typically decreases. Moreover, in general, more data never
hurts. For a fixed task and data distribution, model complexity should not increase more
rapidly than the amount of data. Given more data, we might attempt to fit a more complex
model. Absent sufficient data, simpler models may be more difficult to beat. For many
tasks, deep learning only outperforms linear models when many thousands of training ex-
amples are available. In part, the current success of deep learning owes considerably to the
abundance of massive datasets arising from Internet companies, cheap storage, connected
devices, and the broad digitization of the economy.
test data once, to assess the very best model or to compare a small number of models with
each other, real-world test data is seldom discarded after just one use. We can seldom
afford a new test set for each round of experiments. In fact, recycling benchmark data for
decades can have a significant impact on the development of algorithms, e.g., for image
classification 81 and optical character recognition 82 .
81
The common practice for addressing the problem of training on the test set is to split our
data three ways, incorporating a validation set in addition to the training and test datasets.
The result is a murky business where the boundaries between validation and test data are
82 worryingly ambiguous. Unless explicitly stated otherwise, in the experiments in this book
we are really working with what should rightly be called training data and validation data,
with no true test sets. Therefore, the accuracy reported in each experiment of the book is
really the validation accuracy and not a true test set accuracy.
Cross-Validation
When training data is scarce, we might not even be able to afford to hold out enough data to
constitute a proper validation set. One popular solution to this problem is to employ 𝐾-fold
cross-validation. Here, the original training data is split into 𝐾 non-overlapping subsets.
Then model training and validation are executed 𝐾 times, each time training on 𝐾 − 1
subsets and validating on a different subset (the one not used for training in that round).
Finally, the training and validation errors are estimated by averaging over the results from
the 𝐾 experiments.
3.6.4 Summary
This section explored some of the underpinnings of generalization in machine learning.
Some of these ideas become complicated and counterintuitive when we get to deeper mod-
els; here, models are capable of overfitting data badly, and the relevant notions of complex-
ity can be both implicit and counterintuitive (e.g., larger architectures with more parameters
generalizing better). We leave you with a few rules of thumb:
1. Use validation sets (or 𝐾-fold cross-validation) for model selection;
2. More complex models often require more data;
3. Relevant notions of complexity include both the number of parameters and the range of
values that they are allowed to take;
4. Keeping all else equal, more data almost always leads to better generalization;
5. This entire talk of generalization is all predicated on the IID assumption. If we relax
this assumption, allowing for distributions to shift between the train and testing peri-
ods, then we cannot say anything about generalization absent a further (perhaps milder)
assumption.
3.6.5 Exercises
1. When can you solve the problem of polynomial regression exactly?
118 Linear Neural Networks for Regression
2. Give at least five examples where dependent random variables make treating the problem
as IID data inadvisable.
3. Can you ever expect to see zero training error? Under which circumstances would you
see zero generalization error?
6. The VC dimension is defined as the maximum number of points that can be classified
with arbitrary labels {±1} by a function of a class of functions. Why might this not be
a good idea for measuring how complex the class of functions is? Hint: consider the
magnitude of the functions.
7. Your manager gives you a difficult dataset on which your current algorithm does not
perform so well. How would you justify to him that you need more data? Hint: you
cannot increase the data but you can decrease it.
83
Discussions 83 .
Now that we have characterized the problem of overfitting, we can introduce our first reg-
ularization technique. Recall that we can always mitigate overfitting by collecting more
training data. However, that can be costly, time consuming, or entirely out of our control,
making it impossible in the short run. For now, we can assume that we already have as
much high-quality data as our resources permit and focus the tools at our disposal when the
dataset is taken as a given.
Recall that in our polynomial regression example (Section 3.6.2) we could limit our model’s
capacity by tweaking the degree of the fitted polynomial. Indeed, limiting the number of
features is a popular technique for mitigating overfitting. However, simply tossing aside
features can be too blunt an instrument. Sticking with the polynomial regression example,
consider what might happen with high-dimensional input. The natural extensions of poly-
nomials to multivariate data are called monomials, which are simply products of powers
of variables. The degree of a monomial is the sum of the powers. For example, 𝑥12 𝑥2 , and
𝑥3 𝑥 52 are both monomials of degree 3.
Note that the number of terms with degree 𝑑 blows up rapidly as 𝑑 grows larger. Given 𝑘
variables, the number of monomials of degree 𝑑 is 𝑘−1+𝑑
𝑘−1 . Even small changes in degree,
say from 2 to 3, dramatically increase the complexity of our model. Thus we often need a
more fine-grained tool for adjusting function complexity.
119 Weight Decay
%matplotlib inline
import torch
from torch import nn
from d2l import torch as d2l
Recall that x (𝑖) are the features, 𝑦 (𝑖) is the label for any data example 𝑖, and (w, 𝑏) are
the weight and bias parameters, respectively. To penalize the size of the weight vector,
we must somehow add kwk 2 to the loss function, but how should the model trade off the
standard loss for this new additive penalty? In practice, we characterize this trade-off via
the regularization constant 𝜆, a nonnegative hyperparameter that we fit using validation
data:
𝜆
𝐿 (w, 𝑏) + kwk 2 . (3.7.2)
2
For 𝜆 = 0, we recover our original loss function. For 𝜆 > 0, we restrict the size of kwk.
We divide by 2 by convention: when we take the derivative of a quadratic function, the
2 and 1/2 cancel out, ensuring that the expression for the update looks nice and simple.
The astute reader might wonder why we work with the squared norm and not the standard
120 Linear Neural Networks for Regression
norm (i.e., the Euclidean distance). We do this for computational convenience. By squaring
the ℓ2 norm, we remove the square root, leaving the sum of squares of each component of
the weight vector. This makes the derivative of the penalty easy to compute: the sum of
derivatives equals the derivative of the sum.
Moreover, you might ask why we work with the ℓ2 norm in the first place and not, say,
the ℓ1 norm. In fact, other choices are valid and popular throughout statistics. While ℓ2 -
regularized linear models constitute the classic ridge regression algorithm, ℓ1 -regularized
linear regression is a similarly fundamental method in statistics, popularly known as lasso
regression. One reason to work with the ℓ2 norm is that it places an outsize penalty on large
components of the weight vector. This biases our learning algorithm towards models that
distribute weight evenly across a larger number of features. In practice, this might make
them more robust to measurement error in a single variable. By contrast, ℓ1 penalties lead
to models that concentrate weights on a small set of features by clearing the other weights
to zero. This gives us an effective method for feature selection, which may be desirable for
other reasons. For example, if our model only relies on a few features, then we may not
need to collect, store, or transmit data for the other (dropped) features.
Using the same notation in (3.1.11), minibatch stochastic gradient descent updates for ℓ2 -
regularized regression as follows:
𝜂 Õ (𝑖) > (𝑖)
w ← (1 − 𝜂𝜆) w − x w x + 𝑏 − 𝑦 (𝑖) . (3.7.3)
|B| 𝑖∈ B
As before, we update w based on the amount by which our estimate differs from the ob-
servation. However, we also shrink the size of w towards zero. That is why the method is
sometimes called “weight decay”: given the penalty term alone, our optimization algorithm
decays the weight at each step of training. In contrast to feature selection, weight decay
offers us a mechanism for continuously adjusting the complexity of a function. Smaller
values of 𝜆 correspond to less constrained w, whereas larger values of 𝜆 constrain w more
considerably. Whether we include a corresponding bias penalty 𝑏 2 can vary across imple-
mentations, and may vary across layers of a neural network. Often, we do not regularize
the bias term. Besides, although ℓ2 regularization may not be equivalent to weight decay
for other optimization algorithms, the idea of regularization through shrinking the size of
weights still holds true.
In this synthetic dataset, our label is given by an underlying linear function of our inputs,
corrupted by Gaussian noise with zero mean and standard deviation 0.01. For illustrative
purposes, we can make the effects of overfitting pronounced, by increasing the dimen-
121 Weight Decay
sionality of our problem to 𝑑 = 200 and working with a small training set with only 20
examples.
class Data(d2l.DataModule):
def __init__(self, num_train, num_val, num_inputs, batch_size):
self.save_hyperparameters()
n = num_train + num_val
self.X = torch.randn(n, num_inputs)
noise = torch.randn(n, 1) * 0.01
w, b = torch.ones((num_inputs, 1)) * 0.01, 0.05
self.y = torch.matmul(self.X, w) + b + noise
def l2_penalty(w):
return (w ** 2).sum() / 2
class WeightDecayScratch(d2l.LinearRegressionScratch):
def __init__(self, num_inputs, lambd, lr, sigma=0.01):
super().__init__(num_inputs, lr, sigma)
self.save_hyperparameters()
The following code fits our model on the training set with 20 examples and evaluates it on
the validation set with 100 examples.
122 Linear Neural Networks for Regression
def train_scratch(lambd):
model = WeightDecayScratch(num_inputs=200, lambd=lambd, lr=0.01)
model.board.yscale='log'
trainer.fit(model, data)
print('L2 norm of w:', float(l2_penalty(model.w)))
train_scratch(0)
L2 norm of w: 0.009948714636266232
train_scratch(3)
L2 norm of w: 0.0017270983662456274
algorithm itself for easy use in combination with any loss function. Moreover, this integra-
tion serves a computational benefit, allowing implementation tricks to add weight decay
to the algorithm, without any additional computational overhead. Since the weight decay
portion of the update depends only on the current value of each parameter, the optimizer
must touch each parameter once anyway.
Below, we specify the weight decay hyperparameter directly through weight_decay when
instantiating our optimizer. By default, PyTorch decays both weights and biases simulta-
neously, but we can configure the optimizer to handle different parameters according to
different policies. Here, we only set weight_decay for the weights (the net.weight pa-
rameters), hence the bias (the net.bias parameter) will not decay.
class WeightDecay(d2l.LinearRegression):
def __init__(self, wd, lr):
super().__init__(lr)
self.save_hyperparameters()
self.wd = wd
def configure_optimizers(self):
return torch.optim.SGD([
{'params': self.net.weight, 'weight_decay': self.wd},
{'params': self.net.bias}], lr=self.lr)
The plot looks similar to that when we implemented weight decay from scratch. How-
ever, this version runs faster and is easier to implement, benefits that will become more
pronounced as you address larger problems and this work becomes more routine.
L2 norm of w: 0.013779522851109505
So far, we have touched upon one notion of what constitutes a simple linear function. How-
84 ever, even for simple nonlinear functions, the situation can be much more complex. To see
this, the concept of reproducing kernel Hilbert space (RKHS) 84 allows one to apply tools
124 Linear Neural Networks for Regression
3.7.5 Summary
Regularization is a common method for dealing with overfitting. Classical regularization
techniques add a penalty term to the loss function (when training) to reduce the complexity
of the learned model. One particular choice for keeping the model simple is using an ℓ2
penalty. This leads to weight decay in the update steps of the minibatch stochastic gradient
descent algorithm. In practice, the weight decay functionality is provided in optimizers
from deep learning frameworks. Different sets of parameters can have different update
behaviors within the same training loop.
3.7.6 Exercises
1. Experiment with the value of 𝜆 in the estimation problem in this section. Plot training
and validation accuracy as a function of 𝜆. What do you observe?
2. Use a validation set to find the optimal value of 𝜆. Is it really the optimal value? Does
this matter?
Í
3. What would the update equations look like if instead of kwk 2 we used 𝑖 |𝑤 𝑖 | as our
penalty of choice (ℓ1 regularization)?
4. We know that kwk 2 = w> w. Can you find a similar equation for matrices (see the
Frobenius norm in Section 2.3.11)?
5. Review the relationship between training error and generalization error. In addition to
weight decay, increased training, and the use of a model of suitable complexity, what
other ways might help us deal with overfitting?
6. In Bayesian statistics we use the product of prior and likelihood to arrive at a posterior
via 𝑃(𝑤 | 𝑥) ∝ 𝑃(𝑥 | 𝑤)𝑃(𝑤). How can you identify 𝑃(𝑤) with regularization?
Discussions 85 .
85
4 Linear Neural Networks for Classification
Now that you have worked through all of the mechanics you are ready to apply the skills
you have learned to broader kinds of tasks. Even as we pivot towards classification, most
of the plumbing remains the same: loading the data, passing it through the model, generat-
ing output, calculating the loss, taking gradients with respect to weights, and updating the
model. However, the precise form of the targets, the parametrization of the output layer,
and the choice of loss function will adapt to suit the classification setting.
Regression is the hammer we reach for when we want to answer how much? or how many?
questions. If you want to predict the number of dollars (price) at which a house will be sold,
or the number of wins a baseball team might have, or the number of days that a patient will
remain hospitalized before being discharged, then you are probably looking for a regression
model. However, even within regression models, there are important distinctions. For
instance, the price of a house will never be negative and changes might often be relative
to its baseline price. As such, it might be more effective to regress on the logarithm of the
price. Likewise, the number of days a patient spends in hospital is a discrete nonnegative
random variable. As such, least mean squares might not be an ideal approach either. This
sort of time-to-event modeling comes with a host of other complications that are dealt with
in a specialized subfield called survival modeling.
The point here is not to overwhelm you but just to let you know that there is a lot more
to estimation than simply minimizing squared errors. And more broadly, there is a lot
more to supervised learning than regression. In this section, we focus on classification
problems where we put aside how much? questions and instead focus on which category?
questions.
• Is this customer more likely to sign up or not to sign up for a subscription service?
125
126 Linear Neural Networks for Classification
Even more, there are cases where more than one label might be true. For instance, a news
article might simultaneously cover the topics of entertainment, business, and space flight,
but not the topics of medicine or sports. Thus, categorizing it into one of the above cate-
gories on their own would not be very useful. This problem is commonly known as multi-
86
label classification 86 . See Tsoumakas and Katakis (2007) for an overview and Huang et
al. (2015) for an effective algorithm when tagging images.
4.1.1 Classification
To get our feet wet, let’s start with a simple image classification problem. Here, each input
consists of a 2 × 2 grayscale image. We can represent each pixel value with a single scalar,
giving us four features 𝑥 1 , 𝑥 2 , 𝑥 3 , 𝑥 4 . Further, let’s assume that each image belongs to one
among the categories “cat”, “chicken”, and “dog”.
Next, we have to choose how to represent the labels. We have two obvious choices. Per-
haps the most natural impulse would be to choose 𝑦 ∈ {1, 2, 3}, where the integers represent
{dog, cat, chicken} respectively. This is a great way of storing such information on a com-
87 puter. If the categories had some natural ordering among them, say if we were trying to
predict {baby, toddler, adolescent, young adult, adult, geriatric}, then it might even make
sense to cast this as an ordinal regression 87 problem and keep the labels in this format.
See Moon et al. (2010) for an overview of different types of ranking loss functions and
Beutel et al. (2014) for a Bayesian approach that addresses responses with more than one
mode.
In general, classification problems do not come with natural orderings among the classes.
Fortunately, statisticians long ago invented a simple way to represent categorical data: the
one-hot encoding. A one-hot encoding is a vector with as many components as we have
categories. The component corresponding to a particular instance’s category is set to 1 and
all other components are set to 0. In our case, a label 𝑦 would be a three-dimensional vector,
with (1, 0, 0) corresponding to “cat”, (0, 1, 0) to “chicken”, and (0, 0, 1) to “dog”:
Linear Model
In order to estimate the conditional probabilities associated with all the possible classes,
we need a model with multiple outputs, one per class. To address classification with lin-
ear models, we will need as many affine functions as we have outputs. Strictly speaking,
we only need one fewer, since the final category has to be the difference between 1 and
the sum of the other categories, but for reasons of symmetry we use a slightly redundant
parametrization. Each output corresponds to its own affine function. In our case, since
we have 4 features and 3 possible output categories, we need 12 scalars to represent the
weights (𝑤 with subscripts), and 3 scalars to represent the biases (𝑏 with subscripts). This
yields:
𝑜1 = 𝑥 1 𝑤 11 + 𝑥2 𝑤 12 + 𝑥3 𝑤 13 + 𝑥4 𝑤 14 + 𝑏 1 ,
𝑜2 = 𝑥 1 𝑤 21 + 𝑥2 𝑤 22 + 𝑥3 𝑤 23 + 𝑥4 𝑤 24 + 𝑏 2 , (4.1.2)
𝑜3 = 𝑥 1 𝑤 31 + 𝑥2 𝑤 32 + 𝑥3 𝑤 33 + 𝑥4 𝑤 34 + 𝑏 3 .
The corresponding neural network diagram is shown in Fig. 4.1.1. Just as in linear regres-
sion, we use a single-layer neural network. And since the calculation of each output, 𝑜1 , 𝑜 2 ,
and 𝑜3 , depends on every input, 𝑥1 , 𝑥 2 , 𝑥3 , and 𝑥4 , the output layer can also be described as
a fully connected layer.
t
Fig. 4.1.1 Softmax regression is a single-layer neural network.
For a more concise notation we use vectors and matrices: o = Wx+b is much better suited
for mathematics and code. Note that we have gathered all of our weights into a 3 × 4 matrix
and all biases b ∈ R3 in a vector.
The Softmax
Assuming a suitable loss function, we could try, directly, to minimize the difference be-
tween o and the labels y. While it turns out that treating classification as a vector-valued
regression problem works surprisingly well, it is nonetheless unsatisfactory in the following
ways:
• There is no guarantee that the outputs 𝑜𝑖 sum up to 1 in the way we expect probabilities
to behave.
• There is no guarantee that the outputs 𝑜𝑖 are even nonnegative, even if their outputs sum
up to 1, or that they do not exceed 1.
Both aspects render the estimation problem difficult to solve and the solution very brittle
to outliers. For instance, if we assume that there is a positive linear dependency between
the number of bedrooms and the likelihood that someone will buy a house, the probability
128 Linear Neural Networks for Classification
There are many ways we might accomplish this goal. For instance, we could assume that
the outputs o are corrupted versions of y, where the corruption occurs by means of adding
noise 𝝐 drawn from a normal distribution. In other words, y = o + 𝝐, where 𝜖𝑖 ∼ N (0, 𝜎 2 ).
This is the so-called probit model 88 , first introduced by Fechner (1860). While appealing,
88 it does not work quite as well nor lead to a particularly nice optimization problem, when
compared to the softmax.
Another way to accomplish this goal (and to ensure nonnegativity) is to use an exponential
function 𝑃(𝑦 = 𝑖) ∝ exp 𝑜𝑖 . This does indeed satisfy the requirement that the conditional
class probability increases with increasing 𝑜𝑖 , it is monotonic, and all probabilities are
nonnegative. We can then transform these values so that they add up to 1 by dividing each
by their sum. This process is called normalization. Putting these two pieces together gives
us the softmax function:
exp(𝑜 𝑖 )
ŷ = softmax(o) where 𝑦ˆ 𝑖 = Í . (4.1.3)
𝑗 exp(𝑜 𝑗 )
Note that the largest coordinate of o corresponds to the most likely class according to ŷ.
Moreover, because the softmax operation preserves the ordering among its arguments, we
do not need to compute the softmax to determine which class has been assigned the highest
probability. Thus,
The idea of a softmax dates back to Gibbs (1902), who adapted ideas from physics. Dating
even further back, Boltzmann, the father of modern statistical physics, used this trick to
model a distribution over energy states in gas molecules. In particular, he discovered that
the prevalence of a state of energy in a thermodynamic ensemble, such as the molecules in a
gas, is proportional to exp(−𝐸/𝑘𝑇). Here, 𝐸 is the energy of a state, 𝑇 is the temperature,
and 𝑘 is the Boltzmann constant. When statisticians talk about increasing or decreasing
the “temperature” of a statistical system, they refer to changing 𝑇 in order to favor lower
or higher energy states. Following Gibbs’ idea, energy equates to error. Energy-based
models (Ranzato et al., 2007) use this point of view when describing problems in deep
learning.
Vectorization
To improve computational efficiency, we vectorize calculations in minibatches of data. As-
sume that we are given a minibatch X ∈ R𝑛×𝑑 of 𝑛 examples with dimensionality (number
of inputs) 𝑑. Moreover, assume that we have 𝑞 categories in the output. Then the weights
satisfy W ∈ R𝑑×𝑞 and the bias satisfies b ∈ R1×𝑞 .
O = XW + b,
(4.1.5)
Ŷ = softmax(O).
129 Softmax Regression
This accelerates the dominant operation into a matrix–matrix product XW. Moreover,
since each row in X represents a data example, the softmax operation itself can be computed
rowwise: for each row of O, exponentiate all entries and then normalize them by the sum.
Note, though, that care must be taken to avoid exponentiating and taking logarithms of large
numbers, since this can cause numerical overflow or underflow. Deep learning frameworks
take care of this automatically.
Log-Likelihood
The softmax function gives us a vector ŷ, which we can interpret as the (estimated) con-
ditional probabilities of each class, given any input x, such as 𝑦ˆ 1 = 𝑃(𝑦 = cat | x). In the
following we assume that for a dataset with features X the labels Y are represented using
a one-hot encoding label vector. We can compare the estimates with reality by checking
how probable the actual classes are according to our model, given the features:
Ö
𝑛
𝑃(Y | X) = 𝑃(y (𝑖) | x (𝑖) ). (4.1.6)
𝑖=1
We are allowed to use the factorization since we assume that each label is drawn indepen-
dently from its respective distribution 𝑃(y | x (𝑖) ). Since maximizing the product of terms
is awkward, we take the negative logarithm to obtain the equivalent problem of minimizing
the negative log-likelihood:
Õ
𝑛 Õ
𝑛
(𝑖) (𝑖)
− log 𝑃(Y | X) = − log 𝑃(y |x )= 𝑙 (y (𝑖) , ŷ (𝑖) ), (4.1.7)
𝑖=1 𝑖=1
where for any pair of label y and model prediction ŷ over 𝑞 classes, the loss function 𝑙
is
𝑞
Õ
𝑙 (y, ŷ) = − 𝑦 𝑗 log 𝑦ˆ 𝑗 . (4.1.8)
𝑗=1
For reasons explained later on, the loss function in (4.1.8) is commonly called the cross-
entropy loss. Since y is a one-hot vector of length 𝑞, the sum over all its coordinates 𝑗 van-
ishes for all but one term. Note that the loss 𝑙 (y, ŷ) is bounded from below by 0 whenever ŷ
is a probability vector: no single entry is larger than 1, hence their negative logarithm can-
not be lower than 0; 𝑙 (y, ŷ) = 0 only if we predict the actual label with certainty. This can
never happen for any finite setting of the weights because taking a softmax output towards
1 requires taking the corresponding input 𝑜𝑖 to infinity (or all other outputs 𝑜 𝑗 for 𝑗 ≠ 𝑖
to negative infinity). Even if our model could assign an output probability of 0, any error
made when assigning such high confidence would incur infinite loss (− log 0 = ∞).
130 Linear Neural Networks for Classification