0% found this document useful (0 votes)

266 views538 pages

Math For Data Science

Uploaded by

ananth.gouri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

266 views538 pages

Math For Data Science

Uploaded by

ananth.gouri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 538

Omar Hijab*

Math for Data Science

This text is a presentation of the mathematics underlying Data Science, and

assumes the math background typical of an Electrical Engineering under-
graduate. In particular, Chapter 4, Calculus, assumes the reader has prior
calculus exposure.
By contrast, because we outsource computations to Python, and focus
on conceptual understanding, Chapter 2, Linear Geometry, is developed in
depth.
Depending on the emphasis and supplementary material, the text is ap-
propriate for a course in the following programs
• Applied Mathematics,
• Business Analytics,
• Computer Science,
• Data Science,
• Engineering.
The level and pace of the text varies from gentle, at the start, to advanced,
at the end. Depending on the depth of coverage, the text is appropriate for
a one-semester course, or a two-semester course.
Chapters 1-3, together with some of the appendices, form the basis for
a leisurely one-semester course, and Chapters 4-7 form the basis for an ad-
vanced one-semester course. The chapter ordering is chosen to allow for the
two semesters being taught simultaneously. The text was written while being
repeatedly taught as a two-semester course with the two semesters taught
simultaneously.
The culmination of the text is Chapter 7, Machine Learning. Much of
the mathematics developed in prior chapters is used here. While only an
introduction to the subject, the material in Chapter 7 is carefully and logically
built up from first principles.
As a consequence, the presentation and some results are new: The proofs
of heavy ball convergence and Nesterov convergence for accelerated gradient

vii
viii

descent are simplifications of the proofs in [36], and the connection between
properness and trainability seems to be new to the literature.

Important principles or results are displayed in these boxes.

The ideas presented in the text are made concrete by interpreting them
in Python code. The standard Python data science packages are used, and a
Python index lists the functions used in the text.
Because Python is used to highlight concepts, the code examples are pur-
posely simple to follow. This should be helpful to the reader new to Python.

Python code is displayed in these boxes.

Because SQL is usually part of a data scientist’s toolkit, an introduction

to using SQL, from within Python, is included in an appendix. Also, in case
the instructor wishes to de-emphasize it, integration is presented separately
in an appendix. Other appendices cover combinations and permutations, the
binomial theorem, the exponential function, complex numbers, asymptotics,
and minimizers, to be used according to the instructor’s emphasis and pref-
erences.
The bibiliography at the end is a listing of the references accessed while
writing the text. Throughout, we use iff to mean if and only if, and we use ≈
for asymptotic equality (§A.6). Apart from a few exceptions, all figures in the
text were created by the author using Python or tikZ. The text is typeset
using TEX. The boxes above are created using tcolorbox.
To help navigate the text, in each section, to indicate a break, a new idea,
or a change in direction, we use a ship’s wheel .
Sections and figures are numbered sequentially within each chapter, and
equations and exercises are numbered sequentially within each section, so §3.4
is the fourth section in the third chapter, Figure 4.14 is the twelfth figure in
the fourth chapter, (3.2.1) is the first equation in the second section of the
third chapter, and Exercise 1.2.3 is the third exercise in the second section
of the first chapter. Also, [1] cites the first entry in the references.
Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Averages and Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Two Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.6 High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2 Linear Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.2 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.3 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.4 Span and Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.5 Zero Variance Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
2.6 Pseudo-Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.7 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.8 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.9 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

3 Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

3.1 Geometry of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.2 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
3.4 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.5 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
3.6 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

4 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.1 Single-Variable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.2 Entropy and Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
4.3 Multi-Variable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
4.4 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

ix
x Contents

4.5 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

5 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
5.1 Binomial Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
5.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
5.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
5.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
5.5 Chi-squared Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
5.6 Multinomial Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343

6 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
6.2 Z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
6.3 T -test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
6.4 Chi-Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378

7 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
7.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
7.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
7.4 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
7.5 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
7.6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
7.7 Regression Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
7.8 Strict Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
7.9 Accelerated Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
A.1 Permutations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . 451
A.2 The Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
A.3 The Exponential Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
A.4 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
A.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
A.6 Asymptotics and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
A.7 Existence of Minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
A.8 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509

Python Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
List of Figures

1.1 Iris dataset [28]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Images in the MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 A portion of the MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Original and projections: n = 784, 600, 350, 150, 50, 10, 1. . . . . 6
1.5 The MNIST dataset (3d projection). . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 A crude copy of the image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 HTML colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 The vector v joining the points m and x. . . . . . . . . . . . . . . . . . . . 12
1.9 Datasets of points versus datasets of vectors. . . . . . . . . . . . . . . . . 13
1.10 A dataset with its mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.11 Vectorization of samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.12 A vector v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.13 Vectors v1 and v2 and their shadows in the plane. . . . . . . . . . . . 19
1.14 Adding v1 and v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.15 Scaling with t = 2 and t = −2/3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.16 The polar representation of v = (x, y). . . . . . . . . . . . . . . . . . . . . . 22
1.17 v and its antipode −v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.18 Two vectors v1 and v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.19 Pythagoras for general triangles. . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.20 Proof of Pythagoras for general triangles. . . . . . . . . . . . . . . . . . . . 26
1.21 P and P ⊥ and v and v ⊥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.22 MSD for the mean (green) versus MSD for a random point
(red). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.23 Projecting a vector b onto the line through u. . . . . . . . . . . . . . . . 41
1.24 Unit variance ellipses (blue) and unit inverse variance ellipses
(red) with µ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.25 Variance ellipses (blue) and inverse variance ellipses (red) for
a dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.26 Unit variance ellipse and unit inverse variance ellipse with
standardized Q. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

xi
xii List of Figures

1.27 Positively and negatively correlated datasets (unit inverse

ellipses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.28 Ellipsoid and axes in 3d. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.29 Disks inside the square. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.30 Balls inside the cube. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.31 Suspensions of interval [a, b] and disk D. . . . . . . . . . . . . . . . . . . . 57

2.1 Numpy column space array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

2.2 The points 0, x, Ax, and b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.3 The points x, Ax, the points x∗ , Ax∗ , and the point x+ . . . . . . 105
2.4 Projecting onto a line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.5 Projecting onto a plane, P b = ru + sv. . . . . . . . . . . . . . . . . . . . . . 115
2.6 Dataset, reduced dataset, and projected dataset, n < d. . . . . . . 119
2.7 Relations between vector classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 123
2.8 First defect for MNIST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
2.9 The dimension staircase with defects. . . . . . . . . . . . . . . . . . . . . . . 126
2.10 The dimension staircase for the MNIST dataset. . . . . . . . . . . . . . 127
2.11 A 5 × 3 matrix A is a linear transformation from R3 to R5 . . . 129

3.1 Image of unit circle with σ1 = 1.5 and σ2 = .75. . . . . . . . . . . . . . 137

3.2 SVD decomposition A = U SV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.3 Relations between matrix classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.4 Inverse variance ellipse and centered dataset. . . . . . . . . . . . . . . . . 149
3.5 S = span(v1 ) and T = S ⊥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
3.6 Three springs at rest and perturbed. . . . . . . . . . . . . . . . . . . . . . . . 156
3.7 Six springs at rest and perturbed. . . . . . . . . . . . . . . . . . . . . . . . . . 157
3.8 Two springs along a circle leading to Q(2). . . . . . . . . . . . . . . . . . 158
3.9 Five springs along a circle leading to Q(5). . . . . . . . . . . . . . . . . . 158
3.10 Plot of eigenvalues of Q(50). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
3.11 Density of eigenvalues of Q(d) for d large. . . . . . . . . . . . . . . . . . . 162
3.12 Trace of pseudo-inverse (§2.3) of Q(d). . . . . . . . . . . . . . . . . . . . . . 163
3.13 Directed and undirected graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
3.14 A weighed directed graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
3.15 A double edge and a loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
3.16 The complete graph K6 and the cycle graph C6 . . . . . . . . . . . . . . 166
3.17 The triangle K3 = C3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
3.18 Non-isomorphic graphs with degree sequence (3, 2, 2, 1, 1, 1). . . 174
3.19 Complete bipartite graph K53 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
3.20 MNIST eigenvalues as a percentage of the total variance. . . . . . 185
3.21 MNIST eigenvalue percentage plot. . . . . . . . . . . . . . . . . . . . . . . . . 186
3.22 Original and projections: n = 784, 600, 350, 150, 50, 10, 1. . . . . 190
3.23 The full MNIST dataset (2d projection). . . . . . . . . . . . . . . . . . . . 191
3.24 The Iris dataset (2d projection). . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

4.1 f ′ (a) is the slope of the tangent line at a. . . . . . . . . . . . . . . . . . . . 198

List of Figures xiii

4.2 Composition of two functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

4.3 The logarithm function log x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4.4 Increasing or decreasing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
4.5 Increasing or decreasing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
4.6 Tangent parabolas pm (x) (green), pL (x) (red), L > m > 0. . . . . 211
4.7 The sine function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
4.8 The sine function with π/2 tick marks. . . . . . . . . . . . . . . . . . . . . . 215
4.9 Angle θ in the plane, P = (x, y). . . . . . . . . . . . . . . . . . . . . . . . . . . 216
4.10 The absolute entropy function H(p). . . . . . . . . . . . . . . . . . . . . . . . 219
4.11 The absolute information I(p). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.12 The relative information I(p, q) with q = .7. . . . . . . . . . . . . . . . . 223
4.13 Surface plot of I(p, q) over the square 0 ≤ p ≤ 1, 0 ≤ q ≤ 1. . . . 224
4.14 Composition of multiple functions. . . . . . . . . . . . . . . . . . . . . . . . . . 228
4.15 Composition of three functions in a chain. . . . . . . . . . . . . . . . . . . 233
4.16 A network composition [33]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
4.17 The function g = max(y, z). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
4.18 Forward and backward propagation [33]. . . . . . . . . . . . . . . . . . . . 238
4.19 Level sets and sublevel sets in two dimensions. . . . . . . . . . . . . . . 243
4.20 Contour lines in two dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . 244
4.21 Line segment [x0 , x1 ]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
4.22 Convex: The line segment lies above the graph. . . . . . . . . . . . . . 246
4.23 Convex hull of x1 , x2 , x3 , x4 , x5 , x6 , x7 . . . . . . . . . . . . . . . . . . . . 248
4.24 A convex hull with one facet highlighted. . . . . . . . . . . . . . . . . . . . 248
4.25 Convex set in three dimensions with supporting hyperplane. . . 250
4.26 Hyperplanes in two and three dimensions. . . . . . . . . . . . . . . . . . . 251
4.27 Separating hyperplane I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
4.28 Separating hyperplane II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

5.1 Asymptotics of binomial coefficients. . . . . . . . . . . . . . . . . . . . . . . . 270

5.2 The distribution of p given 7 heads in 10 tosses. . . . . . . . . . . . . . 274
5.3 The logistic function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
5.4 The logistic function takes real numbers to probabilities. . . . . . 277
5.5 Decision boundary (1d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
5.6 Decision boundary (3d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
5.7 Joint distribution of boys and girls [30]. . . . . . . . . . . . . . . . . . . . . 283
5.8 100,000 sessions, with 5, 15, 50, and 500 tosses per session. . . . 284
5.9 The histogram of Iris petal lengths. . . . . . . . . . . . . . . . . . . . . . . . . 286
5.10 Iris petal lengths sampled 100,000 times. . . . . . . . . . . . . . . . . . . . 287
5.11 Iris petal lengths batch means sampled 100,000 times, batch
sizes 3, 5, 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
5.12 When we sample X, we get x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
5.13 Probability mass function p(x) of a Bernoulli random variable. 296
5.14 Cumulative distribution function F (x) of a Bernoulli random
variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
5.15 Confidence that X lies in interval [a, b]. . . . . . . . . . . . . . . . . . . . . 305
xiv List of Figures

5.16 Uniform probability density function (pdf). . . . . . . . . . . . . . . . . . 306

5.17 Densities versus distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
5.18 Continuous cumulative distribution function. . . . . . . . . . . . . . . . . 308
5.19 When we sample X1 , X2 , . . . , Xn , we get x1 , x2 , . . . , xn . . . . . 311
5.20 The pdf of the standard normal distribution. . . . . . . . . . . . . . . . 315
5.21 The binomial cdf and its CLT normal approximation. . . . . . . . 319
5.22 z = Z.ppf(p) and p = Z.cdf(z). . . . . . . . . . . . . . . . . . . . . . . . . . 322
5.23 Confidence (green) or significance (red) (lower-tail, two-tail,
upper-tail). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
5.24 68%, 95%, 99% confidence cutoffs for standard normal. . . . . . . . 323
5.25 Cutoffs, confidence levels, p-values. . . . . . . . . . . . . . . . . . . . . . . . . 324
5.26 p-values at 5% and at 1%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
5.27 68%, 95%, 99% cutoffs for non-standard normal. . . . . . . . . . . . . 325
5.28 (X, Y ) inside the square and inside the disk. . . . . . . . . . . . . . . . . 331
5.29 Chi-squared distribution with different degrees. . . . . . . . . . . . . . 333
5.30 With degree d ≥ 2, the chi-squared density peaks at d − 2. . . . . 334
5.31 Normal probability density on R2 . . . . . . . . . . . . . . . . . . . . . . . . . . 338
5.32 The softmax function takes vectors to probability vectors. . . . 345
5.33 The third row is the sum of the first and second rows, and
the H column is the negative of the I column. . . . . . . . . . . . . . . 354

6.1 Statistics flowchart: p-value p and significance α. . . . . . . . . . . . . 358

6.2 Histogram of sampling n = 25 students, repeated N = 1000
times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
6.3 The error matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
6.4 T -distribution, against normal (dashed). . . . . . . . . . . . . . . . . . . . . 373
6.5 2 × 3 = d × N contingency table [30]. . . . . . . . . . . . . . . . . . . . . . . 382
6.6 Earthquake counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

7.1 A perceptron with activation function f . . . . . . . . . . . . . . . . . . . . 392

7.2 Perceptrons in parallel (R in the figure is the retina) [22]. . . . . 393
7.3 Network of neurons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
7.4 Incoming and Outgoing signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
7.5 Forward and back propagation between two neurons. . . . . . . . . . 395
7.6 Downstream, local, and upstream derivatives at node i. . . . . . . 398
7.7 A shallow dense layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
7.8 Layered neural network [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
7.9 Double well newton descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
7.10 Double well cost function and sublevel sets at w0 and at w1 . . . 408
7.11 Double well gradient descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
7.12 Cost trajectory and number of iterations as learning rate varies.413
7.13 Linear regression neural network with no bias inputs. . . . . . . . . 416
7.14 Logistic regression neural network without bias inputs. . . . . . . . 420
7.15 Population versus employed: Linear Regression. . . . . . . . . . . . . . 430
7.16 Longley Economic Data [19]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
List of Figures xv

7.17 Polynomial regression: Degrees 2, 4, 6, 8, 10, 12. . . . . . . . . . . . . . 434

7.18 Hours studied and outcomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
7.19 Exam dataset: x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
7.20 Exam dataset: (x, p) [35]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
7.21 Exam dataset: (x, x0 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
7.22 Hours studied and one-hot encoded outcomes. . . . . . . . . . . . . . . . 438
7.23 Neural network for student exam outcomes. . . . . . . . . . . . . . . . . . 438
7.24 Equivalent neural network for student exam outcomes. . . . . . . . 439
7.25 Exam dataset: (x, x0 , p). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
7.26 Convex hulls of Iris classes in R2 . . . . . . . . . . . . . . . . . . . . . . . . . . 440
7.27 Convex hulls of MNIST classes in R2 . . . . . . . . . . . . . . . . . . . . . . . 440

A.1 6 = 3! permutations of 3 balls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

A.2 Pascal’s triangle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
A.3 The exponential function exp x. . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
A.4 Convexity of the exponential function. . . . . . . . . . . . . . . . . . . . . . 472
A.5 Multiplying and dividing points on the unit circle. . . . . . . . . . . . 473
A.6 Complex numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
A.7 The second, third, and fourth roots of unity . . . . . . . . . . . . . . . . 478
A.8 The fifth, sixth, and fifteenth roots of unity . . . . . . . . . . . . . . . . . 479
A.9 Areas under the graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
A.10 Area under the parabola. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
A.11 The graph and area under sin x. . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
A.12 Integral of sin x/x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
A.13 Dataframe from list-of-dicts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
A.14 Menu dataframe and SQL table. . . . . . . . . . . . . . . . . . . . . . . . . . . 500
A.15 Rawa restaurant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
A.16 OrdersIn dataframe and SQL table. . . . . . . . . . . . . . . . . . . . . . . . . 504
A.17 OrdersOut dataframe and SQL table. . . . . . . . . . . . . . . . . . . . . . . 505
Chapter 1
Datasets

In this chapter we explore examples of datasets and some simple Python

code. We also review the geometry of vectors in the plane and properties of
2 × 2 matrices, introduce the mean and variance of a dataset, then present a
first taste of what higher dimensions might look like.

1.1 Introduction

Geometrically, a dataset is a sample of N points x1 , x2 , . . . , xN in d-

dimensional space Rd . When manipulating datasets as vectors, they are usu-
ally arranged into d × N arrays. When displaying datasets, as in spreadsheets
or SQL tables, they are usually arranged into N × d arrays.
Practically speaking, as we shall see, the following are all representations
of datasets
matrix = CSV file = spreadsheet = SQL table = array = dataframe
Each point x = (t1 , t2 , . . . , td ) in the dataset is a sample or an example,
and the components t1 , t2 , . . . , td of a sample point x are its features or
attributes. As such, d-dimensional space Rd is feature space.
Sometimes one of the features is separated out as the label or target. In
this case, the dataset is a labeled dataset.

The Iris dataset contains N = 150 examples of d = 4 features of Iris

flowers, and there are three classes of Irises, Setosa, Versicolor, and Virginica,
with 50 samples from each class. For each example, the class is the label
corresponding to that example, so the Iris dataset is labeled.
The four features are sepal length and width, and petal length and width.
In Figure 1.1, the dataset is displayed as an N × d array.

1
2 CHAPTER 1. DATASETS

Fig. 1.1 Iris dataset [28].

The Iris dataset is downloaded using the code

from sklearn import datasets

iris = datasets.load_iris()
iris["feature_names"]

This returns
['sepal length','sepal width','petal length','petal width'].
To return the data and the classes, the code is

dataset = iris["data"]
labels = iris["target"]

dataset, labels

The above code returns dataset as an N × d array. To return a d × N

array, take the transpose dataset = iris["data"].T.

The MNIST dataset consists of images of hand-written digits (Figure 1.2).

There are 10 classes of images, corresponding to each digit 0, 1, . . . , 9. We
1.1. INTRODUCTION 3

seek to compress the images while preserving as much as possible of the

images’ characteristics.
Each image is a grayscale 28x28 pixel image. Since 282 = 784, each image
is a point in d = 784 dimensions. Here there are N = 60000 samples and
d = 784 features.

Fig. 1.2 Images in the MNIST dataset.

This subsection is included just to give a flavor. All unfamiliar words are
explained in detail in Chapter 2. If preferred, just skip to the next subsection.
Suppose we have a dataset of N points

x1 , x2 , . . . , xN

in d-dimensional feature space. We seek to find a lower-dimensional feature

space U ⊂ Rd so that the projections of these points onto U retain as much
information as possible about the data.
In other words, we are looking for an n-dimensional subspace U for some
n < d. Among all n-dimensional subspaces, which one should we pick? The
answer is to select U among all n-dimensional subspaces to maximize vari-
ability in the data.
Another issue is the choice of n, which is an integer satisfying 0 ≤ n ≤ d.
On the one hand, we want n to be as small as possible, to maximize data
compression. On the other hand, we want n to be big enough to capture most
of the features of the data. At one extreme, if we pick n = d, then we have
no compression and complete information. At the other extreme, if we pick
n = 0, then we have full compression and no information.
Projecting the data from Rd to a lower-dimensional space U is dimensional
reduction. The best alignment, the best-fit, or the best choice of U is principal
component analysis. These issues will be taken up in §3.5.

If this is your first exposure to data science, there will be a learning curve,
because here there are three kinds of thinking: Data science (datasets, PCA,
descent, networks), math (linear algebra, probability, statistics, calculus), and
Python (numpy, pandas, scipy, sympy, matplotlib). It may help to read the
4 CHAPTER 1. DATASETS

code examples , and the important math principles first, then dive
into details as needed.
To illustrate and make concrete concepts as they are introduced, we use
Python code throughout. We run Python code in a jupyter notebook.
jupyter is an IDE, an integrated development environment. jupyter
supports many languages, including Python, Sage, Julia, and R. A useful
jupyter feature is the ability to measure the amount of execution time of a
jupyter cell by including at the start of the cell

%%time

It’s simplest to first install Python, then jupyter. If your laptop is not a
recent model, to minimize overhead, it’s best to install Python directly and
avoid extra packages or frameworks. If Python is installed from
https://www.python.org/downloads/,
then the Python package installer pip is also installed.
From within a shell, check the latest version of pip is installed using the
command
pip --version,
The versions of Python and pip used in this edition of the text are 3.12.*
and 24.*. The first step is to ensure updated versions of Python and pip are
on your laptop.
After this, from within a shell, use pip to install your first package:
pip install jupyter
After installing jupyter, all other packages are installed from within
jupyter. For example, for this text, from within a jupyter cell, we ran

pip install numpy

pip install sympy
pip install scipy
pip install scikit-learn
pip install pandas
pip install matplotlib
pip install ipympl
pip install sqlalchemy
pip install pymysql

After installing these packages, restart jupyter to activate the packages.

The above is a complete listing of the packages used in this text.
Because one often has to repeatedly install different versions of Python,
it’s best to isolate your installations from whatever Python your laptop’s
OS uses. This is achieved by carrying out the above steps within a venv, a
1.2. THE MNIST DATASET 5

virtual environment. Then several venvs may be set up side-by-side, and, at

any time, any venv may be deleted without impacting any others, or the OS.

Exercises

Exercise 1.1.1 What is dataset.shape and labels.shape?

Exercise 1.1.2 What does sum(dataset[0]) return and why?

Exercise 1.1.3 What does sum(dataset) return and why?

Exercise 1.1.4 Let a be a list. What does list(enumerate(a)) return?

What does the code below return?

def uniq(a):
return [x for i, x in enumerate(a) if x not in a[:i] ]

1.2 The MNIST Dataset

Fig. 1.3 A portion of the MNIST dataset.

6 CHAPTER 1. DATASETS

The MNIST1 dataset consists of 60,000 training images. Since this dataset is
for demonstration purposes, these images are coarse.
Each image consists of 28 × 28 = 784 pixels, and each pixel shading is a
byte, an integer between 0 and 255 inclusive. Therefore each image is a point
x in Rd = R784 . Attached to each image is its label, a digit 0, 1, . . . , 9.
We assume the dataset has been downloaded to your laptop as a CSV file
mnist.csv. Then each row in the file consists of the pixels for a single image.
Since the image’s label is also included in the row, each row consists of 785
integers. There are many sources and formats online for this dataset.
The code

from pandas import *

from numpy import *

mnist = read_csv("mnist.csv").to_numpy()

# separate rows into data and labels

# first column is the labels
labels = mnist[:,0]
# all other columns are the pixels
dataset = mnist[:,1:]

mnist.shape,dataset.shape,labels.shape

returns

(60000, 785), (60000, 784), (60000,)

Here the dataset is arranged into an N × d array.

Fig. 1.4 Original and projections: n = 784, 600, 350, 150, 50, 10, 1.

1 The National Institute of Standards and Technology (NIST) is a physical sciences labo-
ratory and non-regulatory agency of the United States Department of Commerce.
1.2. THE MNIST DATASET 7

To compress the image means to reduce the number of dimensions in the

point x while keeping maximum information. We can think of a single image
as a dataset itself, and compress the image, or we can design a compression
algorithm based on a collection of images. It is then reasonable to expect that
the procedure applies well to any image that is similar to the images in the
collection.
For the second image in Figure 1.2, reducing dimension from d = 784 to
n equal 600, 350, 150, 50, 10, and 1, we have the images in Figure 1.4.
Compressing each image to a point in n = 3 dimensions and plotting all
N = 60000 points yields Figure 1.5. All this is discussed in §3.5.

Fig. 1.5 The MNIST dataset (3d projection).

Here is an exercise. The top left image in Figure 1.4 is given by a 784-
dimensional point which is imported as an array pixels.

pixels = dataset[1].reshape((28,28))

Then pixels is an array of shape (28,28).

1. In Jupyter, return a two-dimensional plot of the point (2, 3) at size 50

using the code
8 CHAPTER 1. DATASETS

from matplotlib.pyplot import *

grid()
scatter(2,3,s = 50)
show()

2. Do for loops over i and j in range(28) and use scatter to plot points
at location (i,j) with size given by pixels[i,j], then show.

Fig. 1.6 A crude copy of the image.

Here is one possible code, returning Figure 1.6.

from matplotlib.pyplot import *

from numpy import *

pixels = dataset[1]

grid()
for i in range(28):
for j in range(28):
scatter(i,j, s = pixels[i,j])

show()

The top left image in Figure 1.4 is returned by the code

1.2. THE MNIST DATASET 9

from matplotlib.pyplot import *

imshow(pixels, cmap="gray_r")

In recent versions of numpy, floats are displayed as follows

np.float64(5.843333333333335)

To display floats without their type, as follows,

5.843333333333335

insert this code

from numpy import *

set_printoptions(legacy="1.25")

at the top of your jupyter notebook or in your jupyter configuration file

We end the section by discussing the Python import command. The last
code snippet can be rewritten

import matplotlib.pyplot as plt

plt.imshow(pixels, cmap="gray_r")

or as

from matplotlib.pyplot import imshow

imshow(pixels, cmap="gray_r")

So we have three versions of this code snippet.

In the second version, it is explicit that imshow is imported from the mod-
ule pyplot of the package matplotlib. Moreoever, the module matplotlib
,→ .pyplot is referenced by a short nickname plt.
In the first version import from *, many commands, maybe not all, are
imported from the module matplotlib.pyplot.
10 CHAPTER 1. DATASETS

In the third version, only the command imshow is imported. Which import
style is used depends on the situation.
In this text, we usually use the first style, as it is visually lightest. To help
with online searches, in the Python index, Python commands are listed under
their full package path.

Exercises

Exercise 1.2.1 Run the code in this section on your laptop (all code is run
within jupyter).
Exercise 1.2.2 The first image in the MNIST dataset is an image of the
digit 5. What is the 43,120th image?
Exercise 1.2.3 Figure 1.6 is not oriented the same way as the top-left image
in Figure 1.4. Modify the code returning Figure 1.6 to match the top-left
image in Figure 1.4.

1.3 Averages and Vector Spaces

Suppose we have a population of things (people, tables, numbers, vectors,

images, etc.) and we have a sample of size N from this population:

L = [x_1,x_2,...,x_N].

The total population is the population or the sample space. For example, the
sample space consists of all real numbers and we take N = 5 samples from
this population

L_1 = [3.95, 3.20, 3.10, 5.55, 6.93].

Or, the sample space consists of all integers and we take N = 5 samples from
this population

L_2 = [35, -32, -8, 45, -8].

Or, the sample space consists of all rational numbers and we take N = 5
samples from this population

L_3 = [13/31, 8/9, 7/8, 41/22, 32/27].

1.3. AVERAGES AND VECTOR SPACES 11

Or, the sample space consists of all Python strings and we take N = 5 samples
from this population

L_4 = ['a2e?','#%T','7y5,','kkk>><</','[[)*+']

Or, the sample space consists of all HTML colors and we take N = 5 samples
from this population

Fig. 1.7 HTML colors.

Here’s the code generating the colors

# HTML color codes are #rrggbb (6 hexes)

from matplotlib.pyplot import *
from random import choice

def hexcolor():
chars = '0123456789abcdef'
return "#" + ''.join([choice(chars) for _ in range(6)])

for i in range(5): scatter(i,0, c=hexcolor())

show()

Let L be a list as above. The goal is to compute the sample average or

mean of the list, which is
x1 + x2 + · · · + xN
µ= . (1.3.1)
N
In the first example, for real numbers, the average is
3.95 + 3.20 + 3.10 + 5.55 + 6.93
= 4.546.
5
In the second case, for integers, the average is 32/5. In the third case, the
average is 385373/73656. In the fourth case, while we can add strings, we
can’t divide them by 5, so the average is undefined. Similarly for colors: the
average is undefined.
This leads to an important definition. A sample space or population V is
called a vector space if, roughly speaking, one can compute means or averages
12 CHAPTER 1. DATASETS

in V . In this case, we call the members of the population “vectors”, even

though the members may be anything, as long as they satisfy the basic rules
of a vector space.
In a vector space V , the rules are:
1. vectors can be added, with the sum v + w back in V ,
2. vector addition is commutative v + w = w + v,
3. vector addition is associative u + (v + w) = (u + v) + w,
4. there is a zero vector 0,
5. vectors v have negatives −v,
6. vectors v can be scaled to rv by real numbers r, with rv is back in V ,
7. scaling is distributive over addition (r + s)v = rv + sv and r(u + v) =
ru + rv
8. 1v = v and 0v = 0
9. r(sv) = (rs)v.
As mentioned before, real numbers are called scalars because they often
serve to scale vectors.

Let x1 , x2 , . . . , xN be a dataset. Is the dataset a collection of points, or

is the dataset a collection of vectors? In other words, what geometric picture
of datasets should we have in our heads? Here’s how it works.
A vector is an arrow joining two points (Figure 1.8). Given two points
µ = (a, b) and x = (c, d), the vector joining them is

v = x − µ = (c − a, d − b).

Then µ is the tail of v, and x is the head of v. For example, the vector joining
µ = (1, 2) to x = (3, 4) is v = (2, 2).
Given a point x, we would like to associate to it a vector v in a uniform
manner. However, this cannot be done without a second point, a reference
point. Given a dataset of points x1 , x2 , . . . , xN , the most convenient choice
for the reference point is the mean µ of the dataset. This results in a dataset
of vectors v1 , v2 , . . . , vN , where vk = xk − µ, k = 1, 2, . . . , N .

Fig. 1.8 The vector v joining the points m and x.

1.3. AVERAGES AND VECTOR SPACES 13

The dataset v1 , v2 , . . . , vN is centered, its mean is zero,

v1 + v2 + · · · + vN
= 0.
N
So datasets can be points x1 , x2 , . . . , xN with mean µ, or vectors v1 , v2 , . . . ,
vN with mean zero (Figure 1.9). This distinction makes a difference when
measuring the dimension of a dataset (§2.8).

Centered Versus Non-Centered

If x1 , x2 , . . . , xN is a dataset of points with mean µ and

v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ,

then v1 , v2 , . . . , vN is a centered dataset of vectors.

x5 x2
v5 v2

µ v4 v1
0
x4 x1
v3

Fig. 1.9 Datasets of points versus datasets of vectors.

Let us go back to vector spaces. When we work with vector spaces, numbers
are referred to as scalars, because 2v, 3v, −v, . . . are scaled versions of v.
When we multiply a vector v by a scalar r to get the scaled vector rv, we call
this vector scaling. This is to distinguish this multiplication from the inner
and outer products we see below.
For example, the samples in the list L1 form a vector space, the set of all
real numbers R. Even though one can add integers, the set Z of all integers
does not form a vector space because multiplying an integer by 1/2 does
not result in an integer. The set Q of all rational numbers (fractions) is a
vector space, so L3 is a sampling from a vector space. The set of strings is
not a vector space because even though one can add strings, addition is not
commutative:
14 CHAPTER 1. DATASETS

'alpha' + 'romeo' == 'romeo' + 'alpha'

returns False.

For the scalar dataset

x1 = 1.23, x2 = 4.29, x3 = −3.3, x4 = 555,

the average is
1.23 + 4.29 − 3.3 + 555
µ= = 139.305.
4
In Python, averages are computed using numpy.mean. For a scalar dataset,
the code

from numpy import *

dataset = array([1.23,4.29,-3.3,555])
mu = mean(dataset)
mu

returns the average.

For the two-dimensional dataset

x1 = (1, 2), x2 = (3, 4), x3 = (−2, 11), x4 = (0, 66),

the average is

(1, 2) + (3, 4) + (−2, 11) + (0, 66)

µ= = (0.5, 20.75).
4
Note the x-components are summed, and the y-components are summed,
leading to a two-dimensional mean. (This is vector addition, taken up in
§1.4.)
In Python, a dataset of four points in R2 is assembled as 2 × 4 array

from numpy import *

dataset = array([[1,3,-2,0],[2,4,11,66]])

Here the x-components of the four points are the first row, and the y-
components are the second row. With this, the code
1.3. AVERAGES AND VECTOR SPACES 15

mu = mean(dataset, axis=1)
mu

returns the mean (0.5, 20.75).

To explain what axis=1 does, we use matrix terminology. After arranging
dataset into an array of two rows and four columns, to compute the mean,
we sum over the column index.
This means summing the entries of the first row, then summing the entries
of the second row, resulting in a mean with two components.
In Python, the default is to consider the row index i as index zero, and to
consider the column index j as index one.
Summing over index=0 is equivalent to thinking of the dataset as two
points in R4 , so

mean(dataset, axis=0)

returns (1.5, 3.5, 4.5, 33).

Fig. 1.10 A dataset with its mean.

Here is a more involved example of a dataset of random points and their

mean:

from numpy import *

from numpy.random import random
from matplotlib.pyplot import scatter, grid, show

N = 20
def row(N): return array([random() for _ in range(N) ])
16 CHAPTER 1. DATASETS

# 2xN array
dataset = array([ row(N), row(N) ])
mu = mean(dataset,axis=1)

grid()
scatter(*mu)
scatter(*dataset)
show()

This returns Figure 1.10.

In this code, scatter expects two positional arguments, the x and the y
components, arranged as two scalars (for a single point), or two arrays of x
and y components separately (for several points). The unpacking operator *
unpacks mu from one pair into its separate x and y components *mu. So mu
is one Python object, and *mu is two Python objects. Similarly, plot(x,y)
expects the x and y components as two separate arrays.

Sometimes, a population is not a vector space, so we can’t take sample

means from it. Instead, we take the sample mean of a scalar or vector com-
puted from the samples. This computed quantity is a statistic associated to
the population.
A statistic is an assignment of a scalar or vector f (x) to each sample x
from the population, and the sample mean is then

f (x1 ) + f (x2 ) + · · · + f (xN )

. (1.3.2)
N
Since scalars and vectors do form vector spaces, this mean is well-defined. For
example, a population of cats is not a vector space (they can’t be added),
but their heights is a vector space (heights can be added). This process is
vectorization of the samples.
Vectorization is frequently used to count proportions: Samples are drawn
from finitely many categories, and we wish to count the proportion of samples
belonging to a particular category.
If we toss a coin N times, we obtain a list of heads and tails,

H, H, T, T, T, H, T, . . .

To count the proportion of heads, we define

(
(1, 0), if x is heads,
f (x) =
(0, 1), if x is tails.

If we add the vectorized samples f (x) using vector addition in the plane
(§1.4), the first component of the mean (1.3.2) is an average of ones and
1.3. AVERAGES AND VECTOR SPACES 17

zeroes, with ones matching heads, resulting in the proportion p̂ of heads.

Similarly, the second component is the proportion of tails. Hence (1.3.2) is
the pair (p̂, 1 − p̂), where p̂ is the proportion of heads in N tosses.
More generally, if the label of a sample falls into d categories, we may let
f (x) be a vector with d components consisting of zeros and ones, according
to the category of the sample. This is one-hot encoding (see §2.4 and §7.6).
For example, suppose we take a sampling of size N from the Iris dataset,
and we look at the classes of the resulting samples. Since there are three
classes, in this case, we can define f (x) to equal

(1, 0, 0), (0, 1, 0), (0, 0, 1),

according to which class x belongs to. Then the mean (1.3.2) is a triple
p̂ = (p̂1 , p̂2 , p̂3 ) of proportions of each class in the sampling. Of course, p̂1 +
p̂2 + p̂3 = 1, so p̂ is a probability vector (§5.6).

f
sample space

vector space

Fig. 1.11 Vectorization of samples.

When there are only two possibilities, two classes, it’s simpler to encode
the classes as follows,
(
1, if x is heads,
f (x) =
0, if x is tails.

Then the mean (1.3.2) is the proportion p̂ of heads.

Even when the samples are already scalars or vectors, we may still want
to vectorize them. For example, suppose x1 , x2 , . . . , xN are the prices of a
sample of printers from across the country. Then the average price (1.3.1) is
well-defined. Nevertheless, we may set
18 CHAPTER 1. DATASETS
(
1, if x is greater than $100,
f (x) =
0, if x is ≤ $100.

Then the mean (1.3.2) is the sample proportion p̂ of printers that cost more
then $100.
In §6.4, we use vectorization to derive the chi-squared tests.

Exercises

Exercise 1.3.1 For the dataset = array([[1,3,-2,0],[2,4,11,66]]), the

commands mean(dataset,axis=1) and mean(dataset,axis=0) return means
in R2 and in R4 . What does mean(dataset) return and why?

Exercise 1.3.2 What is the average petal length in the Iris dataset?

Exercise 1.3.3 What is the average shading of the pixels in the first image
in the MNIST dataset?

Exercise 1.3.4 What’s the difference between plot and scatter in

from numpy import *

from matplotlib.pyplot import scat