What is the significance of Singular Value Decomposition (SVD) in data compression, and how does it differ from using all singular values?

Singular Value Decomposition (SVD) is significant in data compression as it allows for dimensionality reduction by discarding some of the singular values and vectors, effectively reducing data complexity while preserving essential features. Using all singular values retains full information and allows for perfect reconstruction of the original matrix, while using only some singular values compresses the data by approximating the matrix. This approximation can drastically reduce storage needs and computation time, which is highly beneficial in handling large datasets .

Explain the role of momentum in Stochastic Gradient Descent (SGD) and its effect on convergence in neural networks.

Momentum in Stochastic Gradient Descent (SGD) serves to accelerate convergence by considering the past gradients' direction to maintain momentum in parameter updates. It smooths the updates, which helps SGD escape local minima and leads to faster convergence by mitigating oscillations during the optimization of neural networks. By integrating past gradients, it gives the optimizer a 'push' towards minima, potentially leading to a faster convergence rate .

Discuss the concept of overfitting in machine learning and the role of regularization in mitigating it.

Overfitting occurs when a machine learning model learns noise and details from the training data to the extent that it performs well on training but poorly on unseen data. It tends to produce overly complex models that do not generalize beyond the training set. Regularization mitigates overfitting by introducing a penalty on the model's complexity, thereby discouraging fitting to noise. Techniques such as L1 (Lasso) and L2 (Ridge) regularization constrain model coefficients, promoting simpler models that can generalize better .

How does autograd in Python compute derivatives, and what are its advantages compared to analytical computation?

Autograd in Python automatically computes derivatives by building a computation graph of operations performed on tracked variables during forward passes. It is advantageous as it provides exact derivatives to machine precision without needing hand-derived analytical formulas, which can be cumbersome and error-prone especially for complex functions. Autograd can handle higher-order derivatives as long as the function is differentiable to that order .

Describe the relationship between the number of neurons in a multilayer perceptron (MLP) and the number of data samples needed for training according to the Neurons-Samples Theory.

The Neurons-Samples Theory suggests a loose relationship between the number of neurons in a multilayer perceptron (MLP) and the required number of data samples for training. The theory posits that the total pseudo-dimension (P*) of the MLP, which increases with more layers and neurons, should relate to the sample number but acts only as a lower bound. P* accounts for the complexity of the model and provides a benchmark indicating at least how many samples might ensure robust training, though specific sample numbers can vary significantly depending on other factors such as dataset quality and distribution .

How does piecewise linear interpolation compare to higher-order polynomial interpolation in terms of safety and efficacy?

Piecewise linear interpolation is generally considered safer and more effective than higher-order polynomial interpolation when dense data are available. Higher-order polynomials can be more accurate but tend to introduce bigger problems due to their oscillatory nature, especially at the endpoints of an interval, known as Runge's phenomenon. Piecewise linear interpolation avoids these issues by approximating the function with linear segments, providing a more stable solution .

Explain how Affine transformations are used within the layers of a Multilayer Perceptron (MLP) and their importance.

Affine transformations in a Multilayer Perceptron (MLP) involve applying weight matrices and biases to inputs to transform them as they pass to each layer, which is crucial for enabling linear mappings initially. These transformations, typically followed by non-linear activation functions, allow the network to combine input features linearly, creating intermediate representations that are critical for deep learning. Properly configured, they facilitate learning complex patterns by transforming the feature space across layers, enriching the network's capacity to model sophisticated relationships .

What are the implications of using linear activation functions in the hidden layers of a Multilayer Perceptron (MLP)?

Using linear activation functions in the hidden layers of a Multilayer Perceptron (MLP) negates the advantages offered by deeper architectures. Linear activations render the network equivalent to a single-layer linear model, failing to capture complex non-linear relationships in the data, thereby not increasing the hypothesis space's dimension. Therefore, there is no effective point in using deep layers if they contain only linear activations, as it does not enhance the learning capacity of the network .

How does the initialization of centroids in K-means clustering affect its performance, and what improvements can be achieved by using PCA for initialization?

The initialization of centroids in K-means clustering significantly affects its convergence speed and the quality of the final clusters. Poor initialization can lead to slower convergence or suboptimal clustering. Using PCA to initialize centroids can improve performance by providing more stable scaling and reducing noise, which enhances initial centroid positioning. It increases the likelihood of reaching more accurate clustering results and robust convergence, particularly in datasets with high dimensionalities or variances .

What challenges arise when using the ReLU activation function in neural networks, and how can they be addressed?

The ReLU activation function poses challenges due to its non-differentiability at the origin and its propensity to 'die' during training, where neurons output zero gradient and stop learning (due to large negative inputs). Addressing these involves computing a sub-gradient at the origin, which uses averages from both sides to continue optimization, or using variations like Leaky ReLU or Parametric ReLU, which allow small gradients for negative inputs to prevent neuron 'death' .

Open navigation menu

Upload

100% found this document useful (15 votes)

21K views692 pages

Machine Learning With Python

Uploaded by

SUMARA ALFRED

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (15 votes)

21K views692 pages

Machine Learning With Python

Uploaded by

SUMARA ALFRED

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 692

MACHINE

LEARNING
WITH PYTHON
Theory and Applications
MACHINE
LEARNING
WITH PYTHON
Theory and Applications

G. R. Liu
University of Cincinnati, USA

World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO
Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Library of Congress Cataloging-in-Publication Data

Names: Liu, G. R. (Gui-Rong), author.
Title: Machine learning with Python : theory and applications / G.R. Liu, University of Cincinnati, USA.
Description: Singapore ; Hackensack, NJ : World Scientific Publishing Co. Pte. Ltd., [2023] |
Includes bibliographical references and index.
Identifiers: LCCN 2022001048 | ISBN 9789811254178 (hardcover) |
ISBN 9789811254185 (ebook for institutions) | ISBN 9789811254192 (ebook for individuals)
Subjects: LCSH: Machine learning. | Python (Computer program language)
Classification: LCC Q325.5 .L58 2023 | DDC 006.3/1--dc23/eng20220328
LC record available at https://lccn.loc.gov/2022001048

British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library.

Copyright © 2023 by World Scientific Publishing Co. Pte. Ltd.

All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or
mechanical, including photocopying, recording or any information storage and retrieval system now known or to
be invented, without written permission from the publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center,
Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from
the publisher.

For any available supplementary material, please visit

https://www.worldscientific.com/worldscibooks/10.1142/12774#t=suppl

Desk Editors: Jayanthi Muthuswamy/Steven Patt

Typeset by Stallion Press

Email: [email protected]

Printed in Singapore
About the Author

G. R. Liu received his Ph.D from Tohoku University,

Japan, in 1991. He was a post-doctoral fellow at
Northwestern University, USA, from 1991–1993. He
was a Professor at the National University of Singa-
pore until 2010. He is currently a Professor at the
University of Cincinnati, USA. He is the founder of the
Association for Computational Mechanics (Singapore)
(SACM) and served as the President of SACM until
2010. He served as the President of the Asia-Paciﬁc
Association for Computational Mechanics (APACM)
(2010–2013) and an Executive Council Member of the International Asso-
ciation for Computational Mechanics (IACM) (2005–2010; 2020–2026). He
has authored a large number of journal papers and books including two
bestsellers — Mesh Free Method: Moving Beyond the Finite Element Method
and Smoothed Particle Hydrodynamics: A Meshfree Particle Methods. He is
the Editor-in-Chief of the International Journal of Computational Methods
and served as an Associate Editor for IPSE and MANO. He is the recipient
of numerous awards, including the Singapore Defence Technology Prize,
NUS Outstanding University Researcher Award and Best Teacher Award,
APACM Computational Mechanics Awards, JSME Computational Mechan-
ics Awards, ASME Ted Belytschko Applied Mechanics Award, Zienkiewicz
Medal from APACM, the AJCM Computational Mechanics Award, and the
Humboldt Research Award. He has been listed as a world top 1% most
inﬂuential scientist (Highly Cited Researchers) by Thomson Reuters in 2014–
2016, 2018, and 2019. ISI citations by others: ∼22000; His journal and
book credentials include the following. ISI H-index: ∼85; Google Scholar
H-Index: 110.

v
MACHINE LEARNING
WITH PYTHON
Contents

About the Author v

1 Introduction 1
1.1 Naturally Learned Ability for Problem Solving . . . . . . . 1
1.2 Physics-Law-based Models . . . . . . . . . . . . . . . . . . 1
1.3 Machine Learning Models, Data-based . . . . . . . . . . . 3
1.4 General Steps for Training Machine Learning Models . . . 4
1.5 Some Mathematical Concepts, Variables, and Spaces . . . 5
1.5.1 Toy examples . . . . . . . . . . . . . . . . . . . . . 5
1.5.2 Feature space . . . . . . . . . . . . . . . . . . . . . 6
1.5.3 Aﬃne space . . . . . . . . . . . . . . . . . . . . . 7
1.5.4 Label space . . . . . . . . . . . . . . . . . . . . . . 8
1.5.5 Hypothesis space . . . . . . . . . . . . . . . . . . . 9
1.5.6 Deﬁnition of a typical machine learning model,
a mathematical view . . . . . . . . . . . . . . . . . 10
1.6 Requirements for Creating Machine Learning Models . . . 11
1.7 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Relation Between Physics-Law-based and Data-based
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9 This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.10 Who May Read This Book . . . . . . . . . . . . . . . . . . 14
1.11 Codes Used in This Book . . . . . . . . . . . . . . . . . . . 14
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

vii
viii Machine Learning with Python: Theory and Applications

2 Basics of Python 19
2.1 An Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Briefing on Python . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Variable Types . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Numbers . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Underscore placeholder . . . . . . . . . . . . . . . 28
2.3.3 Strings . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.4 Conversion between types of variables . . . . . . . 36
2.3.5 Variable formatting . . . . . . . . . . . . . . . . . 38
2.4 Arithmetic Operators . . . . . . . . . . . . . . . . . . . . . 39
2.4.1 Addition, subtraction, multiplication, division,
and power . . . . . . . . . . . . . . . . . . . . . . 39
2.4.2 Built-in functions . . . . . . . . . . . . . . . . . . 40
2.5 Boolean Values and Operators . . . . . . . . . . . . . . . . 41
2.6 Lists: A diversified variable type container . . . . . . . . . 42
2.6.1 List creation, appending, concatenation,
and updating . . . . . . . . . . . . . . . . . . . . . 42
2.6.2 Element-wise addition of lists . . . . . . . . . . . . 44
2.6.3 Slicing strings and lists . . . . . . . . . . . . . . . 46
2.6.4 Underscore placeholders for lists . . . . . . . . . . 49
2.6.5 Nested list (lists in lists in lists) . . . . . . . . . . 49
2.7 Tuples: Value preserved . . . . . . . . . . . . . . . . . . . . 50
2.8 Dictionaries: Indexable via keys . . . . . . . . . . . . . . . 51
2.8.1 Assigning data to a dictionary . . . . . . . . . . . 51
2.8.2 Iterating over a dictionary . . . . . . . . . . . . . 52
2.8.3 Removing a value . . . . . . . . . . . . . . . . . . 53
2.8.4 Merging two dictionaries . . . . . . . . . . . . . . 54
2.9 Numpy Arrays: Handy for scientific computation . . . . . . 55
2.9.1 Lists vs. Numpy arrays . . . . . . . . . . . . . . . 55
2.9.2 Structure of a numpy array . . . . . . . . . . . . . 55
2.9.3 Axis of a numpy array . . . . . . . . . . . . . . . . 60
2.9.4 Element-wise computations . . . . . . . . . . . . . 61
2.9.5 Handy ways to generate multi-dimensional
arrays . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.9.6 Use of external package: MXNet . . . . . . . . . . 63
2.9.7 In-place operations . . . . . . . . . . . . . . . . . 66
2.9.8 Slicing from a multi-dimensional array . . . . . . . 67
2.9.9 Broadcasting . . . . . . . . . . . . . . . . . . . . . 67
Contents ix

2.9.10 Converting between MXNet NDArray

and NumPy . . . . . . . . . . . . . . . . . . . . . 70
2.9.11 Subsetting in Numpy . . . . . . . . . . . . . . . . 71
2.9.12 Numpy and universal functions (ufunc) . . . . . . 71
2.9.13 Numpy array and vector/matrix . . . . . . . . . . 72
2.10 Sets: No Duplication . . . . . . . . . . . . . . . . . . . . . 75
2.10.1 Intersection of two sets . . . . . . . . . . . . . . . 75
2.10.2 Difference of two sets . . . . . . . . . . . . . . . . 75
2.11 List Comprehensions . . . . . . . . . . . . . . . . . . . . . 76
2.12 Conditions, “if” Statements, “for” and “while” Loops . . . 77
2.12.1 Comparison operators . . . . . . . . . . . . . . . . 77
2.12.2 The “in” operator . . . . . . . . . . . . . . . . . . 78
2.12.3 The “is” operator . . . . . . . . . . . . . . . . . . 78
2.12.4 The ‘not’ operator . . . . . . . . . . . . . . . . . . 80
2.12.5 The “if” statements . . . . . . . . . . . . . . . . . 80
2.12.6 The “for” loops . . . . . . . . . . . . . . . . . . . 81
2.12.7 The “while” loops . . . . . . . . . . . . . . . . . . 82
2.12.8 Ternary conditionals . . . . . . . . . . . . . . . . . 84
2.13 Functions (Methods) . . . . . . . . . . . . . . . . . . . . . 84
2.13.1 Block structure for function definition . . . . . . . 84
2.13.2 Function with arguments . . . . . . . . . . . . . . 84
2.13.3 Lambda functions (Anonymous functions) . . . . 86
2.14 Classes and Objects . . . . . . . . . . . . . . . . . . . . . . 86
2.14.1 A simplest class . . . . . . . . . . . . . . . . . . . 86
2.14.2 A class for scientific computation . . . . . . . . . 89
2.14.3 Subclass (class inheritance) . . . . . . . . . . . . . 90
2.15 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.16 Generation of Plots . . . . . . . . . . . . . . . . . . . . . . 92
2.17 Code Performance Assessment . . . . . . . . . . . . . . . . 93
2.18 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3 Basic Mathematical Computations 95

3.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.1.1 Scalar numbers . . . . . . . . . . . . . . . . . . . . 96
3.1.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . 96
3.1.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . 98
3.1.4 Tensors . . . . . . . . . . . . . . . . . . . . . . . . 100
x Machine Learning with Python: Theory and Applications

3.1.5 Sum and mean of a tensor . . . . . . . . . . . . . 101

3.1.6 Dot-product of two vectors . . . . . . . . . . . . . 102
3.1.7 Outer product of two vectors . . . . . . . . . . . . 105
3.1.8 Matrix-vector product . . . . . . . . . . . . . . . . 106
3.1.9 Matrix-matrix multiplication . . . . . . . . . . . . 106
3.1.10 Norms . . . . . . . . . . . . . . . . . . . . . . . . 108
3.1.11 Solving algebraic system equations . . . . . . . . . 109
3.1.12 Matrix inversion . . . . . . . . . . . . . . . . . . . 111
3.1.13 Eigenvalue decomposition of a matrix . . . . . . . 113
3.1.14 Condition number of a matrix . . . . . . . . . . . 116
3.1.15 Rank of a matrix . . . . . . . . . . . . . . . . . . 118
3.2 Rotation Matrix . . . . . . . . . . . . . . . . . . . . . . . . 119
3.3 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.3.1 1-D piecewise linear interpolation using
numpy.interp . . . . . . . . . . . . . . . . . . . . . 121
3.3.2 1-D least-squares solution approximation . . . . . 122
3.3.3 1-D interpolation using interp1d . . . . . . . . . . 124
3.3.4 2-D spline representation
using bisplrep . . . . . . . . . . . . . . . . . . . . 124
3.3.5 Radial basis functions for smoothing and
interpolation . . . . . . . . . . . . . . . . . . . . . 126
3.4 Singular Value Decomposition . . . . . . . . . . . . . . . . 129
3.4.1 SVD formulation . . . . . . . . . . . . . . . . . . . 129
3.4.2 Algorithms for SVD . . . . . . . . . . . . . . . . . 130
3.4.3 Numerical examples . . . . . . . . . . . . . . . . . 131
3.4.4 SVD for data compression . . . . . . . . . . . . . 133
3.5 Principal Component Analysis . . . . . . . . . . . . . . . . 135
3.5.1 PCA formulation . . . . . . . . . . . . . . . . . . 135
3.5.2 Numerical examples . . . . . . . . . . . . . . . . . 137
3.6 Numerical Root Finding . . . . . . . . . . . . . . . . . . . 143
3.7 Numerical Integration . . . . . . . . . . . . . . . . . . . . . 145
3.7.1 Trapezoid rule . . . . . . . . . . . . . . . . . . . . 145
3.7.2 Gauss integration . . . . . . . . . . . . . . . . . . 147
3.8 Initial data treatment . . . . . . . . . . . . . . . . . . . . . 148
3.8.1 Min-max scaling . . . . . . . . . . . . . . . . . . . 149
3.8.2 “One-hot” encoding . . . . . . . . . . . . . . . . . 152
3.8.3 Standard scaling . . . . . . . . . . . . . . . . . . . 153
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Contents xi

4 Statistics and Probability-based Learning Model 157

4.1 Analysis of Probability of an Event . . . . . . . . . . . . . 158
4.1.1 Random sampling, controlled random
sampling . . . . . . . . . . . . . . . . . . . . . . . 158
4.1.2 Probability . . . . . . . . . . . . . . . . . . . . . . 160
4.2 Random Distributions . . . . . . . . . . . . . . . . . . . . . 164
4.2.1 Uniform distribution . . . . . . . . . . . . . . . . . 165
4.2.2 Normal distribution (Gaussian distribution) . . . 165
4.3 Entropy of Probability . . . . . . . . . . . . . . . . . . . . 167
4.3.1 Example 1: Probability and its entropy . . . . . . 169
4.3.2 Example 2: Variation of entropy . . . . . . . . . . 170
4.3.3 Example 3: Entropy for events with a variable
that takes different numbers of values of uniform
distribution . . . . . . . . . . . . . . . . . . . . . . 172
4.4 Cross-Entropy: Predicated and True Probability . . . . . . 173
4.4.1 Example 1: Cross-entropy of a quality
prediction . . . . . . . . . . . . . . . . . . . . . . . 174
4.4.2 Example 2: Cross-entropy of a poor
prediction . . . . . . . . . . . . . . . . . . . . . . . 175
4.5 KL-Divergence . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.5.1 Example 1: KL-divergence of a distribution
of quality prediction . . . . . . . . . . . . . . . . . 176
4.5.2 Example 2: KL-divergence of a poorly
predicted distribution . . . . . . . . . . . . . . . . 176
4.6 Binary Cross-Entropy . . . . . . . . . . . . . . . . . . . . . 177
4.6.1 Example 1: Binary cross-entropy for a distribution
of quality prediction . . . . . . . . . . . . . . . . . 178
4.6.2 Example 2: Binary cross-entropy for a poorly
predicted distribution . . . . . . . . . . . . . . . . 178
4.6.3 Example 3: Binary cross-entropy for more uniform
true distribution: A quality prediction . . . . . . . 179
4.6.4 Example 4: Binary cross-entropy for more uniform
true distribution: A poor prediction . . . . . . . . 180
4.7 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . 180
4.8 Naive Bayes Classification: Statistics-based Learning . . . 181
4.8.1 Formulation . . . . . . . . . . . . . . . . . . . . . 181
4.8.2 Case study: Handwritten digits recognition . . . . 181
4.8.3 Algorithm for the Naive Bayes classification . . . 182
4.8.4 Testing the Naive Bayes model . . . . . . . . . . . 185
4.8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . 187
xii Machine Learning with Python: Theory and Applications

5 Prediction Function and Universal Prediction Theory 189

5.1 Linear Prediction Function and Affine Transformation . . . 190
5.1.1 Linear prediction function: A basic
hypothesis . . . . . . . . . . . . . . . . . . . . . . 191
5.1.2 Predictability for constants, the role
of the bias . . . . . . . . . . . . . . . . . . . . . . 192
5.1.3 Predictability for linear functions:
The role of the weights . . . . . . . . . . . . . . . 192
5.1.4 Prediction of linear functions: A machine
learning procedure . . . . . . . . . . . . . . . . . . 193
5.1.5 Affine transformation . . . . . . . . . . . . . . . . 194
5.2 Affine Transformation Unit (ATU), A Simplest
Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.3 Typical Data Structures . . . . . . . . . . . . . . . . . . . 198
5.4 Demonstration Examples of Affine Transformation . . . . . 199
5.4.1 An edge, a rectangle under affine
transformation . . . . . . . . . . . . . . . . . . . . 202
5.4.2 A circle under affine transformation . . . . . . . . 204
5.4.3 A spiral under affine transformation . . . . . . . . 205
5.4.4 Fern leaf under affine transformation . . . . . . . 205
5.4.5 On linear prediction function with affine
transformation . . . . . . . . . . . . . . . . . . . . 206
5.4.6 Affine transformation wrapped with activation
function . . . . . . . . . . . . . . . . . . . . . . . . 206
5.5 Parameter Encoding and the Essential Mechanism
of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
5.5.1 The x to ŵ encoding, a data-parameter
converter unit . . . . . . . . . . . . . . . . . . . . 210
5.5.2 Uniqueness of the encoding . . . . . . . . . . . . . 211
5.5.3 Uniqueness of the encoding: Not affected
by activation function . . . . . . . . . . . . . . . . 212
5.6 The Gradient of the Prediction Function . . . . . . . . . . 213
5.7 Affine Transformation Array (ATA) . . . . . . . . . . . . . 213
5.8 Predictability of High-Order Functions of a Deepnet . . . . 214
5.8.1 A role of activation functions . . . . . . . . . . . . 214
5.8.2 Formation of a deepnet by chaining ATA . . . . . 215
5.8.3 Example: A 1 → 1 → 1 network . . . . . . . . . . 217
5.9 Universal Prediction Theory . . . . . . . . . . . . . . . . . 218
Contents xiii

5.10 Nonlinear Aﬃne Transformations . . . . . . . . . . . . . . 219

5.11 Feature Functions in Physics-Law-based Models . . . . . . 220
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

6 The Perceptron and SVM 223

6.1 Linearly Separable Classification Problems . . . . . . . . . 224
6.2 A Python Code for the Perceptron . . . . . . . . . . . . . 226
6.3 The Perceptron Convergence Theorem . . . . . . . . . . . 233
6.4 Support Vector Machine . . . . . . . . . . . . . . . . . . . 237
6.4.1 Problem statement . . . . . . . . . . . . . . . . . 237
6.4.2 Formulation of objective function
and constraints . . . . . . . . . . . . . . . . . . . . 238
6.4.3 Modified objective function with constraints:
Multipliers method . . . . . . . . . . . . . . . . . 242
6.4.4 Converting to a standard quadratic programming
problem . . . . . . . . . . . . . . . . . . . . . . . . 245
6.4.5 Prediction in SVM . . . . . . . . . . . . . . . . . . 249
6.4.6 Example: A Python code for SVM . . . . . . . . . 250
6.4.7 Confusion matrix . . . . . . . . . . . . . . . . . . 254
6.4.8 Example: A Sickit-learn class for SVM . . . . . . 254
6.4.9 SVM for datasets not separable with
hyperplanes . . . . . . . . . . . . . . . . . . . . . 256
6.4.10 Kernel trick . . . . . . . . . . . . . . . . . . . . . 257
6.4.11 Example: SVM classification with curves . . . . . 258
6.4.12 Multiclass classification via SVM . . . . . . . . . . 260
6.4.13 Example: Use of SVM classifiers for
iris dataset . . . . . . . . . . . . . . . . . . . . . . 260
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

7 Activation Functions and Universal

Approximation Theory 265
7.1 Sigmoid Function (σ(z)) . . . . . . . . . . . . . . . . . . . 266
7.2 Sigmoid Function of an Aﬃne Transformation
Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
7.3 Neural-Pulse-Unite (NPU) . . . . . . . . . . . . . . . . . . 269
7.4 Universal Approximation Theorem . . . . . . . . . . . . . 274
7.4.1 Function approximation using NPUs . . . . . . . . 274
7.4.2 Function approximations using neuron
basis functions . . . . . . . . . . . . . . . . . . . . 275
xiv Machine Learning with Python: Theory and Applications

7.4.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . 281

7.5 Hyperbolic Tangent Function (tanh) . . . . . . . . . . . . . 282
7.6 Relu Functions . . . . . . . . . . . . . . . . . . . . . . . . . 283
7.7 Softplus Function . . . . . . . . . . . . . . . . . . . . . . . 286
7.8 Conditions for activation functions . . . . . . . . . . . . . . 288
7.9 Novel activation functions . . . . . . . . . . . . . . . . . . 288
7.9.1 Rational activation function . . . . . . . . . . . . 288
7.9.2 Power function . . . . . . . . . . . . . . . . . . . . 292
7.9.3 Power-linear function . . . . . . . . . . . . . . . . 294
7.9.4 Power-quadratic function . . . . . . . . . . . . . . 297
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
8 Automatic Differentiation and Autograd 303
8.1 General Issues on Optimization and Minimization . . . . . 303
8.2 Analytic Differentiation . . . . . . . . . . . . . . . . . . . . 304
8.3 Numerical Differentiation . . . . . . . . . . . . . . . . . . . 305
8.4 Automatic Differentiation . . . . . . . . . . . . . . . . . . . 305
8.4.1 The concept of automatic or algorithmic
differentiation . . . . . . . . . . . . . . . . . . . . 305
8.4.2 Differentiation of a function with respect
to a vector and matrix . . . . . . . . . . . . . . . 306
8.5 Autograd Implemented in Numpy . . . . . . . . . . . . . . 308
8.6 Autograd Implemented in the MXNet . . . . . . . . . . . . 310
8.6.1 Gradients of scalar functions with simple
variable . . . . . . . . . . . . . . . . . . . . . . . . 311
8.6.2 Gradients of scalar functions in high
dimensions . . . . . . . . . . . . . . . . . . . . . . 313
8.6.3 Gradients of scalar functions with quadratic
variables in high dimensions . . . . . . . . . . . . 318
8.6.4 Gradient of scalar function with a matrix of
variables in high dimensions . . . . . . . . . . . . 319
8.6.5 Head gradient . . . . . . . . . . . . . . . . . . . . 320
8.7 Gradients for Functions with Conditions . . . . . . . . . . 322
8.8 Example: Gradients of an L2 Loss Function for
a Single Neuron . . . . . . . . . . . . . . . . . . . . . . . . 323
8.9 Examples: Differences Between Analytical, Autograd,
and Numerical Differentiation . . . . . . . . . . . . . . . . 327
8.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Contents xv

9 Solution Existence Theory and

Optimization Techniques 331
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 331
9.2 Analytic Optimization Methods: Ideal Cases . . . . . . . . 332
9.2.1 Least square formulation . . . . . . . . . . . . . . 332
9.2.2 L2 loss function . . . . . . . . . . . . . . . . . . . 333
9.2.3 Normal equation . . . . . . . . . . . . . . . . . . . 334
9.2.4 Solution existence analysis . . . . . . . . . . . . . 334
9.2.5 Solution existence theory . . . . . . . . . . . . . . 336
9.2.6 Eﬀects of parallel data-points . . . . . . . . . . . . 337
9.2.7 Predictability of the solution against
the label . . . . . . . . . . . . . . . . . . . . . . . 337
9.3 Considerations in Optimization for Complex
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
9.3.1 Local minima . . . . . . . . . . . . . . . . . . . . 339
9.3.2 Saddle points . . . . . . . . . . . . . . . . . . . . . 340
9.3.3 Convex functions . . . . . . . . . . . . . . . . . . 343
9.4 Gradient Descent (GD) Method for Optimization . . . . . 344
9.4.1 Gradient descent in one dimension . . . . . . . . . 345
9.4.2 Remarks . . . . . . . . . . . . . . . . . . . . . . . 346
9.4.3 Gradient descent in hyper-dimensions . . . . . . . 347
9.4.4 Property of a convex function . . . . . . . . . . . 348
9.4.5 The convergence theorem for the Gradient
Decent algorithm . . . . . . . . . . . . . . . . . . 349
9.4.6 Setting or the learning rates . . . . . . . . . . . . 351
9.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 353
9.5.1 Numerical experiment . . . . . . . . . . . . . . . . 354
9.6 Gradient Descent with Momentum . . . . . . . . . . . . . 363
9.6.1 The most critical problem with GD methods . . . 363
9.6.2 Formulation . . . . . . . . . . . . . . . . . . . . . 365
9.6.3 Numerical experiment . . . . . . . . . . . . . . . . 368
9.7 Nesterov Accelerated Gradient . . . . . . . . . . . . . . . . 370
9.7.1 Formulation . . . . . . . . . . . . . . . . . . . . . 370
9.8 AdaGrad Gradient Algorithm . . . . . . . . . . . . . . . . 371
9.8.1 Formulation . . . . . . . . . . . . . . . . . . . . . 371
9.8.2 Numerical experiment . . . . . . . . . . . . . . . . 372
9.9 RMSProp Gradient Algorithm . . . . . . . . . . . . . . . . 374
9.9.1 Formulation . . . . . . . . . . . . . . . . . . . . . 375
9.9.2 Numerical experiment . . . . . . . . . . . . . . . . 375
xvi Machine Learning with Python: Theory and Applications

9.10 AdaDelta Gradient Algorithm . . . . . . . . . . . . . . . . 378

9.10.1 The idea . . . . . . . . . . . . . . . . . . . . . . . 378
9.10.2 Numerical experiment . . . . . . . . . . . . . . . . 378
9.11 Adam Gradient Algorithm . . . . . . . . . . . . . . . . . . 381
9.11.1 Formulation . . . . . . . . . . . . . . . . . . . . . 381
9.11.2 Numerical experiment . . . . . . . . . . . . . . . . 382
9.12 A Case Study: Compare Minimization Techniques
Used in MLPClassiﬁer . . . . . . . . . . . . . . . . . . . . 385
9.13 Other Algorithms . . . . . . . . . . . . . . . . . . . . . . . 386
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

10 Loss Functions for Regression 389

10.1 Formulations for Linear Regression . . . . . . . . . . . . . 390
10.1.1 Mathematical model . . . . . . . . . . . . . . . . . 390
10.1.2 Neural network configuration . . . . . . . . . . . . 390
10.1.3 The xw formulation . . . . . . . . . . . . . . . . . 391
10.2 Loss Functions for Linear Regression . . . . . . . . . . . . 391
10.2.1 Mean squared error loss or L2 loss function . . . . 392
10.2.2 Absolute error loss or L1 loss function . . . . . . . 393
10.2.3 Huber loss function . . . . . . . . . . . . . . . . . 394
10.2.4 Log-cosh loss function . . . . . . . . . . . . . . . . 394
10.2.5 Comparison between these loss functions . . . . . 395
10.2.6 Python codes for these loss functions . . . . . . . 396
10.3 Python Codes for Regression . . . . . . . . . . . . . . . . . 398
10.3.1 Linear regression using high-order polynomial
and other feature functions . . . . . . . . . . . . . 401
10.3.2 Linear regression using Gaussian basis
functions . . . . . . . . . . . . . . . . . . . . . . . 404
10.4 Neural Network Model for Linear Regressions
with Big Datasets . . . . . . . . . . . . . . . . . . . . . . . 406
10.4.1 Setting up neural network models . . . . . . . . . 406
10.4.2 Create data iterators . . . . . . . . . . . . . . . . 409
10.4.3 Training parameters . . . . . . . . . . . . . . . . . 411
10.4.4 Define the neural network . . . . . . . . . . . . . . 412
10.4.5 Define the loss function . . . . . . . . . . . . . . . 412
10.4.6 Use of optimizer . . . . . . . . . . . . . . . . . . . 412
10.4.7 Execute the training . . . . . . . . . . . . . . . . . 412
10.4.8 Examining training progress . . . . . . . . . . . . 413
Contents xvii

10.5 Neural Network Model for Nonlinear Regression . . . . . . 415

10.5.1 Train models on the Boston housing price
dataset . . . . . . . . . . . . . . . . . . . . . . . . 416
10.5.2 Plotting partial dependence for two features . . . 416
10.5.3 Plot curves on top of each other . . . . . . . . . . 418
10.6 On Nonlinear Regressions . . . . . . . . . . . . . . . . . . . 418
10.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
11 Loss Functions and Models for Classification 421
11.1 Prediction Functions . . . . . . . . . . . . . . . . . . . . . 421
11.1.1 Linear function . . . . . . . . . . . . . . . . . . . . 422
11.1.2 Logistic prediction function . . . . . . . . . . . . . 422
11.1.3 The tanh prediction function . . . . . . . . . . . . 423
11.2 Loss Functions for Classification Problems . . . . . . . . . 423
11.2.1 The margin concept . . . . . . . . . . . . . . . . . 423
11.2.2 0–1 loss . . . . . . . . . . . . . . . . . . . . . . . . 424
11.2.3 Hinge loss . . . . . . . . . . . . . . . . . . . . . . 425
11.2.4 Logistic loss . . . . . . . . . . . . . . . . . . . . . 426
11.2.5 Exponential loss . . . . . . . . . . . . . . . . . . . 427
11.2.6 Square loss . . . . . . . . . . . . . . . . . . . . . . 427
11.2.7 Binary cross-entropy loss . . . . . . . . . . . . . . 429
11.2.8 Remarks . . . . . . . . . . . . . . . . . . . . . . . 432
11.3 A Simple Neural Network for Classification . . . . . . . . . 432
11.4 Example of Binary Classification Using Neural
Network with mxnet . . . . . . . . . . . . . . . . . . . . . 433
11.4.1 Dataset for binary classification . . . . . . . . . . 433
11.4.2 Define loss functions . . . . . . . . . . . . . . . . . 435
11.4.3 Plot the convergence curve of the
loss function . . . . . . . . . . . . . . . . . . . . . 437
11.4.4 Computing the accuracy of the
trained model . . . . . . . . . . . . . . . . . . . . 437
11.5 Example of Binary Classification Using Sklearn . . . . . . 438
11.6 Regression with Decision Tree, AdaBoost,
and Gradient Boosting . . . . . . . . . . . . . . . . . . . . 443
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
12 Multiclass Classification 445
12.1 Softmax Activation Neural Networks for
k-Classifications . . . . . . . . . . . . . . . . . . . . . . . . 445
xviii Machine Learning with Python: Theory and Applications

12.2 Cross-Entropy Loss Function for k-Classiﬁcations . . . . . 447

12.3 Case Study 1: Handwritten Digit Classification
with 1-Layer NN . . . . . . . . . . . . . . . . . . . . . . . . 448
12.3.1 Set contexts according to computer hardware . . . 448
12.3.2 Loading the MNIST dataset . . . . . . . . . . . . 448
12.3.3 Set model parameters . . . . . . . . . . . . . . . . 451
12.3.4 Multiclass logistic regression . . . . . . . . . . . . 451
12.3.5 Defining a neural network model . . . . . . . . . . 452
12.3.6 Defining the cross-entropy loss function . . . . . . 452
12.3.7 Optimization method . . . . . . . . . . . . . . . . 453
12.3.8 Accuracy evaluation . . . . . . . . . . . . . . . . . 453
12.3.9 Initiation of the model and training execution . . 453
12.3.10 Prediction with the trained model . . . . . . . . . 455
12.4 Case Study 2: Handwritten Digit Classification with
Sklearn Random Forest Multi-Classifier . . . . . . . . . . . 456
12.5 Case Study 3: Comparison of Random Forest,
Extra-Forest, and Gradient Boosting for
Multi-Classifier . . . . . . . . . . . . . . . . . . . . . . . . 460
12.6 Multi-Classification via TensorFlow . . . . . . . . . . . . . 464
12.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
13 Multilayer Perceptron (MLP) for Regression
and Classification 467
13.1 The General Architecture and Formulations of MLP . . . . 467
13.1.1 The general architecture . . . . . . . . . . . . . . 467
13.1.2 The xw+b formulation . . . . . . . . . . . . . . . 469
13.1.3 The xw formulation, use of affine transformation
weight matrix . . . . . . . . . . . . . . . . . . . . 471
13.1.4 MLP configuration with affine transformation
weight matrix . . . . . . . . . . . . . . . . . . . . 473
13.1.5 Space evolution process in MLP . . . . . . . . . . 474
13.2 Neurons-Samples Theory . . . . . . . . . . . . . . . . . . . 474
13.2.1 Affine spaces and the training parameters
used in an MLP . . . . . . . . . . . . . . . . . . . 475
13.2.2 Neurons-Samples Theory for MLPs . . . . . . . . 476
13.3 Nonlinear Activation Functions for the Hidden Layers . . . 478
13.4 General Rule for Estimating Learning Parameters
in an MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
Contents xix

13.5 Key Techniques for MLP and Its Capability . . . . . . . . 479

13.6 A Case Study on Handwritten Digits Using MXNet . . . . 481
13.6.1 Import necessary libraries and load data . . . . . 481
13.6.2 Set neural network model parameters . . . . . . . 482
13.6.3 Softmax cross entropy loss function . . . . . . . . 482
13.6.4 Define a neural network model . . . . . . . . . . . 483
13.6.5 Optimization method . . . . . . . . . . . . . . . . 484
13.6.6 Model accuracy evaluation . . . . . . . . . . . . . 484
13.6.7 Training the neural network and timing
the training . . . . . . . . . . . . . . . . . . . . . . 484
13.6.8 Prediction with the model trained . . . . . . . . . 486
13.6.9 Remarks . . . . . . . . . . . . . . . . . . . . . . . 487
13.7 Visualization of MLP Weights Using Sklearn . . . . . . . . 488
13.7.1 Import necessary Sklearn module . . . . . . . . . 488
13.7.2 Load MNIST dataset . . . . . . . . . . . . . . . . 488
13.7.3 Set an MLP model . . . . . . . . . . . . . . . . . . 489
13.7.4 Training the MLP model and time the
training . . . . . . . . . . . . . . . . . . . . . . . . 489
13.7.5 Performance analysis . . . . . . . . . . . . . . . . 489
13.7.6 Viewing the weight matrix as images . . . . . . . 490
13.8 MLP for Nonlinear Regression . . . . . . . . . . . . . . . . 490
13.8.1 California housing data and preprocessing . . . . . 492
13.8.2 Configure, train, and test the MLP . . . . . . . . 493
13.8.3 Compute and plot the partial dependence . . . . . 494
13.8.4 Comparison studies on different regressors . . . . 495
13.8.5 Gradient boosting regressor . . . . . . . . . . . . . 495
13.8.6 Decision tree regressor . . . . . . . . . . . . . . . . 498
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
14 Overfitting and Regularization 501
14.1 Why Regularization . . . . . . . . . . . . . . . . . . . . . . 501
14.2 Tikhonov Regularization . . . . . . . . . . . . . . . . . . . 504
14.2.1 Demonstration examples: One data-point . . . . . 508
14.2.2 Demonstration examples: Two data-points . . . . 517
14.2.3 Demonstration examples: Three data-points . . . 521
14.2.4 Summary of the case studies . . . . . . . . . . . . 525
14.3 A Case Study on Regularization Effects using MXNet . . . 526
14.3.1 Load the MNIST dataset . . . . . . . . . . . . . . 527
14.3.2 Define a neural network model . . . . . . . . . . . 527
xx Machine Learning with Python: Theory and Applications

14.3.3 Deﬁne loss function and optimizer . . . . . . . . . 527

14.3.4 Define a function to evaluate the accuracy . . . . 528
14.3.5 Define a utility function plotting
convergence curve . . . . . . . . . . . . . . . . . . 528
14.3.6 Train the neural network model . . . . . . . . . . 529
14.3.7 Evaluation of the trained model: A typical case
of overfitting . . . . . . . . . . . . . . . . . . . . . 531
14.3.8 Application of L2 regularization . . . . . . . . . . 531
14.3.9 Re-initializing the parameters . . . . . . . . . . . 531
14.3.10 Training the L2-regularized neural
network model . . . . . . . . . . . . . . . . . . . . 531
14.3.11 Effect of the L2 regularization . . . . . . . . . . . 533
14.4 A Case Study on Regularization Parameters
Using Sklearn . . . . . . . . . . . . . . . . . . . . . . . . . 534
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
15 Convolutional Neural Network (CNN)
for Classification and Object Detection 539
15.1 Filter and Convolution . . . . . . . . . . . . . . . . . . . . 539
15.2 Affine Transformation Unit in CNNs . . . . . . . . . . . . 542
15.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
15.4 Up Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 545
15.5 Configuration of a Typical CNN . . . . . . . . . . . . . . . 545
15.6 Some Landmark CNNs . . . . . . . . . . . . . . . . . . . . 546
15.6.1 LeNet-5 . . . . . . . . . . . . . . . . . . . . . . . . 547
15.6.2 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . 548
15.6.3 VGG-16 . . . . . . . . . . . . . . . . . . . . . . . . 549
15.6.4 ResNet . . . . . . . . . . . . . . . . . . . . . . . . 549
15.6.5 Inception . . . . . . . . . . . . . . . . . . . . . . . 551
15.6.6 YOLO: A CONV net for object detection . . . . . 551
15.7 An Example of Convolutional Neural Network . . . . . . . 552
15.7.1 Import TensorFlow . . . . . . . . . . . . . . . . . 553
15.7.2 Download and preparation of a CIFAR10
dataset . . . . . . . . . . . . . . . . . . . . . . . . 553
15.7.3 Verification of the data . . . . . . . . . . . . . . . 553
15.7.4 Creation of Conv2D layers . . . . . . . . . . . . . 554
15.7.5 Add Dense layers to the Conv2D layers . . . . . . 556
15.7.6 Compile and train the CNN model . . . . . . . . . 557
15.7.7 Evaluation of the trained CNN model . . . . . . . 557
Contents xxi

15.8 Applications of YOLO for Object Detection . . . . . . . . 558

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
16 Recurrent Neural Network (RNN) and Sequence
Feature Models 563
16.1 A Typical Structure of LSTMs . . . . . . . . . . . . . . . . 564
16.2 Formulation of LSTMs . . . . . . . . . . . . . . . . . . . . 565
16.2.1 General formulation . . . . . . . . . . . . . . . . . 565
16.2.2 LSTM layer and standard neural layer . . . . . . . 566
16.2.3 Reduced LSTM . . . . . . . . . . . . . . . . . . . 566
16.3 Peephole LSTM . . . . . . . . . . . . . . . . . . . . . . . . 567
16.4 Gated Recurrent Units (GRUs) . . . . . . . . . . . . . . . 568
16.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
16.5.1 A simple reduced LSTM with a standard NN layer
for regression . . . . . . . . . . . . . . . . . . . . . 569
16.5.2 LSTM class in tensorﬂow.keras . . . . . . . . . . . 574
16.5.3 Using LSTM for handwritten digit recognition . . 575
16.5.4 Using LSTM for predicting dynamics of
moving vectors . . . . . . . . . . . . . . . . . . . . 578
16.6 Examples of LSTM for Speech Recognition . . . . . . . . . 584
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
17 Unsupervised Learning Techniques 585
17.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 585
17.2 K-means for Clustering . . . . . . . . . . . . . . . . . . . . 585
17.2.1 Initialization of means . . . . . . . . . . . . . . . . 586
17.2.2 Assignment of data-points to clusters . . . . . . . 587
17.2.3 Update of means . . . . . . . . . . . . . . . . . . . 588
17.2.4 Example 1: Case studies on comparison of
initiation methods for K-means clustering . . . . 590
17.2.5 Example 2: K-means clustering on the
handwritten digit dataset . . . . . . . . . . . . . . 601
17.3 Mean-Shift for Clustering Without Pre-Specifying k . . . . 605
17.4 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . 609
17.4.1 Basic structure of autoencoders . . . . . . . . . . 610
17.4.2 Example 1: Image compression and denoising . . . 611
17.4.3 Example 2: Image segmentation . . . . . . . . . . 611
17.5 Autoencoder vs. PCA . . . . . . . . . . . . . . . . . . . . . 615
17.6 Variational Autoencoder (VAE) . . . . . . . . . . . . . . . 617
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
xxii Machine Learning with Python: Theory and Applications

18 Reinforcement Learning (RL) 625

18.1 Basic Underlying Concept . . . . . . . . . . . . . . . . . . 625
18.1.1 Problem statement . . . . . . . . . . . . . . . . . 625
18.1.2 Applications in sciences, engineering,
and business . . . . . . . . . . . . . . . . . . . . . 626
18.1.3 Reinforcement learning approach . . . . . . . . . . 627
18.1.4 Actions in discrete time: Solution strategy . . . . 628
18.2 Markov Decision Process . . . . . . . . . . . . . . . . . . . 629
18.3 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
18.4 Value Functions . . . . . . . . . . . . . . . . . . . . . . . . 630
18.5 Bellman Equation . . . . . . . . . . . . . . . . . . . . . . . 631
18.6 Q-learning Algorithm . . . . . . . . . . . . . . . . . . . . . 633
18.6.1 Example 1: A robot explores a room with
unknown obstacles with Q-learning algorithm . . . 633
18.6.2 OpenAI Gym . . . . . . . . . . . . . . . . . . . . . 635
18.6.3 Deﬁne utility functions . . . . . . . . . . . . . . . 636
18.6.4 A simple Q-learning algorithm . . . . . . . . . . . 636
18.6.5 Hyper-parameters and convergence . . . . . . . . 640
18.7 Q-Network Learning . . . . . . . . . . . . . . . . . . . . . . 641
18.7.1 Example 2: A robot explores a room with
unknown obstacles with Q-Network . . . . . . . . 641
18.7.2 Building TensorFlow graph . . . . . . . . . . . . . 642
18.7.3 Results from the Q-Network . . . . . . . . . . . . 644
18.8 Policy gradient methods . . . . . . . . . . . . . . . . . . . 646
18.8.1 PPO with NN policy . . . . . . . . . . . . . . . . 646
18.8.2 Strategy used in policy gradient methods
and PPO . . . . . . . . . . . . . . . . . . . . . . . 647
18.8.3 Ratio policy . . . . . . . . . . . . . . . . . . . . . 649
18.8.4 PPO: Controlling a pole staying upright . . . . . . 650
18.8.5 Save and reload the learned model . . . . . . . . . 654
18.8.6 Evaluate and view the trained model . . . . . . . 654
18.8.7 PPO: Self-driving car . . . . . . . . . . . . . . . . 657
18.8.8 View samples of the racing car before training . . 658
18.8.9 Train the racing car using the CNN policy . . . . 659
18.8.10 Evaluate and view the learned model . . . . . . . 660
18.9 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
Index 663
Chapter 1

Introduction

1.1 Naturally Learned Ability for Problem Solving

We are constantly dealing with all kinds of problems every day, and would
like to solve these problems for timely decisions and actions. We may notice
that for many of the daily-life problems, our decisions are often made spon-
taneously, swiftly without much consciousness. This is because we have been
constantly learning to solve such problems in the past since we were born,
and therefore the solutions have already been encoded in the neuron cells in
our brain. When facing similar problems, our decision is spontaneous.
For many complicated problems, especially in science and engineering,
one would need to think harder and even conduct extensive research and
study on the related issues before we can provide a solution. What if we want
to give spontaneous reliable solutions to these types of problems as well?
Some scientists and engineers may be able to do this for some problems, but
not many. Those scientists are intensively trained or educated in specially
designed courses for dealing with complicated problems.
What if a normal layman would also like to be able solve these challenging
types of problems? One way is to go through a special learning process.
The alternative may be through machine learning, to develop a special
computer model with a mechanism that can be trained to extract features
from experience or data to provide a reliable and instantaneous solution for
a type of problem.

1.2 Physics-Law-based Models

Problems in science and engineering are usually much more diﬃcult to solve.
This is because we humans can only experience or observe the phenomena

1
2 Machine Learning with Python: Theory and Applications

associated with the problem. However, many phenomena are not easily
observable and have very complicated underlying logic. Scientists have been
trying to unveil the underlying logic by developing some theories (or laws or
principles) that can help to best describe these phenomena. These theories
are then formulated in the form of algebraic, differential, or integral system
equations that govern the key variables involved in the phenomena. The
next step is then to find a method that can solve these equations for these
variables varying in space and with time. The final step is to find a way
to validate the theory by observation and/or experiments to measure the
values of these variables. The validated theory is used to build models to
solve problems that exhibit the same phenomena. This type of model is
called physics-law-based model.
The above-mentioned process is essentially what humans on earth have
been doing in trying to understand nature, and we have made tremendous
progress so far. In this process, we have established a huge number of areas
of studies, physics, mathematics, biology, etc., which are now referred to as
sciences.
Understanding nature is only a part of the story. Humans want to
invent and build new things. A good understanding of various phenomena
enables us to do so, and we have practically built everything around us,
buildings, bridges, airplanes, space stations, cars, ships, computers, cell
phones, internet, communication systems, and energy systems. Such a list is
endless. In this process, we humans established a huge number of areas of
development, which we are now referred to as engineering.
Understanding biology helped us to discover medicines, treatments for
illnesses of humans and animals, treatments for plants and the environment,
as well as proper measures and policies dealing with the relationships
between humans, animals, plants, and environments. In this process, we
humans established a huge number of areas of studies, including medicine,
agriculture, and ecology.
In the relentless quest by humans in history, countless theories, laws,
techniques, methods, etc., have been developed in various areas of science,
engineering, and biology. For example, in the study of a small area of compu-
tational mechanics for designing structural systems, we have developed the
finite element method (FEM) [1], smoothed finite element method (S-FEM)
[2], meshfree methods [3, 4], inverse techniques [5], etc., just to name a few
that the author has been working on. It is not possible and necessary to list
all of these kinds of methods and techniques. Our discussion here is just to
provide an overall view of how a problem can be solved based on physics laws.
Introduction 3

Note that there are many problems in nature, engineering, and society
for which it is difficult to describe and find proper physics laws to accurately
and effectively solve them. Alternative means are thus needed.

1.3 Machine Learning Models, Data-based

There is a large class of complicated problems (in science, engineering,

biology, and daily-life) that do not yet have known governing physics laws, or
the solutions to the governing laws’ equations are too expensive to obtain. For
this type of problem, on the other hand, we often have some data obtained
and accumulated through observations or measurements or historic records.
When the data are sufficiently large and of good quality, it is possible to
develop computer models to learn from these data. Such a model can then be
used to find a solution for this type of problem. This kind of computer model
is defined as a data-based model or machine learning model in this book.
Different types of effective artificial Neural Networks (NNs) with various
configurations have been developed and widely used for practical problems
in sciences and engineering, including multilayer perceptron (MLP) [6–9],
Convolutional Neural Networks (CNNs) [10–14], and Recurrent Neural
Networks (RNNs) [15–17]. TrumpetNets [8] and TubeNets [9, 18–20] were
also recently proposed by the author for creating two-way deepnets using
physics-law-based models as trainers, such as the FEM [1] and S-FEM [2].
The unique feature of TrumpetNets and TubeNets is their effectiveness for
both forward and inverse problems [5]. It has a unique net architecture.
Most importantly, solutions to inverse problems can be analytically derived
in explicit formulae for the first time. This implies that when a data-based
model is built properly, one can find solutions very efficiently.
Machine learning is essentially to mimic the natural learning process
occurring in biological brains that can have a huge number of neurons. In
terms of usage of data, we may have three major categories:

1. Supervised Learning, using data with true labels (teachers).

2. Unsupervised Learning, using data without labels.
3. Reinforcement Learning, using a predeﬁned environment.

In terms of problems to solve, there are the following:

1. Binary classiﬁcation problems, answer in probability to yes or no.

2. k-classiﬁcation problems, answer in probabilities to k classes.
3. k-clustering problems, answer in k clusters of data-points.
4 Machine Learning with Python: Theory and Applications

4. Regression (linear or nonlinear), answer in predictions of continuous

functions.
5. Feature extraction, answer in key features in the dataset.
6. Abnormality detection, answer in abnormal data.
7. Inverse analysis, answer in prediction on features from known responses.

In terms of learning methodology or algorithms, we have the following:

1. Linear and logistic regression, supervised.

2. Decision Tree, supervised.
3. Support Vector Machine (SVM), supervised.
4. Naive Bayes, supervised.
5. Multi-Layer Perceptron (MLP) or artiﬁcial Neural Networks (NNs),
supervised.
6. k-Nearest Neighbors (kNN), supervised.
7. Random Forest, supervised.
8. Gradient Boosting types of algorithms, supervised.
9. Principal Components Analysis (PCA), unsupervised.
10. K-means, Mean-Shift, unsupervised.
11. Autoencoders, unsupervised.
12. Markov Decision Process, reinforcement Learning.

This book will cover most of these algorithms, but our focus will be
more on neural network-based models because rigorous theory and predictive
models can be established.
Machine learning is a very active area of research and development. New
models, including the so-called cognitive machine learning models, are being
studied. There are also techniques for manipulating various ML models. This
book, however, will not cover those topics.

1.4 General Steps for Training Machine Learning Models

General steps for training machine learning models are summarized as

follows:

1. Obtaining the dataset for the problem, by your own means of data
generation, or imported from other existing sources, or computer
syntheses.
2. Clean up the dataset if there are objectively known defaults in it.
3. Determine the type of hypothesis for the model.
Introduction 5

4. Develop or import proper module for the needed algorithm for the
problem. The learning ability (number of the learning parameters) of
the model and the size of the dataset shall be properly balanced, if
possible. Otherwise, consider the use of regularization techniques.
5. Randomly initialize the learning parameters, or import some known pre-
trained learning parameter.
6. Perform the training with proper optimization techniques and monitor-
ing measures.
7. Test the trained model using an independent test dataset. This can also
be done during the training.
8. Deploy the trained and tested model to the same type of problems, where
the training and testing datasets are collected/generated.

1.5 Some Mathematical Concepts, Variables, and Spaces

We shall define variables and spaces often used in this book for ease of
discussion. We first state that this book deals with only real numbers, unless
specified when geometrically closed operations are required. Let us introduce
two toy examples.

1.5.1 Toy examples

Toy Example-1, Regression: Assume we are to build a machine learning
model to predict the quality of fruits. Based on its three features, size,
weight, and roundness (that can easily observe and measure), we aim to
establish a machine learning regression model to predict the values of two
characteristics, sweetness and vitamin-C content (that are diﬃcult to
quantify nondestructively), for any given fruit. To build such a model, we
make 8,000 measurements to randomly selected fruits from the market and
create a dataset with 8,000 paired data-points. Each data-point records
the values of these three features and pairs with the values of these two
characteristics. The values of these two characteristics are called labels
(ground truth) to the data-point. The dataset is called labeled dataset that
can be used systematically to train a machine learning model.
Toy Example-2, Classiﬁcation: Assume we are to build a machine learning
model to classify the type of fruits based on its three features (size, weight,
and roundness). In this case, we want a machine to predict whether any
given frait is an apple or orange, so that it can be packaged separately
in an automatic manner. To achieve this, we make 8,000 measurements to
6 Machine Learning with Python: Theory and Applications

randomly selected fruits of these two types from the market, and create a
dataset with 8,000 paired data-points. Each data-point records the values
of these three features and pairs with two labels (ground truth) of yes-or-no
for apple or yes-or-no for orange. The dataset is also called labeled dataset
for model training.
With an understanding of these two typical types of examples, it should
be easy to extend this to many other types of problems for which a machine
learning model can be eﬀective.

1.5.2 Feature space

Feature space Xp : Machine learning uses datasets that contain observed or
measured p variables of real numbers in R, often called features. In our two
toy examples, p = 3. We may define a p-dimensional feature space Xp which is
a vector space (https://en.wikipedia.org/wiki/Vector space) over real num-
bers in R with inner product defined. A vector in Xp for an arbitrary point
(x1 , x2 , . . . , xp ) is written as
x = [x1 , x2 , . . . , xp ], x ∈ Xp (1.1)
The origin of Xp is at x = [0, 0, . . . , 0] following the standard for all vector
spaces. Note that we use italic for scalar variables, bold face for all vectors
and matrices, and blackboard bold for spaces (or sets or of that nature),
and this convention is followed throughout this book. Also, we define, in
general, all vectors in row vectors by default, as we usually do in Python
programming. A column vector is treated as special case of 2D array (matrix)
with only one column.
It is clear that the feature space Xp is a special (with vector operations
defined) case of the real space Rp . Thus, Xp ∈ Rp .
Also, xi (i = 1, 2, . . . , p) is called linear basis functions (not to be confused
with the basis vectors), because a linear combination of xi gives a new x that
is still in Xp . A two-dimensional (2D) feature space X2 is the black plane
x1 − x2 shown in Fig. 1.1.
An observed data-point xi with p features is a discrete point in the space,
and the corresponding vector xi is expressed as
xi = [xi1 , xi2 , . . . , xip ], xi ∈ Xp , ∀ i = 1, 2, . . . , m (1.2)
where m is the number of measurements or observations or data-points in
the dataset. It is also often referred as number of samples in a dataset. For
these two toy examples, m = 8,000. For the example shown in Fig. 1.1, these
4 blue vectors are for four data-points in space X2 , and m = 4.
Introduction 7

Figure 1.1: Data-points in a 2D feature space X2 with blue vectors: xi = [xi1 , xi2 ]; and
2
the same data-points in the augmented feature space X , called aﬃne space, with red
vectors: xi = [1, xi1 , xi2 ]; i = 1, 2, 3, 4.

These data-points xi (i = 1, 2, . . . , m) can be stacked to form a dataset

noted as X ∈ Xp . This is for convenience in formulation. We do not form
such a matrix in computation because it is usually very large for big datasets
with large m.

1.5.3 Aﬃne space

p
Affine space X : It is an augmented feature space. It is the red plane
shown in Fig. 1.1. It has a “complete” linear bases (or basis functions):
x = [1, x1 , x2 , . . . , xp ] (1.3)
By complete linear bases, we mean all bases up to the 1st order of all the
variables including the 0th order. The 0th order basis is the constant basis 1
that provides the augmentation. Affine space is not a vector space, because
p p
0∈ / X and (xi + xj ) ∈
/ X where i, j=1 or 2 or 3 or 4 in Fig. 1.1. This special
and fundamentally useful space always has a constant 1 as a component, and
thus it does not have an origin by definition. Operation that occurs on an
affine space and still stays in an affine space is called affine transformation.
It is the most essential operation in major machine learning models, and the
fundamental reason for such models being predictive.
An observed data-point with p features can also be presented as an
p
augmented discrete point in the X space and can be expressed by
p
xi = [1, xi1 , xi2 , . . . , xip ], xi ∈ X , ∀ i = 1, 2, . . . , m (1.4)
8 Machine Learning with Python: Theory and Applications

p
A X space can be created by ﬁrst spanning Xp by one dimension to Xp+1
via introduction of a new variable x0 as

[x0 , x1 , x2 , . . . , xp ] (1.5)

and then set x0 = 1. These 4 red vectors shown in Fig. 1.1 live in an affine
2
space X .
p
Note that the affine space X is neither Xp+1 nor Xp , and is quite
p
special. A vector in a X is in Xp+1 , but the tip of the vector is confined
in “hyperplane” of x0 = 1. For convenience of discussion in this book, we
say that an affine space has a pseudo-dimension that is p + 1. Its true
dimension is p, but it is a hyperplane in a Xp+1 space.
In terms of function approximation, the linear bases given in Eq. (1.3)
can be used to construct any arbitrary linear function in the feature
space. A proper linear combination of these complete linear bases is still
in the affine space. Such a combination can be used to perform an affine
transformation, which will be discussed in detail in Chapter 5.
These data-points xi (i = 1, 2, . . . , m) are stacked to form an augmented
p
dataset X ∈ X , which is the well-known moment matrix in function
approximation theory [1–4]. Again, this is for convenience in formulation.
We may not form such a matrix in computation.

1.5.4 Label space

Label space Yk : Consider a labeled dataset for a supervised machine
learning model creation. We shall introduce variables (y1 , y2 , . . . , yk ) of real
numbers in R. For toy example-1, k = 2. We may deﬁne a label space Yk over
real numbers. It is a vector space. A vector in space Yk for can be written as

y = [y1 , y2 , . . . , yk ], y ∈ Yk ∈ Rk (1.6)

A label in a dataset is paired with a data-point. The label for data-point xi

which is denoted as yi can be expressed as

yi = [yi1 , yi2 , . . . , yik ], yi ∈ Yk , ∀i = 1, 2, . . . , m (1.7)

For the toy example-1, yij (i = 1, 2, . . . , 8000; j = 1, 2) are 8,000 real numbers
in 2D space Y2 . For the toy example-2, each label, yi1 or yi2 , has a value of
0 or 1 (or −1 or 1), but the labels can still be viewed living in Y2 .
These labels yi (i = 1, 2, . . . , m) can be stacked to form a label set Y ∈ Yk ,
although we may not really do so in computation.
Introduction 9

Typically, aﬃne transformations end at the output layer in a neural

network and produces a vector in a label space, so that a loss function can
be constructed there for “terminal control”.

1.5.5 Hypothesis space

The learning parameters ŵ in a machine learning model are continuous vari-
ables that live in a hypothesis space noted as WP over the real numbers.
Learning parameters are also called training or trainable parameters. We
use these terms interchangeably. The learning parameters include weights
and biases in each and all the layer. The hat above w implying that it
is a collection of all weights and biases, so that we have single notation
in a vector for all learning parameters. Its dimension P depends on type
of hypothesis used including the configuration of neural networks or ML
models. These parameters always work with feature vectors, resulting in
intermediate feature vectors in a new feature space or in a label space,
thorough a properly designed architecture.
These parameters need to be updated which involves vector operations.
To ensure convergence, we would need the vector of all learning parameters
obey important vector properties, such as inner products, norms and the
Cauchy-Schwartz inequality, etc. We will do such proofs multiple times in
this book. Therefore, we require WP be a vector space, so that each update
to the current learning parameters results new parameters that are still in
the same vector space, until they converge.
Note that the learning parameters, in general, are in matrix form or
column vectors (that can be viewed as a special case of matrix). In a typical
machine learning model, there could be multiple matrices of different sizes.
These matrices form affine transformation matrices that operates on
features on affine spaces. A component in a “vector” of the hypothesis space
can be in fact a matrix in general, and thus it is not easy to comprehend
intuitively. The easiest (and valid) way is to “flatten” all the matrix and
then “concatenate” them together to form a tall vector, and then treat it as
a usual vector. We do this kind of flattening and concatenation all the time
in Python. Such a flattened tall vector ŵ in the hypothesis space WP can
be written generally as,

ŵ = [W 0 , W 1 , . . . , W P ] ∈ WP (1.8)

We will discuss in later chapters the details about WP for various models
including estimation of the dimension P .
10 Machine Learning with Python: Theory and Applications

1.5.6 Deﬁnition of a typical machine learning model,

a mathematical view
Finally, we can deﬁne mathematically ML models for prediction as a mapping
operator:
p
M(ŵ ∈ WP ; X ∈ X , Y ∈ Yk ) : Xp → Yk (1.9)

It reads that the ML model M uses a given dataset X with Y to train its
learning parameters ŵ, and produces a map (or giant functions) that makes
a prediction in the label space for any point in the feature space.
The ML model shown in Eq. (1.9) is in fact a data-parameter
converter: it converts a given dataset to learning parameters during training
and then converts the parameters back in making a prediction for a given set
of feature variables. It can also be mathematically viewed as a giant function
with k components in the feature space Xp and controlled (parameterized)
by the training parameters in WP . When the parameters are tuned, one gets
a set of k giant functions over the feature space.
On the other hand, this set of k giant functions can also be viewed as
continuous (differentiable) functions of these parameters for any given data-
point in the dataset, which can be used to form a loss function that is also
differentiable. Such a loss function can be the error between these k giant
functions and the corresponding k labels given in the dataset. It can be
viewed as a functional of prediction functions that in turn are functions of
ŵ in the vector space WP . The training is to minimize such a loss function
for all the data-points in the dataset, by updating the training parameters
to become minimizers. This overview picture will be made explicitly into
a formula in later chapters. The success factors for building a quality ML
model include: (1) type of hypothesis, (2) number of learning parameters
in WP , (3) quality (representativeness to the underlaying problem to be
modeled, including correctness, size, data-point distribution over the features
space, and noise level) of the dataset in Xp , and (4) techniques to find the
minimizer of learning parameters to best produce the label in the dataset.
We will discuss this in detail in later chapters for different machine learning
models.
Concepts on spaces are helpful in our later analysis of the predictive
properties of machine learning models. Readers may find difficulty in
comprehending these concepts at this stage, and thus are advised to just have
some rough idea for now and to revisit this section when reading relevant
chapters. Readers may jump to Section 13.1.5 and take a look at Eq. (13.13)
there just for a quick glance on how the spaces evolve in a deepnet.
Introduction 11

Note also that there are ML models for discontinuous feature variables,
and the learning parameters may not need to be continuous. Such methods
are often developed based on proper intuitive rules and techniques, and
we will discuss some of those. The concepts on spaces may not be directly
applicable but can often help.

1.6 Requirements for Creating Machine Learning Models

To train a machine learning model, one would need the following:

1. A dataset, which may be obtained via observations, experiments, and

physics-law-based models. The dataset is usually divided (in a random
manner) into two mutually independent subsets, training dataset and
testing dataset, typically at a rate of 75:25. The independence of the test-
ing dataset is critical, because ML models are determined largely by the
training dataset, and hence their reliability depends on objective testing.
2. Labels with the dataset, if possible.
3. Prior information on the dataset if possible, such as the quality of the
data and key features of the data. This can be useful in choosing a proper
algorithm for the problem, and in application of regularization techniques
in the training.
4. Proper computer software modules and/or eﬀective algorithms.
5. A computer, preferably connected to the internet.

1.7 Types of Data

Data are the key to any data-based models. There are many types of data
available for diﬀerent types of problems that one may make use of as follows:

• Images: photos from cameras (more often now cellphones), images

obtained from the open websites, computer tomography (CT), X-ray,
ultrasound, Magnetic resonance imaging (MRI), etc.
• Computer-generated data: data from proven physics-law-based mod-
els, other surrogate models, other reliable trained machine learning
models, etc.
• Text: unclassiﬁed text documents, books, emails, webpages, social media
records, etc.
• Audio and video: audio and video recordings.
12 Machine Learning with Python: Theory and Applications

Note that the quality and the sampling domain of the dataset play
important roles in training reliable machine learning models. Use of a trained
model beyond the data sampling domain requires a special caution, because
it can go wrong unexpectedly, and hence be very dangerous.

1.8 Relation Between Physics-Law-based and

Data-based Models

Machine learning models are in general slow learners, fast predictors,

while physics-law-based models do not need to learn (using existing laws),
but are slow in prediction. This is because the strategies for physics-law-
based models and those for data-based models are quite different. ML models
use datasets to train the parameters, but physics-law-based models use laws
to determine the parameters.
However, at the detailed computational methodology level, many tech-
niques used in both models are in fact the same or quite similar. For example,
when we express a variable as a function of other variables, both models
use basis functions (polynomial, or radial basis function (RBF), or both).
In constructing objective functions, the least squares error formulation is
used in both. In addition, the regularization methods used are also quite
similar. Therefore, one should not study these models in total isolation. The
ideas and techniques may be deeply connected and mutually adaptable. This
realization can be useful in better understanding and further development
of more effective methods for both models, by exchanging the ideas and
techniques from one to another. In general, for physics-law-based computa-
tional methods, such as the general form of meshfree methods, we understand
reasonably well why and how a method works in theory [3]. Therefore, we are
quite confident about what we are going to obtain when a method is used
for a problem. For data-based methods, however, this is not always true.
Therefore, it is of importance to develop fundamental theories for data-based
methods. The author made some attempts [21] to reveal the relationship
between physics-law-based and data-based models, and to establish some
theoretical foundation for data-based models. In this book, we will try to
discuss the similarities and differences, when a computational method is
used in both models.

1.9 This Book

This book oﬀers an introduction to general topics on machine learning. Our

focus will be on the basic concepts, fundamental theories, and essential
Introduction 13

computational techniques related to creation of various machine learn-

ing models. We decided not to provide a comprehensive document for
all the machine learning techniques, models, and algorithms. This is
because the topic of machine learning is very extensive and it is not possible
to be comprehensive in content. Also, it is really not possible for many read-
ers to learn all the content. In addition, there are in fact plenty of documents
and codes available publicly online. There is no lack of material, and there is
no need to simply reproduce these materials. In the opinion of the author, the
best learning approach is to learn the most essential basics and build a strong
foundation, which is suﬃcient to learn other related topics, methods, and
algorithms. Most importantly, readers with strong fundamentals can even
develop innovative and more eﬀective machine models for their problems.
Based on this philosophy, the highlights of the book that cannot be found
easily or in good completion in the open literature are listed as follows, many
of which are the outcomes of author’s study in the past years:

1. Detailed discussion on and demonstration of predictability for arbitrary

linear functions of the basic hypothesis used in major ML models.
2. Affine transformation properties and their demonstrations, affine space,
affine transformation unit, array, chained arrays, roles of the weights and
biases, and roles of activation functions for deepnet construction.
3. Examination of predictability of high-order functions and a Universal
Prediction Theory for deepnets.
4. A concept of data-parameter converter, parameter encoding, and unique-
ness of the encoding.
5. Role of affine transformation in SVM, complete description of SVM
formulation, and the kernel trick.
6. Detailed discussion on and demonstration of activation functions,
Neural-Pulse-Unit (NPU), leading to the Universal Approximation
Theorem for wide-nets.
7. Differentiation of a function with respect to a vector and matrix, leading
to automatic differentiation and Autograd.
8. Solution Existence Theory, effects of parallel data-points, and pre-
dictability of the solution against the label.
9. Neurons-Samples Theory gives, for the first time, a general rule of thumb
on relationship between the number of data-points and the number
neurons in a neural network (or the total pseudo-dimensions of affine
spaces involved).
10. Detailed discussion on and demonstration of Tikhonov regularization
effects.
14 Machine Learning with Python: Theory and Applications

The author has made substantial effort to write Python codes to demonstrate
the essential and difficult concepts and formulations, which allows readers
to comprehend each chapter earlier. Based on the learning experience of the
author, this can make the learning more effective.
The chapters of this book are written, in principle, readable indepen-
dently, by allowing some duplicates. Necessary cross-references between
chapters provided are kept minimum.

1.10 Who May Read This Book

The book is written for beginners interested to learn the basics of machine
learning, including university students who have completed their first
year, graduate students, researchers, and professionals in engineering and
sciences. Engineers and practitioners who want to learn to build machine
learning models may also find the book useful. Basic knowledge of college
mathematics is helpful in reading this book smoothly.
This book may be used as a textbook for undergraduates (3rd year or
senior) and graduate students. If this book is adopted as a textbook, the
instructor may contact the author ([email protected]) directly for some
homework and course projects and solutions.
Machine learning is still a fast developing area of research. There still exist
many challenging problems, which offer ample opportunities for research to
develop new methods and algorithms. Currently, it is a hot topic of research
and applications. Different techniques are being developed every day, and
new businesses are formed constantly. It is the hope of the author that this
book can be helpful in studying existing and developing machine learning
models.

1.11 Codes Used in This Book

The book has been written using Jupiter Notebook with codes.
Readers who purchased the book may contact the author directly
(mailto:[email protected]) to request a softcopy of the book with codes
(which may be updated), free for academic use after registration. The
conditions for use of the book and codes developed by the author, in both
hardcopy and softcopy, are as follows:

1. Users are entirely at their own risk using any of part of the codes and
techniques.
Introduction 15

2. The book and codes are only for your own use. You are not allowed to
further distribute without permission from the author of the code.
3. There will be no user support.
4. Proper reference and acknowledgment must be given for the use of the
book, codes, ideas, and techniques.

Note that the handcrafted codes provided in the book are mainly for
studying and better understanding the theory and formulation of ML
methods. For production runs, well-established and well-tested packages
should be used, and there are plenty out there, including but not limited
to Scikit learn, PyTouch, TensorFlow, and Keras. Also, our codes provided
are often run with various packages/modules. Therefore, care is needed when
using these codes, because the behavior of the codes often depends on the
versions of Python and all these packages/modules. When the codes do not
run as expected, version mismatch could be one of the problems. When this
book was written, the versions of Python and some of the packages/modules
were as follows:

• Python 3.6.13 :: Anaconda, Inc.

• Jupyter Notebook (web-based) 6.3.0
• TensorFlow 2.4.1
• keras 2.4.3
• gym 0.18.0

When issues are encountered in running a code, readers may need to

check the versions of the packages/modules used. If Anaconda Navigator
is used, the versions of all these packages/modules installed with the Python
environment are listed when the Python environment are highlighted. You
can also check the versions of a package in a code cell of the Jupyter
Notebook. For example, to check the version of the current environment
of Python, one may use

!python -V # ! is used to execute an external command

Python 3.6.13 :: Anaconda, Inc.

To check the version of a package/module, one may use

• import package name

• print(‘package name version’,package name)
16 Machine Learning with Python: Theory and Applications

For example,

import keras
print('keras version',keras.__version__)
import tensorflow as tf
print('tensorflow version',tf.version.VERSION)

keras version 2.4.3

tensorflow version 2.4.1

If the version is indeed an issue, one would need to either modify the code
to ﬁt the version or install the correct version in your system, by may be
creating an alternative environment. It is very useful to query on the web
using the error message, and solutions or leads can often be found. This is
the approach the author often takes when encountering an issue in running
a code. Finally, this book has used materials and information available on
the web with links. These links may change over time, because of the nature
of the web. The most eﬀective way (and often used by the author) to dealing
with this matter is to use keywords to search online, if the link is lost.

References

[1] G.R. Liu and S.S. Quek, The Finite Element Method: A Practical Course,
Butterworth-Heinemann, London, 2013.
[2] G.R. Liu and T.T. Nguyen, Smoothed Finite Element Methods, Taylor and Francis
Group, New York, 2010.
[3] G.R. Liu, Mesh Free Methods: Moving Beyond the Finite Element Method, Taylor
and Francis Group, New York, 2010.
[4] G.R. Liu and Gui-Yong Zhang, Smoothed Point Interpolation Methods: G Space
Theory and Weakened Weak Forms, World Scientiﬁc, New Jersey, 2013.
[5] G.R. Liu and X. Han, Computational Inverse Techniques in Nondestructive Evalua-
tion, Taylor and Francis Group, New York, 2003.
[6] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of
Brain Mechanisms, New York, 1962. https://books.google.com/books?id=7FhRAA
AAMAAJ.
[7] D.E. Rumelhart, G.E. Hinton and R.J. Williams, Learning Internal Representations
by Error Propagation, 1986.
[8] G.R. Liu, FEA-AI and AI-AI: Two-way deepnets for real-time computations for both
forward and inverse mechanics problems, International Journal of Computational
Methods, 16(08), 1950045, 2019.
[9] G.R. Liu, S.Y. Duan, Z.M. Zhang et al., TubeNet: A special trumpetnet for explicit
solutions to inverse problems, International Journal of Computational Methods,
18(01), 2050030, 2021. https://doi.org/10.1142/S0219876220500309.
Introduction 17

[10] Fukushima Kunihiko, Neocognitron: A self-organizing neural network model for a

mechanism of pattern recognition unaffected by shift in position, Biological Cyber-
netics, 36(4), 193–202, Apr 1980. https://doi.org/10.1007%2Fbf00344251.
[11] D. Ciregan, U. Meier and J. Schmidhuber, Multi-column deep neural networks
for image classification, 2012 IEEE Conference on Computer Vision and Pattern
Recognition, 2012.
[12] M.V. Valueva, N.N. Nagornov, P.A. Lyakhov et al., Application of the residue number
system to reduce hardware costs of the convolutional neural network implementation,
Mathematics and Computers in Simulation, 177, 232–243, 2020.
[13] Duan Shuyong, Ma Honglei, G.R. Liu et al., Development of an automatic lawnmower
with real-time computer vision for obstacle avoidance, International Journal of
Computational Methods, Accepted, 2021.
[14] Duan Shuyong, Lu Ningning, Lyu Zhongwei et al., An anchor box setting technique
based on differences between categories for object detection, International Journal of
Intelligent Robotics and Applications, 6, 38–51, 2021.
[15] M. Warren and P. Walter, A logical calculus of ideas immanent in nervous activity,
Bulletin of Mathematical Biophysics, 5, 127–147, 1943.
[16] J. Schmidhuber, Habilitation Thesis: An Ancient Experiment with Credit Assignment
Across 1200 Time Steps or Virtual Layers and Unsupervised Pre-training for a
Stack of Recurrent NNs, 1993, TUM. https://people.idsia.ch//∼juergen/habilitation/
node114.html.
[17] Yu Yong, Si Xiaosheng, Hu Changhua et al., A review of recurrent neural networks:
LSTM cells and network architectures, Neural Computation, 31(7), 1235–1270,
2019. https://direct.mit.edu/neco/article/31/7/1235/8500/A-Review-of-Recurrent-
Neural-Networks-LSTM-Cells.
[18] L. Shi, F. Wang, S. Duan et al., Two-way TubeNets uncertain inverse methods for
improving positioning accuracy of robots based on interval, The 11th International
Conference on Computational Methods (ICCM2020), 2020.
[19] Duan Shuyong, Shi Lutong, G.R. Liu et al., An uncertainty inversion technique using
two-way neural network for parameter identification of robot arms, Inverse Problems
in Science & Engineering, 29, 3279–3304, 2021.
[20] Duan Shuyong, Wang Li, G.R. Liu et al., A technique for inversely identifying joint-
stiffnesses of robot arms via two-way TubeNets, Inverse Problems in Science &
Engineering, 13, 3041–3061, 2021.
[21] G.R. Liu, A neural element method, International Journal of Computational Methods,
17(07), 2050021, 2020.
MACHINE LEARNING
WITH PYTHON
Chapter 2

Basics of Python

This chapter discusses basics of Python language for coding machine learning
models. Python is a very powerful high-level programming language with
the need for compiling, but with some level of efficiency of machine-level
language. It has become the top popular tool for the development of tools and
applications in the general area of machine learning. It has rich libraries for
open access, and new libraries are constantly being developed. The language
itself is powerful in terms of functionality. It is an excellent tool for effective
and productive coding and programming. It is also fast, and the structure
of the language is well built for making use of bulky data, which is often the
case in machine learning.
This chapter is not a formal training on Python, but just to help readers
have a smoother start in learning and practicing the materials in the later
chapters. Our focus will be on some useful simple tricks that are familiar
to the author, and some behavior subtleties that often affect our coding in
ML. Readers familiar with Python may simply skip this chapter. We will
use the Jupyter Notebook as the platform for the discussions, so that the
documentation and demonstration can be all in a single file.
You may go online and have the Jupyter Notebook installed at, for
example, https://www.anaconda.com/distribution/, where you can have the
Jupyter Notebook and Python installed at the same time, and maybe along
with another useful Python IDE (Integrated Development Environment)
called Spyder. In my laptop, I have all three pieces ready to use.
A Jupyter Notebook consists of “cells” of different types: cells for codes
and cells for text called “makedown”. Each cell is framed with color borders,
and the color shows up when the cell is clicked on. A green color border
indicates that this cell is in the input mode, and one can type and edit the
contents. Pressing “ctrl + Enter” within the cell, the green border changes

19
20 Machine Learning with Python: Theory and Applications

to blue color, indicating that this cell is formatted or executed, and may
produce an outcome. Double clicking on the blue framed cell sets it back to
the input mode. The right vertical border is made thicker for better viewing.
This should be suﬃcient for us to get going. One will get more skills (such
as adding cells, deleting cells, and converting cell types) by playing and
navigating among the menu bars on the top of the Notebook window.
Googling the open online sources is excellent for getting help when one
has a question. The author does this all the time. Sources of the reference
materials include the following:

• https://docs.python.org/3.7/
• https://docs.scipy.org/doc/numpy/reference/?v=20191112052936
• https://medium.com/ibm-data-science-experience/markdown-for-jupyte
r-notebooks-cheatsheet-386c05aeebed
• https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
• https://www.python.org/about/
• https://www.learnpython.org/
• https://en.wikipedia.org/wiki/Python (programming language)
• https://www.learnpython.org/en/Basic Operators+
• https://www.python.org/+
• https://jupyter.org/
• https://www.youtube.com/watch?v=HW29067qVWkb
• https://pynative.com/

The following lists the details of versions of modules related to jupyter-

notebook in the current installation in the author’s laptop (Use “!” to execute
an external command):

!jupyter --version
jupyter core : 4.7.1
jupyter-notebook : 6.3.0
qtconsole : not installed
ipython : 7.16.1
ipykernel : 5.3.4
jupyter client : 6.1.12
jupyter lab : not installed
nbconvert : 6.0.7
ipywidgets : 7.6.3
nbformat : 5.1.3
traitlets : 4.3.3
Basics of Python 21

2.1 An Exercise

Let us have a less conventional introduction here. Diﬀerent from other books
on computer languages, we start the discussion on how to make use of our
own codes that we may develop during the course of study.
First, we “import” the Python system library or module from external or
internal sources, so that functions (also called methods) there can be used
in our code for interaction with your computer system. The most important
environment setting is the path.

import sys # import an external module "sys" which

# provides tools for accessing the computer system.
sys.path.append('grbin')
# I made a code in folder grbin in the current
# working directory, and want to use it later.
#print(sys.path)
# check the current paths To execute this or any
# cell use Ctrl-Enter (hold on Ctrl and press Enter)

Note that “#” in a code cell starts a comment line. It can be put anywhere
in a line in a code. Everything in the line behind # becomes comments, and
is not treated as a part of the code.
One may remove “#” in front of print(sys.path), execute it, and a number
of paths will be printed. Many of them were set during the installations
of the system and various codes, including the Anaconda and Python.
“grbin” in the current working directory has just been added in using the
sys.path.append().
When multiple lines of comments are needed, we use “doc-strings” as
follows:
'''Inside here are all comments with multiple lines. It is \
a good way to convey information to users, co-programmers. \
Use a backslash to break a line.'''

'Inside here are all comments with multiple lines. It is a

good way to convey information to users, co-programmers.
Use a backslash to break a line.'

Just for demonstration purposes, we now import our own “module” (a

Python ﬁle named as grcodes.py) “grcodes”, and then give it an alias “gr” for
easy reference later, when accessing the attributes, functions, classes, etc.,
inside the module.
22 Machine Learning with Python: Theory and Applications

import grcodes as gr # a Python code grcodes.py in 'grbin'.

The following cell contains the Python code “grcodes.py”. Readers may
create the “grcodes.py” ﬁle and put it in the folder “grbin” (or any other
folder), so that the cell above can be executed and “gr.printx()” can be used.

from future import print_function # import external module

import sys
# Define function
def printx(name):
""" This prints out both the name and its value together.
usage: name = 88.0; printx('name') """
frame=sys._getframe(1)
print(name,'=',repr(eval(name,frame.f_globals,frame.f_locals)))

Let us try to use a function in the imported module grcodes, using its
alias gr.

x = 1.0 # Assign x a value.

print(x) # The Python built-in print() function prints
# the value of the given argument x.
gr.printx('x') # a function from the gr module. It prints the
# argument name, and its value at the same time.
1.0
x = 1.0

help(gr.printx) # Find out the usage of the gr.printx function

Help on function printx in module grcodes:

printx(name)
This prints out both the name and its value together.
usage: name = 88.0; printx('name')

Nice. I have actually completed a simple task of printing out “x” using
Python, and in two ways. The gr.printx function is equivalent to doing the
following:

print('x=',x) # you must type the same x twice

x= 1.0

Notice in this case that you must type the same x twice, which gives room
for error. A good code should have as little as possible repetition, allowing
Basics of Python 23

easy maintenance. When a change is needed, the programmer (or others

using or maintaining the code) shall just need to do it once.
Alternatively, one can import functions from a module in the following
alternative manner:

from grcodes import printx # you may import more functions

# by adding the function names separated with ",".
#from grcodes import * # Import everything from grcodes
# This is not a very good practice, because it can
# lead to problems when some function names in
# grcodes happened to be the same as those in the code.

In this case, we can now use the imported functions as if it is written in

the current ﬁle (notebook).

gr.printx('x')
printx('x') # Notice that "gr." is no longer needed.
x = 1.0
x = 1.0

2.2 Briefing on Python

Now, what is Python? Python was created by Guido van Rossum and ﬁrst
released in 1991. Python’s design philosophy emphasizes code “readability”.
It uses an object-oriented approach aiming to help programmers to write
clear, less repetitive, logical codes for small- and large-scale projects that
may have teams of people working together.
Python is open source, and its interpreters are available for many
operating systems. A global community of programmers develops and
maintains CPython, an open-source reference implementation. A non-
proﬁt organization, the Python Software Foundation, manages and directs
resources for Python and CPython development.
The language’s core philosophy is summarized in the document The Zen
of Python (PEP 20), which includes aphorisms such as the following:

• Beautiful is better than ugly.

• Explicit is better than implicit.
• Simple is better than complex.
• Complex is better than complicated.
• Readability counts.
24 Machine Learning with Python: Theory and Applications

Guido van Rossum manages Python development projects together with

a steering council. There are two major Python versions, Python 2 and
Python 3, and they are quite different. Python 3.0, released 2008, was a
major revision. It is not completely backward-compatible with Python 2.
Due to the number of codes written for Python 2, support for Python 2.7
(the last release in the 2.x series) was extended to 2020. At the time of
writing this book, Python 3.9 had already been released. This tutorial uses
Python 3.6 because it supports more existing libraries and modules.
There are a huge number (probably in the order of hundreds) of computer
programming languages developed so far. The author’s first experience with
computer programming languages was in the 1970s, when learning BASICS
for programming. He used ALGOL60 later and then FORTRAN for a long
time from the 1970s till today, along with limited use of Matlab, C, C++,
and now Python. Any programming language has a complicated syntax and
deeply organized logic. For a user like the author, the best approach to learn a
computer programming language is via examples and practice, while paying
attention to the syntax, property, and the behavior. For a beginner, following
examples is probably the best approach to get started. This will be the
guidance in writing this section of the book. For rigorous syntax, readers may
read the relevant documentations that are readily available online. We will
give a lot of examples, with explanations in the form of comments (as a
programmer often does). All these examples may be directly executed within
this book while reading, so that readers can have a real-time feeling for
easy observation of the behavior and hence comprehension. Readers may
also make use of it via a simple copy and paste to form his/her notebook.
Because of this example-based approach, the discussions on different topics
may jump a little here and there.
To write and execute a Python code, one often uses an IDE. Jupyter
Notebook, PyCharm, and Spyder are among the popular IDEs. In this
book, we use Jupyter Notebook (https://jupyter.org/) via the distributor
Anaconda (https://www.anaconda.com/). Jupyter Notebook can be used
not only as an IDE but also as a nice document generator allowing
description text (markdown cells) and code cells to be edited together in
one document. The lecture notes used in the author’s machine learning
course have also been mostly developed using Jupyter Notebook. The
documents created using Jupyter Notebook can be exported in various types
of documents, including ascii doc, html, latex, markdown, pdf, Python(.py),
and slides(.slides.html). Readers and students may use this notebook as a
Basics of Python 25

template for your documents (project reports, homeworks, etc.), if so desired.

If one needs to use a spelling check when typing in the markdown cells in
a Jupyter Notebook, the following commands should be executed in the
Anaconda Prompt:

• %pip install jupyter contrib nbextensions.

• %jupyter contrib nbextension install --user.
• %jupyter nbextension enable spellchecker/main.

This would mark the misspelled words for you (but will not provide
suggestions). Other necessary modules with add-on functions may also be
installed in a similar manner.
This book covers in a brief manner a tiny portion of Python.

2.3 Variable Types

Python is said to be object oriented. Every variable in Python is an object. It

is “dynamically typed”: the variable type is determined at the point when it
is typed. You do not need to declare variables before using them, or declare
their types. It has some basic types of variables: Numbers and Strings. These
variables can stand alone, or form a Tuple, List, Dictionary, Set, Numpy
Arrays, etc. They all can be subjected to various operations (arithmetic,
boolean, logical, formatting, etc.) in a code. Note that the variable is loosely
deﬁned, meaning it could be a Tuple, List, Dictionary, etc. For example, a
List can be in a List, a Tuple in a List, or a List in a Tuple.

2.3.1 Numbers

Python supports three types of numbers — integers, floating point

numbers, and complex numbers.
To deﬁne an integer variable, one can simply assign it with an integer.

my_int = 48 # by this assignment, my_int becomes an integer

print(my_int); printx('my_int')

48
my int = 48
26 Machine Learning with Python: Theory and Applications

type(my_int) # Check the type of the variable. print() is not

# needed, if it is the last line in the cell

int

my_int = 5.0 # by this my_int becomes now a float

type(my_int)

float

my_complex=8.0+5.0j # by this my_complex is a complex number

print(my_complex)
printx('my_complex')

(8+5j)
my complex = (8+5j)

my_int, my_float, my_string = 20, 10.0, "Hello!"

if my_string == "Hello!": # comparison operators: ==, !=, <, <=, >, >=
print("A string: %s" % my_string) # Indented 4 spaces

if isinstance(my_float, float) and my_float == 10.0:

#isinstance():returns if an object is an instance of a class
print("This is a float, and it is %f" % my_float)

if isinstance(my_int, int) and my_int == 20:

print("This is an integer, and it is: %d" % my_int)

A string: Hello!
This is a float, and it is 10.000000
This is an integer, and it is: 20

To list all variables, functions, modules currently in the memory, try this:

#%whos # you may remove "#" and try this

Basics of Python 27

The type of a variable can be converted:

my_float = 5.0 # by this assignment, my_float becomes a float

print(my_float)
my_float = float(6)
# create a float, by converting an integer, using float()
print(my_float)
print(int(7.0)) # float is converted to integer.
printx('int(7.0)')

5.0
6.0
7
int(7.0) = 7

To check the memory address of a variable, use

a = 1.0
print('a=',a, 'at memory address:',id(a))

a= 1.0 at memory address: 1847993237600

b = a
print('b=',b, 'at memory address: ',id(b))

b= 1.0 at memory address: 1847993237600

Notice that ‘b’ has the same address of ‘a’.

a, b = 2.0, 3.0
print('a=',a, 'at memory address: ',id(a))
print('b=',b, 'at memory address: ',id(b))

a= 2.0 at memory address: 1847974064064

b= 3.0 at memory address: 1847974063944

Notice the change in address when the value of a variable changes.

28 Machine Learning with Python: Theory and Applications

2.3.2 Underscore placeholder

n1=100000000000
n_1=100_000_000_000 # for easy reading
print('Yes, n1 is same as n_1') if n1==n_1 else print('No')
# Ternary if Statement
n2=1_000_000_000
print('Total=',n1+n2)
print('Total=',f'{n1+n2:,}') # f-string (Python3.6 or later)
total=n1+n2
print('Total=',f'{total:_}')

Yes, n1 is same as n 1
Total= 101000000000
Total= 101,000,000,000
Total= 101 000 000 000

2.3.3 Strings

Strings are bits of text, which are very useful in coding in generating labels
and ﬁle names for outputs. Strings can be deﬁned with anything between
quotes. The quote can be either a pair of single quotes or a pair of double
quotes.

my_string = "How are you?"

# spring is defined, the characters in it can be indexed
print(my_string, my_string[0:3],my_string[5],my_string[10:])
my_string = 'Hello,' + " hello!" + " I am here."
# note "+" operator for strings is concatenation
print(my_string)

How are you? How r u?

Hello, hello! I am here.

Although both single and double quotes can all be used, when there are
apostrophes in a string, one should use double quotes, or these apostrophes
would terminate the string if single quotes are used and vice versa. For
example,
Basics of Python 29

my_string = "Do not worry, just use double quotes to 'escape'."

print(my_string)

Do not worry, just use double quotes to 'escape'.

One may exchange the role of these two types of quotes:

my_string = 'Do not worry about "double quotes".

print(my_string)

Do not worry about "double quotes"

One shall refer to the Python documentation, when needing to include things
such as carriage returns, backslashes, and Unicode characters. Below are
some more handy and clean operators applied to numbers and strings. You
may try it out and get some experience.

one, two, three = 1, 2, 3 # Assign values to variables.

summation = one + two + three
print('summation=',summation) # printx('summation')

summation= 6

one, two, three = 1, 2, 3.0 # variable type can be mixed!

summation = one + two + three
print('Summation=',summation)

Summation= 6.0

one3 = two3 = three3 = 3 # Assign a same value to variables

print(one3, two3, three3)

3 3 3
30 Machine Learning with Python: Theory and Applications

More handy operations:

hello, world = "Hello,", "world!"

helloworld = hello + " " + world + "!!" # concatenate strings
print(helloworld, ' ', hello + " " + world)
lhw=len(helloworld) # length of the string, counting the space
# and the punctuations.
print('The length of the "helloword" is',lhw)

Hello, world!!! Hello, world!

The length of the "helloword" is 15

You can split the string to a list of strings, each of which is a word.

the_words = helloworld.split(" ") # creates a list of strings

# Similar operations on Lists later
print("Split the words of the string: %s" % the_words)
print('Joined together again with a space as separator:','
'.join(the_words))

Split the words of the string: ['Hello,', 'world!!!']

Joined together again with a space as separator: Hello, world!!!

To ﬁnd a letter (character) in a string, try this:

my_string = "Hello world!"

print('"o" is right after the',my_string.index("o"),\
'th letter.') # "\" is used to break a line
print('The first letter "l" is right after the', \
my_string.index("l"), 'nd letter.')

"o" is right after the 4 th letter.

The first letter "l" is right after the 2 nd letter.

Do not like the white-spaces between “4” and “th” “2” and “nd”? use string
concatenation:
Basics of Python 31

print('The position of the letter "o" is right after the ' +

str(my_string.index("o")) + 'th letter.')
# "+" concatenate
print('The 1st letter "l" is right after the ' +
str(my_string.index("l")) + 'nd letter.')

The position of the letter "o" is right after the 4th letter.
The 1st letter "l" is right after the 2nd letter.

You may need to ﬁnd the frequency of each element in a list.

from collections import Counter # import Counter module.

my_list = ['a','a','b','b','b','c','d','d','d','d','d']
count = Counter(my_list) # Counter object is a dictionary
print(count) # of frequencies of each element in the list
# See also Dictionary later

Counter({'d': 5, 'b': 3, 'a': 2, 'c': 1})

print('The frequency of "b" is', count['b'])

# frequency of an element indexed by its key

The frequency of "b" is 3

Note Python (and many other programming languages) starts counting at 0

instead of 1.
We list below more operations that can be useful.

Conversion between uppercase and lowercase of a string

my_string = "Hello world!"

print(my_string.upper(),my_string.lower(),my_string.title())
# convert to uppercase and lowercase, respectively.

HELLO WORLD! hello world! Hello World!

• Reversion of a string using slicing (also see section on Lists).

32 Machine Learning with Python: Theory and Applications

my_string = "ABCDEFG"
reversed_string = my_string[::-1]
print(reversed_string)

GFEDCBA

The title() function of string class

my_string = "my name is professor g r liu"

new_string = my_string.title()
print(new_string)

My Name Is Professor G R Liu

Use of repetitions

n = 8
my_list = [0]*n
print(my_list)

[0, 0, 0, 0, 0, 0, 0, 0]

my_string = "abcdefg "

print(my_string*2) #concatenated n times and then print out

abcdefg abcdefg

lotsofhellos = "Hello " * 5 #concatenate 5 times

print(lotsofhellos)

Hello Hello Hello Hello Hello

my_list = [1,2,3,4,5]
print(my_list*2) #concatenate 2 times and then print out

[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
Basics of Python 33

Length of a given argument using len()

print('The length of "ABCD" is', len('ABCD'))

The length of "ABCD" is 4

print('The length of',my_string,'is', len(my_string))

The length of abcdefg is 8

even_numbers, odd_numbers= [2,4,6,8], [1,3,5,7]

length = len(even_numbers) #get the length using len()

all_numbers = odd_numbers + even_numbers #concatenation
print(all_numbers,' The original length is',
length, '. The new length is',len(all_numbers))

[1, 3, 5, 7, 2, 4, 6, 8] The original length is 4 . The new

length is 8

Sort the elements using sorted()

print(sorted('BACD'),sorted('ABCD',reverse=True))
print(sorted(all_numbers), sorted(all_numbers,reverse=True))

['A', 'B', 'C', 'D'] ['D', 'C', 'B', 'A']

[1, 2, 3, 4, 5, 6, 7, 8] [8, 7, 6, 5, 4, 3, 2, 1]

Multiplying each element in a list by a same number

original_list, n = [1,2,3,4], 2
new_list = [n*x for x in original_list]
# list comprehension for element-wise operations
print(new_list)

[2, 4, 6, 8]
34 Machine Learning with Python: Theory and Applications

Generating index for a list using enumerate()

my_list = ['a', 'b', 'c']

for index, value in enumerate(my_list):
print('{0}: {1}'.format(index+1, value))

1: a
2: b
3: c

for index, value in enumerate(my_list): # generate indices

print(f'{index+1}: {value}') # f-string

1: a
2: b
3: c

Error exception tracks code while avoiding stop execution

a, b = 1,2
try:
print(a/b) # exception raised when b=0
except ZeroDivisionError:
print("division by zero")
else:
print("no exceptions raised")
finally:
print("Regardless of what happened, run this always")

0.5
no exceptions raised
Regardless of what happened, run this always

Get the memory size in bytes

import sys #import sys module

num = "AAA"
print('The memory size is %d'%sys.getsizeof(num),'bytes')
# memory size of string
Basics of Python 35

The memory size is 52 bytes

num = 21099
print('The memory size is %d'%sys.getsizeof(num),'bytes')
# memory size of integer

The memory size is 28 bytes

num = 21099.0
print('The memory size is %d'%sys.getsizeof(num),'bytes')
# memory size of float

The memory size is 24 bytes

Check whether the string starts with or ends with something

astring = "Hello world!"

print(astring.startswith("Hello"),astring.endswith("asdf"),\
astring.endswith("!"))

True False True

The ﬁrst one printed True, as the string starts with “Hello”. The second one
printed False, as the string certainly does not end with “asdf”. The third
printed True, as the string ends with “!”. Their boolean values are useful
when creating conditions. More such functions:

my_string="Hello World!"
my_string1="HelloWorld"
my_string2="HELLO WORLD!"
print (my_string.isalnum()) #check if all char are numbers
print (my_string1.isalpha()) #check if all char are alphabetic
print (my_string2.isupper()) #test if string is upper case

False
True
True
36 Machine Learning with Python: Theory and Applications

my_string3, my_string4, my_string5="hello world!"," ", "8888a"

print (my_string3.istitle()) #test if contains title words
print (my_string3.islower()) #test if string is lower case
print (my_string4.isspace()) #test if string is spaces
print (my_string5.isdigit()) #test if string is digits

False
True
True
False

Checking the type and other attributes of variables

n, x, s = 8888,8.0, 'string'
print (type(n), type(x), type(s)) # check the type of an object
print (len(s),len(str(n)),len(str(x)))

<class 'int'> <class 'float'> <class 'str'>

6 4 3

2.3.4 Conversion between types of variables

When one of the variables is a ﬂoating point number in an operation with

other integers, the variable becomes a ﬂoating point number.

a = 2
print('a=',a, ' type of a:',type(a))
b = 3.0; a = a + b
print('a=',a, ' type of a:',type(a))
print('b=',b, ' type of a:',type(b))

a= 2 type of a: <class 'int'>

a= 5.0 type of a: <class 'float'>
b= 3.0 type of a: <class 'float'>

The type of a variable can be converted to other types.

Basics of Python 37

n, x, s = 8888,8.5, 'string'
sfn = str(n) #integer to string
print(sfn,type(sfn))
sfx = str(x) #float to string
print(sfx,type(sfx))

8888 <class 'str'>

8.5 <class 'str'>

xfn = float(n) #integer to float

print(xfn,type(xfn))
nfx = int(x) #float to integer
print(nfx,type(nfx))

8888.0 <class 'float'>

8 <class 'int'>

#a = int('Hello') # string to integer: produces ValueError

#a = int('8.5') # string to integer: produces ValueError
a = int('85') # works,'85' is converted to an integer
print(a,type(a))

85 <class 'int'>

a = float('85') # how about this one?

print(a,type(a))

85.0 <class 'float'>

8.0 + float("8.0") #try this

16.0

a = int(False) # check this out

print(a,type(a))

0 <class 'int'>
38 Machine Learning with Python: Theory and Applications

However, operators with mixed numbers and strings are not permitted, and
it triggers a TypeError:

my_mix = my_float + my_string

----------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-67-8c4f7138852e> in <module>
----> 1 my mix = my float + my string

TypeError: unsupported operand type(s) for +: 'float' and 'str'

2.3.5 Variable formatting

Formatting is very useful in printing out variables. Python uses C-style string
formatting to create new, formatted strings. The “%” operator is used to
format a set of variables enclosed in a “tuple”, which is a ﬁxed size list (to
be discussed later). It produces a normal text at a location given by one
of the “argument speciﬁers” like “%s”, “%d”, “%f”, etc. The best way to
understand this is through examples.

name = "John"
print("Hello, %s! how are you?" % name) # %s is for string
# for two or more argument specifiers, use a tuple:
name, age = "Kevin", 23
print("%s is %d years old." % (name,age)) # %d is for digit
Hello, John! how are you?
Kevin is 23 years old.

print(f"{name} is {age} years old.")

# f-string Python 3.6 or later.
Kevin is 23 years old.

Any object that is not a string (a list for example) can also be formatted
using the %s operator. The %s operator formats the object as a string using
the “repr” method and returns it. For example:

list1,list2,x=[1,2,3],['John','Ian'],21.5 # multi- assignment

print("List1:%s; List2:%s\n x=%s,x=%f,x=%.3f,x=%e" \
% (list1, list2, x, x, x, x))
Basics of Python 39

List1:[1, 2, 3]; List2:['John', 'Ian']

x=21.5,x=21.500000,x=21.500,x=2.150000e+01

print(f"List1:{list1};List2:{list2};x={x},x={x:.2f}, x={x:.3e}")
# powerful f-string

List1:[1, 2, 3]; List2:['John', 'Kevin']; x=21.5,x=21.50,x=2.150e+01

Often used formatting argument specifiers (if not using f-string):

• %s - String (or any object with a string representation, like numbers).

• %d - Integers.
• %f - Floating point numbers.
• %.f - Floating point numbers with a fixed number-of-digits to the right of
the dot.
• %e - scientific notation: a float multiplied by the specified power of 10.
• %x/%X - Integers in hex representation (lowercase/uppercase).

2.4 Arithmetic Operators

Addition, subtraction, multiplication, and division operators can be used

with numbers.

2.4.1 Addition, subtraction, multiplication, division, and power

+ − ∗ / //(ﬂoor division) % (remainder of integer division or modulo) ∗∗

number = 1 + 2 * 3 / 4.0
print(number)

2.5

The modulo (%) operator returns the integer remainder of the division:
dividend % divisor = remainder.

numerator, denominator = 11, 2

floor = numerator // denominator #floor division
print(str(numerator)+'//'+str(denominator)+ '=', floor)
remainder = numerator % denominator
print(str(numerator) +'%'+ str(denominator) +'=', remainder)
print(floor*denominator + remainder)
40 Machine Learning with Python: Theory and Applications

11//2= 5
11%2= 1
11

Using two multiplication symbols makes a power relationship.

squared, cubed = 7 ** 2, 2 ** 3
print('7 ** 2 =', squared, ', and 2 ** 3 =',cubed)

7 ** 2 = 49 , and 2 ** 3 = 8

bwlg_XOR = 7^2
print(bwlg_XOR) # ^ is XOR (bitwise logic gate) operator!

Python allows simple swap operation between two variables.

a, b = 100, 200
print('a=',a,'b=',b)
a, b = b, a # swapping without using a "mid-man"
print('a=',a,'b=',b)

a= 100 b= 200
a= 200 b= 100

2.4.2 Built-in functions

Python provides a number of build-in functions and types that are always
available. For a quick glance, see the following table, or ﬁnd more details at
https://docs.python.org/3/library/functions.html.
Basics of Python 41

#help(all) # to find out what a builtin function does

2.5 Boolean Values and Operators

Boolean values are two constant objects: True and False. When used as an
argument to an arithmetic operator, they behave like the integers 0 and 1,
respectively. The built-in function bool() can be used to cast any value to a
Boolean. The deﬁnitions are as follows:

print(bool(5),bool(-5),bool(0.2),bool(-0.1),bool(str('a')),
bool(str('0')))
# True True True True True True
print(bool(0),bool(-0.0)) # These are all zero
# False False
print(bool(''),bool([]),bool({}),bool(())) # all empty (0)
# False False False False

True True True True True True

False False
False False False False

bool() returns False, only if the value is zero or the container is empty.
Otherwise, True. Note that str(‘0’) is neither zero nor empty.
42 Machine Learning with Python: Theory and Applications

Boolean operators include “and” and “or”.

print(True and True, False or True, True or True,)

# True True True
print(False and False, False and True)
# False False

True True True

False False

2.6 Lists: A diversified variable type container

We already saw Lists a few times. This Section gives more details. A list is
a collection of variables, and it is very similar to arrays (see Numpy Array
section for more details). A list may contain any type of variables, and as
many variables as one likes. These variables are held in a pair of square
brackets [ ]. Lists can be iterated over for operations when needed. It is
one of the “iterables”. Let us look at the following examples.

2.6.1 List creation, appending, concatenation, and updating

x_list = [] # Use [] to define a placeholder for x_list.

# It is empty but with an address assigned.
print('x_list=',x_list)
print(hex(id(x_list))) # memory address in hexadecimal

x list= []
0x1ae44e9f548

x_list.append(1) # 1 is appended as the 0th member in this list

x_list.append(2) # 2 is appended as the 1st member
x_list.append(3.) # Variable type changed!
print(x_list[0]) # prints 1, the 0th element ...
print(x_list[1]) # prints 2
print(x_list[2]) # prints 3
print(x_list) # print all in the list

3.0
[1, 2, 3.0]
Basics of Python 43

for x in x_list: # prints out 1,2,3.0 in an iteration

print(x, end=',')

print('\n')
x_list2 = x_list*2
# concatenation of 2 x_list (not element-wise
# multiplication!) this creates an independent new x_list2
print(x_list2)

1,2,3.0,

[1, 2, 3.0, 1, 2, 3.0]

print(id(x_list),id(x_list2)) # addresses are different

1847992120648 1847993139592

id(x_list[1]) # Again, print() function is not needed

# because this is the last line in the cell

1594536160

x_list3 = x_list # assignment is a "pointer" to x_list3

print(x_list,' ',x_list3,)

[1, 2, 3.0] [1, 2, 3.0]

print(id(x_list),id(x_list3)) # They share the same address

1847992120648 1847992120648

x_list4 = x_list.copy() # copy() function creates x_list4

# it is a new independent list
print(x_list,' ',x_list4)

[1, 2, 3.0] [1, 2, 3.0]

44 Machine Learning with Python: Theory and Applications

print(id(x_list),id(x_list4)) # x_list4 has its own address

1847992120648 1847993186760

x_list[0] = 4.0 # Assign the 0th element a new value

print(x_list)

[4.0, 2, 3.0]

print(x_list3,' ',x_list4) # x_list3 is changed with x_list,

# because assignment creates a "pointer". x_list4 is not
# changed, because it was created using copy() function.

[4.0, 2, 3.0] [1, 2, 3.0]

print(x_list2) # Changes to x_list has no affect

[1, 2, 3.0, 1, 2, 3.0]

Creating a list by unpacking a string of digits:

num = 19345678
list_of_digits=list(map(int, str(num))) #list iterable
print(list_of_digits)
list_of_digits=[int(x) for x in str(num)] #list comprehension
print(list_of_digits)

[1, 9, 3, 4, 5, 6, 7, 8]
[1, 9, 3, 4, 5, 6, 7, 8]

2.6.2 Element-wise addition of lists

Element-wise addition of lists needs a little trick. The best ways, including
the use of numpy arrays, will be discussed in the list comprehension section.
Here, we use a primitive method to achieve this.
Basics of Python 45

list1, list2= [20, 30, 40], [5, 6, 8]

print (list1, ' ', list2, ' ', list1+list2)
print ("Original list 1: " + str(list1))
print ("Original list 2: " + str(list2))

print ('"+" is not addition, but concatenation:',list1+list2)

[20, 30, 40] [5, 6, 8] [20, 30, 40, 5, 6, 8]

Original list 1: [20, 30, 40]
Original list 2: [5, 6, 8]
"+" is not addition, it is concatenation: [20, 30, 40, 5, 6, 8]

# We shall use a for-loop to achieve element-wise addition:

add_list = []
for i in range(0, len(list1)): # for-loop to add up one-by-one!
add_list.append(list1[i] + list2[i])

print ("Element-wise addition of 2 lists is: " + str(add_list))

Element-wise addition of 2 lists: [25, 36, 48]

id(add_list[0]) # check the address of the list

1594536896

id(list1[0])

1594536736

add_list = []
for i1,i2 in zip(list1,list2): # for-loop and zip() to add it up
add_list.append(i1+i2)

print ("The element-wise addition of 2 lists: ",add_list)

The element-wise addition of 2 lists: [25, 36, 48]

46 Machine Learning with Python: Theory and Applications

2.6.3 Slicing strings and lists

Slicing is a useful and eﬃcient operation to manipulate parts of a string, list,

or array (to be discussed later). Our discussion starts from slicing strings,
and then lists.

my_string = "Hello world!"

#123456789TET # conventional order
print('0123456789TE') # ordering in Python
print (my_string)
print('5th=',my_string[4]) # take the 5th character
print('7-11th=',my_string[6:11]) # 7th to 11th

0123456789TE
Hello world!
5th= o
7-11th= world

print('[6:-1]=',my_string[6:-1])
# "-1" for the last slice from the 6th to (last-1)th
print('[:]=',my_string[:]) # all characters in the string
print('[6:]=',my_string[6:]) # slice from 7th to the end
print('[:-1]=',my_string[:-1]) # to (last-1)th

[6:-1]= world
[:]= Hello world!
[6:]= world!
[:-1]= Hello world

my_string = "Hello world!"

#123456789TET # conventional order
print('[3:9:2]=',my_string[3:9:2]) # 4th to 9th step 2
# Syntax:[start:stop:step]
my_string = "Hello world!"

[3:9:2]= l o
Basics of Python 47

Using a negative step, we can easily reverse a string, as we have seen earlier:

my_string = "Hello world!"

print('string:',my_string)
print('[::-1]=',my_string[::-1]) # all but from the last

string: Hello world!

[::-1]= !dlrow olleH

In summary, if we just have one number in the brackets, the slicing takes
the character at the (number +1)th position. This is because Python counts
from zero. A colon stands for all available. If it is used alone, the slice is the
entire string. If there is a number on its left, the slice is from the number
to the right-end, and vice versa. A negative number means it counts the
number but is from the right-end: −3 means “the 3rd character from the
right-end”. One can also use the step option for skipping.

Note that when accessing a string with an index which does not exist, it
generates an exception of IndexError.

print('[14]=',my_string[14]) # index out of range error

-----------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-102-024f17c69a4f> in <module>
----> 1 print('[14]=',my string[14]) # gives an index out of range error

IndexError: string index out of range

print('[14:]=',my_string[14:]) # This will not give an

# error, but gives nothing: nothing can be sliced

[14:]=

The very similar rules detailed above for strings apply also to a list, by
treating a variable in the list as a character.
48 Machine Learning with Python: Theory and Applications

# Create my_list that contains mixed type variables:

list2=[]
my_list=[0, 1, 2, 3,'4E', 5, 6, 7,[8,8], 9]
# 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 -> indices
# 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 -> Python indices
#-10 -9,-8,-7,-6, -5,-4,-3, -2, -1 -> Python reverse indices

print(my_list[0:10:1]) # [start:end:step] end is inclusive!

print(my_list[:]) # A colon stands for all variable

[0, 1, 2, 3, '4E', 5, 6, 7, [8, 8], 9]

print(my_list[0:]) # from (0+1)st to the end

print(my_list[1:]) # from (1+1)th to the end
print(my_list[8:]) # from the (8+1)th to the end
print(my_list[8:9]) # Gives a list in list

[0, 1, 2, 3, '4E', 5, 6, 7, [8, 8], 9]

[1, 2, 3, '4E', 5, 6, 7, [8, 8], 9]
[[8, 8], 9]
[[8, 8]]

print(my_list[:]) # A colon stands for all variable

print(my_list[:3]) # 0,1,2,
print(my_list[:1]) # from 1st to 1st: 0
print(my_list[:0]) # from 1st to 0th: empty []
print(my_list[-1]) # reads out the last: 9
print(my_list[-1:]) # Slices out the last:
print(my_list[::-1]) # reverse the list

[0, 1, 2, 3, '4E', 5, 6, 7, [8, 8], 9]

[0, 1, 2]
[0]
[]
9
[9]
[9, [8, 8], 7, 6, 5, '4E', 3, 2, 1, 0]
Basics of Python 49

When accessing a list with an index that does not exist, it generates an
exception of IndexError.

#print(my_list[11]) # will give an index out of range error

print(my_list[10:]) # from the (10+1)th to the end: no more

# there, not out of range, but an empty list

[]

2.6.4 Underscore placeholders for lists

nlist = [10, 20, 30, 40, 50, 6.0, '7H'] # Mixed variables
_, _, n3,_,*nn = nlist # when only the 3rd is needed,
# skip one, and then the rest
print ('n3=',n3, 'the rest numbers',*nn)

nlist = [10, 20, 30, 40, 50, 60, 70]

_, _, n3, *nn, nlast = nlist # The 3rd, last and the rest
# in between are needed
print('n3=',n3,', the last=', nlast,', and all the rest numbers',*nn)

n3= 30 the rest numbers 50 6.0 7H

n3= 30 , the last= 70 , and all the rest numbers 40 50 60

2.6.5 Nested list (lists in lists in lists)

nested_list = [[11, 12], ['2B',22], [31, [32,3.2]]]

# A nested list of mixed types of variables
printx('nested_list')
print(len(nested_list)) #number of sub-lists in nested_list

nested list = [[11, 12], ['2B', 22], [31, [32, 3.2]]]

print(nested_list[0]) #1st sub-list in the nested_list

print(nested_list[1]) #2nd sub-list in the nested_list
print(nested_list[2]) #3rd sub-list in the nested_list
print(nested_list) #print all for easy viewing
50 Machine Learning with Python: Theory and Applications

[11, 12]
['2B', 22]
[31, [32, 3.2]]
[[11, 12], ['2B', 22], [31, [32, 3.2]]]

print(nested_list[0][0]) #1st element in 1st sub-list

print(nested_list[0][1]) #2nd element in 1st sub-list

11
12

print(nested_list)
print(nested_list[1][0]) # Try this: what would this be?
print(nested_list[2][1]) #?
print(nested_list[2][1][0]) #?

[[11, 12], ['2B', 22], [31, [32, 3.2]]]

2B
[32, 3.2]
32

2.7 Tuples: Value preserved

After the discussion about the List, discussing Tuples becomes straightfor-
ward. This is because they are essentially the same, and the major diﬀerence
is as follows:

• A Tuple is usually enclosed with (), but a List is with [].

• A Tuple is immutable, but a List is mutable. This means that tuples
cannot be changed after they are created. Values in Tuples are preserved.

Because a Tuple is immutable, we use it to store data that needs to be

preserved. Thus, its use is very much limited. It is used to store constants
preventing them from being changed. Also, operating on Tuples is faster.
Except these diﬀerences, a Tuple behaves like a List. It can be accessed
via index, iterated over, and assigned to other variables. Below are some
examples.
Basics of Python 51

ttuple = (10, 20, 30, 40, 50, 6.0, '7H') # create a Tuple
gr.printx('ttuple') # print(ttuple)
aa = ttuple[0]
print('aa=',aa)
print(ttuple[1], ' ',ttuple[6],' ',ttuple[-1])

ttuple = (10, 20, 30, 40, 50, 6.0, '7H')

aa= 10
20 7H 7H

for i, data in enumerate(ttuple):

# use enumerate function to get both index and content
if i < 3:
print(i, ':', data)

0 : 10
1 : 20
2 : 30

# ttuple[2] = 300 # this gives an error

The above may be all we need to know about Tuples. We now discuss
another useful data structure in Python.

2.8 Dictionaries: Indexable via keys

A dictionary is a data type similar to a list. It contains paired keys and

values. The key is a string and can be used for indexing. The value can
be any type of object: a string, a number, a list, etc. Because the key and
value are paired, each value stored in a dictionary can be accessed using the
corresponding key. A dictionary does not contain any duplicated keys. The
values may be with duplication.

2.8.1 Assigning data to a dictionary

For example, phone numbers can be assigned to a dictionary in the following

format:
52 Machine Learning with Python: Theory and Applications

phonebook1 = {} # placeholder for a dictionary

phonebook1["Kevin"] = 513476565
phonebook1["Richard"] = 513387234
phonebook1["Jim"] = 513682762
phonebook1["Mark"] = 513387234 # A duplicated value
gr.printx('phonebook1')
print(phonebook1)

phonebook1 = {'Kevin': 513476565, 'Richard': 513387234,

'Jim': 513682762, 'Mark': 513387234}
{'Kevin': 513476565, 'Richard': 513387234, 'Jim': 513682762,
'Mark': 513387234}

A dictionary can also be initialized in the following means:

phonebook2 = {'Joanne': 656477456, 'Yao': 656377243,

'Das': 656662798}
print(phonebook2)

{'Joanne': 656477456, 'Yao': 656377243, 'Das': 656662798}

phonebook0 = {
"John" : [788567837,788347278],
# this is a list, John has 2 phonenumbers
'Mark': 513683222,
'Joanne': 656477456
}
print(phonebook0)

{'John': [788567837, 788347278], 'Mark': 513683222, 'Joanne':

656477456}

2.8.2 Iterating over a dictionary

Like lists, dictionaries can be iterated over. Because keys and values are
recorded in pairs, we may use for-loop to access them.
Basics of Python 53

for name, number in phonebook1.items():

print("Phone number of %s is %d" % (name, number))

Phone number of Kevin is 513476565

Phone number of Richard is 513387234
Phone number of Jim is 513682762
Phone number of Mark is 513387234

for key, value in phonebook1.items():

print(key, value)

Kevin 513476565
Richard 513387234
Jim 513682762
Mark 513387234

for key in phonebook1.keys():

print(key)

Kevin
Richard
Jim
Mark

for value in phonebook1.values():

print(value)

513476565
513387234
513682762
513387234

2.8.3 Removing a value

To delete a pair of records, we use the build-in function del or pop, by using
the keys.
54 Machine Learning with Python: Theory and Applications

phonebook2 = {'Joanne': 656477456, 'Yao': 656377243,

'Das': 656662798}
del phonebook2["Yao"]
print(phonebook2)
value = phonebook2.pop("Das")
print(phonebook2)
print('value for poped out key',value)

{'Joanne': 656477456, 'Das': 656662798}

{'Joanne': 656477456}
value for poped out key 656662798

2.8.4 Merging two dictionaries

First, use update() method.

phonebook1.update(phonebook2) # phonebook1 is updated

print(phonebook1)

{'Kevin': 513476565, 'Richard': 513387234, 'Jim': 513682762,

'Mark': 513387234, 'Joanne': 656477456}

Now, use a simpler means called double-star. This allows one to create a 3rd
new dictionary that is a combination of two dictionaries, without aﬀecting
the original two.

phonebook3 = {phonebook1, phonebook0}

print(phonebook3)

{'Kevin': 513476565, 'Richard': 513387234, 'Jim': 513682762,

'Mark': 513683222,
'Joanne': 656477456, 'John': [788567837, 788347278]}

Duplicated keys (if any) are removed in the new dictionary:

dict_1 = {'Apple': 7, 'Banana': 5}

dict_2 = {'Banana': 3, 'Orange': 4} #'Banana' is in dict_1
Basics of Python 55

combined_dict = {dict_1, dict_2}

#'Banana' in dict_1 will be replaced in the new dictionary
print(combined_dict)
{'Apple': 7, 'Banana': 3, 'Orange': 4}

2.9 Numpy Arrays: Handy for scientific computation

Numpy arrays are similar to Lists, and much easier to work with for scientiﬁc
computations. Operations on Numpy arrays are usually much faster for bulky
data.

2.9.1 Lists vs. Numpy arrays

(1) Similarities:

• Both are mutable (the elements there can be added and removed after
the creation. A mutating operation is also called “destructive”, because it
modiﬁes the list/array in place instead of returning a new one).
• Both can be indexed.
• Both can be sliced.

(2) Diﬀerences:

• For using arrays, one needs to import Numpy module, but lists are
build-in.
• Array works for element-wise operations, but lists cannot (need some
coding to do that).
• Data types in an array must be the same, but a list can have diﬀerent types
of data (part of the reason why element-wise operation is not generally
accepted).
• Numpy array can be multi-dimensional.
• Operations with arrays are, in general, much faster than those on lists.
• Storing arrays uses less memory than storing lists.
• Numpy arrays are more convenient to use for mathematical operations
and machine learning algorithms.

2.9.2 Structure of a numpy array

We ﬁrst brief on the structure of a numpy array in comparison with a list
we discussed earlier. To start the discussion, we import the numpy package.
56 Machine Learning with Python: Theory and Applications

import numpy as np # Import numpy & give it an alias np

#dir(np) #try this (remove #)

x1 = np.array([28, 3, 28, 0]) # a one-dimensional (1D) numpy array

print('x1=',x1) # A numpy array looks like a list.
gr.printx('x1') # This specifies that it is an array.

x1 = [28 3 28 0]
x1 = array([28, 3, 28, 0])

As shown above, a numpy array is “framed” in a pair of square brackets

(same as a list).

x2 = np.array([[51,22.0],[0,0],(18+9j,3.)]) # mixed types

print('x2=',x2) # All become complex-valued

x2= [[51.+0.j 22.+0.j]

[ 0.+0.j 0.+0.j]
[18.+9.j 3.+0.j]]

This is a 2D numpy array. It is framed in a double pair of square brackets.

A list does not have multi-dimensionality, except in the form of nesting: lists
in lists.
We can also create numpy arrays using lists. In the following, we create ﬁrst
two lists, and then create Numpy arrays using these two lists:

list_w = [57.5, 64.3, 71.6, 68.2] # list, peoples' weights (Kg)

list_h = [1.5, 1.6, 1.7, 1.65] # list heights (m)
print('list_w=',list_w, '; list_h= ',list_h)

list w= [57.5, 64.3, 71.6, 68.2] ; list h= [1.5, 1.6, 1.7, 1.65]

narray_w = np.array(list_w) # convert list to numpy array

narray_h = np.array(list_h)
print('narray_w=',narray_w, '; narray_h= ',narray_h)

narray w= [57.5 64.3 71.6 68.2] ; narray h= [1.5 1.6 1.7 1.65]
Basics of Python 57

Let us create a function that prints out the information of a given numpy array.

def getArrayInfo(a):
'''Get the information about a given array:
getArrayInfo(array)'''
print('elements of the first axis of the array:',a[0])
print('type:',type(a))
print('number of dimensions, a.ndim:', a.ndim)
print('a.shape:', a.shape)
print('number of elements, a.size:', a.size)
print('a.dtype:', a.dtype)
print('memory address',a.data)

help(getArrayInfo) # may try this

Help on function getArrayInfo in module main :

getArrayInfo(a)
Get the information about a given array: getArrayInfo(array)

We see here that ”’ ”’ useful to provide a simple instruction for the use of a created
function. Let us now use it to get the information for nist w.

getArrayInfo(narray_w)

elements of the first axis of the array: 57.5

type: <class 'numpy.ndarray'>
number of dimensions, a.ndim: 1
a.shape: (4,)
number of elements, a.size: 4
a.dtype: float64
memory address <memory at 0x000001AE7D549408>

We note that narray w is 1D in dimension, and has a shape of (4,) meaning it

has four entries. The shape of a numpy array is given in a tuple.
Slicing also works for a numpy array, in the similar way as for lists. Let us take
a slice for an array.

print(list_w[1:3]) # a slice between the 2nd and 3rd elements

[64.3, 71.6]

print(narray_w[1:3]) # a slice between the 2nd and 3rd elements

[64.3 71.6]
58 Machine Learning with Python: Theory and Applications

Let us now append an element to both the list and the numpy array.

# For lists, we use:

list_w.append(59.8)
print(list_w)
# For numpy array we shall use:
print(np.append(narray_w,59.8))

[57.5, 64.3, 71.6, 68.2, 59.8]

[57.5 64.3 71.6 68.2 59.8]

print(list_w,' ',narray_w)
print(type(list_w),' ',type(narray_w))
print(len(list_w),' ', narray_w.ndim) # Use len() to get the length

[57.5, 64.3, 71.6, 68.2, 59.8] [57.5 64.3 71.6 68.2]

<class 'list'> <class 'numpy.ndarray'>
5 1

nwh = (narray_w,narray_h) # This forms a tuple of np arrays

print(nwh)

(array([57.5, 64.3, 71.6, 68.2]), array([1.5 , 1.6 , 1.7 , 1.65]))

To form a multi-dimensional array, we may use the following (more on this later):

arr = np.array([narray_w,narray_h])
arr

array([[57.5 , 64.3 , 71.6 , 68.2 ],

[ 1.5 , 1.6 , 1.7 , 1.65]])

getArrayInfo(arr)

elements of the first axis of the array: [57.5 64.3 71.6 68.2]
type: <class 'numpy.ndarray'>
number of dimensions, a.ndim: 2
a.shape: (2, 4)
number of elements, a.size: 8
a.dtype: float64
memory address <memory at 0x000001AE44FB12D0>
Basics of Python 59

We note that arr is of dimension 2, and has a shape of (2, 4), meaning it has
two entries along axis 0 and 4 entries along axis 1. We see again that the shape
of a numpy array is given in a tuple. A multi-dimensional numpy array can be
transported:

arrT = arr.T
print(arrT)
getArrayInfo(arrT) # see the change in shape from (2,4) to (4,2)

[[57.5 1.5 ]
[64.3 1.6 ]
[71.6 1.7 ]
[68.2 1.65]]
elements of the first axis of the array: [57.5 1.5]
type: <class 'numpy.ndarray'>
number of dimensions, a.ndim: 2
a.shape: (4, 2)
number of elements, a.size: 8
a.dtype: float64
memory address <memory at 0x000001AE44FB12D0>

It is seen that the dimension remains 2, but the shape is changed from (2,4) to
(4,2). The value of an entry in a numpy array can be changed.

arr = np.array([[57.5 , 64.3 , 71.6 , 68.2 ],

[ 1.5 , 1.6 , 1.7 , 1.65]])
arrb = arr
printx('arrb')
arr[0,0]= 888.0 # change is done to arr only
printx('arrb')

arrb = array([[57.5 , 64.3 , 71.6 , 68.2 ],

[ 1.5 , 1.6 , 1.7 , 1.65]])
arrb = array([[888. , 64.3 , 71.6 , 68.2 ],
[ 1.5 , 1.6 , 1.7 , 1.65]])

Notice the behavior of the array created via an assignment: changes in an array
will aﬀect the other. This behavior was observed for lists. To create an independent
array, use copy() function.

arr = np.array([[57.5 , 64.3 , 71.6 , 68.2 ],

[ 1.5 , 1.6 , 1.7 , 1.65]])
arrc = arr.copy() #This is expensive. Do it only it is necessary
printx('arrc')
arr[0,0]= 77.0
printx('arr')
printx('arrc')
60 Machine Learning with Python: Theory and Applications

arrc = array([[57.5 , 64.3 , 71.6 , 68.2 ],

[ 1.5 , 1.6 , 1.7 , 1.65]])
arr = array([[77. , 64.3 , 71.6 , 68.2 ],
[ 1.5 , 1.6 , 1.7 , 1.65]])
arrc = array([[57.5 , 64.3 , 71.6 , 68.2 ],
[ 1.5 , 1.6 , 1.7 , 1.65]])

2.9.3 Axis of a numpy array

Axis is an important concept for numpy array operations. A 1D array has axis 0, a
2D array has two axes, 0 and 1, and so on. The deﬁnition is given in Fig. 2.1.

Multidimensional numpy array structure and axes:

We can now use an axis to stack up arrays to form new arrays, as follows:

arr=np.stack([narray_w,narray_h],axis=0)
#stack up 1D arrays along axis 0
print(arr)

[[57.5 64.3 71.6 68.2 ]

[ 1.5 1.6 1.7 1.65]]

We can use np.ravel to ﬂatten an array.

print(arr)
rarr=np.ravel(arr)
print(rarr)
getArrayInfo(rarr)

Figure 2.1: Picture modiﬁed from that in “Introduction to Numerical Computing with
NumPy”, SciPy 2019 Tutorial, by Alex Chabot-Leclerc.
Basics of Python 61

[[57.5 64.3 71.6 68.2 ]

[ 1.5 1.6 1.7 1.65]]
[57.5 64.3 71.6 68.2 1.5 1.6 1.7 1.65]
elements of the first axis of the array: 57.5
type: <class 'numpy.ndarray'>
number of dimensions, a.ndim: 1
a.shape: (8)
number of elements, a.size: 8
a.dtype: float64
memory address <memory at 0x000001AE7D5494C8>

It is seen that the dimension is changed from 2 to 1, and the shape is changed
from (2,4) to (8,).
In machine learning computations, we often perform summation of entries of an
array along an axis of the array. This can be done easily using the np.sum function.

print(arr)
print('Column-sum:',np.sum(arr,axis=0),np.sum(arr,axis=0).shape)
print('row-sum:',np.sum(arr,axis=1),np.sum(arr,axis=1).shape)

[[57.5 64.3 71.6 68.2 ]

[ 1.5 1.6 1.7 1.65]]
Column-sum: [59. 65.9 73.3 69.85] (4,)
row-sum: [261.6 6.45] (2,)

Notice that the dimension of the summed array is reduced.

2.9.4 Element-wise computations

Element-wise computation using Numpy arrays is very handy, which is very much
diﬀerent from that for lists.

print('listwh=',list_w+list_h) # + is a concatenation for lists.

listwh= [57.5, 64.3, 71.6, 68.2, 59.8, 1.5, 1.6, 1.7, 1.65]

print('narraywh=',narray_w+narray_h)
# + is element-wise addition for numpy arrays.

narraywh= [59. 65.9 73.3 69.85]

Let us compute the weights in pounds, using 1kg = 2.20462 lbs.

print(narray_w * 2.20462) # element-wise multiplication

[126.76565 141.757066 157.850792 150.355084]

62 Machine Learning with Python: Theory and Applications

Let us compute the Body Mass Index or BMI using these narrays.

bmi = narray_w / narray_h ** 2 # formula to compute the BMI

print(bmi)

[25.55555556 25.1171875 24.77508651 25.05050505]

This includes element-wise power operation and division as well.

# lbmi = list_w / list_h ** 2 # would this work? Try it!

We discussed element-wise operations for lists earlier, and used special operations
(list comprehension) and special functions such as zip(). The alternative is the
“numpy-way”. This is done by ﬁrst converting the lists to numpy arrays, then
performing the operations in numpy using these arrays, and ﬁnally converting the
results back to a list. When the lists are large in size, this numpy-way can be much
faster, because all these operations can be performed in bulk in numpy, without
element-by-element accessing of the memories.

import numpy as np
list1 = [20, 30, 40, 50, 60]
list2 = [4, 5, 6, 2, 8]
(np.array(list1) + np.array(list2)).tolist()

[24, 35, 46, 52, 68]

The results are the same as those we obtained before using special list element-
wise operations.

2.9.5 Handy ways to generate multi-dimensional arrays

In machine learning and mathematical computations in general, multi-dimensional
arrays are frequently used, because one has to deal with big data frequently. Numpy
supports the necessary functions (tools) to generate, manipulate, and operate multi-
dimensional arrays.

np.arange(2, 8, 0.5, dtype=np.float) # equally spaced values

array([2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. , 6.5, 7. , 7.5])

np.linspace(1., 4., 6)
# arrays `with a specified number of elements with equal value spacing

array([1. , 1.6, 2.2, 2.8, 3.4, 4. ])

Basics of Python 63

a = np.array([1.,2.,3.])
a.fill(9.9) # all entries with the same value
print(a)
[9.9 9.9 9.9]

x = np.empty((3, 4)) # shape (dimension) of (3,4) specified,

# without initializing entries
print(x)
[[2. 2.5 3. 3.5]
[4. 4.5 5. 5.5]
[6. 6.5 7. 7.5]]

x = np.zeros((6,6)) # initialized with zero

x
array([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])

y = np.ones((2,2))*2 # a 2 by 2 array with 1.0 in all the entries

print(y)
[[2. 2.]
[2. 2.]]

x[3:5,3:5] += y # Assign y to a sliced portion in x

print(x)
[[0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0.]
[0. 0. 0. 2. 2. 0.]
[0. 0. 0. 2. 2. 0.]
[0. 0. 0. 0. 0. 0.]]

2.9.6 Use of external package: MXNet

For computations in machine learning, a number of useful packages/modules/
libraries have been developed for creating large-scale machine learning models.
MXNet is one of those. We will also make use of it in this book. Because it is
an external package, MXNet needs to be installed in your computer system using
pip.
pip install mxnet
64 Machine Learning with Python: Theory and Applications

Note that if an error message like “No module named ‘xyz’ ” is encountered,
which is likely during this course using our codes, one shall perform the installation
of “xyz” module in a similar way, so that all the functions and variables
deﬁned there can be made use of. Note also that there are a huge number of
modules/libraries/packages openly available, and it is not possible to install all of
them. The practical way is installing it only when it is needed. One may encounter
issues in installations, many of which are related to compatibility of versions of the
involved packages. Searching online for help can often resolve these issues, because
the chance is high that someone has already encountered similar issues earlier, and
the huge online community has already provided some solution.
After mxnet module is installed, we import it to our code.

import mxnet as mx
mx.__version__
'1.7.0'

np.set_printoptions(formatter={'float': '{: 0.3f}'.format})

from mxnet import nd # import Mxnet library/package
# see https://gluon.mxnet.io for details
x = nd.empty((3, 4)) # Try also x = nd.ones((3, 4))
print(x)
print(x.size) # print the size (number of elements) of a
# multi-dimensional mxnet or nd.array

[[ 0.000 0.000 0.000 0.000]

[ 0.000 0.000 0.000 0.000]
[ 0.000 0.000 0.000 0.000]]
<NDArray 3x4 @cpu(0)>
12

We often create arrays with randomly sampled values when working with neural
networks. In such cases, we initialize an array using a standard normal distribution
with zero mean and unit variance. For example,

y = nd.random_normal(0, 1, shape=(3, 4)) # 0 mean, variance 1

printx('y')
y
y =
[[-0.681 -0.135 0.377 0.410]
[ 0.571 -2.758 1.076 -0.614]
[ 1.831 -1.147 0.054 -2.507]]
<NDArray 3x4 @cpu(0)>
[[-0.681 -0.135 0.377 0.410]
[ 0.571 -2.758 1.076 -0.614]
Basics of Python 65

[ 1.831 -1.147 0.054 -2.507]]

Element-wise operations of addition, multiplication, and exponentiation all work

for multi-dimensional arrays.

x=nd.exp(y)
x
[[ 0.506 0.873 1.458 1.507]
[ 1.771 0.063 2.934 0.541]
[ 6.239 0.318 1.055 0.081]]
<NDArray 3x4 @cpu(0)>

Often used matrix (2D array) transpose can be obtained as follows:

y.T
[[-0.681 0.571 1.831]
[-0.135 -2.758 -1.147]
[ 0.377 1.076 0.054]
[ 0.410 -0.614 -2.507]]
<NDArray 4x3 @cpu(0)>

Now, we can multiply matrices with comparable dimensions. Now we can

multiply matrices with comparable dimensions using dot-product in both numpy
and MXNet:

nd.dot(x, y.T) # x: 3×4; y.T: 4×3

[[ 0.705 -1.476 -3.776]
[ 0.114 3.662 1.970]
[-3.860 3.774 10.910]]
<NDArray 3×3 @cpu(0)>

Note that nd arrays behave diﬀerently from np arrays. They do not usually
work together without proper conversion. Therefore, special care is need. When
strange behavior is observed, one may print out the variable to check the array type.
The same is generally true when numpy arrays work with arrays in other external
modules, because the array objects are, in general, diﬀerent from one module to
another. One may use asnumpy() to convert an nd-array to an np-array, when so
desired. Given below is an example (more on this later):

np.dot(x.asnumpy(), y.T.asnumpy())
# convert nd array to np array and then use numpy np.dot()
array([[ 0.705, -1.476, -3.776],
[ 0.114, 3.662, 1.970],
[-3.860, 3.774, 10.910]], dtype=float32)
66 Machine Learning with Python: Theory and Applications

2.9.7 In-place operations

In machine learning, we frequently deal with bigdata. To avoid expensive and very
complicated moving data operations, we prefer in-place operations. Let us ﬁrst take
a look at the following computations and the locations of the data:

print('id(y) before operation:', id(y))

y = y + x # x and y must be shape compatible
print('id(y) after operation:', id(y))# location of y changes
id(y) before operation: 1847926752760
id(y) after operation: 1847462282128

For in-place operations, we do this:

print('id(y) before operation:', id(y))

y[:] = x + y
# addition first, put it in a temporary buffer, then copy to y[:]
print('id(y) after operation:', id(y)) # memory of y remains the same

id(y) before operation: 1847462282128

id(y) after operation: 1847462282128

To perform an in-place addition without using even a temporary buﬀer, we do

this in MXNet:

print('id(y) before operation:', id(y))

print(nd.elemwise_add(x, y, out=y)) # for mxnet nd.arrays
print('id(y) after operation:', id(y))
# memory location un-changed
id(y) before operation: 1847462282128

[[ 0.837 2.485 4.752 4.931]

[ 5.883 -2.568 9.878 1.009]
[ 20.547 -0.194 3.220 -2.263]]
<NDArray 3x4 @cpu(0)>
id(y) after operation: 1847462282128

If we do not plan to reuse x, then the result can be assigned to x itself. We may
do this in MXNet:

print('id(x) before operation:', id(x))

x += y
print(x)
print('id(x) after operation:', id(x))
id(x) before operation: 1847462202168
Basics of Python 67

[[ 1.343 3.358 6.210 6.438]

[ 7.653 -2.504 12.811 1.550]
[ 26.785 0.124 4.275 -2.182]]
<NDArray 3x4 @cpu(0)>
id(x) after operation: 1847462202168

2.9.8 Slicing from a multi-dimensional array

To read the second and third rows from x, we do this:

print(x)
x[1:3] # read the second and third rows from x
[[ 1.343 3.358 6.210 6.438]
[ 7.653 -2.504 12.811 1.550]
[ 26.785 0.124 4.275 -2.182]]
<NDArray 3x4 @cpu(0)>

[[ 7.653 -2.504 12.811 1.550]

[ 26.785 0.124 4.275 -2.182]]
<NDArray 2x4 @cpu(0)>

x[1:2,1:3] # read the 2nd raw, 2nd to 3rd column from x

[[-2.504 12.811]]
<NDArray 1x2 @cpu(0)>

x[1,2] = 88.0 # change the value at the 2nd raw 3rd column
print(x)
x[1:2,1:3] = 88.0
# change the values from the 2nd raw and 2nd to 3rd column
print(x)
[[ 1.343 3.358 6.210 6.438]
[ 7.653 -2.504 88.000 1.550]
[ 26.785 0.124 4.275 -2.182]]
<NDArray 3x4 @cpu(0)>

[[ 1.343 3.358 6.210 6.438]

[ 7.653 88.000 88.000 1.550]
[ 26.785 0.124 4.275 -2.182]]
<NDArray 3x4 @cpu(0)>

2.9.9 Broadcasting
What would happen if one adds a vector (1D array) y to a matrix (2D array) X? In
Python, this can be done, and is often done in machine learning. Such an operation
is performed using a procedure called “broadcasting”: the low-dimensional array
68 Machine Learning with Python: Theory and Applications

is duplicated along any axis with dimension 1 to match the shape of the high-
dimensional array, and then the desired operation is performed.

import numpy as np
y = np.arange(6) # y has an initial of shape (6), or (1,6)
print('y = ', y,'Shape of y:', y.shape)
x = np.arange(24)
print('x = ', x,'Shape of x:', x.shape)

y = [0 1 2 3 4 5] Shape of y: (6,)
x = [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
20 21 22 23]
Shape of x: (24,)

X = np.reshape(x,(4,6)) # reshape to (4,6), to match (6,) of y

print(X.shape)
print('X = \n', X,'Shape of X:', X.shape)
print('X + y = \n', X + y) # y's shape expands to (4,6) by copying
# the data along axis 0,then the addition

(4, 6)
X =
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]] Shape of X: (4, 6)
X + y =
[[ 0 2 4 6 8 10]
[ 6 8 10 12 14 16]
[12 14 16 18 20 22]
[18 20 22 24 26 28]]

print('shape of X is:',X.shape, ' shape of y is:',y.shape)

print(np.dot(X,y))

shape of X is: (4, 6) shape of y is: (6,)

[ 55 145 235 325]

z = np.reshape(X,(2,3,4))
print (z)

[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Basics of Python 69

[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]

a = np.arange(12).reshape(3,4)
a.fill(100)
a

array([[100, 100, 100, 100],

[100, 100, 100, 100],
[100, 100, 100, 100]])

z + a
array([[[100, 101, 102, 103],
[104, 105, 106, 107],
[108, 109, 110, 111]],

[[112, 113, 114, 115],

[116, 117, 118, 119],
[120, 121, 122, 123]]])

a = np.arange(4)
a.fill(100)
a
array([100, 100, 100, 100])

z + a
array([[[100, 101, 102, 103],
[104, 105, 106, 107],
[108, 109, 110, 111]],

[[112, 113, 114, 115],

[116, 117, 118, 119],
[120, 121, 122, 123]]])

Broadcasting Rules: https://docs.scipy.org/doc/numpy/user/basics.broadcasting.

html.
When operating on two arrays, NumPy compares their shapes element-wise. It
starts with the trailing dimensions and works its way forward. Two dimensions are
compatible when
1. they are equal, or
2. one of them is 1.
70 Machine Learning with Python: Theory and Applications

If these conditions are met, the dimensions are compatible, otherwise not compatible
hence a ValueError. Figure 2.2 shows some examples.
Broadcasting operations:

Figure 2.2: Picture modiﬁed from that in “Introduction to Numerical Computing with
NumPy”, SciPy 2019 Tutorial, by Alex Chabot-Leclerc.

2.9.10 Converting between MXNet NDArray and NumPy

Converting between MXNet NDArrays and NumPy arrays is easy. The converted
arrays do not share memory.

import numpy as np
x = np.arange(24).reshape(4,6)
y = np.arange(6)

array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])

from mxnet import nd

ndx = nd.array(x)
npa = ndx.asnumpy()
type(npa)

numpy.ndarray
Basics of Python 71

npa

array([[ 0.000, 1.000, 2.000, 3.000, 4.000, 5.000],

[ 6.000, 7.000, 8.000, 9.000, 10.000, 11.000],
[ 12.000, 13.000, 14.000, 15.000, 16.000, 17.000],
[ 18.000, 19.000, 20.000, 21.000, 22.000, 23.000]],
dtype=float32)

ndy = nd.array(npa)
ndy

[[ 0.000 1.000 2.000 3.000 4.000 5.000]

[ 6.000 7.000 8.000 9.000 10.000 11.000]
[ 12.000 13.000 14.000 15.000 16.000 17.000]
[ 18.000 19.000 20.000 21.000 22.000 23.000]]
<NDArray 4x6 @cpu(0)>

To ﬁgure out the detailed diﬀerences between the MXNet NDArrays and the
NumPy arrays, one may refer to https://gluon.mxnet.io/chapter01 crashcourse/nd
array.html.

2.9.11 Subsetting in Numpy

Another feature of numpy arrays is that they are easy to subset.

print(bmi[bmi > 25]) # Print only those with BMI above 25

[ 25.556 25.117 25.051]

2.9.12 Numpy and universal functions (ufunc)

NumPy is a useful library for the Python for large-scale numerical computations
including but not limited to machine learning. It supports efficient operations for
bulky data in multi-dimensional arrays (ndarrays) of large size. It also offers a large
collection of high-level mathematical functions to operate on these arrays [1]. In
2005, Travis Oliphant created NumPy by incorporating features of Numarray into
Numeric, which was originally created by Jim Hugunin with several other developers
with extensive modifications. More details can be found at https://en.wikipedia.
org/wiki/NumPy.
The Numpy documentation states that “a universal function (or ufunc for short)
is a function that operates on ndarrays in an element-by-element fashion, supporting
array broadcasting, type casting, and several other standard features. A ufunc is
a “vectorized” wrapper for a function that takes a fixed number of specific inputs
and produces a fixed number of specific outputs. In NumPy, universal functions
are instances of the numpy.ufunc class (to be discussed later). Many of the built-in
functions are implemented in compiled C code. The basic ufuncs operate on scalars,
but there is also a generalized kind for which the basic elements are sub-arrays
72 Machine Learning with Python: Theory and Applications

(vectors, matrices, etc.), and broadcasting is done over other dimensions. One can
also produce custom ufunc instances using the frompyfunc factory function”.
More details of unfuncs can be found at scipy documentation (https://docs.sc
ipy.org/doc/numpy/reference/ufuncs.html). Given below are two of many ufuncs.
exp(x, /[, out, where, casting, order, . . . ]) Calculate the exponential of all elements
in the input array. log(x, /[, out, where, casting, order, . . . ]) Natural logarithm,
element-wise. One may use the following for more details.

# help(np.log) # use this to find out what a ufunc does

2.9.13 Numpy array and vector/matrix

Numpy uses primarily the structure of arrays. The behavior of a numpy array is
similar, and yet quite diﬀerent from the conventional concepts of vector and matrix
that we learned from linear algebra. It can be quite confusing and even frustrating
during coding and debugging. The author has bothered by such diﬀerences quite
frequently in the past. The items below are some of the issues one may need to pay
attention to.

(1) Numpy array is in principle multidimensional

First, we note that numpy array is far more than 1D vector and 2D matrix. By
default, we can have as many as 32D, and that can also be changed. Thus, numpy
array is extremely powerful, and is a data structure works well for complex machine
learning models that use multidimensional dataset and data flows. All the element-
wise operations, broadcasting, handling of flows of large volume data, etc. work very
efficiently. It, however, does not follow precisely the concept of vectors, matrices
that we established in the conventional linear algebra and most frequently used.
This is essentially the root of confusion in many cases, in the author’s opinion.
Understanding the following key points is a good start to mitigate the confusion.

(2) Numpy 1D array vs. vector

Numpy 1D array is similar to the usual vector concept in linear algebra. The
diﬀerence is that 1D array does not distinguish row or column vector. It is just
a 1D array with a shape of (n,), where n is the length of the array. It behaves
largely like the row vector, but not quite. For example, transpose() has no eﬀect on
it. This is because what the transpose() function does is to swap two axises of an
array with two or more axises.
The column vector in linear algebra should be treated in numpy as a special case
of 2D array with only one column. One can create an array like the column vector
in linear algebra, by adding an additional axis to the 1D array. See the following
examples.
Basics of Python 73

a = np.array([1, 2, 3]) # create an 1D array with length 3

print(a, a.T, a.shape, a.T.shape) # its shape will be (3,)

[1 2 3] [1 2 3] (3,) (3,)

As shown, “.T” for transpose() has no eﬀect to the 1D array.

an = a[:,np.newaxis] # add in a newaxis, shape becomes (3, 1)

print(an, an.shape) # it's a column vector, a special 2D array

[[1]
[2]
[3]] (3, 1)

print(an.T, an.T.shape) # .T works, it comes row vector

[[1 2 3]] (1, 3)

The axis added array becomes a 2D array, and the transpose() function works. It
creates a “real” row vector that is a special case of a 2D array in numpy.
The same can also be achieved using the following tricks.

a1 = a.reshape(a.shape+(1,)) # reshape, adds one more dimension

print(a1, a1.shape)
aN = a[:,None] # adds in a new dimension without elements
print(aN, aN.shape)

[[1]
[2]
[3]] (3, 1)
[[1]
[2]
[3]] (3, 1)

Adding an axis can be done at any axis:

a0 = a[np.newaxis,:] # the new axis is added to the 0-axis

print(a0, a0.shape) # shape becomes (1,3), an row vector

[[1 2 3]] (1, 3)

74 Machine Learning with Python: Theory and Applications

print(an+a0) # interesting trick to create a Hankel matrix

[[2 3 4]
[3 4 5]
[4 5 6]]

Once knowing how to create arrays equivalent to the usual row and column
vectors in the conventional linear algebra, we shall be much more comfortable in
debugging codes when encountering strange behavior.
Another often encountered example is when solving linear system equations.
From the conventional linear algebra, we know that the right-hand-side (rhs) should
be a column vector, and the solution also should be a column vector. In using
numpy.linalg.solve() for a linear algebraic equation, we can feed in with a 1D array
as the rhs vector, and it will return a solution that is also a 1D array. Of course,
we can also feed in with a column vector (a 2D array with only one column). In
that case we will get the solution in a 2D array with only one column. We shall see
examples in Section 3.1.11, and many cases in later chapters. These behaviors are
all expected and correct in numpy

(3) Numpy 2D array vs. matrix

Numpy 2D array is similar to the usual matrix in linear algebra largely, but not
quite. For example, the matrix multiplication in linear algebra is often done in
numpy using the dot-product, such as np.dot() or the “@” operator in numpy
(version 3.5 or later). The “*” operator is an element-wise multiplication, as shown
in Section 2.9.4. We will see more examples in Chapter 3 and later chapters. Also,
some operations to a numpy array can results in dimension change. For example,
when mean() is applied to an array, the dimension is reduced. Thus, care is required
on which axis is collapsed.
Note that there is a numpy matrix class (see definition of class in Section 2.14)
that can be used to create matrix objects. These objects behavior quite similar as
the matrix in linear algebra. We try not to use it, because it will be deprecated one
day, as announced in the online document numpy-ref.pdf.
Based on the author’s limited experience, once we are aware of these differences
and behavior subtleties (more discussion later when we encountered one), we can
then pay attention to the behavior subtleties. It is often helpful to frequently
checking the shape of the arrays. This allows us to work more effectively with
powerful numpy arrays, including performing proper linear algebra analysis. At this
moment, it is quite difficult to discuss the theorems in linear algebra using 1D, 2D
or higher dimensional array concepts. In the later chapters, we will still follow the
general rules, principles, and use the terms of vector and matrix in the conventional
linear algebra, because many theoretical findings and arguments are based on it. A
vector refers generally to a 1D numpy array, and a matrix to a 2D numpy array.
When examine the outcomes in numpy codes, we shall notice the behavior subtleties
of the numpy arrays.
Basics of Python 75

2.10 Sets: No Duplication

A set is a list, but with no duplicate entries.

sentence1 = "His first name is Mark and Mark is his first name"
words1 = sentence1.split() # use split() to form a set of words
print(words1) # whole list of these words is printed
word_set1 = set(words1) # convert to a set
print(word_set1) # print. No duplication

['His', 'first', 'name', 'is', 'Mark', 'and', 'Mark', 'is', 'his',

'first', 'name']
{'and', 'His', 'is', 'first', 'Mark', 'his', 'name'}

Using a set to get rid of duplication is useful for many situations. Many other
useful operations can be applied to sets. For example, we may want to ﬁnd the
intersection of two sets. To show this, let us create a new list and then a new set.

sentence2 = "Her first name is Jane and Jane is her first name"
words2 = sentence2.split()
print(words2) # whole list of these words is printed.
word_set2 = set(words2) # convert to a set
print(word_set2) # print. No duplication

['Her', 'first', 'name', 'is', 'Jane', 'and', 'Jane', 'is', 'her',

'first', 'name']
{'her', 'Jane', 'Her', 'is', 'first', 'and', 'name'}

2.10.1 Intersection of two sets

print(word_set1.intersection(word_set2)) #intersection of two sets

{'first', 'and', 'name', 'is'}

2.10.2 Diﬀerence of two sets

print(word_set1.difference(word_set2))
{'his', 'His', 'Mark'}

This ﬁnds words in word set1 that are not in word set2.
We may also want to ﬁnd the words in word set2 that are not in word set1 as
follows:

print(word_set1.symmetric_difference(word_set2))
{'her', 'His', 'Jane', 'Her', 'Mark', 'his'}
76 Machine Learning with Python: Theory and Applications

Can we do similar operations to lists? Try this.

#print(words1.intersection(words2)) # intersection of two list?

# No. It throws an AttributeError.

2.11 List Comprehensions

We used list comprehension for particular situations a few times. List comprehension
is a very powerful tool for operations on all iterables including lists and numpy
arrays. When used on a list, it creates a new list based on another list, in a single,
readable line.
In the following example, we would like to create a list of integers which specify
the length of each word in a sentence, but only if the word is not “the”. The natural
way to do this is as follows:

sentence="Raises the Sun and comes the light" # Create a string.

words = sentence.split() # Create a list of words using split().
word_lengths = [] # Empty list for the lengths of the words
words_nothe = [] # Empty list for the words that are not "the"
for word in words:
if word != "the":
word_lengths.append(len(word))
words_nothe.append(word)

print(words_nothe, ' ',word_lengths)

['Raises', 'Sun', 'and', 'comes', 'light'] [6, 3, 3, 5, 5]

With a list comprehension, we simply do this:

words = sentence.split()
word_lengths = [len(word) for word in words if word != "the"]
words_nothe = [word for word in words if word != "the"]
print(words_nothe, ' ',word_lengths)
['Raises', 'Sun', 'and', 'comes', 'light'] [6, 3, 3, 5, 5]

The following is even better:

words_nothe2=[] # Empty list to hold the lists of word & length

# for words that are not "the", lists in list
words_nothe2=[[word,len(word)] for word in words if word != "the"]
# one may use () instead of the inner []
print(words_nothe2)
[['Raises', 6], ['Sun', 3], ['and', 3], ['comes', 5], ['light', 5]]
Basics of Python 77

The following example uses list comprehension to numpy arrays:

import numpy as np
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
fx = []
x = np.arange(-2, 2, 0.5) # Numpy array with equally spaced values
fx = np.array([-(-xi)**0.5 if xi < 0.0 else xi**0.5 for xi in x])
# Creates a piecewise function (such as activation function).
# The created list is then converted to a numpy array.
print('x =',x); print('fx=',fx)

x = [-2.000 -1.500 -1.000 -0.500 0.000 0.500 1.000 1.500]

fx= [-1.414 -1.225 -1.000 -0.707 0.000 0.707 1.000 1.225]

2.12 Conditions, “if ” Statements, “for” and “while” Loops

In machine learning programming, one frequently uses conditions, “if” statements,

“for” and “while” loops. Boolean variables are used to evaluate conditions. The
boolean values True or False are returned when an expression is compared or
evaluated.

2.12.1 Comparison operators

Comparison operators include ==, <, <=, >, >=, !=. The == operator compares
the values of both the operands and checks for value equality, while “!=” checks for
“inequality”, and others check for inequality and/or equality.

x = 2
print(x == 2) # The comparison result in a boolean value: True
print(x == 3) # The comparison result in a boolean value: False
print(x < 3) # The comparison result in a boolean value: True

True
False
True

x = 2
if x == 2:
print("x equals 2!")
else:
print("x does not equal to 2.")

x equals 2!
78 Machine Learning with Python: Theory and Applications

name, age = "Richard", 18

if name == "Richard" and age == 18: # if-block!
print("He is famous. His name is", name, \
"and he is only",age,"years old.")
He is famous. His name is Richard and he is only 18 years old.

temp_critical = 48.0 #unit: degree Celsius (C).

current_temp = 50.0
if current_temp >= temp_critical:
print("The current temperature is", current_temp, \
"degree C. It is above the critical temperature of",\
temp_critical,"degree C. Actions are needed. ")
else:
print("The current temperature is", current_temp, \
"degree C. It is below the critical temperature of",\
temp_critical, "degree C. No action is needed for now.")

# Notice the use of block, and "\" to break a long line.

The current temperature is 50.0 degree C. It is above the
critical temperature of 48.0 degree C. Actions are needed.

2.12.2 The “in” operator

The “in” operator is used to check if a speciﬁed object exists within an iterable
object container, such as a list.
name1, name2 = "Kevin", 'John'
groupA= ["Kevin", "Richard"] # a list with two strings.
if name1 in groupA:
print("The person's name is either",groupA[0], "or", groupA[1])

if name2 in groupA:
print("The person's name is either",groupA[0], "or", groupA[1])
else:
print(name2, "is not found in group A")

The person's name is either Kevin or Richard

John is not found in group A

2.12.3 The “is” operator

Unlike the double equals operator “==”, the “is” operator does not check the values
of the operands. It checks whether both the operands refer to the same object or
not. For example,
Basics of Python 79

x, y = ['a','b'], ['a','b']
z = y # makes z pointing to y

print(x == y,' x=',x,'y=',y,hex(id(x)),hex(id(y))) # Prints True,

# because the values in x and y are equal

print(x is y,'x=',x,'y=',y,hex(id(x)),hex(id(y))) # Prints False,

# because x and y with different ID

print(y is z,' y=',y,'z=',z,hex(id(y)),hex(id(z))) # True, y is z

True x= ['a', 'b'] y= ['a', 'b'] 0x1ae25570188 0x1ae25570208

False x= ['a', 'b'] y= ['a', 'b'] 0x1ae25570188 0x1ae25570208
True y= ['a', 'b'] z= ['a', 'b'] 0x1ae25570208 0x1ae25570208

y[1]='x' # change the 2nd element

print('After change one value in y')
print(x == y,'x=',x,'y=',y,hex(id(x)),hex(id(y))) # Prints False,
# their values are no longer equal

print(x is y,'x=',x,'y=',y,hex(id(x)),hex(id(y)))
# False, x is # NOT y

print(y is z,' y=',y,'z=',z,hex(id(y)),hex(id(z))) # True, y is z:

# they change together!

After change one value in y

False x= ['a', 'b'] y= ['a', 'x'] 0x1ae25570188 0x1ae25570208
False x= ['a', 'b'] y= ['a', 'x'] 0x1ae25570188 0x1ae25570208
True y= ['a', 'x'] z= ['a', 'x'] 0x1ae25570208 0x1ae25570208

y.append(x)
print('After change one value in y')
print(x == y,'x=',x,'y=',y,hex(id(x)),hex(id(y))) # Prints False,
# their values are no longer equal
print(x is y,'x=',x,'y=',y,hex(id(x)),hex(id(y)))
# False, x is # NOT y
print(y is z,' y=',y,'z=',z,hex(id(y)),hex(id(z))) # True, y is z:
# they change together!

After change one value in y

False x= ['a', 'b'] y= ['a', 'x', ['a', 'b'], ['a', 'b']]
0x1ae25570188 0x1ae25570208
80 Machine Learning with Python: Theory and Applications

False x= ['a', 'b'] y= ['a', 'x', ['a', 'b'], ['a', 'b']]

0x1ae25570188 0x1ae25570208
True y= ['a', 'x', ['a', 'b'], ['a', 'b']] z= ['a', 'x', ['a',
'b'], ['a', 'b']] 0x1ae25570208 0x1ae25570208

2.12.4 The ‘not’ operator

Using “not” before a boolean expression inverts the value of the expression.

print(not False) # Prints out True

print(not False == False) # Prints out False

True
False

The “not” operator can also be used with “is” and “in”: “is not”,“not in”:

x, y = [1,2,3], [1,2,3]
z = y # makes z pointing to same object as y
print(x == y,' x=',x,'y=',y,hex(id(x)),hex(id(y))) # True,
# because the values in x and y are equal

print(x is not y,'x=',x,'y=',y,hex(id(x)),hex(id(y))) # False, the

# values in x in NOT y
print(y is z,' y=',y,'z=',z,hex(id(y)),hex(id(z))) # True, y is z

True x= [1, 2, 3] y= [1, 2, 3] 0x1ae25570348 0x1ae255702c8

True x= [1, 2, 3] y= [1, 2, 3] 0x1ae25570348 0x1ae255702c8
True y= [1, 2, 3] z= [1, 2, 3] 0x1ae255702c8 0x1ae255702c8

2.12.5 The “if ” statements

As any other language, the “if” statement is often used for programming, we have
already seen some earlier. The following are more examples for using Python’s
conditions in the “if” statement with code blocks:

temp_good = 22.0 # unit: degree Celsius (C).

temp_now1 = 48.0 # Readers may change this value and try
statement1=(temp_now1<=(temp_good+10.0))&(temp_now1>=(temp_good-10.0))
# comfortable
statement2 = temp_now1 < (temp_good-10.0) # cold
statement3 = temp_now1 > (temp_good+10.0) # hot
if statement1 is True: # do not forget ":"
print("Okay, it is",temp_now1,"degree C. Comfortable. Let's go.")
pass # do something
elif statement2 is True: # do not forget ":"
print("No, it is",temp_now1,"degree C. Too cold. We cannot go.")
pass # do something else
Basics of Python 81

elif statement3 is True:

print("No, it",temp_now1,"degree C. Too hot. We cannot go.")
pass # do something else
else:
print("Let's check the temperature.") # do another thing
pass
No, it 48.0 degree C. Too hot. We cannot go.

Note that there is no limit in Python on how many blocks one can use in an if
statement.

2.12.6 The “for” loops

There are two types of loops in Python, for and while. We have used “for” loops
already. We discuss it here in more detail together with the while loops. A for loop
iterates over a given iterable sequence. The starting and stopping points and the
step-size are controlled by the sequence, as shown in the following example. It is
sequence controlled.

primes = [2, 3, 5] # define a list (that is iterable)

for prime in primes: # do not forget ":"
print(prime, end=' ')
2 3 5

For loops can iterate over a sequence of numbers using the “range” function. The
range() is a built-in function which returns a range object: a sequence of integers.
It generates integer numbers between the given start-integer and the stop-integer.
It is generally used to iterate over with for loops.

print('Member in range(10) are')

for n in range(10): # Syntax: range(start, stop[, step])
print(n, end=',')

print('\nMembers in range(3, 8) are')

for n in range(3, 8): # Default step is 1
print(n, end=',')

Member in range(10) are

0,1,2,3,4,5,6,7,8,9,
Members in range(3, 8) are
3,4,5,6,7,

print('Members in range(3, 10, 2) are')

for n in range(-3, 10, 2): # starting from a negative value
print(n, end=',')
82 Machine Learning with Python: Theory and Applications

print('\nMembers in range(10, -3, -2) are')

for n in range(10, -3, -2): # reverse range
print(n, end=',')

Members in range(3, 10, 2) are

-3,-1,1,3,5,7,9,
Members in range(10, -3, -2) are
10,8,6,4,2,0,-2,

For a given list of ﬁve numbers, let us display each element that doubled, using a
for loop and range() function.

print("Double the numbers in a list, using for-loop and range()")

given_list = [10, 30, 40, 50]
for i in range(len(given_list)):
print("Index["+str(i)+"]","Value in the given list is",
given_list[i],", and its double is", given_list[i]*2)

Double the numbers in a list, using for-loop and range()

Index[0] Value in the given list is 10 , and its double is 20
Index[1] Value in the given list is 30 , and its double is 60
Index[2] Value in the given list is 40 , and its double is 80
Index[3] Value in the given list is 50 , and its double is 100

The range() function returns an immutable sequence object of integers, so it is

possible to convert a range() output to a list, using the list class. For example,

print("Converting python range() output to a list")

list_rng = list(range(-10,12,2))
print(list_rng)
Converting python range() output to a list
[-10, -8, -6, -4, -2, 0, 2, 4, 6, 8, 10]

2.12.7 The “while” loops

The “while” loops repeat as long as a certain boolean condition is met. The condition
controls the operations. For example,

#To print out 0,1,2,3,4,5,6,7,8,9,

count = 0
while count < 10: # do not forget ":"
print(count, end=',')
count += 1 # This is the same as count = count + 1
0,1,2,3,4,5,6,7,8,9,
Basics of Python 83

“break” and “continue” statements: break is used to exit a “for” loop or a “while”
loop, whereas continue is used to skip the current block, and return to the “for” or
“while” statement.

# print all digits limited by a given number.

count = 0
while True:
print(count,end=',')
count += 1
if count >= 10:
break
print('\n')
# Prints out only even numbers: 0,2,4,6,8,
for n in range(10): # for-loop to control the range
if n % 2 != 0: # Check condition, control what to print
continue # continue to count on (in the for-loop)
print(n,end=',')

0,1,2,3,4,5,6,7,8,9,

0,2,4,6,8,

When the loop condition fails, then the code part in “else” is executed. If break
statement is executed inside the for loop, then the “else” part is skipped. Note that
the “else” part is executed even if there is a continue statement before it.

# Prints out 0,1,2,3,4 and then it prints "count value reached 5"
count=0
nlimit = 5
while(count<nlimit):
print(count, end=',')
count +=1
else:
print("count value reached %d" %(nlimit))

# Prints out 1,2,3,4

for i in range(1, 10):
if(i%5 == 0): # modulo division (%)
break
print(i, end=',')
else:
print("This is not printed because for-loop is terminated due\
to the break but not due to fail in condition")

0,1,2,3,4,count value reached 5

1,2,3,4,
84 Machine Learning with Python: Theory and Applications

2.12.8 Ternary conditionals

The following 4-line code

condition=True
if condition:
x=1
else:
x=0
print(x)
1

can be written in one line, with ternary conditionals:

condition=True
x=1 if condition else 0
print(x)
1

It is simple, readable and dry. Thus conditionals are frequently used in Python.

2.13 Functions (Methods)

Functions oﬀer a convenient way to divide a code into useful blocks that can be called
unlimited times when needed. This can drastically reduce the repeat of codes, and
make codes clean, more readable, and easy to maintain. In addition, functions are
good ways to deﬁne interfaces for easy sharing of the codes among programmers.

2.13.1 Block structure for function deﬁnition

A function has a “block” structure. Block keywords include those we have already
seen, such as “if”, “for”, and “while”. Functions in Python are deﬁned using the
block keyword “def”, followed by a function name that is also the block’s name.
The function is called using the function name followed by (), which brackets the
arguments, if any. Try this simplest function:

def print_hello(): # do not forget ":"

print("Hello, welcome to this simple function!")
print_hello()
Hello, welcome to this simple function!

2.13.2 Function with arguments

In the simple case given above, no argument is required. Functions are often created
with required arguments that are variables passed from the caller to the function.
Basics of Python 85

def greeting_student(username, greeting):

print(f"Hello, {username}, greetings! Wish you {greeting}")
greeting_student("Kevin", "a fun journey in using functions!")

Hello, Kevin, greetings! Wish you a fun journey in using functions!

Functions may be created with return values to the caller, using the keyword
“return”.

def sum_two_numbers(a, b):

return a + b

x, y = 2.0, 8.0
apb = sum_two_numbers(x, y)
print(f'{x} + {y} = {apb}') # print('%f + %f = %f'%(x,y,apb))
x, y = 20, 80
apb = sum_two_numbers(x, y)
print(f'{x} + {y} = {apb}, perfect!')
#print('%d + %d = %d: perfect!'%(x,y,apb))
print(f'{x + y}, perfect!') #f-string allow the use of operations

2.0 + 8.0 = 10.0

20 + 80 = 100, perfect!
100, perfect!

Variable scope: LEGB rule (local, enclosing, global, buildings) deﬁnes the sequence
for Python to search for a variable. The searching terminates when the variable is
found.

def sum_two_numbers(a, b):

a += 1
print('Inside the function, a=',a)
print('Inside the function x=',x)
return a + b

x, y = 2.0, 8.0
print('Before the function is called,x=',x,'y=',y)
apb = sum_two_numbers(x, y)
print('After the function is called, %f + %f = %f'%(x,y,apb))

Before the function is called,x= 2.0 y= 8.0

Inside the function, a= 3.0
Inside the function x= 2.0
After the function is called, 2.000000 + 8.000000 = 11.000000
86 Machine Learning with Python: Theory and Applications

2.13.3 Lambda functions (Anonymous functions)

Lambda function is a single-line function. It may be the simplest form of functions
and the most useful. The linear functions can be deﬁned simply as

f = lambda x, k, b: k * x + b # do not forget ":"

print(f(1,1,1),f(2,2,2),f(1,2,3))

2 6 5

The quadratic functions can be deﬁned as

f = lambda x, a, b, c: a*x**2 + b*x +c

print(f(2,2,-4,6))

Often Lambda functions are used together with the normal functions, especially for
returning a value where a single-line function comes in handy.

def func2nd_order(a,b,c):
return lambda x: a*x**2 + b*x +c

f2 = func2nd_order(2,-4,6)
print('f2(2)=',f2(2), 'or', func2nd_order(2,-4,6)(2))

f2(2)= 6 or 6

2.14 Classes and Objects

A class is a single entity that encapsulates variables and functions (or methods).
A class is essentially a template to create class objects. One can create unlimited
number of class objects with it, and each class object gets the structure, vari-
ables/attributes and-functions (or methods) from the class. References used in this
section includes the following:

• https://www.python-course.eu/python3 class and instance attributes.php.

• https://realpython.com/instance-class-and-static-methods-demystiﬁed/.

2.14.1 A simplest class

class C: # Define a lass named C

''' A simplest possible class named "C" '''
ca = "class attribute" # an attribute defined in the class

help(C) # check out what has been created

Basics of Python 87

Help on class C in module main :

class C(builtins.object)
| A simplest possible class named "C"
|
| Data descriptors defined here:
|
| dict
| dictionary for instance variables (if defined)
|
| weakref
| list of weak references to the object (if defined)
|
| --------------------------------------------------------------
| Data and other attributes defined here:
|
| ca = 'class attribute'

This example shows how a Class is structured, comments given in ”’ ”’ in the class
deﬁnition are used to convey a message to the user, and how an attribute can be
created in the class. We can now use it to observe some behavior of a class attribute
and instance attributes.

i1 = C() # an instance i1 (a class object) is created.

i2 = C() # an instance i2 (another class object) is created.
print('i1.ca=',i1.ca) # prints: 'class attribute!'
print('i2.ca=',i2.ca) # prints: 'class attribute!'
print('C.ca =',C.ca) # prints: 'class attribute!'
#print('ca=',ca) # NameError:'ca' is not defined

i1.ca= class attribute

i2.ca= class attribute
C.ca = class attribute

C.ca = "This is a changed class attribute 'ca'"

# Changing the class attribute, via class
print('C.ca =',C.ca)
print('i1.ca=',i1.ca)
print('i2.ca=',i2.ca)

C.ca = This is a changed class attribute 'ca'

i1.ca= This is a changed class attribute 'ca'
i2.ca= This is a changed class attribute 'ca'
88 Machine Learning with Python: Theory and Applications

Note the values in the instances are also changed.

i1.ca = "This is a changed instance attribute 'ca'"

# Changing an instance attribute
print('i1.ca=',i1.ca) # Changed
print('C.ca =',C.ca) # Unchanged
print('i2.ca=',i2.ca) # Unchanged
i1.ca= This is a changed instance attribute 'ca'
C.ca = This is a changed class attribute 'ca'
i2.ca= This is a changed class attribute 'ca'

The change is only eﬀective for the instance attribute that is changed.

C.ca = "The 2nd changed class attribute 'ca'"

# Changing the class attribute, via class
print('C.ca =',C.ca) # should change accordingly
print('i1.ca=',i1.ca) # Will not change! No longer follow C
# because it made a change after creation
print('i2.ca=',i2.ca) # should change according to class C
# because it has not been changed creation
C.ca = The 2nd changed the class attribute 'ca'
i1.ca= This is a changed instance attribute 'ca'
i2.ca= The 2nd changed class attribute 'ca'

Class attributes and object instance attributes are stored in separate dictionaries:

C.__dict__
mappingproxy({' module ': ' main ',
' doc ': ' A simplest possible class named "C" ',
'ca': "The 2nd changed the class attribute 'ca'",
' dict ': <attribute ' dict ' of 'C' objects>,
' weakref ': <attribute ' weakref ' of 'C' objects>})

i1.__dict__

{'ca': "This is a changed instance attribute 'ca'"}

It is clear that a dictionary has been created when a change is made at the instance
level, departing from the class level.

i2.__dict__

{}

No dictionary has been created, because no change is made at the instance level. It
stays with the class.
Basics of Python 89

i2.ca = "Make now a change at the instance y to the attribute 'ca'"

i2.__dict__

{'ca': "Make now a change at the instance y to the attribute 'ca'"}

A dictionary has now been created, because the change is made at the instance
level. It departed from the class level. Any future change at the class level to this
attribute will no longer aﬀect the attribute at this instance level.

2.14.2 A class for scientiﬁc computation

Let us look at an example for simple scientiﬁc computations. We ﬁrst create a class
called Circle to compute the area of a circle for given radius. The following is the
code:

class Circle:
''' Class "Circle": Compute the area of a circle '''
pi = 3.14159 #class attribute of constants used class-wide
# and class specific
def __init__(self, radius): # a special constructor
# __init__ is executed when the class is called.
# it is used to initiate a class. For this simple
# task we need only one variable: radius.
# "self" is used to reserve an argument place for an
# instance (to-be-created) itself to pass along.
# It is used for all the functions in a class.
self.radius = radius # This allows the instance accessing
# the variable: radius.
def circle_area(self): # function computes circle area
return self.pi * self.radius **2 # pi gets in there via
# the object instance itself

#help(Circle) # check out what has been created

We can now use this class to perform computations.

r = 10
c10 = Circle(r) # create an instance c10. c10 is now passed to
# self inside the class definition
# 10 is passed to the self.radius
print('Circle.pi before re-assignment',Circle.pi)
# access pi via class
90 Machine Learning with Python: Theory and Applications

print('Radius=',c10.radius)
# access via object c10.radius is the self.radius in __init__
print('c10.pi before re-assignment', c10.pi)
# The class attribute is accessed via instance attribute
Circle.pi before re-assignment 3.14159
Radius= 10
c10.pi before re-assignment 3.14159

c10.pi=3.14 # this will change the constant for instance c10

# It will not change the class-wide pi value
print('c10.pi after re-assignment via c10.pi:',c10.pi)
print('Circle.pi after re-assignment via c10.pi:',Circle.pi)
print('circle_area of c10 =',c10.circle_area())
print('circle_area of Circle100=',Circle(100).circle_area())
c10.pi after re-assignment via c10.pi: 3.14
Circle.pi after re-assignment via c10.pi: 3.14159
circle area of c10 = 314.0
circle area of Circle100= 31415.899999999998

It is seen that the Class Circle works well. Let us now create a subclass.

2.14.3 Subclass (class inheritance)

Subclasses can often be used, to take the advantages of the inheritance feature in
Python. This allows us to create new classes by fully making use of an existing class
for its entire structure (attributes and functions), without aﬀecting the ongoing
use of the existing class. It is thus also useful in upgrading the existing programs
because of the reduced duplications.
Assume that the Circle code created above has already been distributed and
used by many. We now decide to create another class to compute the area of a
partial Circle, given the value of the portion of the circle. We can now create a
subclass for this purpose, called P circle, without aﬀecting the use of the already
distributed Circle. The following is the code:

class P_circle(Circle):
# Subclass P_circle referring the base (or parent)
# Circle in (). This establishes the inheritance
''' Subclass "P_circle" based on Class "Cirle": Compute the\
area of a circle portion '''
def __init__(self,radius,portion):
# with 3 attributes: self, radius, and portion.
super().__init__(radius) # This brings in base attributes
# from the base class Circle.
self.portion = portion # Subclass attribute.
Basics of Python 91

def pcircle_area(self):
# define a function to compute the area of a partial circle
return self.portion*self.circle_area() # New function in
# subclass. The base class Circle is used here.

#help(P_circle) # check out what has been created

Readers may remove “#” in above cell, execute it, and take a moment to read
through this information, and see how the subclass is structured, its connection
with the base class, how self is been used to prepare for connections with the future
objects to be assigned, and what are the attributes and functions that are newly
created and inherited from the base class.

pc10 = P_circle(10.,0.5) # create an object instance using the

# subclass, with argument radius=10 and portion =50%
pc10.pi # we have the same attribute from the base class

3.14159

pc10.radius # we have the same attribute from the base class

10.0

pc10.pcircle_area() # area of a 50% partial circle is computed

157.0795

Let us now make a change to constant pi via subclass instance.

pc10.pi = 3.14
pc10.pi
3.14

It changed. Let us check pi via the base class c10.

c10.pi
3.14

It remains unchanged. Actions to the subclass are not aﬀecting the base class.

2.15 Modules

We now touch upon modules. A module is a piece of Python file that has a specific
functionality. For example, when writing a finite element program, we may write
one module for creating the stiffness matrix and another for solving the system
92 Machine Learning with Python: Theory and Applications

equations. Each module is a separated Python file, which can be written and edited
independently. This helps a lot in organizing and maintaining large programs.
A module in Python is a Python file with .py extension. The file name will be
the module name. Such a module can have a set of functions, classes, and variables
defined. In a module, one can import other modules in the procedure mentioned in
the beginning of this chapter.

2.16 Generation of Plots

Python is also very powerful in generating plots. This is done by import modules
that are openly available. Here, we shall present a simple demo plot of scattered
circles.
First, we import the modules needed.

import numpy as np
import matplotlib.pyplot as plt
# matplotlib.pyplot is a plot function in the matplotlib module
%matplotlib inline
# to have the plot generated inside the notebook
# Otherwise, it will be generated in a new window

We now generate sample data, and then plot 80 randomly generated circles.

n=80
x=np.random.rand(n) # Coordinates randomly generated
y=np.random.rand(n)
colors=np.random.rand(n)
areas=np.pi*(18*np.random.rand(n))**2
# circle radiuses from 0~20 randomly generated
plt.scatter(x,y,s=areas,c=colors,alpha=0.8)
plt.show()

Figure 2.3: Randomly generated circular areas ﬁlled with diﬀerent colors.
Basics of Python 93

# Plost a curve
x = range(1000)
y = [i ** 2 for i in x]
plt.plot(x,y)
plt.show();

Figure 2.4: Curve for a quadratic function.

x = np.linspace(0, 1, 1000)**1.5
plt.hist(x);

Figure 2.5: An example of a histogram.

2.17 Code Performance Assessment

Performance assessment on a code can be done in two ways. Typical example codes
are given below. Readers may make use of these codes for accessing computational
performance to his/her codes.
94 Machine Learning with Python: Theory and Applications

import time # import time module

import numpy as np

g=list(range(10_000_000))
#print(g)
q=np.array(g,'float64')
#print(q)
start = time.process_time()
sg=sum(g)
t_elapsed = (time.process_time() - start)
print(sg,'Elapsed time=',t_elapsed)

start = time.process_time()
sq=np.sum(q)
t_elapsed = (time.process_time() - start)
print(sq,t_elapsed)

49999995000000 Elapsed time= 0.28125

49999995000000.0 0.03125

%%timeit #use timeit

sg=sum(g)

329 ms ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2.18 Summary

With the basic knowledge of Python and its related modules, and the essential
techniques for coding, we are now ready to learn to code for computations and
machine learning techniques. In this process one can also gradually improve his/her
Python coding skills.

Reference
[1] C.R. Harris, K.J. Millman and J.v.d.W. Stéfan et al. Array programming with NumPy,
Nature, 585(7825), 357–362, Sep 2020. http://dx.doi.org/10.1038/s41586-020-2649-2.
Chapter 3

Basic Mathematical Computations

This chapter discusses typical scientiﬁc computations using codes in Python.

We will focus on how numerical data are represented mathematically, how
they are structured or organized, stored, manipulated, operated upon, or
computed in eﬀective manners. Subtleties in those operations in Python will
be examined. At the end of this chapter, techniques for initial treatment
of datasets will also be discussed. The reference materials used in the
chapter include Numpy documentation, Scipy documentation, https://gluon.
mxnet.io/, and https://jupyter.org/. Our discussion shall start with some
basic linear algebra operations on data with a structure of vector, matrix
and tensor.

3.1 Linear Algebra

Linear algebra is most essential for any computations that involve big data,
such as in machine learning. We plan to brieﬂy review basic linear algebraic
operations, through the use of Python programming, using modules that
have already been developed in the Python community at large. We shall go
through the basic concepts, the mathematical notation, data structure, and
the computation procedure. Readers feel free to skim or skip this chapter
if you are already conﬁdent in the basic linear algebra computations. Our
discussion will start from the data structure. First, we import necessary
modules and functions.

import sys # import "sys" module

sys.path.append('grbin') # current/relative directory,
# like ..\\..\\code
# Or absolute folder like 'F:\\xxx\\...\\code'

95
96 Machine Learning with Python: Theory and Applications

import grcodes as gr # grcodes module is placed in the

# folder above
from grcodes import printx # import a particular function
import numpy as np # Import Numpy package

We will also use the MXNet package. If not done yet, we shall have
MXNet installed using: pip install mxnet.
After the installation, we import the MXNet.

from mxnet import nd # Import NDArray and give it a alias nd

3.1.1 Scalar numbers

As discussed in Chapter 2, scalar numerical numbers for mathematical

computations have three major types: integer, real number, and complex
number. In Python, such a numerical number is assigned with a unique name
and a given address. It is accessed by calling its name, can be updated, and
used as an argument of a properly defined function (built-in, defined in a code
or in an imported module). In Python programming, all these operations
on a number are straightforward. The most often encountered problem in
computation is that the number may get out of the bound of the digits of
the computer (over- or under-flow), and may become illegal as an argument
for a function. Also, we assume that the numbers generated in Python can
cover the entire real space with the machine accuracy of limit.

3.1.2 Vectors

A vector refers to an object that has more than one component stacked along
one dimension. It can give a physical meaning depending on the type of the
physics problem. For example, a force vector in three-dimensional (3D) space
has three components, each representing the projection of the force vector
onto one of the three axes of the space. It is also known as the degrees of
freedom (DoF). Higher dimensions are encountered in discretized numerical
models, such as the ﬁnite element method or FEM (e.g., [1]). In the FEM,
the solid structure is discretized in space with elements and nodes. The DoFs
of such a discretized model depend on the number of nodes, which can be
very large, often in the order of millions. Therefore, we form vectors with
millions of components or entries. In machine learning models, the features
and labels can be written in vector form.
Basic Mathematical Computations 97

In this chapter, we will not discuss much about the physical problems.
Instead, we discuss general aspects of a vector in abstract, and issues on
computational operations that we may perform to the vector for a given
coordinate system. The DoF for a vector is also referred as the length. A
vector of length p has a shape denoted in Python as (p’).

p = 15 # Length of the vector

x = nd.arange(p) # Create a vector that is an nd-array
# using nd.arange() function
gr.printx('x') # x is now a vector with n components
printx('x') # Use the printx function directly
print(x.shape)
x =
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.]
<NDArray 15 @cpu(0)>
x =
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.]
<NDArray 15 @cpu(0)>
(15,)
Note that in mathematics, the notation of vector is often presented as a
column vector. In Python, is shows as a raw vector.

print(x.T) # Transpose of x
print(x.T.shape)
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.]
<NDArray 15 @cpu(0)>
(15,)
In mathematics, the transpose of a (column) vector shall become a
raw vector, and vice versa. In MXNet NDArray, The transformed vector
is stored in the same place but marked as transpose, so that operations,
such as multiplication, can be performed properly. This saves operations for
physically copying and moving the data, improving eﬃciency. To conﬁrm
this, we just print out the addresses, as follows.

print(id(x))
print(id(x.T)) # Transpose of x

2212173184080
2212173184080
98 Machine Learning with Python: Theory and Applications

It is clear that in NDArray, a vector is a vector. One does not distinguish

whether it is a column or row vector. Its transport is just a marker on it and
only one set of data is stored. This is an example that MXNet pays special
attention to not moving the data in memory unnecessarily. We do not seem
to observe this behavior in numpy arrays, as shown below.

xnp = np.arange(15) # an Numpy array is generated

print(id(xnp), xnp.shape)
print(id(xnp.T),xnp.T.shape) # address is changed
2212173195344 (15,)
2212173195424 (15,)

The shape of the numpy array is unchanged (still structured as row-vector

or vector), but its transpose is given a separated address.

3.1.3 Matrices
A matrix refers to an object that has more than one dimensions of data
structure, in which each dimension has more than one component. It can be
viewed as stacks of vectors of some length. It again can have a physical mean-
ing depending on the type of physics problem. For example, the stiffness (and
mass) matrix created based on a discretized numerical model for a solid struc-
ture, such as the FEM, has a two-dimensional (2D) structure. In each of the
dimensions, the number of components is the same as that of the DoFs. The
whole matrix is a kind of spatially distributed “stiffness” of the structure [1].
In machine learning models, the input data points in the feature space, and
learning parameters in the hypothesis space may be written in a matrix form.
Again, we will not discuss much about the physical problem here.
Instead, we discuss general aspects of a matrix in abstract, and issues on
computational operations that we may perform to the matrix. Such an
abstract matrix can be presented in arrays of multi-dimensions known as
shape in Python defined in Chapter 2.

A = x.reshape((3, 5)) # Create a 2D matrix by reshaping

# a 1D array
print("A=", A, "\n A.T=", A.T)
A=
[[ 0. 1. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]
Basic Mathematical Computations 99

[10. 11. 12. 13. 14.]]

A.T=
[[ 0. 5. 10.]
[ 1. 6. 11.]
[ 2. 7. 12.]
[ 3. 8. 13.]
[ 4. 9. 14.]]
<NDArray 5x3 @cpu(0)>

print(A.shape, A.T.shape)

(3, 5) (5, 3)

The shape or the dimension of the matrix A is 3 by 5, and that of its

transpose becomes (5, 3).
As discussed in Chapter 2 in detail, each of the components (or entries)
in the vector or matrix can be accessed by indexing or slicing.

print('A[1, 2] = ', A[1, 2]) # via index

print('row 2', A[2, :]) # slice the 3rd row (count from 0)
print('column 1', A[:, 1]) # slice the 2nd column
A[1, 2] =
[7.]
<NDArray 1 @cpu(0)>
row 2
[10. 11. 12. 13. 14.]
<NDArray 5 @cpu(0)>
column 1
[ 1. 6. 11.]
<NDArray 3 @cpu(0)>

The matrix can also be transposed.

A.T # A transpose, shape becomes 5 by 3

print(id(A), id(A.T))
2212173182960 2213649918832

It is found that the transposed matrix has its own address also in MXNet.
100 Machine Learning with Python: Theory and Applications

3.1.4 Tensors
The term of “Tensor” requires some clarification. In mathematics or physics,
tensor has a specific well-defined meaning. It refers to a structured data
(a single number, a vector, or a multi-dimensional matrix) that obeys
a certain tensor transformation rule under coordinate transformations.
Therefore, tensor is a very special group of structured data or object, and
not all matrices can be called a tensor. In fact, most of them are not. So long
as the tensor transformation rules are obeyed, it can be classed in different
orders. Scalars are 0th order tensors, vectors are 1st-order tensors, and 2D
matrices are 2nd-order tensors, and so on.
Having said that, in the machine learning (ML) community, however, any
matrix with dimension higher than 2 is called a tensor. It can be viewed as
stacks of matrices of same shape. This ML tensor seems to carry a meaning
of big data that needs to be structured in high dimensions. The ML tensor
is now used as a general way of representing an array with an arbitrary
dimension or arbitrary number of axes. The use of ML tensors becomes
more convenient when dealing with, for example, images that can have 3D
data structures, with axes corresponding to the height, width, and the three
color (RGB) channels. In numpy, a tensor is a multidimensional array.
Because there is usually no such a coordinate transformation performed in
machine learning, there will be no possible confusion caused in our discussion
in this book. From now onwards, we will call the ML tensor a tensor,
with the understanding that it may not obey the real-tensor transformation
rules and we do not perform such a transformation in machine learning
programming.
We now use nd.range() and then the reshape() to create a 3D nd-array.

X = nd.arange(24).reshape((2, 3, 4))
print('X.shape =', X.shape)
print('X =', X)

X.shape = (2, 3, 4)
X =
[[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]]
Basic Mathematical Computations 101

[[12. 13. 14. 15.]

[16. 17. 18. 19.]
[20. 21. 22. 23.]]]
<NDArray 2x3x4 @cpu(0)>
Element-wise operations are applicable to all tensors.

A = nd.arange(8).reshape((2, 4))
B = nd.ones_like(A)*8 # get shape of A, assign uniform entries
print('A =', A, '\n B =', B)
print('A + B =', A + B, '\n A * B =', A * B)

A =
[[0. 1. 2. 3.]
[4. 5. 6. 7.]]
<NDArray 2x4 @cpu(0)>
B =
[[8. 8. 8. 8.]
[8. 8. 8. 8.]]
<NDArray 2x4 @cpu(0)>
A + B =
[[ 8. 9. 10. 11.]
[12. 13. 14. 15.]]
<NDArray 2x4 @cpu(0)>
A * B =
[[ 0. 8. 16. 24.]
[32. 40. 48. 56.]]
<NDArray 2x4 @cpu(0)>

3.1.5 Sum and mean of a tensor

x = nd.arange(5)
print(x)

[0. 1. 2. 3. 4.]
<NDArray 5 @cpu(0)>

nd.sum(x) # summation of all entries/elements: 0+1+2+3+4=10

102 Machine Learning with Python: Theory and Applications

[10.]
<NDArray 1 @cpu(0)>

X = nd.ones(15).reshape(3,5)
X
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
<NDArray 3x5 @cpu(0)>

nd.sum(X) # summation of all entries/elements

[15.]
<NDArray 1 @cpu(0)>

print(nd.mean(X), nd.sum(X)/X.size) # same as nd.mean()

[1.]
<NDArray 1 @cpu(0)>
[1.]
<NDArray 1 @cpu(0)>

3.1.6 Dot-product of two vectors

Dot-product may be one of the most, if not the most, widely used operations
in scientific computation including machine learning. We discussed about it
briefly in Section 2.9. Here, we shall discuss more on its use for vectors that
may have different data structure resulting in some subtleties, as mentioned
in Section 2.9.13
Given two vectors a and b, their dot-product is often written in linear
algebra as a b or a · b. Essentially, it is just a sum of their element-wise
products, which results in a scalar. This implies that the shape of a and b
must be compatible: both have the same length. Let us see some examples.

a = nd.arange(5)
b = nd.ones_like(a) * 2 #This ensures the compatibility
print(f"a={a},a.shape={a.shape} \nb={b},b.shape={b.shape}")
printx('nd.dot(a, b)')
printx('nd.dot(a, b).shape')
print(f"np.dot(a, b)={np.dot(a.asnumpy(), b.asnumpy())}")
Basic Mathematical Computations 103

print(f"np.dot(a.T,b)={np.dot(a.asnumpy().T, b.asnumpy())}")
printx('np.dot(a.asnumpy(), b.asnumpy()).shape')
a=
[0. 1. 2. 3. 4.]
<NDArray 5 @cpu(0)>,a.shape=(5,)
b=
[2. 2. 2. 2. 2.]
<NDArray 5 @cpu(0)>,b.shape=(5,)
nd.dot(a, b) =
[20.]
<NDArray 1 @cpu(0)>
nd.dot(a, b).shape = (1,)
np.dot(a, b)=20.0
np.dot(a.T,b)=20.0
np.dot(a.asnumpy(), b.asnumpy()).shape = ()

Note that the transpose() to the vector will have not eﬀect, because
transpose() in numpy swaps the axises of a 2D array. An numpy 1D array
has a shape of (n,) and hence no action can be taken. A numpy 1D array is
not treated as a matrix, as discussed in Section 2.9.13. When b is a column
vector, a special case of 2D array, it then has two axises like a matrix. The
dot-product a · b is same as the matrix-product ab (where a is deﬁned as
a row vector and b a column vector), in terms of the scalar value resulted.
Thus, in our formulation, we do not distinguish them mathematically, and
we often use the following equality.
a · b = ab (3.1)
In computations in numpy, however, there are some subtleties. The dot-
product of two (1D array) vectors gives a scalar, and the dot-product of a
(1D array) vector with a column vector gives an 1D array with the same
scalar as the sole element. In NDArray, such subtleties are not observed.
Readers may examine the following code carefully to make sense of this.

b_c = b.reshape(-1, 1) # convert a 1D array to a column vector

print(b_c, 'b_c.shape=', b_c.shape)
printx('np.dot(a.asnumpy(),b.asnumpy())') # np dot-product (scalar)
printx('np.dot(a.asnumpy(),b_c.asnumpy())') # np dot-product (array)
print(a.asnumpy()@b_c.asnumpy()) # np matrix-product (array)
printx('nd.dot(a, b)') # nd dot-product (array)
printx('nd.dot(a, b_c)') # nd dot-product (array)
printx('nd.dot(a, b_c).shape')
104 Machine Learning with Python: Theory and Applications

[[2.]
[2.]
[2.]
[2.]
[2.]]
<NDArray 5x1 @cpu(0)> b_c.shape= (5, 1)
np.dot(a.asnumpy(),b.asnumpy()) = 20.0
np.dot(a.asnumpy(),b_c.asnumpy()) = array([20.], dtype=float32)
[20.]
nd.dot(a, b) =
[20.]
<NDArray 1 @cpu(0)>
nd.dot(a, b_c) =
[20.]
<NDArray 1 @cpu(0)>
nd.dot(a, b_c).shape = (1,)

As seen, all these give the same scalar value but in a diﬀerent data
structure.
The dot-product of two column vectors (special matrices) a and b of
equal length is written linear algebra as a b or b a, which gives the same
scalar (but in a 2D array or a matrix with only one element).

a_c = a.reshape(-1, 1) # convert a 1D array to a column vector

print(a_c),
printx('np.dot(a_c.asnumpy().T,b_c.asnumpy())') # scalar in 2D array
printx('np.dot(b_c.asnumpy().T,a_c.asnumpy())')
printx('np.dot(b_c.asnumpy().T,a_c.asnumpy()).shape')

[[0.]
[1.]
[2.]
[3.]
[4.]]
<NDArray 5x1 @cpu(0)>
np.dot(a_c.asnumpy().T,b_c.asnumpy()) = array([[20.]], dtype=float32)
np.dot(b_c.asnumpy().T,a_c.asnumpy()) = array([[20.]], dtype=float32)
np.dot(b_c.asnumpy().T,a_c.asnumpy()).shape = (1, 1)

To access the scalar value in a 2D array of shape (1, 1), simply use.

print(np.dot(a_c.asnumpy().T,b_c.asnumpy())[0][0])

20.0
Basic Mathematical Computations 105

One may use ﬂatten() to make a column vector back to a row vector.

printx('np.dot(a_c.asnumpy().flatten(),b_c.asnumpy())')
printx('np.dot(a_c.asnumpy().flatten(),b_c.asnumpy()).shape')
np.dot(a_c.asnumpy().flatten(),b_c.asnumpy()) = array([20.], dtype=float32)
np.dot(a_c.asnumpy().flatten(),b_c.asnumpy()).shape = (1,)

To access the scalar value in a 1D array of shape (1,), simply use,

printx('np.dot(a_c.asnumpy().flatten(),b_c.asnumpy())[0]')
np.dot(a_c.asnumpy().flatten(),b_c.asnumpy())[0] = 20.0

One may use ravel() to convert a multidimensional array to a 1D array (in

this case no copy of the array is made).

printx('np.dot(a_c.asnumpy().T,b_c.asnumpy()).ravel()')
printx('np.dot(a_c.asnumpy().T,b_c.asnumpy()).ravel()[0]')
np.dot(a_c.asnumpy().T,b_c.asnumpy()).ravel() = array([20.], dtype=float32)
np.dot(a_c.asnumpy().T,b_c.asnumpy()).ravel()[0] = 20.0

3.1.7 Outer product of two vectors

Given two vectors a and b, the outer product a⊗b becomes a matrix; in
its ij position, the element is ai bj . Thus, the shapes of a and b are always
compatible.

a = np.arange(3)
b = np.ones(5) * 2
print(a, b)
print('np.outer=\n',np.outer(a, b))
[0 1 2] [2. 2. 2. 2. 2.]
np.outer=
[[0. 0. 0. 0. 0.]
[2. 2. 2. 2. 2.]
[4. 4. 4. 4. 4.]]

A matrix (2D array) is created using two 1D arrays of arbitrary lengths,

with the help of the np.outer() function. One can achieve the same results
using the @ operator, but a needs to be a column vector with shape (n, 1)
106 Machine Learning with Python: Theory and Applications

and b needs to be a row vector with shape (1, m). Readers may try this
as an exercise. Note that although we may get the same results, using the
built-in np.outer() is recommended, because it is usually much faster, and
does not need additional operations. This recommendation applies to all the
other similar situations.

3.1.8 Matrix-vector product

When the dimensionality (shape) is compatible (or made compatible via
“broadcasting”), one can obtain a matrix-vector product, using the np.dot()
function.
A35 = nd.arange(15).reshape(3,5)
d5 = nd.ones(A35.shape[1]) # get the 2nd element of shape A35
print(A35,A35.shape, d5, d5.shape)
f = nd.dot(A35, d5) # shape compatible:[3,5]X[5]->vector
# of length 3
print(f,f.shape)

[[ 0. 1. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]
[10. 11. 12. 13. 14.]]
<NDArray 3x5 @cpu(0)> (3, 5)
[1. 1. 1. 1. 1.]
<NDArray 5 @cpu(0)> (5,)

[10. 35. 60.]

<NDArray 3 @cpu(0)> (3,)

#nd.dot(d5,A35) # shape error: [5]X[3,5] not compatible

d3 = nd.ones(A35.shape[0])
nd.dot(d3,A35) # this works:[3]X[3,5]-> vector of length 5
[15. 18. 21. 24. 27.]
<NDArray 5 @cpu(0)>

3.1.9 Matrix-matrix multiplication

Further, dot-product can also be used for matrix-matrix multiplications, as
long as the shape is compatible.
Basic Mathematical Computations 107

A23 = nd.ones(shape=(2, 3))

B35 = nd.ones(shape=(3, 5))
print(A23,B35)
nd.dot(A23, B35) # [2,3]X[3,5]: shape compatible ->[2,5]
[[1. 1. 1.]
[1. 1. 1.]]
<NDArray 2x3 @cpu(0)>
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
<NDArray 3x5 @cpu(0)>

[[3. 3. 3. 3. 3.]
[3. 3. 3. 3. 3.]]
<NDArray 2x5 @cpu(0)>

#nd.dot(B35,A23) # this would give a dot shape error: [4,5]X[3,4]

In numpy, we have similar ways to perform matrix-matrix dot-product

operation.

import numpy as np
print('np.dot():\n',np.dot(A23.asnumpy(),B35.asnumpy()))
print('numpy @ operator:\n',A23.asnumpy() @ B35.asnumpy())
np.dot():
[[3. 3. 3. 3. 3.]
[3. 3. 3. 3. 3.]]
numpy @ operator:
[[3. 3. 3. 3. 3.]
[3. 3. 3. 3. 3.]]

Care is needed in dealing with matrix-vector, matrix-matrix operations,

because of the requirement of dimension compatibility. It is important to
always check the consistency of the dimensions of all the terms in the same
equation. Readers may need to struggle a while to get used to it. Operations
between one-dimensional arrays (row vectors) in Python are rather simpler,
because there is only on dimension to check and we usually use only the
dot-product (inner product).
108 Machine Learning with Python: Theory and Applications

3.1.10 Norms
A norm is used in Python to measure how “big” a vector or matrix is. There
are types of norm measures, but they all produce a non-negative value for
the measure. The most often used or default one is the L2-norm. It is the
square root of the sum of the squared elements in the vector or matrix or
tensor. For matrices, it is often called the Frobenius norm. The computation
is by calling a norm() function:

d = nd.ones(9) # create an array (vector)

print(d,nd.sum(d))
printx('nd.norm(d)') # use nd.norm()
print(np.linalg.norm(d.asnumpy())) # use numpy linalg.norm()
[1. 1. 1. 1. 1. 1. 1. 1. 1.]
<NDArray 9 @cpu(0)>
[9.]
<NDArray 1 @cpu(0)>
nd.norm(d) =
[3.]
<NDArray 1 @cpu(0)>
3.0

Note the diﬀerence that the nd.norm returns an NDArray, but

np.linalg.norm gives a ﬂoat.

#help(nd.norm) # when wondering, use this

print(nd.norm(A23),np.sqrt(6*1**2)) # nd.norm() for matrix,

# default L2
print(np.linalg.norm(A23.asnumpy())) # numpy linalg.norm()
# for matrix
[2.4494898]
<NDArray 1 @cpu(0)> 2.449489742783178
2.4494898

print(nd.norm(A23,ord=2, axis=1)) # nd.norm() for matrix,

# along axis 1
print(np.linalg.norm(A23.asnumpy(),ord=2, axis=1))
# numpy linalg.norm() for matrix
Basic Mathematical Computations 109

[1.7320508 1.7320508]
<NDArray 2 @cpu(0)>
[1.7320508 1.7320508]

The L1-norm of a vector is the sum of the absolute value of the elements
in a vector. The L1-norm of a matrix can be deﬁned as the maximum of
L1-norm of column vectors of the matrix. For computing the L1-norms of a
vector, we use the following:

printx('nd.sum(nd.abs(d))') # use nd.norm() for vector

printx('nd.norm(d,1)')
nd.sum(nd.abs(d)) =
[9.]
<NDArray 1 @cpu(0)>
nd.norm(d,1) =
[9.]
<NDArray 1 @cpu(0)>

print(np.sum(np.abs(d.asnumpy()))) # numpy for vector

print(np.linalg.norm(d.asnumpy(),1)) # np.linalg.norm() for vector
9.0
9.0

print(np.linalg.norm(A23.asnumpy(),1)) # np.linalg.norm()
# for matrix
2.0

3.1.11 Solving algebraic system equations

We can use numpy.linalg.solve to solve a set of linear algebraic system
equations given as follows:

KD = F (3.2)

where K is a given positive-deﬁnite (PD) square matrix (stiﬀness matrix in

FEM, for example), F is a given vector (nodal forces in FEM), and D is the
unknown vector (the nodal displacements). The K matrix is a symmetric-
positive-deﬁnite (SPD) for well-posed FEM models [1].
110 Machine Learning with Python: Theory and Applications

import numpy as np
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})

K = np.array([[1.5, 1.], [1.5, 2.]]) # A square matrix

print('K:',K)
F = np.array([1, 1]) # one may try F = np.array([[2], [1]])
print('F:',F)
D = np.linalg.solve(K,F)
print('D:',D)

K: [[ 1.500 1.000]
[ 1.500 2.000]]
F: [1 1]
D: [ 0.667 0.000]

If careful, one should see that the input F is a 1D numpy array, and the
result is also a 1D array, which is not the convention of linear algebra, as
discussed in Section 2.9.13. One can also purposely define a column vector
(a 2D array with only on column) following the convention of linear algebra,
and get the solution. In this case, however, the returned solution will be the
same, but is in a column vector. Readers may try this as an exercise.
Note that solving linear algebraic system equations numerically can be
very time consuming and expensive, especially for large systems. With
the development of computer hardware and software in the past decades,
numerical algorithms for solving linear algebraic systems are well developed.
The most effective solver for very large systems uses iterative methods.
It converts the problem of solving algebraic equations to a minimization
problem with a properly defined error residual function as a cost or
loss function. A gradient-based algorithm, such as the conjugate gradient
methods and Krylov methods, can be used to minimize the residual error.
These methods are essentially the same as those used in machine learning.
The numpy.linalg.solve uses routines from the widely used efficient Linear
Algebra PACKage or LAPACK.
For a matrix that is not square, we shall use the least-square solvers
for a best solution with minimized least-square error. The function in
numpy is numpy.linalg.lstsq(). We will see examples later when discussing
interpolations.
Basic Mathematical Computations 111

3.1.12 Matrix inversion

Readers may notice that the solution to Eq. (3.2) can be written as

D = K−1 F (3.3)

where K−1 is the inverse of K. Therefore, if one can compute K−1 , the
solution is simply a matrix-vector product. Indeed, for small systems, this
approach does work, and is used by many. We now use numpy.linalg.inv()
to compute the inverse of a matrix.

from numpy.linalg import inv # import the inv() function

Kinv = inv(K)
print(Kinv)
[[ 1.333 -0.667]
[-1.000 1.000]]

print(np.allclose(np.dot(K, Kinv), np.eye(2)))

print(np.allclose(np.dot(Kinv, K), np.eye(2)))
True
True

The solution to Eq. (3.2) is obtained as follows:

D = np.dot(Kinv,F)
print('D:',D)
D: [ 0.667 0.000]

which is the same as the one obtained earlier.

Multiple matrices can be inverted at once.

a = np.array([[[1., 2.], [3., 4.]], [[1, 3], [3, 5]]])

print(a)
inv(a)
[[[ 1.000 2.000]
[ 3.000 4.000]]

[[ 1.000 3.000]
[ 3.000 5.000]]]
112 Machine Learning with Python: Theory and Applications

array([[[-2.000, 1.000],
[ 1.500, -0.500]],

[[-1.250, 0.750],
[ 0.750, -0.250]]])

a = np.array([[1.5, 1.], [1.5, 1.]])

# A singular matrix, because its
print(a) # two columns are parallel
#ainv = inv(a) # This would give a Singular matrix error
[[ 1.500 1.000]
[ 1.500 1.000]]

Note that computing numerically the inverse of a matrix of large size is

much more expensive, compared to solving the algebraic system equations.
Therefore, one would like to avoid the computation of inverse matrix. In
many cases, we can change the matrix inversion problem to a set of problems
of solving algebraic equations.
In machine learning computations, one may encounter matrices that
are singular, which do not have an inverse, leading to breakdown in
computations. More often, the matrices are nearly singular, which may allow
the computation to continue but can lead to serious error, showing as some
unexpected, strange behavior. When such behavior is observed, the chance
is high that the system matrix may be “bad-conditioned”, and one should
check for possible errors in the data or the formulation procedure that may
lead to the nearly singular system matrix. If the problem is rooted in the data
itself, one may need to clean up the data or check for data error. After this
is exhausted, one may resort to mathematical means, one of which is the use
of singular value decomposition (SVD) to get the best possible information
from the data. See later in this chapter on SVD.
The key point we would like to make here is that the most important
factor that controls whether we can solve an algebraic system equation for
quality solution is the property (or characteristics or condition) of the system
matrix. Therefore, studying the property of a matrix is of fundamental
importance.
Eigenvalues (if they exist) and their corresponding eigenvectors are the
characteristics of a matrix.
Basic Mathematical Computations 113

3.1.13 Eigenvalue decomposition of a matrix

A diagonalizable matrix can have an eigenvalue decomposition, which gives

a set of eigenvalues and the corresponding eigenvectors. The original matrix
can be decomposed into a diagonal matrix with eigenvalues at the diagonal
and a matrix consisting of eigenvectors. In particular, for real symmetric
matrices, eigenvalue decomposition is useful, and the computation can be
fast, because the eigenvalues are all real and the eigenvectors can be made
real and orthonormal. Consider a real symmetric square matrix A that is
positive-deﬁnite (PD). It has the following eigenvalue decomposition:

A = VΛV (3.4)

where stands for transpose, Λ is a diagonal matrix with eigenvalues at the

diagonal, and matrix V is formed with eigenvectors corresponding to the
eigenvalues. It is an orthonormal matrix, because

VV = I (3.5)

which implies also that the inverse of V equals its transpose.

V−1 = V (3.6)

In addition, once a matrix is decomposed, computing the inverse of it

becomes trivial simply by matrix multiplications. To see this, we start from
the deﬁnition of inverse of a matrix, AA−1 = I, and use Eqs. (3.4) and (3.6),
which leads to
A−1 = VΛ−1 V (3.7)

Because the matrix is PD, its inverse exists and all the eigenvalues are
nonzero (positive-deﬁnite). The inverse of the diagonal matrix Λ is simply
the same diagonal matrix with diagonal terms replaced by the reciprocals of
the eigenvalues.
Eigenvalue decomposition can be viewed as a special case of SVD. For
general matrices that we often encounter in machine learning, the SVD is
more widely used for matrix decomposition and will be discussed later in
this chapter, because it exists for all matrices.
In this section, let us see an example of how the eigenvalues and the
corresponding eigenvectors can be computed in Numpy.
114 Machine Learning with Python: Theory and Applications

import numpy as np
from numpy import linalg as lg # import linalg module
A = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) # Identity matrix
print('A=',A)
e, v = lg.eig(A)
print('Eigenvalues:',e)
print('Eigenvectors:\n',v)

A= [[1 0 0]
[0 1 0]
[0 0 1]]
Eigenvalues: [ 1.000 1.000 1.000]
Eigenvectors:

[[ 1.000 0.000 0.000]

[ 0.000 1.000 0.000]
[ 0.000 0.000 1.000]]

It is clearly seen that the identity matrix has three eigenvalues all of 1, and
their corresponding eigenvectors are three linearly independent unit vectors.
Let us look at a more general symmetric matrix.

A = np.array([[1, 0.2, 0], [0.2, 1, 0.5], [0, 0.5, 1]])

# Symmetric A
print('A:\n',A)
e, v = lg.eig(A)
print('Eigenvalues:',e, '\n Eigenvectors:\n',v)

A:
[[ 1.000 0.200 0.000]
[ 0.200 1.000 0.500]
[ 0.000 0.500 1.000]]
Eigenvalues: [ 1.539 1.000 0.461]
Eigenvectors:
[[-0.263 0.928 0.263]
[-0.707 -0.000 -0.707]
[-0.657 -0.371 0.657]]

We obtain three eigenvalues and their corresponding eigenvectors, and

they are all real numbers. These eigenvectors are orthonormal. To see this,
lets compute:
Basic Mathematical Computations 115

print(np.dot(v,v.T))
[[ 1.000 0.000 0.000]
[ 0.000 1.000 -0.000]
[ 0.000 -0.000 1.000]]

This means that all these three eigenvectors are orthogonal with each
other, and the dot-product between them becomes a unit. We are now ready
to recover the original matrix A, using these eigenvalues and eigenvectors.

print(A)
lamd = np.eye(3)*e
A_recovered = v@[email protected]
print(A_recovered)
[[ 1.000 0.200 0.000]
[ 0.200 1.000 0.500]
[ 0.000 0.500 1.000]]
[[ 1.000 0.200 0.000]
[ 0.200 1.000 0.500]
[ 0.000 0.500 1.000]]

It is clear that matrix A is recovered with the machine error. This

means that the information in matrix A is fully kept in its eigenvalues and
eigenvectors.
Next, we use the eigenvalues and eigenvectors to compute its inverse,
using Eq. (3.7).

lamd_inv = np.eye(3)/e
A_inv = v@[email protected]
print(A_inv)
[[ 1.056 -0.282 0.141]
[-0.282 1.408 -0.704]
[ 0.141 -0.704 1.352]]

which is the same as that obtained directly using the numpy.linalg.inv()

function:

from numpy.linalg import inv

print(inv(A))
116 Machine Learning with Python: Theory and Applications

[[ 1.056 -0.282 0.141]

[-0.282 1.408 -0.704]
[ 0.141 -0.704 1.352]]

Let us now compute the eigenvalues and eigenvectors of an asymmetric

matrix.
A = np.array([[1,-0.2, 0], [0.1, 1,-0.5], [0, 0.3, 1]])
# Symmetric A
print('A:\n',A)
e, v = lg.eig(A)
print('Eigenvalues:',e, '\n Eigenvectors:\n',v)
A:
[[ 1.000 -0.200 0.000]
[ 0.100 1.000 -0.500]
[ 0.000 0.300 1.000]]
Eigenvalues: [1.+0.j 1.+0.4123j 1.-0.4123j]
Eigenvectors:
[[-0.9806+0.j 0. +0.3651j 0. -0.3651j]
[ 0. +0.j 0.7528+0.j 0.7528-0.j ]
[-0.1961+0.j -0. -0.5477j -0. +0.5477j]]

We now see one real eigenvalue, but the other two eigenvalues are
complex valued. These two complex eigenvalues are conjugates of each other.
Similar observations are made for the eigenvectors. We conclude that a real
asymmetric matrix can have complex eigenvalues and eigenvectors. Complex
valued matrices shall in general have complex eigenvalues and eigenvectors.
A special class of complex valued matrices called the Hermitian matrix (self-
adjoint matrix) has real eigenvalues. This example shows that the complex
space is geometrically closed, but the real space is not. An n by n real matrix
should have n eigenvalues (and eigenvectors), but they may not be all in the
real space. Some of them get into the complex space (that with the real space
as its special case).

3.1.14 Condition number of a matrix

The condition number of a matrix is a measure of the “level” of singularity.
There are a number of options of norms for computing the condition number.
It is always larger than or equal to 1 for any measure. This implies that the
best condition of a matrix is 1, which is the condition number of any unit
matrix that has no (the lowest) singularity. Any other matrix shall have some
Basic Mathematical Computations 117

level of singularity. The condition number for a matrix with the highest level
of singularity is inﬁnite. Let us see some examples.

from numpy import linalg as lg

A = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
# A unit matrix
# Clearly, it has 3 eigenvalues of all 1.0
print(A, '\n Condition number of A=',lg.cond(A))
# or lg.cond(A,2)
# Option 2-> L2 more measure
[[1 0 0]
[0 1 0]
[0 0 1]]
Condition number of A= 1.0

Because matrix A is a unit matrix, we got a condition number of 1, as

expected. Another example is as follows:

A = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 0]])

# A singular matrix
# It has 2 eigenvalues of all 1.0, and 1 eigenvalue of 0
print(A, '\n Condition number of A=',lg.cond(A))
[[1 0 0]
[0 1 0]
[0 0 0]]
Condition number of A= inf

Because matrix A is singular, its condition number is inf which is a numpy

number for inﬁnity, as expected. More examples are as follows:

A = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 10.]])

# Last entry is 10.
print(A, '\n Condition number of A=',lg.cond(A))
[[ 1.000 0.000 0.000]
[ 0.000 1.000 0.000]
[ 0.000 0.000 10.000]]
Condition number of A= 10.0

The condition number of this A is 10.0, which is 10.0/1.0. Again, the

condition number is the ratio of the largest eigenvalue and the smallest
eigenvalue. We can now conclude that if the largest eigenvalue of a matrix is
118 Machine Learning with Python: Theory and Applications

very large or the smallest eigenvalue of the matrix is very small, the matrix
is likely singular, depending on their ratio.
This finding implies that a normalization to a matrix (which is often
done in machine learning) will not in theory change its condition number.
It may help in reducing the loss of significant digits (because of the limited
presentation of floats in computer hardware).

3.1.15 Rank of a matrix

We mentioned rank before. If a square matrix has a full rank, it implies that
all the columns (or rows) are mutually linearly independent. Such a matrix is
not singular. For a singular matrix, its rank should be less than full, which is
called rank deficiency. One can further ask what the level of rack deficiency
is. The answer is the deficit number of its rank. Essentially, if the matrix has
a rank of 2, we know that two linearly independent vectors can be formed
using these columns (or rows) in the matrix.
For a non-square matrix, the full rank is the number of its columns or
rows, whichever is smaller. Similarly, it can also have rank deficiency if the
rank is smaller than the full rank.
Let us now examine it in more detail using Numpy.

from numpy.linalg import matrix_rank, eig

print('Rank=',matrix_rank(np.eye(4))) # Identity matrix
Rank= 4

It is seen that the identity matrix has a shape of 4 × 4. It has a full rank.

A = np.array([[1,-0.2, 0], [0.1, 1,-0.5], [0.1, 1,-0.5]])

# singular A
print(A, '\n Rank=', matrix_rank(A))
[[ 1.000 -0.200 0.000]
[ 0.100 1.000 -0.500]
[ 0.100 1.000 -0.500]]
Rank= 2

This singular matrix has two linearly independent columns, and hence
a rank of 2. It has a rank deﬁciency of 1. Thus, it should also have a zero
eigenvalue, as shown below. If the matrix has a rank deﬁciency of n, it shall
have n zero eigenvalues. This is easy checked out using Numpy.
Basic Mathematical Computations 119

np.set_printoptions(formatter={'float': '{: 0.3f}'.format})

eig(A)
(array([ 0.956, 0.544, -0.000]),
array([[ 0.955, -0.296, 0.088],
[ 0.209, -0.675, 0.438],
[ 0.209, -0.675, 0.894]]))

A = np.array([[1, -0.2, 0], [0.1, 1, -0.5]])

print('A:',A, '\n Rank=',matrix_rank(A))
A: [[ 1.000 -0.200 0.000]
[ 0.100 1.000 -0.500]]
Rank= 2

The matrix has only two rows, and a rank of 2. It has a full rank.

3.2 Rotation Matrix

For two-dimensional cases, the coordinate transformation (rotation) matrix

can be given as follows:

cos θ − sin θ
T= (3.8)
sin θ cos θ

where T is the transformation (or rotation) matrix, and θ is the rotation

angle. A given vector (displacement, force, for example) can be written with
two components in the coordinate system as follows:

d = [u v] (3.9)

The new coordinates of the vector dθ that is rotated by θ can be computed

using the rotation matrix.

import numpy as np
theta = 45 # Degree
thetarad = np.deg2rad(theta)
c, s= np.cos(thetarad), np.sin(thetarad)
T = np.array([[c, -s],
[s, c]])
print('Transformation matrix T:\n',T)
120 Machine Learning with Python: Theory and Applications

Transformation matrix T:
[[ 0.707 -0.707]
[ 0.707 0.707]]

d = np.array([1, 0]) # Original vector

T @ d # rotated by theta
array([ 0.707, 0.707])

T @ (T@d) # rotated by 2 thetas

array([-0.000, 1.000])

T @ T # 2 theta rotations
array([[ 0.000, -1.000],
[ 1.000, -0.000]])

T @ (T@(T @ T)) # 4 theta rotations

array([[-1.000, 0.000],
[ 0.000, -1.000]])

T@(T @ (T@(T @ (T @ (T@(T @ T)))))) # 8 theta rotations =

# no rotation
array([[ 1.000, 0.000],
[-0.000, 1.000]])

3.3 Interpolation

Interpolation is a frequently used numerical technique for getting an

approximate date based on known data. Machine learning is in some way
quite similar to the interpolation. This section studies general issues related
to interpolation, using numpy. Interpolation is also known as curve ﬁtting.
We show here some examples of function interpolation and approximation,
using values given at discrete points in a space.
Let us try numpy.interp ﬁrst. More descriptions can also be found in
Scipy documentation (https://docs.scipy.org/doc/numpy-1.13.0/reference/
generated/numpy.interp.html).
Basic Mathematical Computations 121

3.3.1 1-D piecewise linear interpolation using numpy.interp

# Data available
xn = [1, 2, 3] # data: given coordinates x
fn = [3, 2, 0] # data: given function values at x
# Query/Prediction f at a new location of x
x = 1.5
f = np.interp(x, xn, fn) # get approximated value at x
print(f'f({x:.3f})≈{f:.3f}')
f(1.500)≈2.500

np.interp(2, xn, fn) # Is it a data-passing interpolation?

2.0

np.interp([0, 1, 1.5, 2.72, 3.14], xn, fn) # querying at

# more points

array([ 3.000, 3.000, 2.500, 0.560, 0.000])

np.interp(4, xn, fn)

0.0

In practice, we know that interpolation can be a quite dangerous

operation, and hence extra care is required, especially when extrapolating.
To avoid or to be made aware of an extrapolation, one can set a warning
when the interpolation occurs outside the domain that the data covers.

out_of_domain = -109109109.0 # A warning number is used

print(np.interp(2.9, xn, fn,right=out_of_domain))
# print out the number when extrapolation occurs
print(np.interp(3.5, xn, fn,right=out_of_domain))
0.20000000000000018
-109109109.0

Interpolation using higher-order polynomials can be more accurate but

can also be a bigger problem. Piecewise linear approximation has often been
found to be much safer and can be very eﬀective, when dense data are
available. Given below is an example using piecewise linear interpolation for
the approximation of a sine function.
122 Machine Learning with Python: Theory and Applications

import matplotlib.pyplot as plt # module for plot the results

x = np.linspace(0, 2*np.pi, 20) # data: x values
y = np.sin(x) # data: function values at x
xvals = np.linspace(0, 2*np.pi, 50) # generate dense x data
# at which the values are
# obtained via interpolation
yinterp = np.interp(xvals, x, y)
plt.plot(x, y, 'o') # plot the original data points
plt.plot(xvals, yinterp, '-x') # plot interpolated data points
plt.show() # show the plots

Figure 3.1: Fitted sine curve.

3.3.2 1-D least-square solution approximation

This is to ﬁt a given set of data with a straight line in x − y plane

y = wx + b (3.10)

We shall determine the gradient w and bias b, using the data pair [xi , yi ]. In
this example, Eq. (3.10) can be rewritten as

y =X·w (3.11)

where X = [x, 1] and w = [w, b]. Now, we can use the np.linalg.lstsq to solve
for w:

w_true,b_true = 1.0, -1.0 # used for generating data

x = np.array([0, 1, 2, 3]) # x value at which data
# will be generated
X = np.vstack([x, np.ones(len(x))]).T # Form the matrix of data
X
Basic Mathematical Computations 123

array([[ 0.000, 1.000],

[ 1.000, 1.000],
[ 2.000, 1.000],
[ 3.000, 1.000]])

y = w_true*x+b_true+np.random.rand(len(x))/1.0
# generate y data random noise added
print(y)
w, b = np.linalg.lstsq(X, y, rcond=None)[0]
w, b
[-0.686 0.754 1.643 2.702]

(1.1055249814646126, -0.5550392342429096)

#help(np.linalg.lstsq) # to find out the details of this function.

import matplotlib.pyplot as plt

plt.plot(x, y, 'o', label='Original data', markersize=10)
plt.plot(x, w*x + b, 'r', label='Fitted line')
plt.legend()
plt.show()

Figure 3.2: Least square approximation of data via a straight line.

We have, in fact, created the simplest machine learning model, known as

linear regression.
Scipy package oﬀers eﬃcient functions for machine learning computations,
including interpolation. Let us examine some examples that are available
at the Scipy documentation (https://docs.scipy.org/doc/scipy/reference/
tutorial/interpolate.html).
124 Machine Learning with Python: Theory and Applications

3.3.3 1-D interpolation using interp1d

import numpy as np
from scipy import interpolate
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
x0, xL = 0, 18
x = np.linspace(x0, xL, num=11, endpoint=True)
# x data points
y = np.sin(-x**3/8.0) # y data points
print('x.shape:',x.shape,'y.shape:',y.shape)
f = interp1d(x, y) # linear interpolation
f2 = interp1d(x, y, kind='cubic') # Cubit interpolation
# try also quadratic
xnew = np.linspace(x0,xL,num=41,endpoint=True)
# x prediction points
plt.plot(x, y, 'o', xnew, f(xnew), '-', xnew, f2(xnew), '--')
plt.legend(['data', 'linear', 'cubic'], loc='best')
plt.show()
x.shape: (11,) y.shape: (11,)

Figure 3.3: Interpolation using Scipy.

3.3.4 2-D spline representation using bisplrep

x, y = np.mgrid[-1:1:28j, -1:1:28j] # x, y data grid

z = (x**2+y**2)*np.exp(-2.0*(x*x+y*y+x*y)) # z data
plt.figure()
Basic Mathematical Computations 125

plt.pcolor(x, y, z, shading='auto') # plot the initial data

plt.colorbar()
plt.title("Function sampled at discrete points")
plt.show()

Figure 3.4: Spline representation using bisplrep, coarse grids.

xnew, ynew = np.mgrid[-1:1:88j, -1:1:88j] # for view at grid

tck = interpolate.bisplrep(x, y, z, s=0) # B-spline
znew = interpolate.bisplev(xnew[:,0], ynew[0,:], tck) # z value
plt.figure()
plt.pcolor(xnew, ynew, znew, shading='auto')
plt.colorbar()
plt.title("Interpolated function.")
plt.show()

Figure 3.5: Spline representation using bisplrep, fine grids.

126 Machine Learning with Python: Theory and Applications

3.3.5 Radial basis functions for smoothing and interpolation

Radial basis functions (RBFs) are useful basis functions for approximation
of functions. RBFs are distance functions, and hence work well for irregular
grids (even randomly distributed points), in high dimensions, and are often
found less prone to overﬁtting. They are also used for constructing meshfree
methods [2]. In using Scipy, the choices of RBFs are as follows:

• “multiquadric”: sqrt((r/self.epsilon)**2 + 1)
• “inverse”: 1.0/sqrt((r/self.epsilon)**2 + 1)
• “gaussian”: exp(-(r/self.epsilon)**2)
• “linear”: r
• “cubic”: r**3
• “quintic”: r**5
• “thin plate”: r**2 * log(r).

The default is “multiquadric”.

First, let us look at one-dimensional examples.

import numpy as np
from scipy.interpolate import Rbf, InterpolatedUnivariateSpline
import matplotlib.pyplot as plt

# Generate data
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
x = np.linspace(0, 10, 9)
print('x=',x)
y = np.sin(x)
print('y=',y)
# fine grids for plotting the interpolated data
xi = np.linspace(0, 10, 101)
# use fitpack2 method
ius=InterpolatedUnivariateSpline(x,y) # interpolation
# function
yi = ius(xi) # interpolated values at fine grids
plt.subplot(2, 1, 1) # have 2 sub-plots plotted together
plt.plot(x, y, 'bo') # original data points in blue dots
plt.plot(xi, np.sin(xi), 'r') # original function, red line
Basic Mathematical Computations 127

plt.plot(xi, yi, 'g') # Spline interpolated, green line

plt.title('Interpolation using univariate spline')
plt.show()
# use RBF method
rbf = Rbf(x, y)
fi = rbf(xi)
plt.subplot(2, 1, 2) # have 2 plots plotted together
plt.plot(x, y, 'bo') # original data points in blue dots
plt.plot(xi, np.sin(xi), 'r') # original function, red line
plt.plot(xi, fi, 'g') # RBF interpolated, green line
plt.title('Interpolation using RBF - multiquadrics')
plt.show()
x= [ 0.000 1.250 2.500 3.750 5.000 6.250 7.500 8.750
10.000]
y= [ 0.000 0.949 0.598 -0.572 -0.959 -0.033 0.938 0.625
-0.544]

Figure 3.6: Comparison of interpolation using spline and radial basis function (RBF).

We now examine some two-dimensional examples.

import numpy as np
from scipy.interpolate import Rbf
import matplotlib.pyplot as plt
from matplotlib import cm
128 Machine Learning with Python: Theory and Applications

# 2-d tests - setup scattered data

x = np.random.rand(108)*4.0-2.0
y = np.random.rand(108)*4.0-2.0
z = (x+y)*np.exp(-x**2-y**2+x*y)
di = np.linspace(-2.0, 2.0, 108)
XI, YI = np.meshgrid(di, di)

# use RBF https://docs.scipy.org/doc/scipy/reference/

# generated/scipy.
# interpolate.Rbf.html#scipy.interpolate.Rbf
rbf = Rbf(x, y, z, epsilon=2)
ZI = rbf(XI, YI)

# plot the result

plt.pcolor(XI, YI, ZI, cmap=cm.jet, shading='auto')
plt.scatter(x, y, 88, z, cmap=cm.jet)
plt.title('RBF interpolation - multiquadrics')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.colorbar();

Figure 3.7: Two-dimensional interpolation using RBF.

RBFs can be used for interpolation over N -dimensions. Below is an

example for 3D.
Basic Mathematical Computations 129

from scipy.interpolate import Rbf

x, y, z, d = np.random.rand(4, 20)
# randomly generated data 0~1
# 4D, 50 points in each dimension
#print(d) # original data
rbfi = Rbf(x, y, z, d) # RBF interpolator
xi = yi = zi = np.linspace(0, 1, 10)
di = rbfi(xi, yi, zi) # interpolated values
print(di)
[-0.378 -0.058 0.450 0.832 0.825 0.665 0.559 0.512
0.430 0.497]

3.4 Singular Value Decomposition

3.4.1 SVD formulation

Singular value decomposition (SVD) is an essential tool for many numerical
operations on matrices, including signal processing, statistics, and machine
learning. It is a general factorization of a matrix (real or complex) that
may be singular and of any shape. It is very powerful, because any such a
matrix exists an SVD and can be found numerically. It is a generalization of
the eigenvalue decomposition that works for diagonalizable square matrices,
which was discussed earlier in this chapter.
A general (real or complex, square, or not square) m × p matrix A has
the following singular value decomposition:

A = UΣV∗ (3.12)

where * stands for Hessian (transpose of the matrix and conjugate to the
complex values).
• U is an m × m unitary matrix.
• Σ is an m × p rectangular diagonal matrix with non-negative real numbers
on the diagonal entries.
• V is a p × p unitary matrix.
• The diagonal entries σi in Σ are known as the singular values of A.
• The columns of U are called the left-singular vectors of A.
• The columns of V are called the right-singular vectors of A.
More detailed discussions on SVD can be found at Wikipedia (https://
en.wikipedia.org/wiki/Singular value decomposition).
130 Machine Learning with Python: Theory and Applications

3.4.2 Algorithms for SVD

Computation of SVD for large matrices can be very expensive. The
often used SVD algorithm is based on the QR decomposition (https://en.
wikipedia.org/wiki/QR decomposition) and its variations. The basic idea is
to decompose the given matrix into an orthogonal matrix Q and an upper
triangular matrix R. Readers can refer to the Wikipedia page for more details
and the leads there on the related topic. Here, we discuss a simple approach
to compute SVD based on the well-established eigenvalue decomposition.
This approach is not used for numerical computation of SVD. This is because
it uses a normal matrix that with condition number squared, leading to
numerical instability issues for large systems. For our theoretical analysis
and formula derivation, this is not an issue, and thus will be used here. For
this simple approach to work, we would need to impose some condition on
matrix A.
Consider a general m×p matrix A of real numbers with m>p, and assume
it has a rank of p. Such a matrix is often encountered in machine learning.
We ﬁrst form a normal matrix B:

B = A A (3.13)

which will be a p × p symmetric square matrix (smaller in size). Therefore,

B will be orthogonally diagonalizable. Because matrix A has a rank of p, B
will also be symmetric-positive-deﬁnite (SPD). Thus, B has an eigenvalue
decomposition, we perform such a decomposition. The results can be written
in the form of

B = Ve ΛVe (3.14)

where Ve is a p × p orthonormal matrix of p eigenvectors of the B matrix.

Λ is a p × p square diagonal matrix. The diagonal entries are the eigenvalues
that are positive real numbers.
On the other hand, we know that matrix A has an SVD decomposition;
we thus also have
A = UΣV (3.15)

Because matrix A has a rank of p, the singular value in Σ shall all be positive
real numbers. Using Eq. (3.15), we have

A A = UΣV UΣV = VΣU UΣV = VΣ2 V = B (3.16)
Basic Mathematical Computations 131

In the above derivation, we used the fact that U is unitary: U U = I, and

Σ is diagonal (not aﬀected by the transpose). Comparing Eq. (3.16) with
Eq. (3.14), we have
V = Ve (3.17)
√
Σ= Λ (3.18)

Using now Eq. (3.15), and the orthonormal property of V: V V = I, we

have

AV = UΣ (3.19)

Because Σ has all positive eigenvalues, we ﬁnally obtain

U = AVΣ−1 (3.20)

It is easy to conﬁrm U is unitary. Finally, if A is rank deﬁcient

(rank(A)< p), matrix B will have zero eigenvalues. In such cases, we simply
discard all the zero-eigenvalues and their corresponding eigenvectors. This
will still give us an SVD in a reduced form, and all the process given above
still holds
Readers may derive the similar set of equations for an m × p matrix A
of real numbers with p > m, and assume it has a rank of m.
The above analysis proves in theory that any matrix has a SVD. In
practical computations, we usually do not use the above procedure to
compute the SVD. This is because the condition number of matrix B is
squared, as seen in Eq. (3.13), leading to numerical instability. The practical
SVD algorithms often use the QR decomposition of A, which avoids forming
B. With the theoretical foundation, we now use some simple examples to
demonstrate the SVD process using Python.

3.4.3 Numerical examples

import numpy as np
a = np.random.randn(3, 6) # matrix with random numbers
print(a)
[[-1.582 0.398 0.572 -1.060 1.000 -1.532]
[ 0.257 0.435 -1.851 0.894 1.263 -0.364]
[-1.251 1.276 1.548 1.039 0.165 -0.946]]
132 Machine Learning with Python: Theory and Applications

u, s, vh = np.linalg.svd(a, full_matrices=True)
print(u.shape, s.shape, vh.shape)
(3, 3) (3,) (6, 6)

print('u=',u,'\n','s=',s,'\n','vh=',vh)
u= [[-0.677 0.283 0.679]
[ 0.202 0.959 -0.198]
[-0.707 0.003 -0.707]]
s= [ 3.412 2.466 1.881]
vh= [[ 0.589 -0.318 -0.544 0.048 -0.158 0.479]
[-0.083 0.216 -0.652 0.227 0.606 -0.318]
[-0.128 -0.382 -0.180 -0.867 0.166 -0.159]
[-0.510 -0.736 -0.110 0.412 -0.115 -0.066]
[ 0.300 -0.343 0.476 0.115 0.723 0.171]
[-0.529 0.218 -0.083 -0.104 0.210 0.781]]

smat = np.zeros((3, 6))

smat[:3, :3] = np.diag(s)
print(smat)
[[ 3.412 0.000 0.000 0.000 0.000 0.000]
[ 0.000 2.466 0.000 0.000 0.000 0.000]
[ 0.000 0.000 1.881 0.000 0.000 0.000]]

np.allclose(a, np.dot(u, np.dot(smat, vh)))

# Is original a recovered?
True

print(a)
print(np.dot(u, np.dot(smat, vh)))
[[-1.582 0.398 0.572 -1.060 1.000 -1.532]
[ 0.257 0.435 -1.851 0.894 1.263 -0.364]
[-1.251 1.276 1.548 1.039 0.165 -0.946]]
[[-1.582 0.398 0.572 -1.060 1.000 -1.532]
[ 0.257 0.435 -1.851 0.894 1.263 -0.364]
[-1.251 1.276 1.548 1.039 0.165 -0.946]]
Basic Mathematical Computations 133

We note here that the SVD of a matrix keeps the full information of
the matrix: using all these singular values and vectors, one can recover the
original matrix. What if one uses only some of these singular values (and the
corresponding singular vectors)?

3.4.4 SVD for data compression

We can use SVD to compress data, by discarding some (often many) of these
singular values and vectors of the data matrix. The following is an example
of compressing an m × n array of an image data.

from pylab import imshow,gray,figure

from PIL import Image, ImageOps
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})

A = Image.open('../images/hummingbird.jpg') # open an image

print(np.shape(A)) # check the shape of A, it is a 3D tensor
A = np.mean(A,2) # get 2-D array by averaging RBG values
m, n = len(A[:,0]), len(A[1])
r = m/n # Aspect ratio of the original image
print(r,len(A[1]),A.shape,A.size)
fsize, dpi = 3, 80 # inch, dpi (dots per inch, resolution)
plt.figure(figsize=(fsize,fsize*r), dpi=dpi)
gray()
imshow(A)
U, S, Vh = np.linalg.svd(A, full_matrices=True)
print(U.shape, S.shape, Vh.shape)
# Recover the image
k = 20 # use first k singular values
S = np.resize(S,[m,1])*np.eye(m,n)
Compressed_A=np.dot(U[:,0:k],np.dot(S[0:k,0:k],Vh[0:k,:]))
#print(Compressed_A.shape,'Compressed_A=',Compressed_A)
plt.figure(figsize=(fsize,fsize*r), dpi=dpi)
gray()
imshow(Compressed_A)
134 Machine Learning with Python: Theory and Applications

(405, 349, 3)
1.160458452722063 349 (405, 349) 141345
(405, 405) (349,) (349, 349)

<matplotlib.image.AxesImage at 0x203185a7080>

Figure 3.8: Reproduced image using compressed data in comparison with the original
image.

It is clear that when k = 20 (out of 349) singular values are used, the
reconstructed image is quite close to the original one. Readers may estimate
how much the storage can be saved if one keeps 10% of the singular values
(and the corresponding singular vectors), assuming the reduced quality is
acceptable.
Basic Mathematical Computations 135

3.5 Principal Component Analysis

Principal component analysis (PCA) is an eﬀective technique to extract

features from datasets. It was invented in 1901 by Karl Pearson [3]. It is a pro-
cedure that converts a dataset of p observations (raw features) of a possibly
correlated to a reduced set of variables (extracted features), using an orthog-
onal transformation. The principal components produced via a PCA are not
linearly correlated and are sorted by their variance values. Principal compo-
nents at the top of the sorted list account for higher variability in the dataset.
It is an effective way to reduce the dimension of feature spaces of datasets,
so that machine learning models such as the neural network can work more
effectively [4, 5]. Note that PCA is known to be sensitive to the relative
scaling of the original observations, see Wikipedia (https://en.wikipedia.
org/wiki/Principal component analysis#cite note-1) for more details.
PCA can be performed at least in two ways. One is to use a regression
approach which finds the set of orthogonal axes in an iterative manner.
The other is to use eigenvalue decomposition algorithms. The following
examples use the 2nd approach.

3.5.1 PCA formulation

Consider a general m × p matrix A of real numbers with m>p. We ﬁrst form
a matrix B:
B = A A (3.21)

which is a p × p symmetric square matrix with reduced size. Thus, it will be

at least semi-positive-deﬁnite, and often SPD. We can perform an eigenvalue
decomposition to it, which gives

B = VΣV (3.22)

These decomposed matrices are as follows:

• V is a p × p orthonormal matrix of p eigenvectors of the B matrix.
• Σ is a p×p square diagonal matrix. The diagonal entries are the eigenvalues
that are non-negative real numbers.
The PCA is then given as

AP CA = AV (3.23)

which is the projection of A on these p orthonormal eigenvectors. It has the

same shape as the original A that is m × p.
136 Machine Learning with Python: Theory and Applications

One may reconstruct A using the following formula, if using all the
eigenvectors:
Ar = AP CA V = AVV = A (3.24)

This is because eigenvectors are orthonormal. It is often that the ﬁrst few
(ranked by the value of the eigenvalues in descending order) eigenvectors
contain most of the overall information of the original matrix A. In this
case, we can use only a small number of eigenvectors to reconstruct the A
matrix. For example, if we use k p number of eigenvectors, we have
Ar = AP CA [0 : m, 0 : k]V [0 : k, 0 : p]
(3.25)
= A[0 : m, 0 : p]V[0 : p, 0 : k]V [0 : k, 0 : p] = A
This will, in general, not equal the original A, but can be often very close
to it. In this case, the storage becomes m × k + k × p which can be much
smaller than the original size of m × p. In Eq. (3.25), we used the Python
syntax, and hence it is very close to that in the Python code.
Note that if matrix A has dimensions of m < p, we simply treat its
transpose in the same way mentioned above.
One can also perform a similar analysis by forming a normal matrix B
using the following equation instead:
B = AA (3.26)

which will be an m×m symmetric square matrix of reduced size. Assuming it

is at least semi-positive-deﬁnite, we can perform an eigenvalue decomposition
to it, which gives
B = VΣV (3.27)

In this case, these decomposed matrices are as follows:

• V is an m × m orthonormal matrix of m eigenvectors of the B matrix.
• Σ is an m × m square diagonal matrix. The diagonal entries are the
eigenvalues that are non-negative real numbers.
The PCA is then given as
AP CA = V A (3.28)

It has the same shape as the original A that is m×p. One may reconstruct A
using the following formula and all the eigenvectors (that are orthonormal):

Ar = VAP CA = VV A = A (3.29)

Basic Mathematical Computations 137

We can use only a small number of eigenvectors to reconstruct the A matrix.

For example, if we use k m number of eigenvectors, we have

Ar = V[0 : k, 0 : m]V [0 : m, 0 : k]AP CA [0 : k, 0 : p]

(3.30)
= A

Note that for large systems, we do not really form the normal matrix
B, perform eigenvalue decomposition, and then compute V numerically.
Instead, the QR transformation type of algorithms are used. This is because
of the instability reasons mentioned in the beginning of Section 3.4.2.

3.5.2 Numerical examples

3.5.2.1 Example 1: PCA using a three-line code

We show an example of PCA code with only three lines. It is from glowing-
python (https://glowingpython.blogspot.com/2011/07/principal-component-
analysis-with-numpy.html), with permission. It is inspired by the function
princomp of matlab’s statistics toolbox and quite easy to follow. We modiﬁed
the code to exactly follow the PCA formulation presented above.

import numpy as np
from pylab import plot,subplot,axis,show,figure
def princomp(A):
""" PCA on matrix A. Rows: m observations; columns:
p variables. A will be zero-centered and normalized
Returns:
coeff: eig-vector of A^T A. Row-reduced observations,
each column is for one principal component.
score: the principal component - representation of A in
the principal component space.Row-observations,
column-components.
latent: a vector with the eigenvalues of A^T A.
"""
# eigenvalues and eigenvectors of covariance matrix
# modified. It was:
# M = (A-np.mean(A.T,axis=1))
138 Machine Learning with Python: Theory and Applications

# [latent,coeff] = np.linalg.eig(np.cov(M))
# score = np.dot(coeff.T,M)
A=(A-np.array([np.mean(A,axis=0)])) # subtract the mean
[latent,coeff] = np.linalg.eig(np.dot(A.T,A))
score = np.dot(A,coeff) # projection on the new space
return coeff,score,latent

Let us test the code using a 2D dataset.

# A simple 2D dataset
np.set_printoptions(formatter={'float': '{: 0.2f}'.format})

Data = np.array([[2.4,0.7,2.9,2.5,2.2,3.0,2.7,1.6,1.8,1.1,
1.6,0.9],
[2.5,0.5,2.2,1.9,1.9,3.1,2.3,2.0,1.4,1.0,
1.5,1.1]])
A = Data.T # Note: transpose to have A with m>p
print('A.T:\n',Data)
coeff, score, latent = princomp(A) # change made. It was A.T
print('p-by-p matrix, eig-vectors of A:\n',coeff)
print('A.T in the principal component space:\n',score.T)
print('Eigenvalues of A, latent=\n',latent)
figure(figsize=(50,80))
figure()
subplot(121)
# every eigenvector describe the direction of a principal
# component.
m = np.mean(A,axis=0)
plot([0,-coeff[0,0]*2]+m[0], [0,-coeff[0,1]*2]+m[1],'--k')
plot([0, coeff[1,0]*2]+m[0], [0, coeff[1,1]*2]+m[1],'--k')
plot(Data[0,:],Data[1,:],'ob') # the data points
axis('equal')
subplot(122)
# New data produced using the s
plot(score.T[0,:],score.T[1,:],'*g') # Note: transpose back
axis('equal')
show()
Basic Mathematical Computations 139

A.T:
[[ 2.40 0.70 2.90 2.50 2.20 3.00 2.70 1.60 1.80
1.10 1.60 0.90]
[ 2.50 0.50 2.20 1.90 1.90 3.10 2.30 2.00 1.40 1.00
1.50 1.10]]
p-by-p matrix, eig-vectors of A:
[[ 0.74 -0.67]
[ 0.67 0.74]]
A.T in the principal component space:
[[ 0.82 -1.79 0.98 0.49 0.26 1.66 0.90 -0.11 -0.37
-1.16 -0.45 -1.24]
[ 0.23 -0.11 -0.33 -0.28 -0.08 0.27 -0.12 0.40 -0.18 -0.01
0.03 0.20]]
Eigenvalues of A, latent=
[ 11.93 0.58]

<Figure size 3600x5760 with 0 Axes>

Figure 3.9: Data process with PCA.

3.5.2.2 Example 2: Truncated PCA

This example is a modiﬁed PCA based on the previous code. The test is
done for an image compression application. The code is from glowingpython
(https://glowingpython.blogspot.com/2011/07/pca-and-image-compression-
with-numpy.html), with permission.
140 Machine Learning with Python: Theory and Applications

import numpy as np
def princomp(A,numpc=0):
# computing eigenvalues and eigenvectors of covariance
# matrix A
A = (A-np.array([np.mean(A,axis=0)]))
# subtract the mean (along columns)
[latent,coeff] = np.linalg.eig(np.dot(A.T,A))
#was: A = (A-np.mean(A.T,axis=1)).T # subtract the mean
#was: [latent,coeff] = np.linalg.eig(np.cov(M))

p = np.size(coeff,axis=1)
idx = np.argsort(latent) # sorting the eigenvalues
idx = idx[::-1] # in ascending order
# sorting eigenvectors according to eigenvalues
coeff = coeff[:,idx]
latent = latent[idx] # sorting eigenvalues
if numpc < p and numpc >= 0:
coeff = coeff[:,range(numpc)] # cutting some PCs

#score = np.dot(coeff.T,M) # projection on the new space

score = np.dot(A,coeff)
# projection of the data on the new space
return coeff,score,latent

The following code computes the PCA of matrix A, which is an image in

color scale. It ﬁrst converts Image A into gray scale. After the PCA is done,
a diﬀerent reduced number of principal components are used to reconstruct
the image.

from pylab import imread,subplot,imshow,title,gray,figure,

show,NullLocator
from ipykernel import kernelapp as app
from PIL import Image, ImageOps
%matplotlib inline
#A = Image.open('./images/hummingbirdcapsized.jpg')
A = Image.open('../images/hummingbird.jpg') # open an image
#A = ImageOps.flip(B) # flip it if so required
# or use A = imread('./images/hummingbirdcapsized.jpg')
A = np.mean(A,2) # to get a 2-D array
Basic Mathematical Computations 141

full_pc = np.size(A,axis=1)
# numbers of all the principal components
r = len(A[:,0])/len(A[1])
print(r,len(A[1]),A.shape,A.size)
i = 1
dist = []
figure(figsize=(11,11*r))
for numpc in range(0,full_pc+10,50): # 0 50 100 ... full_pc
coeff, score, latent = princomp(A,numpc)
print(numpc,'coeff, score, latent \n',
coeff.shape, score.shape, latent.shape)
Ar = np.dot(score,coeff.T)+np.mean(A,axis=0)
#was:Ar = np.dot(coeff,score).T+np.mean(A,axis=0)
# difference in Frobenius.norm
dist.append(np.linalg.norm(A-Ar,'fro'))
# showing the pics reconstructed with less than 50 PCs
if numpc <= 250:
ax = subplot(2,3,i,frame_on=False)
ax.xaxis.set_major_locator(NullLocator())
ax.yaxis.set_major_locator(NullLocator())
i += 1
imshow(Ar) #imshow(np.flipud(Ar))
title('PCs # '+str(numpc))
gray()
figure()
imshow(A) #imshow(np.flipud(A))
title('numpc FULL: '+str(len(A[1])))
gray()
show()

1.160458452722063 349 (405, 349) 141345

0 coeff, score, latent
(349, 0) (405, 0) (349,)
50 coeff, score, latent
(349, 50) (405, 50) (349,)
100 coeff, score, latent
(349, 100) (405, 100) (349,)
150 coeff, score, latent
(349, 150) (405, 150) (349,)
200 coeff, score, latent
142 Machine Learning with Python: Theory and Applications

(349, 200) (405, 200) (349,)

250 coeff, score, latent
(349, 250) (405, 250) (349,)
300 coeff, score, latent
(349, 300) (405, 300) (349,)
350 coeff, score, latent
(349, 349) (405, 349) (349,)

Figure 3.10: Images reconstructed using reduced PCA components, in comparison with
the original image.
Basic Mathematical Computations 143

We can see that 50 principal components give a pretty good quality image,
compared to the original one.
To assess the quality of the reconstruction quantitatively, we compute
the distance of the reconstructed images from the original one in the
Frobenius norm, for a diﬀerent number of eigenvalues/eigenvectors used in
the reconstruction. The results are plotted in Fig. 3.11, with the x-axis for
the number of eigenvalues/eigenvectors used. The sum of the eigenvalues is
plotted in the blue curve, and the Frobenius norm is plotted in the red curve.
The sum of the eigenvalues relates to the level of variance contribution.

from pylab import plot, axis, cumsum

figure()
perc = cumsum(latent)/sum(latent)
dist = dist/max(dist)
plot(range(len(perc)),perc,'b',range(0,full_pc+10,50), dist,'r')
axis([0,full_pc,0,1.1])
show()

Figure 3.11: Quality of the reconstructed images.

In practical computations, the QR decomposition can be used to compute

the eigenvectors V, to avoid numerical stability, as discussed earlier for SVD.

3.6 Numerical Root Finding

Module scipy.optimize oﬀers a function fsolve() to ﬁnd roots of a set of

given nonlinear equations deﬁned by f (x) = 0, with estimated locations of
the roots. The fsolve() is a wrapper around the algorithms in MINPACK
that uses essentially a variant of the Newton iteration method (https://en.
wikipedia.org/wiki/Newton%27s method), which ﬁnds the root using the
144 Machine Learning with Python: Theory and Applications

Figure 3.12: The Newton Iteration: The function is shown in blue and the tangent line at
local xi is in red. We see that xi gets closer and closer to the root of the function when the
number of iterations i increases (https://en.wikipedia.org/wiki/Newton%27s method#/
media/File:NewtonIteration Ani.gif) under the CC BY-SA 3.0. (https://creativecommons.
org/licenses/by-sa/3.0/) license.

function derivative to approximate the function locally. This process can

be easily viewed from the animation nicely made by Ralf Pfeifer shown in
Fig. 3.12.
Let us define two functions, one with a single variable and another
with two variables, as examples to demonstrate how to find the roots. For
functions with a single variable x, we define the following function that is
often encountered in structural mechanics problems:
import numpy as np
def BeamOnFoundation(bz): # Deflection of a beam on foundation
return np.exp(-bz)*(np.sin(bz)+np.cos(bz))
def f(x): # function whose root to be found
return 2*BeamOnFoundation(x/2)-1-BeamOnFoundation(x)

from scipy.optimize import fsolve

starting_guess = 5 # specify estimated location of the root
x_root=fsolve(f, starting_guess)
print('x_root=',x_root)
np.isclose(f(x_root), 0) # check if f(x_root)=0.0.
x root= [ 1.86]

array([ True])
Basic Mathematical Computations 145

Let us now consider a set of two functions with two variables.

def f2d(x):
return [x[0]*np.sin(x[1])-5,x[1]*x[0]-x[1]-8]
# x is an array function-1 function-2
x_roots = fsolve(f2d, [3, 2]) # specify 2 estimated roots
print('x_roots=',x_roots)
np.isclose(f2d(x_roots), [0.0, 0.0]) # check if f(x_root)=0.0.

x roots= [ 5.25 1.88]

array([ True, True])

Note in general, a polynomial (or other algebraic equation) can have

complex roots, even though their coeﬃcients are all real. This is another
case where the complex space is geometrically closed, but the real space is
not. A polynomial of nth order should have n roots, but they may not be
all in the real space. Some of them get into the complex space.

3.7 Numerical Integration

Numerical integration is one of the routine operations in computations for

practical problems in sciences and engineering. This is because only simple
functions can be analytically integrated, and one has to resort to numerical
means for real-life problems. Diﬀerent types of numerical integration tech-
niques have been developed in the past, and numpy made the computation
easy to implement and use. Our discussion on this topic starts from the
classical trapezoid rule that may be familiar to many readers. More reference
materials can be found from the Scipy.integrate documentation (https://
docs.scipy.org/doc/scipy/reference/tutorial/integrate.html) and notebook.
community (https://notebook.community/sodafree/backend/build/ipyth
on/docs/examples/notebooks/trapezoid rule).

3.7.1 Trapezoid rule

The trapezoid rule for deﬁnite integration uses the following formula:
n
b
1s

f (x) dx ≈ (xk − xk−1 ) (f (xk ) + f (xk−1 )) . (3.31)

a 2
k=1

We deﬁne a simple polynomial function and sample it in a ﬁnite range [a, b]

at ns number of sampling points equally spaced.
146 Machine Learning with Python: Theory and Applications

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})

def f(x):
return 9*x**3-8*x**2-7*x+6 # define a polynomial function
a, b, n = -1., 2, 400 # large n for plotting the curve
x = np.linspace(a, b, n) # x at n points in [a,b]
y = f(x) # compute the function values

The function is integrated over [a, b], by sampling a small number of

points.

ns = 6 # sample ns points for integration

xint = np.linspace(a, b, ns)
yint = f(xint)

Plot the function curve and the trapezoidal (shaded) areas below it.

plt.plot(x, y, lw=2) # plot the function as a line of width 2

#plt.axis([a, b, 0, 150]) # plot x and y axes
plt.fill_between(xint, 0, yint, facecolor='gray', alpha=0.4)
# plot the shaded area over which the integration is done
plt.text((a+b)/2,12,r"$\int_a^b f(x)dx$", horizontalalignment=\
'center',fontsize=15); # use \ to change line in code

Figure 3.13: Integration using the trapezoidal rule.

Basic Mathematical Computations 147

The trapezoid integration computes the shaded area. Thus, it is likely an

approximation as shown.

from scipy.integrate import quad # quadrature (integration)

integral, error = quad(f, a, b)
# shall give the results and the error
integral_trapezoid=sum((xint[1:]-xint[:-1])*(yint[1:]+yint[:-1]))/2
# use the trapezoid formula
print("The results should be:", integral, "+/-", error)
print("The results by the trapezoid approximation with",len(xint),
"points is:", integral_trapezoid)
The results should be: 17.25 +/- 1. 9775770133077287e-13
The results by the trapezoid approximation with 6 points are:
18.240000000000002

3.7.2 Gauss integration

Gauss integration (or quadrature) is regarded as one of the most effective
numerical integration techniques. It samples the integrand function at
specific points called the Gauss points and sums up these sampled function
values weighted by the Gauss weights for these points. It can produce exact
values (to machine accuracy) for the integration of a polynomial integrand,
because the Gauss point locations are the roots of the Legendre polynomials
defined in the natural coordinates in [−1, 1].
The Gauss integration is widely applied in numerical integration if the
fixed locations of sampling points are not a concern. It is a standard
integration scheme used in the FEM [1]. Here, we show an example using the
p roots() function available in Scipy module to find the roots of polynomials,
and then carry out the integration.

from pylab import *

from scipy.special.orthogonal import p_roots

def gauss(f,n,a,b):
[x,w] = p_roots(n+1) # roots of the Legendre polynomial
# and weights
G=0.5*(b-a)*sum(w*f(0.5*(b-a)*x+0.5*(b+a)))
# in natural coordinates
# sample the function values at these roots and sum up.
return G
148 Machine Learning with Python: Theory and Applications

def my_f(x):
return 9*x**3-8*x**2-7*x+6 # define a polynomial function
ng = 2
integral_Gauss = gauss(my_f,ng,a,b)
print("The results should be:", integral, "+/-", error)
print("The results by the trapezoid approximation with",
len(xint),"points is:", integral_trapezoid)
print("The results by the Gauss integration with", ng,
'Gauss points:', "points is:", integral_Gauss)

The integral results should be: 17.25 +/-

1.9775770133077287e-13
The results by the trapezoid approximation with 6 points is:
18.240000000000002
The results by the Gauss integration with 2 Gauss points:
points is: 17.250000000000007

It is observed that the Gauss integration gives a much more accurate

solution with a much smaller number of sampling points. In fact, the solution
is exact (within the machine error) for this example because the integrand
is a polynomial of the order of 3. We need only 2 Gauss points to obtain the
exact solution. The general formula for polynomial integrands is ng = n+1 2 ,
where n is the order of the polynomial integrand and ng is the number
of Gauss points needed to obtain the exact solution for the integral. Note
that when the trapezoid integration rule is used with 6 sampling points, the
solution is still quite far oﬀ.
For general complicated integrand functions, Gauss integration may not
give the exact solution. The accuracy, however, will still be much better
compared to the trapezoid rule or the rectangular rule (which we did not
discuss, but very similar to the trapezoid rule). In other words, for solutions
of similar accuracy, Gauss integration uses less sampling points.

3.8 Initial data treatment

Finally, let us introduce techniques often used for initial treatment for
datasets. Consider a given training dataset X ∈ Xm×p . In machine learning
models, m is the number of data-points in the dataset, and p is the number of
feature variables. The values of the data are often in a wide range for real-life
problems. For numerical stability reasons, we usually perform normalization
to the given dataset before feeding it to a model. There are mainly two
Basic Mathematical Computations 149

techniques are used: min-max feature scaling and standard scaling. Such a
scaling or normalization is also called transformation in many ML modules.

3.8.1 Min-max scaling

For formulation for min-max scaling is given as follows.

X − X. min(axis = 0)
Xscaled = (3.32)
X. max(axis = 0) − X. min(axis = 0)

where X. min and X. min will be (row) vectors, and we used the Python
syntax of broadcasting rules and element-wise divisions. This would bring
all values for each feature into [0, 1] range. A more generalized formula that
can bring these values to an arbitrary range of [a, b] is given as follows.

X − X. min(axis = 0)
Xscaled = a + (b − a) (3.33)
X. max(axis = 0) − X. min(axis = 0)

Here we used again the Python syntax so that scalars, vectors and matrix
are all in the same formula.
Once such a scaling transformation to the training dataset is done, X. min
and X. min can be used to perform exactly the same transformation to the
testing dataset to ensure consistency for proper predictions.
The following is a simple code to perform min-max scaling using
Eq.(3.32).

np.set_printoptions(precision=4)
X = [[-1, 2, 8], # an assumed toy training dataset
[2.5, 6, 1.5], # with 4 samples, and 3 features
[3, 11, -6],
[21, 7, 2]]
print(f"Original training dataset X:\n{X}")
X = np.array(X)
X_scaled = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
print(f"Scaled training dataset X:\n{X_scaled}")
print(f"Maximum values for each feature:\n{X.max(axis=0)}")
print(f"Minimum values for each feature:\n{X.min(axis=0)}")

Original training dataset X:

[[-1, 2, 8], [2.5, 6, 1.5], [3, 11, -6], [21, 7, 2]]
Scaled training dataset X:
[[0. 0. 1. ]
[0.1591 0.4444 0.5357]
150 Machine Learning with Python: Theory and Applications

[0.1818 1. 0. ]
[1. 0.5556 0.5714]]

Maximum values for each feature:

[21. 11. 8.]
Minimum values for each feature:
[-1. 2. -6.]

We can now perform the same transformation to the testing dataset using
X. min and X. min of the training dataset.

Xtest = [[-2, 3, 7], # assumed testing dataset

[5, 4, 5.5]]
Xt_scaled = (Xtest - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
print(f"Scaled corresponding testing dataset Xtest:\n{Xt_scaled}")

Scaled corresponding testing dataset Xtest:

[[-0.0455 0.1111 0.9286]
[ 0.2727 0.2222 0.8214]]

The inverse transformation can be done with ease.

X_back = X_scaled*(X.max(axis=0) - X.min(axis=0))+ X.min(axis=0)

print(f"Back transformed training dataset:\n{X_back}")

Xt_back = Xt_scaled*(X.max(axis=0) - X.min(axis=0))+ X.min(axis=0)

print(f"Back transformed testing dataset Xtest:\n{Xt_back}")

Back transformed training dataset:

[[-1. 2. 8. ]
[ 2.5 6. 1.5]
[ 3. 11. -6. ]
[21. 7. 2. ]]
Back transformed testing dataset Xtest:
[[-2. 3. 7. ]
[ 5. 4. 5.5]]

It is clearly seen that the min-max scaling does no harm to the dataset.
One can get it back as needed.
The same min-max scaling can be done using Sklearn.
Basic Mathematical Computations 151

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() # create an instance
scaler.fit(X) # fit with the training dataset
X_scaled = scaler.transform(X) # perform the scaling transformation
print(f"Scaled training dataset X:\n{X_scaled}")
print(f"Maximum values for each feature:\n{scaler.data_max_}")
print(f"Minimum values for each feature:\n{scaler.data_min_}")

Scaled training dataset X:

[[0. 0. 1. ]
[0.1591 0.4444 0.5357]
[0.1818 1. 0. ]
[1. 0.5556 0.5714]]
Maximum values for each feature:
[21. 11. 8.]
Minimum values for each feature:
[-1. 2. -6.]

Xtest = [[-2, 3, 7], # assumed testing dataset

[5, 4, 5.5]]
Xt_scaled = scaler.transform(Xtest)
print(f"Scaled corresponding testing dataset:\n{Xt_scaled}")

Scaled corresponding testing dataset:

[[-0.0455 0.1111 0.9286]
[ 0.2727 0.2222 0.8214]]

X_back = scaler.inverse_transform(X_scaled)
print(f"Back transformed training dataset:\n{X_back}")

Xt_back = scaler.inverse_transform(Xt_scaled)
print(f"\nBack transformed testing dataset Xtest:\n{Xt_back}")

Back transformed training dataset:

[[-1. 2. 8. ]
[ 2.5 6. 1.5]
[ 3. 11. -6. ]
[21. 7. 2. ]]

Back transformed testing dataset Xtest:

[[-2. 3. 7. ]
[ 5. 4. 5.5]]
152 Machine Learning with Python: Theory and Applications

3.8.2 “One-hot” encoding

Many ML datasets use categorical features. For example, a color variable may
have values of “red”, “green”, and “blue”. These values must be converted
to numerical numbers for building a ML model. Consider a single column
feature vector is given originally as [[green], [red], [0], [blue]], one can simply
encode this feature vector as [[1], [2], [0], [3]], where the integers are arbitrary
but distinct. The treatment to dataset coded in this manner will be the same
as any ordinary dataset we discussed before. However, this implies that the
colors are having significance in values, which may not be what we want.
To avoid such a problem, we often use the so-called “one-hot” encoding.
The single column dataset is then encoded to a matrix X with three columns,
as shown in the code below. Thus, one-hot encoding results in a significant
increase in the number of feature vectors, so that the features can all be
made unique, and the categories are not given any value significance. Let us
scale such kind of dataset.

# red green blue

X = [[0, 1, 0 ], # a 'one-hot' training dataset
[1, 0, 0 ],
[0, 0, 0 ],
[0, 0, 1 ]]
scaler.fit(X)
print(f"Original 'one-hot' training dataset X:\n{X}")
X_scaled = scaler.transform(X) # perform the scaling transformation
print(f"Scaled 'one-hot' training dataset X:\n{X_scaled}")
print(f"Maximum values for each feature:\n{scaler.data_max_}")
print(f"Minimum values for each feature:\n{scaler.data_min_}")

Original 'one-hot' training dataset X:

[[0, 1, 0], [1, 0, 0], [0, 0, 0], [0, 0, 1]]
Scaled 'one-hot' training dataset X:
[[0. 1. 0.]
[1. 0. 0.]
[0. 0. 0.]
[0. 0. 1.]]
Maximum values for each feature:
[1. 1. 1.]
Minimum values for each feature:
[0. 0. 0.]

It is seen that the min-max scaling used has not changed anything to the
one-hot dataset, as expected. They do not have value signiﬁcance.
Basic Mathematical Computations 153

3.8.3 Standard scaling

When the dataset has a distribution that is close to the normal distribution,
one can use the standard scaling. The formulation for the standard scaling
is given as follows.
X − X.mean(axis = 0)
Xscaled = (3.34)
X.std(axis = 0)
The following is a simple code to perform min-max scaling using
Eq.(3.34).

X = [[-1, 2, 8], # an assumed toy training dataset

[2.5, 6, 1.5],
[3, 11, -6],
[21, 7, 2]]
print(f"Original training dataset X:\n{X}")
X = np.array(X)
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)
print(f"Standard scaled training dataset X:\n{X_scaled}")
print(f"Mean value for each feature:\n{X.mean(axis=0)}")
print(f"Standard deviation for each feature:\n{X.std(axis=0)}")

Original training dataset X:

[[-1, 2, 8], [2.5, 6, 1.5], [3, 11, -6], [21, 7, 2]]
Standard scaled training dataset X:
[[-0.8592 -1.4056 1.3338]
[-0.4515 -0.1562 0.0252]
[-0.3932 1.4056 -1.4848]
[ 1.7039 0.1562 0.1258]]
Mean value for each feature:
[6.375 6.5 1.375]
Standard deviation for each feature:
[8.5832 3.2016 4.9671]

Note that the values are not conﬁrmed in [−1, 1]. It follows a normal dis-
tribution. We can now perform the same transformation to the corresponding
testing dataset using the ﬁtted instance of the training dataset.

Xtest = [[-2, 3, 7], # an assumed testing dataset

[5, 4, 5.5]]
Xt_scaled = (Xtest - X.mean(axis=0)) / X.std(axis=0)
print(f"Standard scaled corresponding testing dataset Xtest:
\n{Xt_scaled}")
154 Machine Learning with Python: Theory and Applications

Standard scaled corresponding testing dataset Xtest:

[[-0.9757 -1.0932 1.1325]
[-0.1602 -0.7809 0.8305]]

The inverse transformation can be done with ease.

X_back = X_scaled*X.std(axis=0) + X.mean(axis=0)

print(f"Back transformed training dataset:\n{X_back}")

Xt_back = Xt_scaled*X.std(axis=0) + X.mean(axis=0)

print(f"Back transformed corresponding testing dataset Xtest:
\n{Xt_back}")

Back transformed training dataset:

[[-1. 2. 8. ]
[ 2.5 6. 1.5]
[ 3. 11. -6. ]
[21. 7. 2. ]]
Back transformed corresponding testing dataset Xtest:
[[-2. 3. 7. ]
[ 5. 4. 5.5]]

It is clearly seen that the standard scaling does no harm to the dataset.
One can get it back as needed.
The same standard scaling can be done using Sklearn.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler() # create an instance
scaler.fit(X)
X_scaled = scaler.transform(X)
print(f"Scaled dataset X:\n{X_scaled}")
print(f"Mean values for each feature:\n{scaler.mean_}")
print(f"Standard deviations for each feature:\n{np.sqrt(scaler.var_)}")

Scaled dataset X:
[[-0.8592 -1.4056 1.3338]
[-0.4515 -0.1562 0.0252]
[-0.3932 1.4056 -1.4848]
[ 1.7039 0.1562 0.1258]]
Mean values for each feature:
[6.375 6.5 1.375]
Standard deviations for each feature:
[8.5832 3.2016 4.9671]
Basic Mathematical Computations 155

Xtest = [[-2, 3, 7], # assumed testing dataset

[5, 4, 5.5]]
Xt_scaled = scaler.transform(Xtest)
print(f"Scaled corresponding testing dataset Xtest:\n{Xt_scaled}")

Scaled corresponding testing dataset Xtest:

[[-0.9757 -1.0932 1.1325]
[-0.1602 -0.7809 0.8305]]

X_back = scaler.inverse_transform(X_scaled)
print(f"Back transformed training dataset:\n{X_back}")

Xt_back = scaler.inverse_transform(Xt_scaled)
print(f"\nBack transformed testing dataset Xtest:\n{Xt_back}")

Back transformed training dataset:

[[-1. 2. 8. ]
[ 2.5 6. 1.5]
[ 3. 11. -6. ]
[21. 7. 2. ]]

Back transformed testing dataset Xtest:

[[-2. 3. 7. ]
[ 5. 4. 5.5]]

Note that the same scaling can be done for the labels in the training
dataset, if they are not probability distribution types of data. When perform
the testing on the trained model or prediction using the trained model, the
labels should be scaled back to the original really data unit.
Also, it is a good practice to take look at the distribution of the data-
points. This is usually done after scaling so that the region of the data-points
is normalized. One may simply plot the so-called kernel density estimation
(KDE) using, for example, seaborn.kdeplot().

References

[1] G.R. Liu and S.S. Quek, The Finite Element Method: A Practical Course, Butterworth-
Heinemann, 2013. London.
[2] G.R. Liu, Mesh Free Methods: Moving Beyond the Finite Element Method, Taylor and
Francis Group, New York, 2010.
156 Machine Learning with Python: Theory and Applications

[3] K. Pearson, On lines and planes of closest fit to systems of points in space, Philosophical
Magazine, 2(1), 559–572, 1901.
[4] G.R. Liu, S.Y. Duan, Z.M. Zhang et al., Tubenet: A special trumpetnet for explicit
solutions to inverse problems, International Journal of Computational Methods,
18(01), 2050030, 2021. https://doi.org/10.1142/S0219876220500309.
[5] Shuyong Duan, Zhiping Hou, G.R. Liu et al., A novel inverse procedure via creating
tubenet with constraint autoencoder for feature-space dimension-reduction, Interna-
tional Journal of Applied Mechanics, 13(08), 2150091, 2021.
Chapter 4

Statistics and Probability-based Learning Model

This chapter discusses some topics of probability and statistics related to

machine learning models and the computation techniques using Python.
Referenced materials include codes from Numpy documentation (https://
numpy.org/doc/), Jupyter documentation (https://jupyter.org/), and
Wikipedia (https://en.wikipedia.org/wiki/Main Page). Codes from mxnet-
the-straight-dope (https://github.com/zackchase/mxnet-the-straight-dope)
are also used under the Apache-2.0 License.
Building a machine learning model is mostly for prediction, classification,
or identification, based on the data available and the knowledge about the
data. Predictions can be deterministic and probabilistic. We often want to
predict the probability of the occurrence of an event, which can be very
useful and more practical for some problems.
For example, for aircraft maintenance, the engineers might want to assess
how likely it is for the engine of the aircraft to get into an unhealthy state,
based on records and/or diagnostic data. For a doctor, he/she may want
to predict the possibility of a patient having a critical illness in the next
period of time, based on the patient’s health records, diagnostic data, and
the current health environment. Health care organizations want to predict
the likelihood of the occurrence of a pandemic. For all these types of tasks,
we need to resort to means of quantifying the probability of the occurrence of
the event. It can be a complicated topic of study and research, and machine
learning models may help.
This chapter focuses on some of the basic concepts, theories, formulations,
and computational techniques that we may need to build machine learning
models using probability and statistics. At the end of this chapter, we will
introduce a Naive Bayes classification model.

157
158 Machine Learning with Python: Theory and Applications

4.1 Analysis of Probability of an Event

4.1.1 Random sampling, controlled random sampling

In machine learning, one often needs to sample numbers in a random manner.

This can be done numerically. In Python, we import random module to do so.
The random() can then be used to generate “simulated” random numbers.
First, let us use a code to produce random integers. We shall use
random.randint() for this, which samples numbers uniformly in a given
range.

# help(random.randint) # check it out

import random # random module

na, nb, n = 1, 100, 5 # n integers in na~nb
for i in range(n): # First, generate n
print(random.randint(na,nb),' ',end ='') # random integers
# in na~nb
print('\n')
for i in range(n): # Generate again
print(random.randint(na,nb),' ',end ='') # n random integers

86 61 35 81 40

4 92 95 31 70

We generated 5 random integer numbers twice. These generated numbers

are “random”, because two of the same generations gave two sets of diﬀerent
numbers. One can try to execute the above cell multiple times, and it should
be found that each time diﬀerent sets of numbers are generated.
Now, let us redo the same, but this time we use random.seed() to specify
the same seed for each of the generations.

random.seed(1) # seed value 1 for random number generation

for i in range(n): # Generate n random integers
print(random.randint(na, nb),' ',end ='')
print('\n')
random.seed(1) # The same seed value (try also seed(2))
for i in range(n):
print(random.randint(na, nb),' ',end ='')
Statistics and Probability-based Learning Model 159

18 73 98 9 33

We see now that the same set of numbers is generated, which is some kind
of controlled random sampling by a seed value. The use of random.seed()
may confuse many beginners, but the above example shall eliminate the
confusion. Function random.seed() is used just to ensure the repeatability
when one reruns the code again, which is important for code development
ensuring reproducibility. We will use it quite frequently.
Also, we see the fact that random numbers generated by a computer
are not entirely random and are controllable to a certain degree. Naturally,
it should be, because any (classic) computer is deterministic in nature.
This pseudo-random feature is useful: when we study a probability event,
we make use of the randomness of random.randint() or random.random().
When we want our study and code to be repeatable, we make use of
random.seed().
Note that the seed value of 1 can be changed to any other number,
and with a diﬀerent seed value used, a diﬀerent set of random numbers
is generated.
Let us now generate real numbers.

#random.seed(1) # seed for random number generation

n = 5
for i in range(n): #generates n random real numbers
print(random.random())

0.11791870367106105
0.7609624449125756
0.47224524357611664
0.37961522332372777
0.20995480637147712

It is seen that real numbers are generated in between 0 and 1. It is

produced by generating a random integer first using random.randint() and
then dividing it by its maximum range. The reader may switch on and off
random.seed(1) or change the seed value to see the difference.
160 Machine Learning with Python: Theory and Applications

4.1.2 Probability
Probability is a numerical measure on the likelihood of the occurrence of an
event or a prediction. Assume, for example, the probability of the failure of
a structure is 0.1. We can then denote it mathematically as

P r(failure=“yes”) = 0.1 (4.1)

In this case, there is only one random variable that takes two possible discrete
values: “yes” with probability of 0.1, and “no” with probability of 0.9. Such
a distribution of a random variable is known as the Bernoulli distribution.
For general events, there may be more possible discrete random variables
and random variables with continuous distributions. Statistics studies the
techniques for sampling, interpreting, and analyzing the data about an event.
Machine learning is based on a dataset available for an event, and thus
statistical analysis helps us to make sense of a dataset and hopefully produces
a prediction in terms of probability.
We use Python to perform statistical analysis to datasets. We ﬁrst
import necessary packages, including the MXNet packages (https://gluon.
mxnet.io/).

import numpy as np # numpy package, give an alias np

import mxnet as mx # mxnet package, give an alias mx
from mxnet import nd # ndarray class from mxnet package
mx.random.seed(1) # seed for random number generation
# for repeatability of the code

Let us consider a simple event: tossing a die that has six identical surfaces,
each of which is marked with a unique digit number, from 1 to 6. In this
case, the random variable can take 6 possible discrete values. Assume that
such markings do not introduce any bias (fair die), and do not aﬀect in any
way the outcome of a tossing. We want to know the probability of getting
a particular number on the top surface, after a number of tossings. One can
then perform “numerical” experiments: tossing the die the large number of
times virtually in a computer and counting the times that the number shows
on the top surface. We use the following code to do this:

pr = nd.ones(6)/6 # probability distribution for a

# number on top. A total of 6
# values. Assume they have the
# same Pr (uniform distribution)
Statistics and Probability-based Learning Model 161

print(pr)
n_top_array = nd.sample_multinomial(pr,shape=(1))
# toss once using the
# sample_multinomial() function
print('The number on top surface =', n_top_array)

[0.16666667 0.16666667 0.16666667 0.16666667 0.16666667

0.16666667]
<NDArray 6 @cpu(0)>
The number on top surface =
[3]
<NDArray 1 @cpu(0)>

For this problem, we know (assumed) that the theoretical or the “true”
probability for a number showing on the top surface is 1/6 ≈ 0.1667.
The one-time toss above gives an nd-array with just one entry that is the
number on the top surface of the die. To obtain a probability, we shall toss
for many times for statistics to work. This is done by simply specifying the
length of the nd-array in the handy nd.sample multinomial() function.

n_surfaces = 6 # number of possible values

n_tosses = 18 # number of tosses
mx.random.seed(1)
toss_results = nd.sample_multinomial(pr, shape=(n_tosses))
# toss n_tosses times
print("Tossed", n_tosses,'times.')
print("Toss results", toss_results)

Tossed 18 times.
Toss results
[3 0 0 3 1 4 4 5 3 0 0 2 2 2 4 2 4 4]
<NDArray 18 @cpu(0)>

This time, we tossed 18 times, resulting to an nd-array with 18 entries.

Note that if mx.random.seed(1) is not used in the above cell, we would get a
diﬀerent array each tossing, because of the random nature. For this controlled
tossing (that readers can repeat) with random.seed(1), we got 1 “5” in 18
times of tossing. We thus have Pr(die=“5”) = 1/18. We got 3 “3”s, which
gives Pr(die=“3”) = 3/18=1/6, and so on. The values of Pr(die=“5”) and
Pr(die=“3”) are quite apart. Let us toss some more times.
162 Machine Learning with Python: Theory and Applications

n_t = 20
print(nd.sample_multinomial(pr, shape=(n_t))) #toss n_t times

[2 3 5 4 0 0 1 1 0 2 3 0 0 0 2 4 5 2 0 4]
<NDArray 20 @cpu(0)>

It is diﬃcult to count and calculate the probabilities manually. Let us use

the following code available at the mxnet site to do so:

# The code is modified from these at https://github.com/

# zackchase/mxnet-the-straight-dope/blob/master/chapter01_
# crashcourse/probability.ipynb; # Under Apache-2.0 License.
#
n_tosses = 2000 # number of tosses
toss_results = nd.sample_multinomial(pr, shape=(n_tosses))
# toss, record the results
record=nd.zeros((n_surfaces,n_tosses)) # count the event (tossing)
# results: times of each of 6 surfaces appearing on top

n_digit_number = nd.zeros(n_surfaces) # Initial with zeros for an

# array to hold the probability of on-top
# appearances for each of the 6 numbers
for i, digit_number in enumerate(toss_results):
n_digit_number[int(digit_number.asscalar())] += 1
# counts and put in the
# corresponding place.
record[:,i] = n_digit_number # records the results

n_digit_number[:]=n_digit_number/n_tosses # compute the Pr

print('Total number of tosses:',n_tosses)
print('Probability of each of the 6 digits:',n_digit_number)
print('Theoretical (true) probabilities:',pr)

Total number of tosses: 2000

Probability of each of the 6 digits:
[0.1675 0.1865 0.1705 0.1635 0.15 0.162 ]
<NDArray 6 @cpu(0)>
Theoretical (true) probabilities:
[0.16667 0.16667 0.16667 0.16667 0.16667 0. 16667]
<NDArray 6 @cpu(0)>

We see the probability values for all the 6 digits getting closer to the
theoretical or true probability.
Statistics and Probability-based Learning Model 163

import numpy as np
np.set_printoptions(suppress=True)
print(record) # print out the records
[[ 0. 0. 0. ... 333. 334. 335.]
[ 0. 0. 1. ... 373. 373. 373.]
[ 0. 1. 1. ... 341. 341. 341.]
[ 1. 1. 1. ... 327. 327. 327.]
[ 0. 0. 0. ... 300. 300. 300.]
[ 0. 0. 0. ... 324. 324. 324.]]
<NDArray 6x2000 @cpu(0)>

We now normalized the data, which we often do in machine learning, by

the total number of tosses using the following codes:

x = nd.arange(n_tosses).reshape((1,n_tosses)) + 1
#print(x)
observations = record / x # Pr of 6 digits for all tosses
print(observations[:,0]) # observations for 1st toss
print(observations[:,10]) # for first 10 toss
print(observations[:,999]) # for first 1000 toss
[0. 0. 0. 1. 0. 0.]
<NDArray 6 @cpu(0)>

[0.181819 0.272728 0.090909 0.181819 0.090909 0. 181819]

[0.175 0.185 0.16 0.164 0.144 0.172]

This simple experiment gives us 1,000 observations for six possible values
of uniform distribution (any of the 6 digits has equal chance to land on
top). When the probability of the appearance of each of the six surfaces
of the die is computed after 1,000 times of tossing, we got roughly 0.14 to
0.19. These probabilities will change a little each time we do the experiment
because of the random nature. If we would do 10,000 times of tossing for each
experiment, we shall get all probabilities quite close to the theoretical value
of 1/6 ≈ 0.1667. Readers can try this very easily using the code given above.
Let us now plot the “numerical” experimental results. For this, we use
matplotlib library.
164 Machine Learning with Python: Theory and Applications

# The code is modified from these at https://github.com/

# zackchase/mxnet-the-straight-dope/blob/master/chapter01_
# crashcourse/probability.ipynb; Under Apache-2.0 License.
%matplotlib inline
from matplotlib import pyplot as plt
plt.plot(observations[0,:].asnumpy(),label="Observed P(die=1)")
plt.plot(observations[1,:].asnumpy(),label="Observed P(die=2)")
plt.plot(observations[2,:].asnumpy(),label="Observed P(die=3)")
plt.plot(observations[3,:].asnumpy(),label="Observed P(die=4)")
plt.plot(observations[4,:].asnumpy(),label="Observed P(die=5)")
plt.plot(observations[5,:].asnumpy(),label="Observed P(die=6)")
plt.axhline(y=0.166667, color='black', linestyle='dashed')
plt.legend()
plt.show()

Figure 4.1: Probabilities obtained via finite sampling from a uniform distribution of a
fair die.

It is clear that the more experiments we do, the probability gets closer
to the theoretical value of 1/6.
The above discussion is for very simple events of a die toss. It gives a
clear view on some of the basic issues and procedures related to the statistics
analysis and probability computation for complicated events.

4.2 Random Distributions

In machine learning, one often needs to sample numbers in a random manner.

Depending on types of problems, the distribution of the data of a variable
can have diﬀerent types. Sampling of data numerically shall be based on a
given/assumed distribution type. We did so in the beginning for this chapter
using uniform distribution. We shall now examine this further.
Statistics and Probability-based Learning Model 165

4.2.1 Uniform distribution

These numbers generated based on uniform distribution shall have an equal
chance to land anywhere in between the speciﬁed range. To check the
uniformity of the numbers generated using random.randint(), we can run
it for a large number of times, say 1 million, and see how these numbers are
distributed. We use the following code to do so:
# This code are modified from these at https://github.com/
# zackchase/mxnet-the-straight-dope/blob/master/chapter01_
# crashcourse/probability.ipynb; Under Apache-2.0 License.
import numpy as np
import matplotlib.pyplot as plt
import random
na, nb, n = 0, 99, 100
counts = np.zeros(n) # Array to hold counted numbers
fig, axes = plt.subplots(2,3,figsize=(15,8),sharex=True)
axes = axes.reshape(6)
n_samples = 1000001
for i in range(1, n_samples):
counts[random.randint(na, nb)]+=1 # Random integers
if i in [10, 100, 1000, 10000, 100000, 1000000]:
axes[int(np.log10(i))-1].bar(np.arange(na+1,nb+2),counts)
plt.show()

Figure 4.2: Finite samplings from a uniform distribution.

It is observed that with the increase of sampling, the uniformity increases.

4.2.2 Normal distribution (Gaussian distribution)

The normal distribution is also called Gaussian distribution. It is widely used
in statistics because many events in nature, science, and engineering obey
166 Machine Learning with Python: Theory and Applications

this distribution. It is deﬁned using the following Gaussian density function

of variable x:
2
1 − x−μ
√
p(x) = √ e 2σ (4.2)
σ 2π
where μ and σ are, respectively, the mean and the standard deviation of
the distribution. The normal distribution is often denoted as N (μ, σ 2 ).
In particular, when μ = 0 and σ = 1, we have the standard normal
distribution denoted as N (0, 1) and its density function becomes simply
−( √x )2
p(x) = √12π e 2 .
The gauss() in the numpy random module can be used to conveniently
generate normal distribution numbers.
from random import gauss
mu, sigma, n = 0., 0.1, 10
for i in range(n): # generates n random numbers
print(f'{gauss(mu, sigma):.4f} ', end ='')
# mu: mean; sigma: standard deviation

-0.1939 0.1794 0.0614 -0.1348 0.1020 0.0432 -0.2144 -0.0636

0.0502 -0.1377

Let us plot out the density function deﬁned in Eq. (4.2). The bell shape of
the function may already be familiar to you.
x = np.arange(-0.5, 0.5, 0.001) # define variable x
def gf(mu,sigma,x): # define the Gauss function
return 1/(sigma*np.sqrt(2*np.pi))*np.exp(-.5*((x-mu)/sigma)**2)
mu, sigma = 0, 0.1 # mean 0, standard deviation 0.1
plt.figure(figsize=(6, 4))
plt.plot(x, gf(mu,sigma,x))
plt.show()

Figure 4.3: A typical normal distribution (Gaussian distribution).

Statistics and Probability-based Learning Model 167

Let us now generate some random samples of Gauss distribution using

np.random.normal(). We then compare the sampled data with the “true”
Gaussian distribution.
n = 500
samples = np.random.normal(mu, sigma, n) #generate samples
count, bins, ignored = plt.hist(samples, 80, density=True)
# plot histogram of the samples
plt.plot(bins, gf(mu,sigma,bins),linewidth=2, color='r')
# plot true Gauss distribution
plt.show()

Figure 4.4: Sampling from a normal (Gaussian) distribution.

Numpy can generate samples of about 40 diﬀerent types of distributions.

Readers are referred to the numpy documentation (https://numpy.org/
doc/1.16/reference/routines.random.html) for details when needed.

4.3 Entropy of Probability

For given probabilities of random variables of a statistics event, one can

evaluate the corresponding entropy. It is a measure for uncertainty of the
probability distribution for the event. It is the dot-product of the probability
vector (that holds the probability values of a random variable) with its
negative logarithm. The entropy Hp for an event with probability p is
expressed by
Hp = − pi log pi = −p · log(p) (4.3)
i

where pi is the probability of the ith possible value of the variable and

i pi = 1. Vector p is the vector that holds these the probabilities. The
168 Machine Learning with Python: Theory and Applications

negative sign is needed, because entropy is positive and log(pi ) is always

negative for 0 ≤ pi ≤ 1. In the computation, we often have it normalized by
dividing it with the total number of possible values. Entropy is used very
often in machine learning, especially for constructing objective functions,
because it is a measure of the uncertainty that needs to be minimized.
Since the logarithm is used frequently, we shall ﬁrst examine it in more
detail using numpy log() function.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
p = np.arange(0.01, 1.0, .01) # generate variables
logp = -np.log(p)
# negative sign for positive value: log(p)<0 for 0<p<1
plt.plot(p,logp,color='blue')
plt.xlabel('Probability p')
plt.ylabel('Negative log(p)')
plt.title('Negative log Value of Probability')
plt.show()

Figure 4.5: A log function of probability.

Let us mention a few important features, which are the root reasons for
why the logarithm is used so often in machine learning.

• −log(p) is monotonically varying with argument p. The monotonous is

important because it does not aﬀect the locations of the stationary points
of the original function when it is in logarithm. This is an excellent
property for optimization algorithms that are frequently used in machine
learning.
Statistics and Probability-based Learning Model 169

• −log(p) is monotonically decaying with increasing probability. This

reverses the trend of the probability, and thus it is the proper behavior
for measuring the entropy for a high probability range. This is because
when the probability is high, the uncertainty level is low (because we are
quite certain that it is likely to happen) and so the value of entropy. When
the probability is low, the uncertainty should also be low (because we are
quite certain that it is unlikely to happen). In this case, we simply make
use of the probability itself in the entropy equation.
• Entropy Hp is a combination of both the probability and its negative
logarithm in the form of a product, as shown in Eq. (4.3). This combination
gives the needed behavior, and it is nicely deﬁned to suit our purpose by
making use of the features of the logarithm function.

The following examples demonstrate how the entropy function works:

4.3.1 Example 1: Probability and its entropy

Consider an event with a variable that takes two values. We made an

observation ﬁrst, which produces probability vector q1 with entries of these
two probabilities of the corresponding two variables. We then made another
observation, which produces probabilities q2 . We would like to evaluate the
entropy of the probability of these two observations.

q1 =np.array([0.999, 0.001]) # Pr. distribution with low

# uncertainty: quite sure whether the event is to
# happen, because the variable is with either a
# very high or low chance to be observed.
q2 =np.array([ 0.5,0.5 ]) # Distribution with high uncertainty:
# Not sure whether the event is to happen,
# because the variable is with neither a
# high or low chance to be observed.
# - sign for getting a positive value:
print('q1=',q1,' -log(q1)=',-np.log(q1)) # log(p): negative
# for 0<p<1
print('q2=',q2,' -log(q2)=',-np.log(q2))
H_q1 = -np.dot(q1,np.log(q1))/len(q1) #Entropy: Uncertainty
H_q2 = -np.dot(q2,np.log(q2))/len(q2)
print('H_q1=',H_q1,'H_q2=',H_q2)

q1= [0.999 0.001] -log(q1)= [0.0010005 6.90775528]

q2= [0.5 0.5] -log(q2)= [0.69314718 0.69314718]
H q1= 0.003953627556116044 H q2= 0.34657359027997264
170 Machine Learning with Python: Theory and Applications

In this example, we see the contradicting behavior of p and −log(p).

Vector q1 has either low or high probability values for its two variables,
meaning that the uncertainty is low, and hence the computed entropy is
low. On the other hand, q2 have two probabilities in the middle for both
variables, meaning that it is very uncertain. The computed entropy is high
as expected.

4.3.2 Example 2: Variation of entropy

To show this more clearly, we create artiﬁcial events q1 with a variable that
takes two possible values, and let the probabilities be v1 and v2 for these two
values, which change in the reverse manner while the sum of the probabilities
equals 1. We write the following code to compute the entropy changing with
the changes in the probability of v1 and v2 :

# An event with a variable that takes two values

v1 = np.arange(0.01, 1.0, .05)
gap = (v1[1]-v1[0])*len(v1)/3.
v1 /= (v1[0]+v1[-1]) # create an array that holds linearly
# changing probability values.
v2 = v1[::-1] # create the revise of v1.
print(v1,np.sum(v1)) # to check it out
print(v2,np.sum(v2))
print(v1+v2,np.sum(v1+v2)/2)
xtick = range(len(v1)) #[0,1,2,3,4]
plt.bar(range(len(v1)),v1,width=gap*1.2,alpha=.9,color='blue')
plt.bar(range(len(v2)),v2,width=gap,alpha=.9,color='red')
plt.xlabel('Event ID, blue: v1, red: v2')
plt.ylabel('Probability')
plt.xticks(xtick)
plt.show()

H_qf = np.array([]) # initialize the array for entropy

for q1 in list(zip(v1,v2)):
# create a pair of probability compute the entropy and append
H_qf = np.append(H_qf,-(np.dot(q1,np.log(q1)))/2)

plt.plot(v1,H_qf)
plt.xlabel('Probability, v1 (v2=1-v1)')
plt.ylabel('Entropy of events')
plt.title('Entropy of Events')
plt.show()
Statistics and Probability-based Learning Model 171

[0.01030928 0.06185567 0.11340206 0.16494845 0.21649485 0.26804124

0.31958763 0.37113402 0.42268041 0.4742268 0.5257732 0.57731959
0.62886598 0.68041237 0.73195876 0.78350515 0.83505155 0.88659794
0.93814433 0.98969072] 10.0
[0.98969072 0.93814433 0.88659794 0.83505155 0.78350515 0.73195876
0.68041237 0.62886598 0.57731959 0.5257732 0.4742268 0.42268041
0.37113402 0.31958763 0.26804124 0.21649485 0.16494845 0.11340206
0.06185567 0.01030928] 9.999999999999998
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 10.0

Figure 4.6: Probabilities of 20 events each of which has two values.

Figure 4.7: Variation of entropy of the probabilities of the 20 events.

It is clear that when the probability of v1 and v2 is at 0.5, the entropy is

the largest. The entropy is smallest at two ends, as expected.
172 Machine Learning with Python: Theory and Applications

4.3.3 Example 3: Entropy for events with a variable that takes

different numbers of values of uniform distribution
Let us take a look at events with a variable that can take different numbers of
possible values. We assume that the probability distribution for the variable
is uniform for all these events. We want to find out how the entropy for the
probability distribution changes with the number of variables of the events.

# An event with a variable that can take many values of

→uniform probability

N = 0
max_v = 100 # Events with N variables
# capped at max_values.
Ni = np.array([]) # For the number of v
H_qf = np.array([]) # To hold the entropy
while N < max_v:
N += 1
Ni = np.append(Ni,N)
qf = np.ones(N)
qf = qf/np.sum(qf) # uniform sample generated
H_qf = np.append(H_qf,-np.dot(qf,np.log(qf))/len(qf))

print('Probability distribution:',qf[0:max_v:10])
print('H_qf=', H_qf[0:max_v:10])
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.plot(Ni,H_qf)
plt.xlabel('Number of variables, all with same probability')
plt.ylabel('Entropy')
plt.title('Events with variables of uniform distribution')
plt.show()

Probability distribution: [0.01 0.01 0.01 0.01 0.01 0.01 0.01

0.01 0.01 0.01]
H qf= [-0. 0.21799048 0.14497726 0.11077378 0.09057493
0.07709462 0.06739137 0.06003774 0.05425246 0.04956988]
Statistics and Probability-based Learning Model 173

Figure 4.8: Entropy of probability of events with uniform distributions.

We ﬁnd that (1) the entropy is zero when N = 1; (2) it peaks at N = 3;

and (3) when N get very big, the entropy becomes small, implying that when
an event has a very large number of variables, the entropy becomes small.
This is because the probability of each of the variables becomes very small
due to the uniform probability assumption for all these variables.

4.4 Cross-Entropy: Predicated and True Probability

Let us now look at the cross-entropy, which is an often used concept in

statistics. The cross-entropy of a distribution q relative to the distribution
p is deﬁned as follows:

Hpq = − pi log qi = −p · log(q) (4.4)
i

In general, the cross-entropy is a measure of the similarity of two distribu-

tions (from the same space). In machine learning, we are interested in the
cross-entropy of the predicated probability q with respect to the true one, p.
In this case, Hpq can be a measure of the performance of a prediction model,
and hence often used as an objective or loss function in machine learning
models.
We note the following properties:

• Cross-entropy is not symmetric: Hpq = Hqp , if p = q, which is obvious

from Eq. (4.4).
• We shall have Hpq ≥ Hp , and Hqp ≥ Hq . The diﬀerence will be the
KL-divergence, which is always positive (see the following section).
174 Machine Learning with Python: Theory and Applications

• When these two distributions are the same, the cross-entropy becomes the
entropy studied in the previous section. All these three inequalities above
become equal.
• Therefore, in machine learning models, even if the prediction is perfect,
the cross-entropy will still not be zero, because the true distribution itself
may have an entropy. If p is the entropy of the true distribution, the cross-
entropy Hpq is bounded from below by Hp . It can only be zero if the true
distribution is without any uncertainty (probabilities of the variables are
all zero, except for one of them, which is 1).

We look at some simple examples.

4.4.1 Example 1: Cross-entropy of a quality prediction

We examine a simple event with a variable that can take two possible values.
Assuming we have a quality prediction, how is it measured in cross-entropy?

# A good prediction case

q_good = np.array([0.9,0.1]) # predicted Pr. of 2 values
y = np.array([0.99,0.01]) # true Pr. of 2 values
p = y # Truth
q = q_good # prediction
print('p=',p,' log(p)=',-np.log(p))
print('q=',q,' log(q)=',-np.log(q))
print(' Entropy: Hp=',-np.dot(p,np.log(p))/len(p),\
' Hq=',-np.dot(q,np.log(q))/len(q))
print('\nCross-entropy: Hpq=',-np.dot(p,np.log(q))/len(p),\
'Hqp=',-np.dot(q,np.log(p))/len(q))
#Cross-entropy: Hpq>Hp; Hqp>Hq should hold.

p= [0.99 0.01] log(p)= [0.01005034 4.60517019]

q= [0.9 0.1] log(q)= [0.10536052 2.30258509]
Entropy: Hp= 0.028000767177423672 Hq= 0.1625414866957241

Cross-entropy: Hpq= 0.06366638 Hqp= 0. 234781160

It is seen that the cross-entropy Hpq is low, indicating that the prediction
q is good. Notice that Hpq = Hqp , Hpq ≥ Hp , and Hqp ≥ Hq .
Statistics and Probability-based Learning Model 175

4.4.2 Example 2: Cross-entropy of a poor prediction

Consider again a simple event with a variable that can take two possible
values. This time, we assume a poor-quality prediction, and we examine
how it is measured in cross-entropy.

# A totally-off prediction case

q_bad = np.array([0.1,0.9]) # predicted Pr. of 2 values
y = np.array([0.99,0.01]) # true Pr. of 2 values
p = y # Truth
q = q_bad # prediction
print('p=',p,' log(p)=',-np.log(p))
print('q=',q,' log(q)=',-np.log(q))

print(' Entropy: Hp=',-np.dot(p,np.log(p))/len(p),\

' Hq=',-np.dot(q,np.log(q))/len(p))
print('\nCross-entropy: Hpq=',-np.dot(p,np.log(q))/len(p),\
'Hqp=',-np.dot(q,np.log(p))/len(q))
#Cross-entropy: Hpq>Hp; Hqp>Hq should hold.

p= [0.99 0.01] log(p)= [0.01005034 4.60517019]

q= [0.1 0.9] log(q)= [2.30258509 0.10536052]
Entropy: Hp= 0.028000767177423672 Hq= 0.1625414866957241

Cross-entropy: Hpq= 1.1403064236103417 Hqp= 2.072829100487316

It is seen that the cross-entropy Hpq is high, indicating that the prediction
q is bad. Notice again that Hpq = Hqp , Hpq ≥ Hp , and Hqp ≥ Hq .
We are now ready to discuss the KL-divergence.

4.5 KL-Divergence

Kullback-Leibler Divergence or KL-divergence is a measure of the relative

entropy from one distribution to another. For the given two distributions p
and q, the KL-divergence from q to p is deﬁned as

DKL (p||q) = pi · [log pi − log qi ] = p · [log(p) − log(q)] (4.5)
i

It is also referred to as the relative entropy of q with respect to p that can

be regarded as the true or reference distribution. Using the deﬁnitions for
176 Machine Learning with Python: Theory and Applications

the entropy and the cross-entropy, we shall have

DKL (p||q) = Hpq − Hp (4.6)

Note that the KL-divergence of q with respect to p is diﬀerent from that of

p with respect to q. We have also

DKL (p||q) ≥ 0, equality holds only if p = q (4.7)

This is known as the Gibbs’ inequality (https://en.wikipedia.org/wiki/

Gibbs%27 inequality).
Two simple examples of KL-divergence are given below.

4.5.1 Example 1: KL-divergence of a distribution

of quality prediction
We examine a simple event with a variable that can take two possible values.
Assuming we have a quality prediction of a distribution in relation to the
true distribution, how is it measured in KL-divergence?

# Good prediction case

p = y # true or reference distribution
q = q_good # prediction
print('p=',p,' log(p)=',-np.log(p))
print('q=',q,' log(q)=',-np.log(q))
Dpq=np.sum(np.dot(p,(np.log(p)-np.log(q))))/len(p)
Dqp=np.sum(np.dot(q,(np.log(q)-np.log(p))))/len(p)
print('Dpq=',Dpq,' Dqp=',Dqp)

p= [0.99 0.01] log(p)= [0.01005034 4.60517019]

q= [0.9 0.1] log(q)= [0.10536052 2.30258509]
Dpq= 0.035665613538170556 Dqp= 0.07223967373775611

It is seen that the KL-divergences Dpq and Dqp are all positive. They
all have low values, indicating that the prediction q is good. Notice that
Dpq = Dqp .

4.5.2 Example 2: KL-divergence of a poorly

predicted distribution
Consider again a simple event with a variable that can take two possible
values. Assume that we have a poor prediction of a distribution in relation
Statistics and Probability-based Learning Model 177

to the true or reference distribution. We examine how it is measured in

KL-divergence using the following code:

# A bad prediction case

p = y
q = q_bad
print('p=',p,' log(p)=',-np.log(p))
print('q=',q,' log(q)=',-np.log(q))
Dpq=np.dot(p,(np.log(p)-np.log(q)))/len(p)
Dqp=np.dot(q,(np.log(q)-np.log(p)))/len(p)
print('Dpq=',Dpq, ' Dqp=',Dqp)

p= [0.99 0.01] log(p)= [0.01005034 4.60517019]

q= [0.1 0.9] log(q)= [2.30258509 0.10536052]
Dpq= 1.112305656432918 Dqp= 1.910287613791592

It is seen again that the KL-divergences Dpq and Dqp are all positive.
They all have high values, indicating the prediction q is poor. Notice also
that Dpq = Dqp .

4.6 Binary Cross-Entropy

Let us ﬁnally look at the so-called binary cross-entropy used in machine

learning. For the given two distributions p and q, the binary cross-entropy
of q with respect to p is deﬁned as

B
Hpq =− pi · log qi + (1 − pi ) · log (1 − qi )
i (4.8)
= −p · log(q) − (1 − p) · log(1 − q)

In machine learning models, we usually assume that p is the truth

distribution, which can have probability 0 or 1, and hence cannot be subject
to logarithm. The binary cross-entropy can be viewed as a measure of the
entropy of the predicated probability with respect to the true one. It takes
into account both the probability p and q and the converse probability
(1 − p) and (1 − q), and computes the entropy of both of them. It roughly
doubles the cross-entropy, a somewhat enhanced measure of the discrepancy
of the predicated distribution from the true distribution. It is often used to
measure the performance of a model and used as one type of loss function.
We look at some examples.
178 Machine Learning with Python: Theory and Applications

4.6.1 Example 1: Binary cross-entropy for a distribution

of quality prediction

Consider a simple event with a variable that can take four possible values.
Assume that we have a good prediction of a distribution in relation to the
true or reference distribution. We examine how it is measured in the binary
cross-entropy using the following code:

# A good prediction case

import numpy as np
q = np.array([0.9,0.04,0.03,0.03]) # prediction
p = np.array([1.0,0.0,0.,0.]) # truth
p_conv = 1.0 - p # converse side of prediction
q_conv = 1.0 - q # converse side of truth
print(p, q, ' converse:',p_conv,q_conv)
cHpq = -np.sum(np.dot(p,np.log(q)))/len(p)
bcHpq = -np.sum(np.dot(p,np.log(q))+np.dot(p_conv,
np.log(q_conv)))/len(p)
print('Cross-entropy cHpq:',cHpq)
print('Binary cross-entropy bcHpq:',bcHpq)

[1. 0. 0. 0.] [0.9 0.04 0.03 0.03] converse: [0. 1. 1. 1.]

[0.1 0.96 0.97 0.97]
Cross-entropy cHpq: 0.02634012891445657
Binary cross-entropy bcHpq: 0.05177523128687465

It is found that the binary cross-entropy roughly doubles the cross-

entropy value, as expected.

4.6.2 Example 2: Binary cross-entropy for a poorly

predicted distribution

Consider an event with a variable that can take four possible values. Assume
that we have a poor prediction of a distribution in relation to the true
distribution. We examine again how it is measured in the binary cross-
entropy using the following code:

# A bad prediction case

q = np.array([0.4,0.2,0.3,0.1]) # prediction
p = np.array([1.0,0.0,0.,0.]) # truth
Statistics and Probability-based Learning Model 179

p_conv = 1.0 - p # converse side of prediction

q_conv = 1.0 - q # converse side of truth
print(p, q, ' converse:',p_conv,q_conv)
cHpq = -np.sum(np.dot(p,np.log(q)))/len(p)
bcHpq = -np.sum(np.dot(p,np.log(q))+np.dot(p_conv,
np.log(q_conv)))/len(p)
print('Cross-entropy cHpq:',cHpq)
print('Binary cross-entropy bcHpq:',bcHpq)

[1. 0. 0. 0.] [0.4 0.2 0.3 0.1] converse: [0. 1. 1. 1.]

[0.6 0.8 0.7 0.9]
Cross-entropy cHpq: 0.22907268296853875
Binary cross-entropy bcHpq: 0.4003674356962309

It is also found that the binary cross-entropy roughly doubles the cross-
entropy, leading to an enhanced discrepancy measure.

4.6.3 Example 3: Binary cross-entropy for more uniform

true distribution: A quality prediction

In the previous two examples, we studied two cases with the true distribution
at extreme: its probabilities are 1.0 and zeros. For both examples, we
observed an enhanced entropy measure using the binary cross-entropy. In
this example, we consider a more even true distribution and examine the
behavior of the binary cross-entropy.

# A good prediction case

q = np.array([0.4,0.2,0.3,0.1]) # prediction
p = np.array([0.3,0.3,0.2,0.2]) # truth, rather even distribution
p_conv = 1.0 - p # converse side of prediction
q_conv = 1.0 - q # converse side of truth
print(p, q, ' converse:',p_conv,q_conv)
cHpq =-np.sum(np.dot(p,np.log(q)))/len(p)
bcHpq=-np.sum(np.dot(p,np.log(q))+np.dot(p_conv,
np.log(q_conv)))/len(p)
print('Cross-entropy cHpq:',cHpq)
print('Binary cross-entropy bcHpq:',bcHpq)

[0.3 0.3 0.2 0.2] [0.4 0.2 0.3 0.1] converse: [0.7 0.7 0.8
0. 8] [0.6 0.8 0.7 0.9]
180 Machine Learning with Python: Theory and Applications

Cross-entropy cHpq: 0.36475754318911824

Binary cross-entropy bcHpq: 0.5856092407474651

In this case, it is also found that the binary cross-entropy does not give
enhancement.

4.6.4 Example 4: Binary cross-entropy for more uniform

true distribution: A poor prediction

Same as the previous example, but consider a case with poor prediction.

# A bad prediction case

q = np.array([0.4,0.05,0.05,0.5]) # prediction
p = np.array([0.1,0.3,0.2,0.2]) # truth, rather even distribution
p_conv = 1.0 - p # converse side of prediction
q_conv = 1.0 - q # converse side of truth
print(p, q, ' converse: ',p_conv,q_conv)
cHpq =-np.sum(np.dot(p,np.log(q)))/len(p)
bcHpq=-np.sum(np.dot(p,np.log(q))+np.dot(p_conv,
np.log(q_conv)))/len(p)
print('Cross-entropy cHpq:',cHpq)
print('Binary cross-entropy bcHpq:',bcHpq)

[0.1 0.3 0.2 0.2] [0.4 0.05 0.05 0.5] converse: [0.9 0.7 0.
8 0.8] [0.6 0.95 0.95 0.5]
Cross-entropy cHpq: 0.43203116151910004
Binary cross-entropy bcHpq: 0.7048313483737685

In this case, we found a similarity: the binary cross-entropy gives an

enhancement.
In conclusion, our study on simple vents above shows that the binary
cross-entropy enhances the discrepancy measure, by taking into consider-
ation both positive samples (with probability close to 1) and “negative”
samples (with probability close to zero).

4.7 Bayesian Statistics

Consider a statistics event with more than one random variable that occurs
jointly. When we deal with such multiple random variables, we may want to
know the joint probability Pr(A,B): the probability of both A = a and B =
b occurring simultaneously, for given elements a and b.
Statistics and Probability-based Learning Model 181

It is clear that for any values a and b, Pr(A,B) ≤ Pr(A = a), because
Pr(A = a) is measured regardless of what happens for B. For A and B to
happen jointly, A has to happen and B also has to happen (and vice versa).
Thus, A,B cannot be more likely than A or B occurring individually.
Pr(A,B)Pr(A) is called conditional probability and is denoted by Pr(B|A),
which is the probability that B happens, under the condition that A has
happened. This leads to the important Bayes’ theorem.

• By construction, we have: Pr(A,B) = Pr(B|A)Pr(A).

• By symmetry, this also holds: Pr(A,B) = Pr(A|B)Pr(B).
• We thus have
P r(A|B) = P r(B|A)P r(A)/P r(B). (4.9)

4.8 Naive Bayes Classification: Statistics-based Learning

4.8.1 Formulation
Based on the Bayesian statistics, a popular algorithm has been developed
known as the Naive Bayes Classiﬁer. Consider an event with p variable
x = {x1 , x2 , . . . , xp } ∈ Xp . We assume that any variable xi is independent
of another. For a given label y, the conditional probability for being x is
expressed as
p(x|y) = p(xi |y) (4.10)
i

Based on Bayes’ Theorem, we have the following formula:

4.8.2 Case study: Handwritten digits recognition

We can now use the code provided at mxnet-the-straight-dope (https://
github.com/zackchase/mxnet-the-straight-dope/blob/master/chapter01
crashcourse/probability.ipynb) to show how a Naive Bayes classiﬁer is coded
to identify handwritten digits. We will use the well-known MNIST dataset
182 Machine Learning with Python: Theory and Applications

to train this classiﬁer. The MNIST (https://en.wikipedia.org/wiki/MNIST

database) contains a total of 70,000 images (60,000 for training and 10,000 for
testing) of handwritten 10 digits from 0 to 9, and all these images are labeled.
These images have been taken from American Census Bureau employees and
American high school students.
The digit classiﬁcation problem is then casted to compute the probability
of a given image x being digit y: p(y|x). Any image x contains p pixels
xi (i = 1, 2, . . . , p), and each pixel xi can take a value of 1 (being lighted on)
or 0 (being lighted oﬀ), and hence is a binary variable.
Equation (4.12) can then be used, in which we need to estimate p(y)
and p(xi |y). Both can be computed using the MNIST training dataset for
each digit y. For example, in the total of 60,000 images of digits of the
MNIST training dataset, digit 4 is found 5,800 times, and we then have
5800
p(y = 4) = 60000 . To estimate p(xi |y), we can estimate p(xi = 1|y), because
xi is binary and p(xi = 0|y) = 1 − p(xi = 1|y). Estimating p(xi = 1|y) can
be done by counting the times that pixel i is on for label digit y, and then
dividing it by the number of occurrences of label y in the dataset. In this
simple algorithm, all we need is to count over the MNIST training dataset.
It is quite a straightforward strategy, the training is just counting, and we
can use the following code to get this done:

4.8.3 Algorithm for the Naive Bayes classiﬁcation

# The codes are modified from these at https://github.com/

# zackchase/mxnet-the-straight-dope/blob/master/chapter01_
# crashcourse/probability.ipynb; Under Apache-2.0 License.
# import all the necessary packages
import numpy as np
import mxnet as mx
from mxnet import nd

def transform(data, label): # define a function to transfer data

return (nd.floor(data/128)).astype(np.float32),
label. astype(np.float32)
# floor of 255/128 = 1 pixel value = 1 when it is on.
# Divide dataset to 2 sets: one for training and one for testing
mnist_train = mx.gluon.data.vision.MNIST(train=True,
transform=transform)
mnist_test = mx.gluon.data.vision.MNIST(train=False,
transform=transform)
Statistics and Probability-based Learning Model 183

print('type:',type(mnist_train))

type: <class 'mxnet.gluon.data.vision.datasets.MNIST'>

import matplotlib.pyplot as plt

%matplotlib inline
image_index=8888 # Check one. Any integer <60,000
# 8888 is digit with label 3, as printed below
print(mnist_train[image_index][1]) #image in 0; label in 1
plt.imshow(mnist_train[image_index][0].reshape((28, 28)).\
asnumpy(), cmap='Greys') #image pixel: 28 by 28

3.0

<matplotlib.image.AxesImage at 0x1c5743ba630>

Figure 4.9: One sample image of handwritten digit from the MNIST dataset.

# Initialize arrays for counts for computing p(y), p(xi|y)

# We initialize all numbers with a count of 1 to avoid
# division by zero, known as Laplace smoothing.

ycount = nd.ones(shape=(10)) #10 possible digits

xcount = nd.ones(shape=(784, 10)) #784 (= 28*28) variables
184 Machine Learning with Python: Theory and Applications

# Aggregate the count of the labels in training dataset

# and number of its corresponding pixels being on (value=1)
for data, label in mnist_train: # loop over the dataset
x = data.reshape((784,))
y = int(label) # get the digit-number
ycount[y] += 1 # add 1 to (digit)th entry
xcount[:, y] += x # add the image data to
# the (digit)th column
# compute the probabilities p(xi|y) (divide per pixel counts
# by total count of the label in the training dataset)
for i in range(10):
xcount[:, i] = xcount[:, i]/ycount[i]
# Compute the probability p(y)
py = ycount / nd.sum(ycount)

The model has been trained using the training dataset. We now plot the
“trained” model.
import matplotlib.pyplot as plt
%matplotlib inline
fig, figarr = plt.subplots(1, 10, figsize=(15, 15))

for i in range(10):
figarr[i].imshow(xcount[:,i].reshape((28,28)).asnumpy(),
cmap='hot')
figarr[i].axes.get_xaxis().set_visible(False)
figarr[i].axes.get_yaxis().set_visible(False)

plt.show()
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
print(py.asnumpy(),nd.sum(py).asnumpy())

Figure 4.10: A kind of mean appearance of handwritten digits.

[0.099 0.112 0.099 0.102 0.097 0.090 0.099 0.104

0. 098 0.099] [1.000]
Statistics and Probability-based Learning Model 185

These pictures show the estimated probability distributions of observing

a switched-on pixel for all these 10 digits. These are the mean appearance of
these digits, or what a digit shall look like on average based on the training
dataset.

4.8.4 Testing the Naive Bayes model

We now examine the performance of this statistics-based model using the
MNIST test dataset. The training (which is just simple counting over
the training dataset) completed above gives us p(xi = 1|y) and p(y).
For a given image x from the test dataset, we compute the likelihood of
the image corresponding to a label y, which is to compute p(y|x) using
Eq. (4.12) where p(x|y) is in turn computed using the trained model. To
avoid chain multiplication of small probability numbers, we compute the
following logarithms instead (known as “log-likelihood”):

log p(y|x) ∝ log p(x|y) + log p(y) = log p(xi |y) + log p(y) (4.13)
i

For the given image x, a feature xi is binary and takes values of either 1
or 0. Because we are using the train model to compute the probabilities, we
shall have
p(xi = 1|y) = p(xi = 1|y) for predicting xi is on
(4.14)
p(xi = 0|y) = 1 − p(xi = 1|y) for predicting xi is oﬀ

Equation (4.14) can be written in a single one using a mathematical trick:

p(xi |y) = p(xi = 1|y)xi (1 − p(xi = 1|y))1−xi (4.15)

This is a general equation of computing the probability of an event with

binary variables using a trained model for predicting the probability of the
positive variable. We ﬁnally have

log p(xi |y) = [xi log p(xi = 1|y)+(1−xi ) log (1 − p(xi = 1|y))] (4.16)
i i

It is clear now that the testing is essentially measuring the binary cross-
entropy of the distribution of a given image (true distribution) with that
of the average image of a labeled letter computed from the dataset (the
model distribution). Therefore, we can write out Eq. (4.16) directly using
the binary cross-entropy formula.
186 Machine Learning with Python: Theory and Applications

To avoid re-computing the logarithms repetitively, we pre-compute

log p(y) for all y, and also log p(xi |y) and log (1 − p(xi |y)) for all pixels.

logxcount = nd.log(xcount) # pre-computations

logxcountneg = nd.log(1-xcount)
logpy = nd.log(py)
fig, figarr = plt.subplots(2, 10, figsize=(15, 3))
# test and show 10 images
ctr = 0 # initialize the control iterator
y = []
pxm = np.array([])
xi = ()
for data, label in mnist_test: # for any image
x = data.reshape((784,))
y.append(int(label))
# Incorporate the prior probability p(y) since p(y|x) is
# proportional to p(x|y) p(y)
logpx = logpy.copy() #nd.zeros_like(logpy)
for i in range(10):
# compute the log probability for a digit
logpx[i]+=nd.dot(logxcount[:,i],x)+nd.
dot(logxcountneg[:,i],1-x)

# normalize to prevent overflow or underflow by

# subtracting
# the largest value
logpx -= nd.max(logpx)
# and compute the softmax using logpx
px = nd.exp(logpx).asnumpy()
px = px*py.asnumpy() # this proportional to P(y|x)
px /= np.sum(px)
pxm = np.append(pxm,max(px)) # use the one with max Pr.
xi = np.append(xi,np.where(px == np.amax(px)))
# bar chart and image of digit
figarr[1, ctr].bar(range(10), px)
figarr[1, ctr].axes.get_yaxis().set_visible(False)
figarr[0, ctr].imshow(x.reshape((28,28)).asnumpy(),
cmap='hot')
Statistics and Probability-based Learning Model 187

figarr[0, ctr].axes.get_xaxis().set_visible(False)
figarr[0, ctr].axes.get_yaxis().set_visible(False)
ctr += 1
if ctr == 10:
break
np.set_printoptions(formatter={'float': '{: 0.0f}'.format})
plt.show()
print('True label: ',y)
xi = np.array(xi)
print('Predicted digits:',xi)
print('Correct?',np.equal(y,xi))
np.set_printoptions(formatter={'float': '{: 0.1f}'.format})
print('Maximum probability:',pxm)

Figure 4.11: Predicted digits (in probability) using images from the testing dataset of
MINST.

True label: [7, 2, 1, 0, 4, 1, 4, 9, 5, 9]

Predicted digits: [7 2 1 0 4 1 4 9 4 9]
Correct? [True True True True True True True True False True]
Maximum probability: [1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0]

4.8.5 Discussion
The test shows that this classifier made one wrong classification for the first
10 digits in the testing dataset. The 9th digit should be 5, but classified
as 4. For this wrongly classified digit, the confidence level is very close to 1.
The wrong prediction may be due to the incorrect assumptions: each pixel
is independently generated, depending only on the label. Clearly, a digit is
a very complicated function of images, and statistic information alone has
a limit. This type of Naive Bayes classifier was popular in the 1980s and
1990s for applications such as spam filtering. For image processing types of
problems, we now have effective classifies (such as CNN; for example, see
Chapter 15).
188 Machine Learning with Python: Theory and Applications

Alternatively, we can use the cross-entropy or the binary cross-entropy

concept to perform the prediction. Once we obtained p(xi |yj ), j = 1, 2, . . . , 9
using the training dataset, one can then compute the (binary) cross-entropy
between p(xi |yj ) and p(xi |ytest ), where ytest is any given image from the test
dataset (or any other handwritten digit). The yj that gives the least (binary)
cross-entropy should be regarded as the predicted digit.
This simple example does show how statistical analyses are useful in
classifications and machine learning in general. We also showed how statistics
can be computed for given datasets. Powerful Naive classifiers can be
conveniently trained using the existing module at Sklearn (https://scikit-
learn.org/stable/modules/naive bayes.html) for practical problems that are
heavily governed by statistics and where the physics laws are unknown, such
as medical applications, recommendation systems, text classification, and
real-time prediction and recommendation.
Chapter 5

Prediction Function and

Universal Prediction Theory

To build an ML model for predictions, one needs to use some hypothesis,

which predefines the prediction function to connect the feature variables to
the learning parameters. Thus, a proper hypothesis may follow the function
approximation theory, which has been studied intensively in physics-law-
based models [1–4]. The most essential rule is that the prediction function
must be capable of predicting an arbitrary linear function in the feature space
by a chosen set of learning parameters. Therefore, the prediction function is
assumed as a complete linear function of the feature variables, and it is
one of the most basic hypotheses.
It turns out such a complete linear prediction function performs affine
transformations of patterns in the affine space. Such a transformation
preserves affinity, meaning that the ratios of distances (Euclidean)
between points lying on a straight line and the parallelism of parallel line
segments remain unchanged after the transformation. It may not preserve the
angles between line segments and the (Euclidean) distances between points
in the original pattern. Further discussion on more general issues with affine
transformations can be found in Wikipedia (https://en.wikipedia.org/wiki/
Affine transformation) and the links therein.
The affinity ensures a special unique point-to-point seamless and gapless
transformation, meaning that it does not merge two distinct points to one, or
split one point to two, when the learning parameters are varying smoothly.
The connection between the feature variables and the learning parameters
is also smooth. Because an affine transformation is a combination of a linear
transformation and a translation controlled by the learning parameters, any
function up to the first order in the feature space can be reproduced, which
is critically important for machine learning models to be predictive.

189
190 Machine Learning with Python: Theory and Applications

This chapter discusses ﬁrst the formulation and predictability of predi-

cation functions, followed by discussion on detailed process, properties, and
behavior of affine transformations. We shall focus on two aspects: (1) capabil-
ity in predicting functions in the feature space, and (2) affine transformation
of patterns in the affine space.
Then, the concept of affine transformation unit (ATU) (or linear predic-
tion function unit) is introduced as a building block, and simple neural net-
work codes will be built to perform affine transformations and demonstrate
their behavior and property. Feature encodings by learning parameters and
the uniqueness of the encodings are then studied, demonstrating the concept
of data-parameter converter. Next, an extension of ATU to form an affine
transformation array (ATA) and further extensions of the activation function
wrapped ATA to form MLPs or deepnets will be studied, which shows how
the predictability of a high-order nonlinear function can be established with
a deepnet. Finally, a Universal Prediction Theory is presented offering the
fundamental basis of why a deepnet can be made predictive.

5.1 Linear Prediction Function and Aﬃne Transformation

Figure 5.1 shows the mechanism of the transmission of neurotransmitters in a

synaptic cleft of sensory neurons. Molecules of a neurotransmitter are shown
in a pseudo-colored image from a scanning electron microscope. A terminal
button (green) has been opened to reveal the synaptic vesicles (orange and
blue) inside the neurotransmitter.

(a) (c) (b)

Figure 5.1: (a) Transmission of neurotransmitters; (b) a pseudo-colored image of a

neurotransmitter from a scanning electron microscope: a terminal button (green) has been
opened to reveal the synaptic vesicles (orange and blue) inside; (c) a typical affine trans-
formation in xw formulation in an artificial NN. (Modified based on the image given in the
Psychology book (2d) by Rose M. Spielman et al. under the CC BY 4.0 License). (Credit b:
modification of work by Tina Carvalho, NIH-NIGMS; scale-bar data from Matt Russell).
Prediction Function and Universal Prediction Theory 191

In an artiﬁcial NN, we use a so-called aﬃne transformation between the

data-points and the learning parameters in an artiﬁcial neuron to somehow
mimic the information transformation process in a neurotransmitter. This
transformation is of most fundamental importance in many ML models, and
thus the related theory is discussed in great detail in this chapter.

5.1.1 Linear prediction function: A basic hypothesis

In machine learning models, the basic hypothesis is that the prediction
function z is given by the following equation:

b
z(ŵ; x) = x ŵ = [1 x] = xw + b
w
x
ŵ (5.1)
p

= W 1 x1 + · · · + W p xp + b = W i xi + b

i=1

where z(ŵ; x) reads as “z is a function of ŵ for given x”, and all vectors are
deﬁned as
x = [x1 , x2 , . . . , xp ] ∈ Xp ∈ Rp

p
x = [1 x] = x0 , x1 , x2 , . . . , xp ∈ X ∈ Rp+1

1
(5.2)
w = [W 1 , W 2 , . . . , W p ] ∈ Wp ∈ Rp

ŵ = [b w] = W 0 , W 1 , W 2 , . . . , W p ∈ Wp+1 ∈ Rp+1

b

in which xi (i = 1, 2, . . . , p) are feature variables, and here we use only the

linear basis functions; W i (i = 1, 2, . . . , p) ∈ R are called weights. Constant
b ∈ R is called bias, and is also often denoted as W 0 ∈ R. These real numbers
form vectors in corresponding spaces (deﬁned in Chapter 1) and used in the
operations above. The hat over w stands for the basis on top of it. It absorbs
the bias and become new vector in the hypothesis space Wp+1 . Both weights
and bias are parameters that can be tuned to predict exactly a desired
arbitrary linear function in the feature space. The relations of these spaces
are discussed in Chapter 1. Notice the transpose we used for the learning
parameters, which implies that they are formed in a matrix (with only one
column in this case). The features are formed in row vectors. This is why
the learning parameter matrix is acting from the right on the feature vector.
We intentionally put together most used formulas for the prediction
function in Eq. (5.1) a single united form, so that the relationship between all
192 Machine Learning with Python: Theory and Applications

these variables can be made clear once for all. Readers may take a movement
to digest this formulation, so that the later formulations can be understood
more easily.
When we write z = xw+b, we call it a xw+b formulation. When we write
z = x ŵ, in which the bias b is absorbed by w, we call it an xw formulation.
Both formulations will be used in the book interchangeably, because they are
essentially the same. The xw+b formulation allows explicit viewing the roles
of weights and biases separately during analysis. The xw formulation is more
concise in derivation processes, and also allows explicit expressions of aﬃne
transformations, which are most essential for major machine learning models.

5.1.2 Predictability for constants, the role of the bias

Note that the bias b is a must have. If we set b = 0 and used only w, the
hypothesis will not be able to even predict a function that is a constant.
This can be easily proven as follows.
Consider a given (label) function y(x) = c, where c is a given constant
in R independent of x. This means that at x = 0, y(x = 0) = c. In order to
predict c using Eq. (5.1) we must have z(w, b; x = 0) = c. Now, if we drop
b in Eq. (5.1), regardless of what we choose for w, the hypothesis always
predicts

z =0·w =0 (5.3)

This means that the constant c ∈ R will never be predicted by the hypothesis
without b. This means also that a pure linear transformation through w is
insuﬃcient for proper prediction, because it cannot even predict constants.
On the other hand, when b is there, we simply choose b = c, and the
constant c is then produced by the hypothesis. This implies also that z must
p
live in an aﬃne space X , an augmented feature space that lives within Xp+1 .

5.1.3 Predictability for linear functions: The role of the weights

Further, by proper choices of w and b, any linear function can be produced
using Eq. (5.1). This can also be easily proven as follows.
Consider any given (label) linear function y ∈ Y of variables x ∈ Xp :

y(x) = xk + c (5.4)

where c ∈ R is a given constant and k is a (column) vector in Wp . Note k

in general may be in Rp , however, we need to perform vector operations on
it, it must be conﬁned in the vector space Wp .
Prediction Function and Universal Prediction Theory 193

By simply choosing w∗ = k, and b∗ = c, we obtain

z(w∗ , b∗ ; x) = xk + c = y(x) (5.5)

The given linear function is predicted exactly, using such a particular choice
of w∗ and b∗ . This means that any given arbitrary linear function of x ∈ Xp
can be predicted using hypothesis Eq. (5.1).

5.1.4 Prediction of linear functions: A machine

learning procedure
The above process showed that to predict a label linear function using
Eq. (5.1) is straightforward because we can choose these learning parameters
by inspection. For more complicated problems, this is not possible, and we
would use a minimization process to ﬁnd these learning parameters. Here,
we demonstrate such a process to ﬁnd w∗ and b∗ . This time, let us use the
xw formulation. We rewrite Eq. (5.4) to

y(x) = x k̂ (5.6)

where k̂ = [c k] ∈ Wp+1 ∈ Rp+1 .

Step 1: deﬁne a loss function in terms of the learning parameters. The loss
function shall evaluate the error between the hypothesis Eq. (5.1) and the
label function Eq. (5.6). It can have various forms. The widely used one is
the L2 error function that is the error squared:

L(z(ŵ)) = z(ŵ; x) − y(x))2 = [z(ŵ; x) − y(x)]2

= (x ŵ − x k̂) (x ŵ − x k̂) (5.7)

= (ŵ − k̂ )[x x](ŵ − k̂)

It is clear that the loss function is L(z) is a scalar function of the prediction
function z(ŵ) that is in turn a function of the vector of learning parameter
ŵ. Therefore, L(z) is in fact a functional. It takes a vector ŵ ∈ W(p+1) and
produces a positive number in R. It is also quadratic in ŵ.
In the 2nd line of Eq. (5.7), we ﬁrst moved the transpose into the ﬁrst
pair of parentheses, and then factor out x and x from these two pairs of
parenthesis to form matrix that is the out-product of [x x], which is a p × p
symmetric matrix of rank 1. All these follow the matrix operation rules. Note
that (ŵ − k̂) is a vector. Therefore, Eq. (5.7) is a standard quadratic form.
If [x x] is SPD, L has a unique minimum at (ŵ − k̂) = 0 or ŵ∗ = k̂. This
would prove that the prediction function is capable of reproducing any linear
194 Machine Learning with Python: Theory and Applications

function uniquely in the feature space, and we are done. However, because
[x x] has only rank 1, we need to manipulate a little further for deeper
inside.

Step 2: minimize the loss function with respect to ŵ ∈ Wp+1 .

This allow us to ﬁnd ŵ that is the stationary point of L, by setting the

gradient to zero.

∂L(ŵ)
= 2x x(ŵ − k̂) = 0 (5.8)
∂ ŵ

In above, we used again that x x is symmetric. Regardless its contents,

Eq. (5.8) is satisfied when we set ŵ = ŵ∗ = k̂, which gives w∗ = k,
and b∗ = c. This is exactly the same as the results obtained previously.
This proves that the prediction function is capable of reproducing any linear
function in the feature space. Because x x is rank deficient (rank =1), the
solution of ŵ∗ = k̂ is not unique, and there are other (may infinite number
of) solutions of (ŵ − k̂) = 0 that satisfy Eq. (5.8). These solutions live in the
null-space of x x. This implies that we need to have more data points to
make the null-space zero, for unique solution. All these issues relate to the
solution existence theory that demands sufficient quality data-points,
which will be discussed in Chapter 9.
We have now solved analytically an ML problem using a typical mini-
mization procedure to predict a continuous function. This problem we just
examined is simple, but our analysis reveals essential issues in an ML model
using predictive functions. For more complicated problems with datasets of
discrete data-points, we would usually need computational means to solve
it, but the essential concept is the same.

5.1.5 Aﬃne transformation

On the other hand, Eq. (5.1) can be used to perform an aﬃne transfor-
mation, where weights wi (i = 1, 2, . . . , p) are responsible for (pure) linear
transformation and bias b is responsible for translation. Both wi and b are
learning parameters in a machine learning model. To show how Eq. (5.1) is
explicitly used to perform an aﬃne transformation, we perform the following
maneuver in matrix formulations:
First, using each ŵi (i = 1, 2, . . . , k) and Eq. (5.1) we obtain,

zi = x ŵi (5.9)
Prediction Function and Universal Prediction Theory 195

Now, form the following vector,

b1 b2 bk
z = [z1 , z2 , . . . , zk ] = x , x ,..., x
w1 w2 wk

b W0 (5.10)
=x =x or simply z = x Ŵ
W W

Ŵ

where b = [b1 , b2 , . . . , bk ], W0 = [ W 01 , W 02 , . . . , W 0k ] = b, and W =

b1 b2 bk
[w1 , w2 , . . . , wk ].
Notice that Ŵ absorbs b as the hat of W, and hence collects all the
learning parameters for predicting z. Our notation Ŵ allows easy tracking
the variables. It is often used for prediction at the output layer of a neural
network, because z = x Ŵ, and aﬃne transformation is no longer needed at
the output layer.
With Eq. (5.10), we can further construct the following matrix opera-
tion [6].
1 1 0 1 1 0 1
= = or
z b W x W0 W x

z W
x W
x

1 b 1 W0
1 z = 1 x = 1 x or simply (5.11)
0 W 0 W
z x x
W W

z = xW

where 0 = [0, 0, . . . , 0] with p zeros. Matrix W we derived is the aﬃne

transformation matrix. It has a dimension of (p + 1) × (k + 1) and
performs affine transformations an affine transformation from space Xp to
space Xk for given x. It is used in the hidden layers in neural networks,
because transformations must occur in the affine space in these layers to
ensure proper connections.
Matrix W will be used in Chapter 13 when studying MLPs or deepnets.
Note that W can be written as [e1 Ŵ], in which e1 = [1 0] is the first base
vector of space Wp+1 . W contains a unit vector (constant) as its column, and
thus the trainable parameters are in Ŵ. In the output layer of an NN, we use
only Ŵ, because affine transformation is no longer needed there. For machine
learning models, W can be called affine transformation weight matrix.
196 Machine Learning with Python: Theory and Applications

Formulation given above reveals clearly how the aﬃne transformation

weight matrix is derived, and how the traditional weights and biases are
involved in the affine transformation weight matrix. Finally, we note the
following points:
p
1. We know that [1, x] is in X . Now, after action of W (from the right)
k
on it, the resulting [1, z] that is clearly also in X . This is known as the
automorphism property of an affine transformation. A pattern in an
affine space stays in an affine space after affine transformation. We will
demonstrate this in the example section.
2. The affine transformation uses only b and w given in Eq. (5.1), except
that it needs to be arranged in a form of W to properly act on x. This
may be the reason for Eq. (5.1) being often called affine transformation
(although not exactly at least in concept). Equation (5.1) is currently used
in the actual computations of affine transformation, including many parts
in this book. Our Eq. (5.11), however, allows more concise formulation,
and shows explicitly the automorphism. The last equation in Eq. (5.11)
and Eq. (5.10) can and should, of course, also be used for computation.
If we do, the codes and data structure may be even neater, because we
have only xw operations, and not b is involved.
3. For given W 0 (or b) and w, W is a linear operator. Weights w is responsible
for stretching-compression (scaling) and rotation, and W 0 is responsible
for translation. This property is shown as the affinity mentioned earlier.
We will discuss this further in the following sections and demonstrate this
in the example section.
4. For conveniences, we will use both xw formulation and xw+b formulation.

Let us look at some special cases, when Eq. (5.11) is used to perform aﬃne
transformations.

Case 1: if we set all learning parameters to zero, Ŵ = 0, we will obtain

[1 z] = [1 0], meaning that any data-point [1 x] in an affine space
collapsed to the same point [1 0] in another affine space.
Case 2: if we set b = 0 and W = I where I is the identity matrix, we shall
have [1 z] = [1 x]. This means that any original point in affine
space is unchanged (no transformation).
Case 3: if we set b = c where c is a constant vector, and W = I, we shall
have [1 z] = [1 c + x]. This means that any original point in affine
space is translated by c.
Prediction Function and Universal Prediction Theory 197

Case 4: if we set where b = [c, 0, . . . , 0], W = [k, e2 , . . . , ep ] where ei is a

base vector of Wp (with all zero entries except 1 at the ith entry),
we obtain zi = xi (i = 2, . . . , p) and

z1 = x k + c (5.12)

which is Eq. (5.5). This means that the prediction of a linear function in the
feature space can be viewed as an aﬃne transformation in the aﬃne space.
Since k in W is the gradient of the function, it is responsible for rotation.

5.2 Aﬃne Transformation Unit (ATU), A Simplest Network

A p → 1 neural network can be built for predicting arbitrary linear functions

in the feature space, or to perform an affine transformations on the affine
space. The typical architectures are shown in Fig. 5.2.
This net can be set to predict arbitrary linear functions, or to perform
affine transformation defined in Eq. (5.1) for a given data-point xi (i =
1, 2, . . . , p) using sets of learning parameters wi (i = 1, 2, . . . , p) and b. Clearly,
any change in learning parameters shall result in a different value in function
z, for a given same data-point xi (i = 1, 2, . . . , p).
Note that Fig. 5.2 is a basic unit or a building block that can be used
to form a complicated neural network. Therefore, let us write a code to

Figure 5.2: A p → 1 neural network with one layer neurons taking an input of p features,
and one output layer of just one single neuron that produces a single prediction function
z. This forms an aﬃne transformation unit or ATU (or linear function prediction unit).
The net on the left is for xw+b formulation, and on the right is for xw formulation with
p + 1 neurons in the input layer in which the one at the top is ﬁxed as 1. Both ATUs are
essentially identical.
198 Machine Learning with Python: Theory and Applications

study it in great detail using the following examples. Let us first discuss the
data structures. Different ML algorithms may use a different one, and the
following one is quite typical.

5.3 Typical Data Structures

p → 1 nets:
Equation (5.1) can be written in the matrix form with dimensionality clearly
speciﬁed as follows:

z(ŵ; x) = x w + b = x ŵ (5.13)
1×p p×1 1×1 1×(p+1) (p+1)×1

p
The prediction function z ∈ X is now clearly speciﬁed as a function of w
and b corresponding to any x ∈ Xp . For the ith data-point xi , we have

z(ŵ; xi ) = xi w + b = xi ŵ (5.14)
1×p p×1 1×1 1×(p+1) (p+1)×1

Note that z(ŵ; x) is still a scalar for one data-point. Also, because no further
transformation, z is not needed for one layer nets.
p → k nets:
In hyperspace cases, we would have many, say k, neurons in the output
of the current layer, each neuron performing (independently) an aﬃne
transformation based on the same dataset (see, Fig. 5.13). Therefore, the
output should be an array with k entries. The data may be structured in
matrix form: ⎡ ⎤
W 11 W 12 . . . W 1k
⎢W 21 W 22 . . . W 2k ⎥
⎢ ⎥
[z1 z2 · · · zk ] = [xi1 xi2 · · · xip ]⎢ . .. .. .. ⎥
z(W,b; x )
i 1×k xi1×p
⎣ .
. . . . ⎦
W p1 W p2 ... W pk
W p×k
+[b1 b2 · · · bk ] (5.15)
b 1×k

The above matrix can be written in a concise matrix form as follows, with
all the dimensionality speciﬁed clearly:

z(Ŵ; xi ) = xi W + b = xi Ŵ (5.16)
1×k 1×p p×k 1×k 1×(p+1) (p+1)×k

Note that one can stack up as many neurons in a layer as needed because
these weights for each neuron are independent of the those for any other
Prediction Function and Universal Prediction Theory 199

neuron in the stack. This stacking is powerful because it makes the well-
known universal approximation theory (see Chapter 7) workable.
p → k nets with m data-points:
For a dataset with m points, the data may be structured as matrix Xm×p ,
by vertically stacking xi . In this case, m predictions can be correspondingly
made, and the formulation in matrix form becomes
⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
z1 (W, b; x1 ) x11 x12 . . . x1p W 11 W 12 . . . W 1k b
⎢ z2 (W, b; x2 ) ⎥ ⎢ x21 x22 . . . x2p ⎥ ⎢W 21 W 22 . . . W 2k ⎥ ⎢b⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ .. ⎥ = ⎢ .. .. . . .. ⎥ ⎢ .. .. . . .. ⎥ + ⎢ .. ⎥ (5.17)
⎣ . ⎦ ⎣ . . . . ⎦ ⎣ . . . . ⎦ ⎣.⎦
zm (W, b; xm ) xm1 xm2 . . . xmp W p1 W p2 ... W pk b
Z m×1 (W,B;X) X m×p W p×k Bm×1

where vector B has the same b for all entries. The above matrix can be
written in a concise matrix form as follows, with all the dimensionality
speciﬁed clearly:

Z(W, B) = X W + B = X Ŵ (5.18)
m×1 m×p p×1 m×k m×(p+1) (p+1)×k

Note that we do not actually form matrix Z in practical computations,

because when a loss function is constructed, it is a form of a summation
over m or a mini-batch size. We will see this frequently in later chapters on
actual machine learning models.

5.4 Demonstration Examples of Aﬃne Transformation

We now present a number of examples of aﬃne transformations. This is

performed as follows.
For a given geometric pattern deﬁned with a set of multiple data-
2
points X ∈ X (an aﬃne space), its ith row is xi = [1, xi1 , xi2 ], computed
using Eq. (5.1):
zI = XŵI (5.19)

where ŵI = [bI , wI ] in which bI and wI are a given set of learning

parameters in the hypothesis space W3 , and

zII = XŵII (5.20)

where ŵII = [bII , wII ] in which bII and wII are a changed set of learning
parameters in W3 .
200 Machine Learning with Python: Theory and Applications

2
This results in a transformed data-point Z = [1, zI , zII ] ∈ X .
The above procedure using affine transformation on the original dataset
2
X ∈ X by varying ŵ for 2 times results in transformed dataset Z that is in
2
the same affine space X , the automorphism.
We now write a code to demonstrate the affinity of the above transforma-
2
tion. Because X is also a 2D plane, we can conveniently plot both original
and transformed patterns together in space R2 using only zI and zII for
visualization and analysis. We first define some functions.

import numpy as np

def logistic0(z): # The sigmoid/logistic function

return 1. / (1. + np.exp(-z))

def net(x,w,b): # An affine transformation net

y = np.dot(x,w) + b
return y

def edge(k,dd): # Define a line pattern in 2D

# feature space (x1, x2)
x = np.arange(-1.0,1.0+dd,dd) # dd: interval
x1 = x # x1 value.
x2 = k * x # x2 value, k slope of the line
x1 = np.append(x1,0) # add the center
x2 = np.append(x2,0)
X = np.stack((x1, x2), axis=-1) # X has two components.
return x1,x2,X

def circle(r,dpi): # Define a circular pattern in 2D

# feature space (x1, x2)
x = np.arange(0.0,2*np.pi,dpi)
x1 = r*np.cos(x) # circle function, for x1 value.
# Radius r and scaling factor c.
x2 = r*np.sin(x) # for x2 value.
x1 = np.append(x1,0) # add the center
x2 = np.append(x2,0)
X = np.stack((x1, x2), axis=-1) # X has two components.
return x1,x2,X

def rectangle(MinX, MaxX, MinY, MaxY,dd,theta):

# data for rectangle pattern in 2D space (x1, x2(y))
x = np.arange(MinX,MaxX+dd,dd)
ymin= np.full(x.shape,MinY) #y=x2
ymax= np.full(x.shape,MaxY)
Prediction Function and Universal Prediction Theory 201

y = np.arange(MinY,MaxY+dd,dd)
xmin= np.full(y.shape,MinX)
xmax= np.full(y.shape,MaxX)

x1 = np.append(np.append(np.append(x,xmax),np.flip(x)),xmin)
x2 = np.append(np.append(np.append(ymin,y),ymax),np.flip(y))
x1 = np.append(x1,(MaxX+MinX)/2) # add the center
x2 = np.append(x2,(MaxY+MinY)/2)
X1 = x1*np.cos(theta)+x2*np.sin(theta)
X2 = x2*np.cos(theta)-x1*np.sin(theta)

X = np.stack((X1, X2), axis=-1) # X has two components.

return X1,X2,X

def spiral(alpha,c):
# Define a spiral pattern in 2D space (x1, x2)
xleft,xright,xdelta = 0.0, 40.01, 0.1
x = np.arange(xleft,xright,xdelta)
x1 = np.exp(alpha*x)*np.cos(x)/c # logarithmic spiral
# function, x1 with decay rate alpha & scaling factor c.
x2 = np.exp(alpha*x)*np.sin(x)/c # x2 value.
x1 = np.append(x1,0) # add the center
x2 = np.append(x2,0)
X = np.stack((x1, x2), axis=-1) # X has two components.
return x1,x2,X

Let us now set up learning parameters w and b.

# Set w0 & b0,so that original [x1,x2] pattern is reproduced.

w0_I=np.array([1.,0.]) # Initial 2 weights (vector),2D space
b0_I=0 # Initial bias, so that z_I = x1
w0_II=np.array([0.,1.]) # Initial 2 weights (vector),2D space
b0_II=0 # Initial bias, so that z_II = x2

# Set wI=w0_I, b_I=b0_I; but w_II, b_II be arbitrary.

#w_I, b_I = w0_I, b0_I # readers may try this
w_I=np.array([.8,.2]) # Arbitrary values for the two weights
# to perform scaling and rotation
b_I = -0.6 # Arbitrary values for the bias to
# perform translation
w_II=np.array([.2,.5]) # Arbitrary values for the two weights
# to perform scaling and rotation
b_II = 0.6 # Arbitrary values for the bias to
# perform translation
202 Machine Learning with Python: Theory and Applications

Next, we deﬁne a function for plotting these patterns, initial and aﬃne
transformed with wII and bII , and linear transformed with wII and bII = 0.

%matplotlib inline
import matplotlib.pyplot as plt
def affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II):
plt.figure(figsize=(4.,4.),dpi=90)
plt.scatter(net(X,w0_I,b0_I),net(X,w0_II,b0_II),label=\
"Original: w0I=["+str(w0_I[0])+","+str(w0_I[1])+
"], b0I="+str(b0_I)+"\n w0II=["+str(w0_II[0])+","+
str(w0_II[1])+"], b0II="+str(b0_II),s=10,c='orange')
#plot the initial pattern
plt.scatter(net(X,w_I,b_I),net(X,w_II,b_II),label=\
"Affine: wI=["+str(w_I[0])+","+str(w_I[1])+
"], bI="+str(b_I)+"\n wII=["+str(w_II[0])+","+
str(w_II[1])+"], bII="+str(b_II),s=10,c='blue')
# plot the affine transformed pattern
plt.scatter(net(X,w_I,b_I),net(X,w_II,b0_II),label=\
"Linear: wI=["+str(w_I[0])+","+str(w_I[1])+
"], bI="+str(b_I)+"\n wII=["+str(w_II[0])+","+
str(w_II[1])+"], b0II="+str(b0_II),s=10,c='red')
#plot the linear transformed pattern
plt.xlabel('$z_{I}$')
plt.ylabel('$z_{II}$')
plt.title('linear and affine transformation')
plt.grid(color='r', linestyle=':', linewidth=0.3)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.axis('scaled')
#plt.ylim(-5,9)
plt.show()

5.4.1 An edge, a rectangle under aﬃne transformation

We first create an edge (straight line segment) and a rectangle pattern
represented by a set of orange points, perform the affine transformations
defined by Eq. (5.1), and plot out these patterns before and after the affine
(blue) and linear (red) transformations. Because a rectangle consists of
straight lines, it is easy for us to observe the affinity of the transformation.

x1,x2,X = edge(1.5,0.2)
affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II)
Prediction Function and Universal Prediction Theory 203

Figure 5.3: Aﬃne transformations of a straight line/edge.

x1,x2,X = rectangle(-2.,2.,-1.,1.,0.2,np.pi/4)
affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II)

Figure 5.4: Aﬃne transformations of a rectangle.

From the above two ﬁgures, the following observations can be made:

1. After the aﬃne transformation, the original (orange) rectangular pattern

is only rotated, scaled, sheared, and translated to a new one (blue).
Weights w is responsible for the linear transformation results in scaling
and rotations, and b is responsible for translation.
204 Machine Learning with Python: Theory and Applications

2. The transformation moves point to point, edge to edge, and quadrilateral

to quadrilateral.
3. The transformation preserves the ratio of the lengths of parallel line
segments. For example, the ratio of the two longer sides of the orange
rectangle is the same as the ratio of the two longer sides of the blue
quadrilateral.
4. Parallel line segments remain parallel after the aﬃne transformation.
5. It does not preserve distances between points. It preserves only the ratios
of the distances between points lying on a straight line.
6. The aﬃne transformation does not preserve angles between lines.

This simple demonstration helps one to imagine how an aﬃne transformed

pattern covers the (same) space by changing w and b. The pure linear
transformation alone does not change the origin, and hence has a much
limited coverage.

5.4.2 A circle under aﬃne transformation

Let us now take a look at the aﬃne (and linear) transformation to a circle
using the same code.

x1,x2,X = circle(1.0,0.1)
affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II)

Figure 5.5: Aﬃne transformations of a circle.

Prediction Function and Universal Prediction Theory 205

This time, it is clearly seen from Fig. 5.5 that after the transformation, the
original (orange) circular pattern is rotated, scaled, sheared, and translated
to an ellipse (blue). The points that we have observed for the rectangles are
still valid.

5.4.3 A spiral under aﬃne transformation

Let us examine the aﬃne (and linear) transformation to a more complicated
pattern, spiral.

x1,x2,X = spiral(0.1,10.)
affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II)

Figure 5.6: Aﬃne transformations of a spiral.

In this case, the original (orange) spiral is rotated, scaled, sheared, and
translated to a new one. The observations made for the above two examples
still hold.

5.4.4 Fern leaf under aﬃne transformation

Figure 5.7 shows an excellent example of affine transformation available
in the public domain provided by António Miguel de Campos (https://
en.wikipedia.org/wiki/Affine transformation). It is an image of a fractal of
Barnsley’s fern (https://en.wikipedia.org/wiki/Barnsley fern). A leaf of the
fern is an affine transformation of another by a combination of rotation,
scaling, reflection, and translation. The red leaf, for example, is an affine
206 Machine Learning with Python: Theory and Applications

Figure 5.7: An image of a leaf of the fern-like fractal is an aﬃne transformation of another.

transformation of the dark blue leaf, or any of the light blue leaves. The fern
seems to have this typical pattern coded as an ATU in its DNA. This implies
that ATU is as fundamental as DNA.

5.4.5 On linear prediction function with aﬃne transformation

One should not confuse the linear prediction function with the affine
transformation. They are essentially the same hypothesis, but viewed from
different aspects. The former is on the predictability of an arbitrary linear
function in the feature space using hypothesis Eq. (5.1), and the latter is on
affinity when Eq. (5.1) is used for pattern transformation on the affine space.
The predictability of the arbitrary linear function enables affine transforma-
tion. The predictability for constants allows translational transformation,
and the predictability for the linear function enables variable-wise scaling
and rotation, while maintaining the affinity. The prediction of a linear
function in the feature space can be viewed as an affine transformation in
the affine space. Because of this, we use these two terms interchangeably,
knowing this subtle difference.

5.4.6 Aﬃne transformation wrapped with activation function

When an affine transformation z(w, b; x) is wrapped with a nonlinear activa-
tion function (see Chapter 7), the output φ(z) is confined by the activation
function, and the affinity is destroyed. However, φ(z) shall now have some
Prediction Function and Universal Prediction Theory 207

capability to predict nonlinear functions, because the activation functions

used in ML are continuous, smooth (at least piecewise differentiable), and
vary monotonically with z.
As an example, we wrap the affine transformed pattern with the sigmoid
function. In this case, φ(z) is confined in (0, 1). We write the following code
to demonstrate some examples of affine mapping:

# Code for Affine transformation wrapped with sigmoid function

def sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II):
# We do the same affine transformation and put the
# results to a sigmoid.
plt.figure(figsize=(4.5, 3.0),dpi=100)
plt.scatter(logistic0(net(X,w0_I,b0_I)),
logistic0(net(X,w0_II,b0_II)),label=\
"Original: w0I=["+str(w0_I[0])+","+str(w0_I[1])+
"], b0I="+str(b0_I)+"\n w0II=["+str(w0_II[0])+","+
str(w0_II[1])+"], b0II="+str(b0_II),s=10,c='orange')
#plot the initial pattern
plt.scatter(logistic0(net(X,w_I,b_I)),
logistic0(net(X,w_II,b_II)),label=\
"Affine: wI=["+str(w_I[0])+","+str(w_I[1])+
"], bI="+str(b_I)+"\n wII=["+str(w_II[0])+","+
str(w_II[1])+"], bII="+str(b_II),s=10,c='blue')
# plot the affine transformed pattern
plt.scatter(logistic0(net(X,w_I,b_I)),
logistic0(net(X,w_II,b0_II)),label=\
"Linear: wI=["+str(w_I[0])+","+str(w_I[1])+
"], bI="+str(b_I)+"\n wII=["+str(w_II[0])+","+
str(w_II[1])+"], b0II="+str(b0_II),s=10,c='red')
#plot the linear transformed pattern
plt.xlabel('$\sigma (z_{I})$')
plt.ylabel('$\sigma (z_{II})$')
plt.title('Affine transformation wrapped with sigmoid')
plt.grid(color='r', linestyle=':', linewidth=0.3)

plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))

#plt.axis('scaled')
plt.ylim(-.05,1.1)
plt.show()

x1,x2,X = edge(5.0,0.1)
sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II)
208 Machine Learning with Python: Theory and Applications

Figure 5.8: Nonlinear activation function wrapped aﬃne transformations of a straight

line/edge.

x1,x2,X = rectangle(-2.,2.,-1.,1.,0.2,np.pi/4)
sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II)

Figure 5.9: Nonlinear activation function wrapped aﬃne transformations of a rectangle.

For this sigmoid wrapped aﬃne transformation, the following observa-

tions can be made:
1. The transformation still sends a point to a point, an edge to an edge
uniquely. The use of the nonlinear activation function does not change
the uniqueness of the point-to-point transformation. This is because
the activation functions are continuous, smooth (at least piecewise
diﬀerentiable), and vary monotonically.
Prediction Function and Universal Prediction Theory 209

2. The affinity is, however, destroyed: the ratios of distances between points
lying on a straight line are changed. Not all the parallel line segments
remain parallel after the sigmoid transformation. The use of the sigmoid
function clearly brings nonlinearity. This gives the net the following
capabilities:
• The output φ(z(ŵ; x)) is now nonlinearly dependent on the features x.
One can now use it for logistic regression for labels given 0, or 1, by
training ŵ.
• φ(z(ŵ; x)) is linearly independent of the features x in the input. This
allows further affine transformations to be carried out in a chain to the
next layer if needed.
• φ(z(ŵ; x)) is also linearly independent of the features ŵ used in this
layer. When we need more layers in the net, fresh ŵs can now be used
for the next layers independently. This enables the creation of deepnets.
Let us now take a look at the wrapped affine transformation to a circle
function, using the same code.
x1,x2,X = circle(2.5,0.1)
sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II)

Figure 5.10: Nonlinear activation function wrapped aﬃne transformations of a circle.

This case shows more server shape destroyed. The uniqueness of the point-
to-point transformation is still preserved. The following is for the wrapped
aﬃne transformation for the spiral pattern.
x1,x2,X = spiral(0.1,10.)
sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II)
210 Machine Learning with Python: Theory and Applications

Figure 5.11: Nonlinear activation function wrapped aﬃne transformations of a spiral.

We see server distortions, due again to the nonlinearity of the nonlinear

activation function, sigmoid. The point-to-point transformation is still
observed, but when z is near 0.0 and 1.0, the original points and the
transformed points are “squashed” closer due to the sigmoid actions. Hence,
information near there is getting transmitted much less through updating of
these learning parameters w and b. In other words, the gradient gets close
to zero, due again to the properties of the sigmoid function.

5.5 Parameter Encoding and the Essential Mechanism

of Learning

5.5.1 The x to ŵ encoding, a data-parameter converter unit

Based on Eq. (5.1), we now see a situation where a dataset (x, z) can
be encoded in a set of learning parameters ŵ in the hypothesis space.
Figure 5.12 shows schematically such an encoded state.
The straight lines in Fig. 5.12 are encoded with a point in the hypothesis
space W2 . For example, the red line is encoded by a red dot at w0 = 1 and
w1 = 1. In other words, using w0 = 1 and w1 = 1, we can reproduce the red
line. The blue line is encoded by a blue dot at w0 = 2 and w1 = −0.5, with
which the blue line can be reproduced. The same applies to the black line. A
machine learning process is to produce an optimal set of dots using a dataset.
After that, one can then produce lines. In essence, a machine learning model
converts or encodes the data to wi . This implies that the size and quality of
Prediction Function and Universal Prediction Theory 211

Figure 5.12: Data (on relations of x-z or x-y for given labels) encoded in model
parameters ŵ in the hypothesis space. In essence, a ML model converts data to ŵ during
training.

the dataset are directly related to the dimension of the affine spaces used in
the model.
On the other hand, if one tunes wi , different prediction functions can be
produced in the label space. Therefore, it is possible to find such a set of
wi that makes the prediction match the given label in the dataset for given
data-points. Finding such a set of wi is the process of learning. Real machine
models are a lot more complicated, but this gives the essential mechanism.

5.5.2 Uniqueness of the encoding

We state that the encoding of a line in the X-z space to a point in the
hypothesis space is unique. It is very easy to prove as follows.
Assume any arbitrary line in the X-z space that has two distinct points
in the hypothesis space ŵ(1) and ŵ(2) ; using Eq. (5.9), this line can be
expressed as
z = xŵ(1) (5.21)

which holds for arbitrary x. This same line can also be expressed as

z = xŵ(2) (5.22)

which holds also for arbitrary x. Using now Eqs. (5.21) and (5.22), we obtain

0 = x[ŵ(2) − ŵ(1) ] (5.23)

212 Machine Learning with Python: Theory and Applications

Equation (5.23) must also hold for arbitrary x. Therefore, we shall have

ŵ(1) = ŵ(2) (5.24)

which completes the proof.

On the other hand, we also state that a point in the hypothesis space
gives a unique line. It can be easily proven as follows.
Given any arbitrary point in the hypothesis space ŵ, assume we can
construct two lines in the X-z space. Using Eq. (5.9), the ﬁrst line can be
expressed as
z (1) = xŵ (5.25)

The second line can be expressed as

z (2) = xŵ (5.26)

Using Eqs. (5.25) and (5.26), we obtain

z (2) − z (1) = 0 (5.27)

This means that these two lines are the same, which completes the proof.
In fact, the uniqueness can be clearly observed from Fig. 5.12, because a
line is uniquely determined by its slope and bias, and both are given by ŵ.
The uniqueness is one of the most fundamental reasons for a quality
dataset to be properly encoded with the learning parameters based on the
hypothesis of aﬃne transformations, or for a machine learning model to be
capable of reliably learning from data.

5.5.3 Uniqueness of the encoding: Not aﬀected

by activation function

It is observed that the uniqueness of the encoding of an aﬃne transformation

wrapped with an activation function is not aﬀected by the activation
function. This is because activation functions are all strictly monotonic
functions (as will be shown in Chapter 7), which does not change the
uniqueness of its argument. The proof is thus essentially the same as that
given above.
On the other hand, this property implies that the activation function
must be monotonic. This is true for all the activation functions discussed in
Chapter 7.
Prediction Function and Universal Prediction Theory 213

5.6 The Gradient of the Prediction Function

The gradient of the prediction function with respect to the learning

parameters has the following simple forms:
∇w z = x
(5.28)
∇b z = 1
which shows that the gradient with respect to weights relates the feature
variable (the data). The gradient with respect to bias is, however, a unit.
This is because the data corresponding to bias are x0 = 1. This may suggest
that when a regularization technique (see Chapter 14) is used, one may
choose diﬀerent regularization parameters for weight and bias.
When the xw formulation is used, we have

∇ŵ z = x (5.29)

which can be used in an autograd in machine learning processes.

5.7 Aﬃne Transformation Array (ATA)

We can now duplicate vertically the single neuron on the right in Fig. 5.2
for k times and let all neurons become densely connected, meaning that
each neuron in the output is connected with each of the neurons in the input
(also known as fully connected). This forms a mapping of p → k network.
The prediction functions z from an ATA can be expressed as

z(Ŵ; x) = xW + b = x Ŵ (5.30)

where z is a vector of prediction functions given by

z(Ŵ; x) = [z1 (ŵ1 ; x), z2 (ŵ2 ; x), . . . , k (ŵk ; x)], with
ŵj = [W 0j , W 1j , W 2j , . . . , W pj ] with W 0j = bj are the (p + 1) learning
parameters for the jth neuron in the output layer,
Ŵ = [b W] is a matrix of (p + 1)×k containing all learning parameters
for all neurons in the output layer,
W = [W ij ], (i = 1, 2, . . . , p; j = 1, 2, . . . , k) is a matrix of weights (part of
the learning parameters), and
b = [b1 , b2 , . . . , bk ] is a vector of biases (part of the learning parameters).
The vector ŵ of total learning parameters of a p → k ATA becomes,

ŵ = [ŵ1 , ŵ2 , . . . , ŵk ]

214 Machine Learning with Python: Theory and Applications

Figure 5.13: A p → k neural network with one input layer of p neurons and one output
layer of k neurons that produces k prediction functions zi (i = 1, 2, . . . , k). Each neuron
at the output connects to all the neurons in the input with its own weights. This stack
of ATU forms an aﬃne transformation array or ATA. In other words, the p → k net
has the predictability of k functions in the feature space with p dimensions. Left: xw+b
formulation; Right: xw formulation.

which can also be regarded as the ﬂattened Ŵ. The total number of the
learning parameters is

P = (p + 1) × k

It is clear that the hypothesis space grows fast in multiples for an ATA.
Equation (5.30) is the matrix form of a set of aﬃne transformations.
It is important to note that each zj (wj , bj ) is computed using Eq. (5.1),
using its own weights wj and bias bj . This enables all zj (wj , bj ), j =
1, 2, . . . , k being independent of each other. Therefore, the ATA given in
Fig. 5.13 creates the simplest mapping that can be used for k-dimensional
regression problems using a dataset with p features. Note also that when
k = p, it can perform the p → p aﬃne transformation.

5.8 Predictability of High-Order Functions of a Deepnet

5.8.1 A role of activation functions

Now, we wrap the stack of prediction functions with nonlinear activation
functions. This leads to a vector of
⎡ ⎤
⎢ ⎥
⎢φ(z1 (w1 , b1 )), φ(z2 (w2 , b2 )), . . . , φ(zk (wk , bk ))⎥ (5.31)
⎣ ⎦
(new) (new) (new)
x1 x2 xk
Prediction Function and Universal Prediction Theory 215

(new)
It becomes a set of new features xi , i = 1, 2, . . . , k that are linearly
independent of the original features xi , i = 1, 2, . . . , p. These new features
can then be used as inputs to the next layer. This allows the use of a new
set of learning parameters for the next layer.
It is clear that a role of the activation function is to force the outputs from
an ATA linearly independent of that of the previous ATA, enabling further
affine transformations leading to a chain of ATAs, a deepnet. To fulfill this
important role, the activation function must be nonlinear.
Such a chain of stacks of prediction functions (affine transformations)
wrapped with nonlinear activation functions gives a very complex deepnet,
resulting in a complex prediction function. Further, when affine transfor-
mation Eq. (5.1) is replaced with spatial filters, one can build a CNN
(see Chapter 15) for object detection, and when replaced with temporal
filters, we may have an RNN (see Chapter 16) for time sequential models,
and so on.

5.8.2 Formation of a deepnet by chaining ATA

These new features given in Eq. (5.31) can now be used as the inputs for the
next layer to form a deepnet. To illustrate this more clearly, we consider a
simpliﬁed deepnet with 4 − 2 − 3 neurons shown in Fig. 5.14.

Figure 5.14: Schematic drawing of a chain of stacked aﬃne transformations wrapped with
activation functions in a deepnet for approximation of high-order nonlinear functions of
high dimensions. This case is an xw+b formulation. A deepnet using xw formulation will
be given in Section 13.1.4.
216 Machine Learning with Python: Theory and Applications

Here, let us use the number in parentheses to indicate the layer number:
(1)
1. Based on 4 (independent input) features xi (i = 1 ∼ 4) to the first layer,
(1)
a stack of 2 affine transformations zi (i = 1 ∼ 2) takes place, using a
(1)
4 × 2 weight matrix W(1) and biases bi (i = 1 ∼ 2). Affine transformation
(1) (1) (1) (1) (1) (1)
z1 uses wi1 (i = 1 ∼ 4) and b1 , and z2 uses wi2 (i = 1 ∼ 4) and b2 .
Clearly, these are carried out independently using different sets of weights
and biases.
(1) (1)
2. Next, z1 and z2 are, respectively, subjected to a nonlinear activation
(2)
function φ, producing 2 new features xi (i = 1 ∼ 2). Because of the
(2)
nonlinearity of φ, xi will no longer linearly depend on the original
(1)
features xi (i = 1 ∼ 4).
(2)
3. Therefore, xi (i = 1 ∼ 2) can now be used as independent inputs for
the 2nd layer of affine transformations, using a 2 × 3 weight matrix W(2)
(2)
and biases bi (i = 1 ∼ 3), in the same manner. This results in a stack
(3)
of 3 affine transformations zi (i = 1 ∼ 3), which can then be wrapped
again with nonlinear activation functions. This completes the 2nd layer
of 3 stacked affine transformations in a chain.

The above process can continue as desired to increase the depth of the neural
network. Note also that the number of neurons in each layer can be arbitrary
in theory. Because of the stacking and chaining, the hypothesis space is
greatly increased. The stacking causes the increase in multiples, and the
chaining in additions. The prediction functions may live in an extremely high
dimensional space WP for deepnets. For this simple deepnet of 4 − 2 − 3, the
dimension of the hypothesis space becomes P = (4× 2+ 2)+ (2× 3+ 3) = 19.
In general, for a net of p − q − r − k, for example, the formulation should be

P = (p × q + q) + (q × r + r) + (r × k + k) (5.32)

layer 1 layer 2 layer 3

The vector of all trainable parameters in an MLP becomes,

ŵ = Ŵ(1) .f latten(), Ŵ(2) .f latten(), . . . , Ŵ(NL ) .f latten()

where NL is the total number of hidden layers in the MLP. Note that we
may not really perform the foregoing ﬂattening in actual ML models. It is
just for demonstrating the growth of the dimension of the hypothesis space.
In actual computations, we may simply group them in a Python list, and
use an important autograd algorithm to automatically perform the needed
Prediction Function and Universal Prediction Theory 217

Figure 5.15: 1 → 1 → 1 net with sigmoid activation at the hidden and last layers.

forward and backward computations in training an MLP. The computations

over such a high dimension is achieved numerically. This will be discussed
in detail in Chapter 8.
The prediction functions are now functions of ŵ in a high dimensional
hypothesis space WP for MLPs. The formulation on calculating P for more
general MLPs will be given in Chapter 13. The Neurons-Samples Theory
that gives the relationship between the number of neurons and the number
of data-points will be discussed in Section 13.2.1.

5.8.3 Example: A 1 → 1 → 1 network

Consider the simplest 1 → 1 → 1 neural network shown in Fig. 5.15. Let us
use the same linear prediction function Eq. (5.1) and a sigmoid activation
function for the hidden and last layers. In this case, the output at the last
layer x(3) can be obtained as follows:
1
x(3) = σ z (2) = σ w(2) x(2) + b(2) = (2) (2) (2)
(5.33)
1 + e−(w x +b )

where x(2) is the output from the hidden layer:

1
x(2) = σ z (1) = σ w(1) x(1) + b(1) = (1) x+b(1) )
(5.34)
1+e−(w

in which x(1) = x, which is the inputs that can be normalized to be in

(−1, 1). The number in the parenthesis in the superscripts stands for the
layer number. Because x(2) is in (0, 1), we next use the Taylor expansion
consecutively twice to approximate the sigmoid function; we obtain

x(3) = c0 + c1 x + c2 x2 + c3 x3 + · · · (5.35)

where these constants are given, through a lengthy but simple derivation, by

1 (1) (2) (1) (1) (2) (1)

1 (2) (1) (2) 2 (1) 2
c0 = − b b w b −b w − b w b w − 12
16 48

1 (1) (1) 2
− b b − 12
48
218 Machine Learning with Python: Theory and Applications

1 (1) (2) (1) 2 (1) (2) (1)
2
(2) (1)
2
c1 = − w w b + 2b b w + b w −4 (5.36)
16
1 (1) 2 (2) 2 (1)
c2 = − w w b + b(2) w(1)
16
1 (1) 3 (2) 3
c3 = − w w
48

It is clear from Eq. (5.36) that by properly setting (training using a

dataset) the weights and biases for the neurons in the hidden and output
layers, all these constants ci , i = 1, 2, 3, 4 can be determined. The prediction
function at the output x(3) becomes a 3rd-order polynomial of the feature
x as given in Eq. (5.35). This means that our 1 → 1 → 1 net has the
predictability for 3rd-order functions approximately, in contrast to a 1 → 1
net that is only capable for 1st-order functions.
Note that the same analysis can be performed for other types of activation
functions. Also, if we cut oﬀ higher-order terms in the Taylor series, we can
approximate even higher-order functions. The point of our discussion here
is not about how well to approximate the sigmoid function, but to show the
capability of the simple 1 → 1 → 1 net. This simple analysis supports a
very important fact that adding layers gives the net the capacity to predict
higher-order nonlinear latent behavior. This is the reason why a deepnet is
powerful if it can be eﬀectively trained.
We note, without further elaboration, that increasing the depth of a net is
equivalent to increasing the order of the shape functions in the physics-law-
based models such as the FEM [1] or the meshfree methods [3]. In contrast,
increasing the number of neurons in a layer is equivalent to increasing the
number of the elements or nodes [5].

5.9 Universal Prediction Theory

A deepnet can be established with the following important properties:

1. The capability of the linear prediction function (aﬃne transformation)

in predicting exactly any function up to the ﬁrst order, and one-to-one
unique transformation (with or without activation function) in each ATU.
2. The independence of the function approximation (aﬃne transformation)
of a neuron to another in an ATA (due to its independent connections).
3. New independent features are produced in each layer using nonlinear
activation functions for each ATA.
Prediction Function and Universal Prediction Theory 219

4. Chaining of ATA wrapped with nonlinear activation functions provides

the capability for predicting complex nonlinear functions to an arbitrarily
higher order.

In the opinion of the author, these are the fundamental reasons for various
types of deepnets being capable of creating p → k mappings for extremely
complicated problems from p inputs of features to k labels (targeted features)
existing in the dataset. We now summarize our discussion to a Universal
Prediction Theory [6].

Universal Prediction Theory: A deepnet with suﬃcient layers of suﬃ-

cient neurons wrapped with nonlinear activation functions can be established
for predictions of latent features existing in a dataset when properly trained.

This theory claims only the capability of a deepnet in terms of creating

giant prediction functions based on the dataset. How to realize its capability
requires a number of techniques, including how to set up a proper structure
of the deepnet for given types of problems and how to find these optimal
learning parameters reliably and effectively. In addition, the applicability of
the trained MLP model depends on the quality of the dataset, as mentioned
in Section 1.5.6. The dataset quality is defined as its representativeness
to the underlaying problem to be modeled, including correctness, size, data-
point distribution over the features space, and noise level.

5.10 Nonlinear Aﬃne Transformations

Note that in the above formulations, the features xi , i = 1, 2, . . . , p are used

in an aﬃne transformation as linear basis functions. However, the basis func-
tions do not have to be linear. Take a one-dimensional problem, for example;
when linear approximation is used, the vector of the features should be

x = [1, x] (5.37)

If one would like to use 2nd-order approximation (often times called

nonlinear regression), the vector of the features simply becomes

x = 1, x, x2 (5.38)

If one knows the dataset well and believes that a particular function can be
used as a basis function, one may simply add it as an additional feature. For
220 Machine Learning with Python: Theory and Applications

example, one can include sin(x) as a feature in the following form:

x = [1, x, sin(x)] (5.39)

The use of nonlinear functions as bases for features is also related to the
so-called support vector machine (SVM) models that we will discuss in
Chapter 6, where we use kernel functions for linearly un-separable classes.
This kind of nonlinear feature basis or kernel is sometimes called feature
functions.
In our neural network models, higher-order and enrichment basis func-
tions can also be used in higher dimensions. For example, for two-dimensional
spaces, we may have features like

x = 1, x1 , x2 , x1 x2 , x21 , x22 , sin(x) (5.40)

This has completed 2nd-order polynomial bases functions, and is enriched

with a sin(x) function (with proper scaling on x). The dataset X shall also
be arranged in the order of all these features. The aﬃne transformation
using the nonlinear basis functions can be exactly the same as using linear
basis discussed in this chapter. In fact, we have already done the aﬃne
transformations for the circle and spiral. Note that for high-dimensional
problems, the feature space with nonlinear bases can be in extremely higher
dimensions. For such cases, the so-called kernel trick may apply to avoid
dimension increase.

5.11 Feature Functions in Physics-Law-based Models

This concept of using higher-order or special feature functions is essentially

the same as in the physics-law-based computational methods. In these
physics-law-based methods, the feature functions are called basis func-
tions. These basis functions are used to approximate the field variables
(displacements, stress, velocity, pressure, etc.) that are governed by a physic-
law in either strong or weak forms.
For example, in the finite element approximation [1], and the smoothed
finite element methods [2], we frequently use higher-order polynomial bases
called higher-order elements. Special basis functions are also used but called
enrichment functions. For example, when we would like to capture the
√
singular stress field in the domain, we add in r in the bases [2]. In the
meshfree methods [3], one can also use distance basis functions, such as
the radial basis functions (RBFs).
Prediction Function and Universal Prediction Theory 221

In methods used to solve linear mechanics problems governed by physics

laws, we often use high-order and special basis functions. The resulting
system equations will still be linear in the ﬁeld variables. The essential
concept is the same: to capture necessary features in the system (governed by
law or hidden in data), one shall use feature or basis functions of necessary
complexity. The resulting models may still be linear in ﬁeld variables.

References
[1] G.R. Liu and S.S. Quek, The Finite Element Method: A Practical Course, Butterworth-
Heinemann, London, 2013.
[2] G.R. Liu and T.T. Nguyen, Smoothed Finite Element Methods, Taylor and Francis
Group, New York, 2010.
[3] G.R. Liu, Mesh Free Methods: Moving Beyond the Finite Element Method, Taylor and
Francis Group, New York, 2010.
[4] G.R. Liu and Gui-Yong Zhang, Smoothed Point Interpolation Methods: G Space Theory
and Weakened Weak Forms, World Scientiﬁc, London, 2013.
[5] G.R. Liu, A Neural Element Method, International Journal of Computational Methods,
17(07), 2050021, 2020.
[6] G.R. Liu, A thorough study on aﬃne transformations and a novel Universal Prediction
Theory, International Journal of Computational Methods, 19(10), in-printing, 2022.
MACHINE LEARNING
WITH PYTHON
Chapter 6

The Perceptron and SVM

This chapter discusses two fundamentally important supervised machine

learning algorithms, the Perceptron and the Support Vector Machine (SVM),
for classification. Both are conceptually related, but very much different in
formulation and algorithm. In the opinion of the author, there are a number
of key ideas used in both classifiers in terms of computational methods. These
ideas and the resulting formulations are very inspiring and can be used for
many other machine learning algorithms. We hope the presentation in this
chapter can help readers to appreciate these ideas. The referenced materials
and much of the codes are from the Numpy documentations (https://
numpy.org/doc/), Scikit-learn documentations (https://scikit-learn.org/
stable/), mxnet-the-straight-dope (https://github.com/zackchase/mxnet-
the-straight-dope), Jupyter Notebook (https://jupyter.org/), and Wikipedia
(https://en.wikipedia.org/wiki/Main Page).
Our discussion starts naturally from the Perceptron. It is one of the
earliest machine learning algorithms by Frank Roseblatt in 1957 [1, 2].
It is used for problems of binary classification and was one of the well-studied
problems of classification in 1960s. Here, we first introduce the mathe-
matical model of a typical classification problem with the related formulation
and its connection to affine transformation. We then examine in detail
Coli’s Perceptron algorithm that is made available at mxnet-the-straight-
dope (https://github.com/zackchase/mxnet-the-straight-dope) (under the
Apache License 2).
The discussion on the Perceptron is naturally followed by that on SVM
[3, 4, 9]. A complete description of the SVM formulation is provided,
including detailed process leading to a quadratic programming problem, the
kernel trick for linearly inseparable datasets, as well as the concept of affine
transformation used in SVM.

223
224 Machine Learning with Python: Theory and Applications

6.1 Linearly Separable Classification Problems

Let us consider the following problem. Given a set of m data-points with p

features, the ith data-point is noted as xi = {xi1 , xi2 , . . . , xip , } ∈ Xp with
its corresponding label yi ∈ {±1} (meaning the label can be either +1 or −1
for a data-point). Such labels distinguish these data-points in two classes:
positive points and negative ones. We assume that this set of data-points
is linearly separable, meaning that these data-points can be separated into
these two distinct classes using a hyperplane (a line in higher-dimensional
space, Xp ).
Consider a simple problem with only 2 features, x1 and x2 , in a two-
dimensional (2D) space x = {x1 , x2 } ∈ X2 , so that we can have a good
visualization. A set of m data-points is scattered in 2D space shown in
Fig. 6.1.
All these data-points are labeled into two classes: positive points marked
with “+” symbols, and each of the points is labeled with y = +1; negative
points marked with “−” symbols, and each of them is labeled with y = −1.
These two classes of points can be separated by straight lines, such as the
red-dashed line and the red-dotted line in Fig. 6.1. For datasets in the real
world, there are infinite numbers of such lines forming a street. Our goal
is to develop a computer algorithm to find such a line with a given labeled
dataset (the data-points with corresponding labels). This is simple but quite
a typical classification problem.
Assume for the moment that we know the orientation of one such red
line, say the middle red-dashed line for easy discussion. Hence, we know its
normal direction vector w = [w1 , w2 ] ∈ W2 , although we do not yet know

Figure 6.1: Linearly separable data-points in 2D space.

The Perceptron and SVM 225

the translational location of the red-dashed line along its normal. We then
w
have the unit normal vector as w with a length of 1. For any point (not
necessarily the data-point) in the 2D space (marked with a small cross in
Fig. 6.1), we can form a vector x starting at the origin. Now, the dot-product
w
x· (6.1)
w
w
becomes the length of the projection of x on the unit normal w . Therefore,
it is the measure that we need to determine how far point x is away from the
w
origin in the direction of w , which is a useful piece of information. Because
we do not yet know the translational location of the red line in relation to x,
b
we thus introduce a parameter w , where b ∈ W1 is an adjustable parameter
to allow the red line move up and down along w.
Notice in Eq. (6.1) that we used dot-product (the inner product). This
is the same as the matrix-product we used in the Python implementation,
because their shapes match: x is a (row) vector, and w is column vector (a
matrix with single column) with the same length. Therefore, we use both
interchangeably in this book.
The equation for an arbitrary line in relation to given point x shall have
this form:
x·w+b (6.2)

Our task now is to ﬁnd the red-dashed line by choosing a particular

set of w and b in W3 that separates the data-points, which is the aﬃne
transformation discussed in Chapter 5. The conditions should be

(x · w + b) > 0, for points in upper-right side of the red-dash line: y = +1

(6.3)
(x · w + b) < 0, for points in lower-left side of the red-dash line: y = −1
(6.4)

Note that Eqs. (6.3) and (6.4) are for ideal situations where these two sets of
points might be infinitely close. In practical applications, we often find that
these points are in two distinct classes, and they are separated by a street
with a finite width w (that may be very small). The formulation can now be
modified as follows:

(x · w + b) > w/2, for points in upper-right side of the red-dash line: y = + 1

(6.5)
(x · w + b) < w/2, for points in lower-left side of the red-dash line: y = −1
(6.6)
226 Machine Learning with Python: Theory and Applications

This type of equation is also known as the decision rule: when the condition
is satisﬁed by an arbitrary point x, it then belongs to a labeled class (y = 1
or y = −1), when the parameters w and b are known. We made excellent
progress.
It is obvious that Eqs. (6.5) and (6.6) can be magically written in a single
equation by putting these two conditions together with their corresponding
labels y.

y(x · w + b) > w/2 or y(x · ŵ) > w/2 or mg > w/2 (6.7)

This is a simpliﬁed single-equation decision rule: when the condition is

satisﬁed by an arbitrary point, it belongs to the labeled domain, and is not
within the street. This single equation is a lot more convenient for developing
the algorithm called a classifier to do the task. Note that mg is called
margin, which will be formally discussed in detail in Chapter 11.
We need to now bring in a labeled dataset to ﬁnd w and b using the
above decision rule. For any data-point (the ith, for example, and regardless
of which class it belongs to), it must satisfy the following equation:

yi (xi · w + b) > w/2 or yi (xi · ŵ) > w/2 or mg(i) > w/2 (6.8)

where mg(i) is the margin for the ith data-point. Because there are an
infinite number of lines (such as the red-dashed and red-dotted lines shown in
Fig. 6.1) for such a separation, there exist multiple solutions to our problem.
We just want to find one of them that satisfies Eq. (6.8) for all data-points
in the dataset. This process is called training. Because labels are used, it is
a supervised training. The trained model can be used to predict the class of
a given data-point (which may not be from the training dataset), known as
classification or prediction in general. The following is an algorithm to
perform all those: training as well as prediction.

6.2 A Python Code for the Perceptron

The following is an easy-to-follow code available at mxnet-the-straight-

dope (https://github.com/zackchase/mxnet-the-straight-dope), under the
Apache-2.0 License. We have modiﬁed it a little and added in some detailed
descriptions as comment lines with the code.
Let us examine the details in this algorithm. As usual, we import the
needed libraries.
The Perceptron and SVM 227

import mxnet as mx
from mxnet import nd, autograd
import matplotlib.pyplot as plt
import numpy as np
# We now generate a synthetic dataset for this examination.
mx.random.seed(1) # for repeatable output of this code
# define a function to generate the dataset that is
# separable with a margin strt_w
def getfake(samples, dimensions, domain_size, strt_w):
wfake = nd.random_normal(shape=(dimensions)) # weights
bfake = nd.random_normal(shape=(1)) # bias
wfake = wfake / nd.norm(wfake) # normalization
# generate linearly separable data, with labels
X = nd.zeros(shape=(samples, dimensions)) # initialization
Y = nd.zeros(shape=(samples))
i = 0
while (i < samples):
tmp = nd.random_normal(shape=(1,dimensions))
margin = nd.dot(tmp, wfake) + bfake
if (nd.norm(tmp).asscalar()<domain_size) & \
(abs(margin.asscalar())>strt_w):
X[i,:] = tmp[0]
Y[i] = 1 if margin.asscalar() > 0 else -1
i += 1
return X, Y, wfake, bfake
# Plot the data with colors according to the labels
def plotdata(X,Y):
for (x,y) in zip(X,Y):
if (y.asscalar() == 1):
plt.scatter(x[0].asscalar(),x[1]. asscalar(),color='r')
else:
plt.scatter(x[0].asscalar(),x[1]. asscalar(),color='b')

# define a function to plot contour plots over [-3,3] by [-3,3]

def plotscore(w,d):
xgrid = np.arange(-3, 3, 0.02) # generating grids
ygrid = np.arange(-3, 3, 0.02)
xx, yy = np.meshgrid(xgrid, ygrid)
zz = nd.zeros(shape=(xgrid.size, ygrid.size, 2))
zz[:,:,0] = nd.array(xx)
zz[:,:,1] = nd.array(yy)
vv = nd.dot(zz,w) + d
CS = plt.contour(xgrid,ygrid,vv.asnumpy())
plt.clabel(CS, inline=1, fontsize=10)
228 Machine Learning with Python: Theory and Applications

street_w = 0.1
ndim = 2
X,Y,wfake,bfake = getfake(50,ndim,3,street_w)
#generates 50 points, in 2D space with a margin of street_w
plotdata(X,Y)
plt.show()

Figure 6.2: Computer-generated data-points that are separable with a straight line.

We now see a dataset with 50 scattered points in 2D space separated by

a street with width street w. These data-points can be clearly separable by
at least one line in between the blue and red data-points.
Let us ﬁrst take a look at the points after their vectors are all projected
on separable with an arbitrary vector w and with an arbitrary bias b (an
arbitrary aﬃne transformation). We do the same projection using also the
true w and bias b used to generate all these points for comparison. We write
the following code to do so:

wa = nd.array([0.5,0.5]) # a given vector

cs = (wa [0]/nd.norm(wa)).asnumpy() # cosine value
si = (wa [1]/nd.norm(wa)).asnumpy() # sine value
ba = 0.5 * nd.norm(wa) # bias (one may change it)
Xa = nd.dot(X,wa)/nd.norm(wa) + ba # projection (affine mapping)

plt.plot(Xa.asnumpy()*cs,Xa.asnumpy()*si,zorder=1) # results
for (x,y) in zip(Xa,Y):
if (y.asscalar() == 1):
plt.scatter(x.asscalar()*cs,x. asscalar()*si,color='r')
else:
plt.scatter(x.asscalar()*cs,x.asscalar()*si, color='b')
cs = (wfake [0]/nd.norm(wfake)).asnumpy() # projection on true norm
The Perceptron and SVM 229

si = (wfake [1]/nd.norm(wfake)).asnumpy()
Xa = nd.dot(X,wfake) + bfake # with true bias

Figure 6.3: Data-points projected on an arbitrary straight line.

plt.plot(Xa.asnumpy()*cs,Xa.asnumpy()*si,zorder=1)
for (x,y) in zip(Xa,Y):
if (y.asscalar() == 1):
plt.scatter(x.asscalar()*cs,x. asscalar()*si,color='r')
else:
plt.scatter(x.asscalar()*cs,x.asscalar()*si, color='b')
plt.show()

Figure 6.4: Data-points projected on a straight line that is perpendicular to a line that
separates these data-points.

It is seen that

• When these points are projected on a vector that is not the true normal
(along the blue line) direction, the blue and red points are mixed along
the blue line.
230 Machine Learning with Python: Theory and Applications

• When the true normal direction is used, all these points are distinctly
separated into two classes, blue and red, along the orange line. This dataset
is linearly separable.

Let us see how the Perceptron algorithm ﬁnds a direction and the bias,
and hence the red line that separates these two classes. We use again the algo-
rithm available at mxnet-the-straight-dope (https://github.com/zackchase/
mxnet-the-straight-dope). The algorithm is based on the following encour-
agement rule: positive events should be encouraged and negative ones should
be discouraged. This rule is used with the decision rule discussed earlier for
each data-point in a given dataset.

# The Perceptron algorithm

def Perceptron(w,b,x,y,strt_w):
if (y*(nd.dot(w,x)+b)).asscalar()<=strt_w/2:
# Decision rule to check whether the line with
# the current parameters in the street
w += y * x # In the street, update w
b += y # update b
update = 1
else: # Otherwise (outside the street)
update = 0 # No action
return update

The above Perception algorithm is used in an iteration to update the

learning parameters: the weights in vector w and bias b, by looping over the
dataset.

w = nd.zeros(shape=(ndim)) #stars with zero (worst case)

b = nd.zeros(shape=(1))
t = 0
print('w:',w.shape,' b:',b.shape,' X:',X.shape,' Y:',Y.shape)
for (x,y) in zip(X,Y):
update = Perceptron(w,b,x,y,street_w)
if (update == 1):
t += 1
print('In the street: update the parameters')
print('data{}, label{}'.format(x.asnumpy(),y. asscalar()))
print('weight{}, bias{}'.format(w.asnumpy(),b. asscalar()))
The Perceptron and SVM 231

plotscore(w,b) # The plane with updated w and b

plotdata(X,Y) # data-points
plt.scatter(x[0].asscalar(),x[1]. asscalar(),color='g')
# currently updated data-point
plt.show()

w: (2,) b: (1,) X: (50, 2) Y: (50,)

In the street: update the parameters
data [ 0.03751943 -0.7298465 ], label -1.0
weight [-0.03751943 0.7298465 ], bias -1.0

Figure 6.5: Results at the first iteration.

In the street: update the parameters

data [-2.0401056 1.4821309], label -1.0
weight [ 2.0025861 -0.7522844], bias -2.0

Figure 6.6: Results at the second iteration.

232 Machine Learning with Python: Theory and Applications

In the street: update the parameters

data [ 1.040828 -0.45256865], label -1.0
weight [ 0.96175814 -0.29971576], bias -3.0

Figure 6.7: Results at the third iteration.

In the street: update the parameters

data [-0.934901 -1.5937568], label 1.0
weight [ 0.02685714 -1.8934726 ], bias -2.0

Figure 6.8: Results at the final iteration.

print('Total number of points:',len(Y),'; times of updates:',t)

print('weight {}, bias {}'.format(w.asnumpy(),b.asscalar()))
print('wfake {}, bfake {}'.format(wfake.asnumpy(),bfake. asscalar()))

Total number of points: 50 ; times of updates: 4

weight [ 0.02685714 -1.8934726 ], bias -2.0
wfake [ 0.0738321 -0.99727064], bfake -0.9501792788505554
The Perceptron and SVM 233

It is seen that all the red dots are on the positive side of the straight line
of x · w + b = 0 with the learned parameters of the weight vector w∗ and
bias b∗ . All the data marked with blue dots are on the negative side of the
line. In the entire process, all these points stay still, and the updates are
done only on the weight vector w and bias b. We shall now examine the
fundamental reasons for this simple algorithm to work.

6.3 The Perceptron Convergence Theorem

Theorem: Consider a dataset with a ﬁnite number of data-points. The ith

data-point is paired with its label as [xi , yi ]. Any data-point xi is bounded
by xi ≤ R < ∞, and its label yi ∈ {±1}.

• If the data-points are linearly separable, meaning that there exists at least
one pair of parameters (w∗ , b∗ ) with w∗ ≤ 1 and b2 ≤ 1, such that
yi (xi · w∗ + b∗ ) ≥ w/2 > 0 for all data pairs, where w is a given scalar of
the street width,
• then the Perceptron algorithm converges after at most t = 2(R2 +1)/w2 ∝
R 2
w iterations, with a pair of parameters (wt , bt ) forming a line xi ·wt +bt
that separates the data-points in two classes.

We now prove this Theorem largely following the procedure with codes
at mxnet-the-straight-dope (https://github.com/zackchase/mxnet-the-
straight-dope/blob/master/chapter01 crashcourse/probability.ipynb), under
the Apache-2.0 License. We ﬁrst check the convergence behavior numerically
(this may take minutes to run).
ws = np.arange(0.025,0.45,0.025) #generate a set of street widths
number_iterations = np.zeros(shape=(ws.size))
number_tests = 10
for j in range(number_tests): #set number of tests to do
for (i,wi) in enumerate(ws):
X,Y,_,_=getfake(1000,2,3,wi) #generate dataset
for (x,y) in zip(X,Y):
number_iterations[i] += Perceptron(w,b,x,y,wi)
#for each test, record the number of updates
number_iterations = number_iterations / 10.0
plt.plot(ws,number_iterations,label='Average number of iterations')
plt.legend()
plt.show()

The test results are plotted in Fig. 6.9. It shows that the number of
iterations needs to increase with the decrease of the street width w, and
234 Machine Learning with Python: Theory and Applications

Figure 6.9: Convergence behavior examined numerically.

the rate is roughly quadratic (inversely). This test supports the convergence
theorem. Let us now prove this in a more rigorous mathematical manner.
The proof assumes that the data are linearly separable. Therefore, there
exists a pair of parameters (w∗ , b∗ ) with w∗ ≤ 1 and b2 ≤ 1. Let us
examine the inner product of the current set of parameters ŵ with the
assumed existing ŵ∗ at each iteration. What we would like the iteration
to do is to update the current ŵ to approach ŵ∗ iteration by iteration, so
that their inner product can get bigger and bigger. Eventually, they can be
parallel with each other. Let us see whether this is really what is happening
in the Perceptron algorithm given above. Our examination is also iteration
by iteration but considers only the iterations when an update is made by the
algorithm, because the algorithm does nothing otherwise. This means that
we perform update only when yt (xt · ŵt ) ≤ w/2 at at the tth step.
At the initial setting in the algorithm, t = 0, we have no idea on what ŵ
should be, and thus set ŵ0 = 0. Here for neat formulation using dot-product.
We assume that column vector ŵ0 and ŵ∗ are flatten to (row) vectors, so
that ŵ can have a dot-product directly with any other (flattened) ŵ resulted
in the iteration process. This can be done easily in numpy using the flatten()
function. Thus we have at the initial setting,

ŵ0 · ŵ∗ = 0

At t = 1, following the algorithm, we bring in arbitrarily a data-point, say

x1 with y1 . We shall ﬁnd y1 (x1 · ŵ1 ) = 0 ≤ (w/2), because at this point,
the current ŵ1 are still all zero. Therefore, data-point x1 with y1 is in the
The Perceptron and SVM 235

street deﬁned by the line with current ŵ1 . Next, we perform the following
updates:

ŵ1 = ŵ0 + y1 x1

We thus have,

ŵ1 · ŵ∗ = ŵ0 · ŵ∗ + y1 (x1 · ŵ∗ ) ≥ w/2

This is because yi (xi · ŵ∗ ) ≥ w/2 given by the Theorem as a condition. We

see the direction of vector ŵ1 approaches to that of ŵ∗ by w/2. They are
w/2 more aligned.
Similarly, at t = 2, following the algorithm, we have the following update.

ŵ2 = ŵ1 + y2 x2

We now have,

ŵ2 · ŵ∗ = ŵ1 · ŵ∗ + y2 (x2 · ŵ∗ ) ≥ 2(w/2)

This is because of the result obtained at t = 1 and the new addition of

yi (ŵ∗ · xi ) ≥ w/2 condition given by the Theorem. We see the direction of
vector ŵ1 approaches to that of ŵ∗ by 2(w/2). They are now 2(w/2) more
aligned.
It is clear that inner product adds one (w/2) in each iteration. Now at
tth update, we shall have

ŵt · ŵ∗ ≥ t(w/2) (6.9)

We see the direction of vector ŵt approaches to that of ŵ∗ by t(w/2). It is

clear that the algorithm drives ŵ more and more in alignment with ŵ∗ , at
a linear rate of (w/2). The wider the street, the faster the convergence.
We next examine the evolution of the length (amplitude) of vector ŵt+1 .

ŵt+1 2 = ŵt+1 · ŵt+1 = (ŵt + yt xt ) · (ŵt + yt xt )

= ŵt · ŵt + 2yt xt · ŵt + yt xt · yt xt (6.10)
2
= ŵt + 2yt xt · ŵt + yt2 xt 2

Using the conditions given by the Theorem: xi = xi + 1 ≤ R + 1, and

yi ∈ {±1}, and yt (xt · ŵt ) ≤ w/2 (this is the condition for starting the tth
update), we shall have

ŵt+1 2 ≤ ŵt 2 + R2 + 1 + w
236 Machine Learning with Python: Theory and Applications

When t = 0, we have ŵ0 = 0, and hence,

ŵ1 2 ≤ ŵ0 2 + R2 + 1 + w = R2 + 1 + w

When t = 1, we shall have,

ŵ2 2 ≤ ŵ1 2 + R2 + 1 + w ≤ 2(R2 + 1 + w)

This means that each iteration gives (R2 + 1 + w).

At the t = T iteration, we shall have

ŵT ≤ T (R2 + 1 + w) (6.11)

Using the Cauchy-Schwartz inequality, i.e., a·b ≥ a·b, and then Eq. (6.9),
we obtain,

ŵT ŵ∗ ≥ ŵT · ŵ∗ ≥ T (w/2)

Using the conditions given by the Theorem: ŵ∗ = w∗ + b ≤ 1 + 1 = 2,

we have
√
ŵT 2 ≥ T (w/2)

Combining this with the inequality Eq. (6.11) yields,

2T (R2 + 1 + w) ≥ T (w/2) (6.12)

Let us examine this equation. The number of iterations T is linear on the

left and in a square root on the right side of Eq. (6.12), and all others are
constants. Thus, this inequality cannot hold for large T . Therefore, T must
be limited to satisfy Eq. (6.12), which means that the Perceptron algorithm
2
will converge in a ﬁnite number of iterations, and T ≤ 8(R w+1+w)
2 ∝ (R 2
w) .
This can also be written as
R √
∝ T (6.13)
w
This means that it takes a square root of times for the Perceptron algorithm
to converge for a given relative street width.
We now give the following remarks:

1. The Perceptron convergence proof requires that the data-points be

separable with a line.
The Perceptron and SVM 237

2. The convergence is independent of the dimensionality of the data. This

is not a condition used in the proof. It is also independent of the number
of observations.
3. The number of iterations increases with the decrease of the street width
w, and the rate is (inversely) quadratic. This echoes the numerical test
conducted earlier.
4. The number of iterations decreases with the increase of the upper bound
of the data R, and the rate is also (inversely) quadratic. This means that if
the data-points are generally farther apart, it is easier to separate, which
is intuitively understandable.
5. The algorithm updates only for the data-points that are not in alignment.
It simply skips the data-points that are already in alignment.

6.4 Support Vector Machine

6.4.1 Problem statement

In our discussions above on the Perceptron, it is seen that there are in fact
an infinite number of solutions of parameters w and b in the hypothesis
space to form straight lines for a linearly separable dataset, as long as the
street width w has a finite value. Readers may observe the different straight
lines obtained by simply starting the classifier with different initial weights
and biases. One may naturally ask what the best solution is among all these
possible solutions. One answer is that the line that separates these two classes
of data with largest “street” width and sits in the middle line of the street
may be the best. To obtain the widest street optimal solution, one can
formulate the problem as an optimization one. In fact, one can simply bring
in an optimization algorithm to control the Perceptron to try for multiple
times to find an optimal solution if the efficiency is not a concern.
Here, we introduce the well-known algorithm known as the support vector
machine or SVM. The initial idea was invented in the early 1960s in Vapnik’s
PhD thesis, and became popular in the 1990s when it (with kernel trick) was
applied for handwritten digit recognition [3, 4].
The SVM is an effective algorithm that is constructed using a systematic
formulation and the Lagrangian multiply approach. Given below is the
detailed description and formulation on SVM. A good reference is the excel-
lent lecture by Prof. Partick Winston at MIT(https://www.youtube.com/
watch?v= PwhiWxHK8o), and also some workings available online (https://
towardsdatascience.com/support-vector-machine-python-example-d67d9b
63f1c8).
238 Machine Learning with Python: Theory and Applications

6.4.2 Formulation of objective function and constraints

Our formulation continues from the formulation we derived for the Percep-
tron. The difference here is that the street width is no longer given in this
SVM setting. We just assume it is there. We must find a way to formulate
the street width, and then maximize it for a given dataset. We first derive
the formula for the street width.
Because we assume that the dataset is linearly separable, there must be
a street width of some finite value w. From the decision rule we derived,
we know the equation for the middle line of the street can be written in an
affine transformation form as
x·w+b=0 (6.14)

where b ∈ W is a learning parameter called bias, x ∈ Xp is a position

vector of an arbitrary point in the feature space with linear polynomial
bases (features), x1 , x2 , . . . , xp ,

x = [x1 , x2 , . . . , xp ] (6.15)

and w ∈ Wp is a vector of weights that are also learning parameters

w1 , w2 , . . . , wp :
w = [w1 , w2 , . . . , wp ] (6.16)

This middle line in a 2D feature space X2 is shown in Fig. 6.10 with red dash-
dot line, which is approximated using the weights w and bias b in hypothesis
space W3 .

Figure 6.10: Linearly separable data-points in 2D space, the width of the street is to be
maximized using SVM.
The Perceptron and SVM 239

On the upper-right gutter, there should be at least one positive data-

point, say x1 , right on it. Its vector x1 is the blue arrow, a support vector
to the gutter. Because x1 belongs to the positive class, its label is +1. The
equation for this gutter line of the street can be given as

x1 · w + b = +1 (6.17)

This is the decision border for data-point x1 . Similarly, on the lower-left

gutter, there is at least one negative data-point, say x2 , right on it. Its vector
x2 supports the gutter. Because x2 belongs to the negative class, its label is
−1. The equation for this gutter line can be given as

x2 · w + b = −1 (6.18)

Consider now the projections of these two support vectors, x1 and x2 on the
w
unit normal of the middle line for the street, w . The projection of x1 gives
the (Euclidean) distance of the upper-right gutter to the origin along the
normalized w. Similarly, the projection of x2 gives the distance of the lower-
left gutter to the origin along the normalized w. Therefore, their diﬀerence
gives the width of the street:
w w x1 · w − x2 · w
w = x1 · + b − x2 · −b= (6.19)
w w w

Substituting Eqs. (6.17) and (6.18) to (6.19), we obtain

2
w= (6.20)
w
A very simple formula. Now, we got the equation for the street width, and
that depends only on training parameter weights w! This is not so difficult
to understand because these weights determine the orientation of the gutter
of the street and hence the direction of the street. When the weights change,
the street turns accordingly while remaining in touch with both data-points
x1 and x2 , which results in a change in the street width. The bias b affects
only the translational location of a line. Because the street width is the
difference of the two gutter lines, the bias is thus canceled. Therefore, the
bias b should not affect the width. Here, we observe a fact that the affine
space is not a vector space (As mentioned in Chapter 1), and the difference
of two vectors in the affine space comes back to the feature space (because
of the cancellation of the augment 1).
However, why is the street weight inversely related to the norm of the
w? This may be seemly counterintuitive, but it is really the mathematics at
240 Machine Learning with Python: Theory and Applications

Figure 6.11: Change of width of the street when the street is turned with respect to w.

work. To examine and view what is really happening here, let us look at a
simpler setting where both x1 and x2 (that are on the gutters) are sitting
on the x2 -axis, as shown in the Fig. 6.11.
When the gutters are at the horizontal direction, the equation for the
upper gutter is
0 · x1 + 1 · x2 = +1 + b (6.21)

The street width is w0 . The normal vector w and its norm are given as
follows:

w = 0 1 , w = 02 + 12 = 1 (6.22)

When these gutters rotate to have a slope k while remaining supported by

both x1 and x2 , the equation for the upper gutter becomes

k · x1 + 1 · x2 = +1 + b (6.23)

The new normal vector wk and its norm are given as follows:

wk = k 1 , wk = k 2 + 12 (6.24)

It is obvious that the street width after the rotation, wk , is clearly smaller
than the original street width before the rotation, w0 , while the norm of w
has increased. This is also true for any value of k. The street width is at the
maximum when it is along the horizontal direction which is perpendicular
to the vector x1 − x2 . We write the following code to plot this relationship:
The Perceptron and SVM 241

import matplotlib.pyplot as plt

import numpy as np
%matplotlib inline
fig, figarr = plt.subplots(1, 2, figsize=(11, 4))
def w_norm(k): # norm of vector w for a line of slope k
return np.sqrt(1. + k**2)
k = np.arange(-10, 10, .1)
y = 2/w_norm(k) # compute the width of the street
figarr[0].plot(k,y,c='r')
figarr[0].set_xlabel('Slope of the street gutter, $k$')
figarr[0].set_ylabel('Width of the street')
figarr[1].plot(w_norm(k),y)
figarr[1].set_xlabel('Norm of the weight vector')
figarr[1].set_ylabel('Width of the street')
#figarr[1].ax1.set_title('ax1 title')
plt.show()

Figure 6.12: Variation of the street width with the slope of the street gutter (left) and
with the norm of the weight vector (right).

It is clear from Fig. 6.12 that the street width is inversely related to the
norm of the w.
Most importantly, this analysis shows that if the street width is maxi-
mized, w must be perpendicular to the gutters (decision boundaries). This
conclusion is true for the arbitrary pair of data-points x1 and x2 on these
two gutters.
Now, Eq. (6.19) can be rewritten as
w
w= [x1 − x2 ] · (6.25)

w
Data pair in linear polynomial bases

weights or optimization parameters

242 Machine Learning with Python: Theory and Applications

This means that when w is maximized, the inner production of [x1 − x2 ] and
w
w is maximized (these two vectors are parallel), where [x1 −x2 ] is a vector of
pairs of data-points in the linear polynomial bases x1 , x2 , . . . , to approximate
w
a line in the feature space, and w is the vector of the normalized weights
or tuning/optimization parameters. The use of linear polynomial bases here
is because we assume the data-points are linearly separable by a hyperplane.
Remember our original goal is to ﬁnd maximum street width. Based on
Eq. (6.20), this is equivalent to minimizing the norm of w, which in turn
is the same as minimizing 12 w2 . The beneﬁt of such simple conversions
will soon be evidenced. We now have our objective function:
1 1
L= w2 = w w (6.26)
2 2
The above function needs to be minimized. We see a nice property of the
above formulation: the objective function is quadratic, and its Hessian matrix
is a unity matrix that is clearly SPD. Therefore, it has one and only one
minimal. The local minimal is the global one. This is the fundamental reason
why local minim