MACHINE
LEARNING
WITH PYTHON
Theory and Applications
MACHINE
LEARNING
WITH PYTHON
Theory and Applications
G. R. Liu
University of Cincinnati, USA
World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO
Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
Library of Congress Cataloging-in-Publication Data
Names: Liu, G. R. (Gui-Rong), author.
Title: Machine learning with Python : theory and applications / G.R. Liu, University of Cincinnati, USA.
Description: Singapore ; Hackensack, NJ : World Scientific Publishing Co. Pte. Ltd., [2023] |
Includes bibliographical references and index.
Identifiers: LCCN 2022001048 | ISBN 9789811254178 (hardcover) |
ISBN 9789811254185 (ebook for institutions) | ISBN 9789811254192 (ebook for individuals)
Subjects: LCSH: Machine learning. | Python (Computer program language)
Classification: LCC Q325.5 .L58 2023 | DDC 006.3/1--dc23/eng20220328
LC record available at https://lccn.loc.gov/2022001048
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library.
Copyright © 2023 by World Scientific Publishing Co. Pte. Ltd.
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or
mechanical, including photocopying, recording or any information storage and retrieval system now known or to
be invented, without written permission from the publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center,
Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from
the publisher.
For any available supplementary material, please visit
https://www.worldscientific.com/worldscibooks/10.1142/12774#t=suppl
Desk Editors: Jayanthi Muthuswamy/Steven Patt
Typeset by Stallion Press
Email:
[email protected]Printed in Singapore
About the Author
G. R. Liu received his Ph.D from Tohoku University,
Japan, in 1991. He was a post-doctoral fellow at
Northwestern University, USA, from 1991–1993. He
was a Professor at the National University of Singa-
pore until 2010. He is currently a Professor at the
University of Cincinnati, USA. He is the founder of the
Association for Computational Mechanics (Singapore)
(SACM) and served as the President of SACM until
2010. He served as the President of the Asia-Pacific
Association for Computational Mechanics (APACM)
(2010–2013) and an Executive Council Member of the International Asso-
ciation for Computational Mechanics (IACM) (2005–2010; 2020–2026). He
has authored a large number of journal papers and books including two
bestsellers — Mesh Free Method: Moving Beyond the Finite Element Method
and Smoothed Particle Hydrodynamics: A Meshfree Particle Methods. He is
the Editor-in-Chief of the International Journal of Computational Methods
and served as an Associate Editor for IPSE and MANO. He is the recipient
of numerous awards, including the Singapore Defence Technology Prize,
NUS Outstanding University Researcher Award and Best Teacher Award,
APACM Computational Mechanics Awards, JSME Computational Mechan-
ics Awards, ASME Ted Belytschko Applied Mechanics Award, Zienkiewicz
Medal from APACM, the AJCM Computational Mechanics Award, and the
Humboldt Research Award. He has been listed as a world top 1% most
influential scientist (Highly Cited Researchers) by Thomson Reuters in 2014–
2016, 2018, and 2019. ISI citations by others: ∼22000; His journal and
book credentials include the following. ISI H-index: ∼85; Google Scholar
H-Index: 110.
v
MACHINE LEARNING
WITH PYTHON
Contents
About the Author v
1 Introduction 1
1.1 Naturally Learned Ability for Problem Solving . . . . . . . 1
1.2 Physics-Law-based Models . . . . . . . . . . . . . . . . . . 1
1.3 Machine Learning Models, Data-based . . . . . . . . . . . 3
1.4 General Steps for Training Machine Learning Models . . . 4
1.5 Some Mathematical Concepts, Variables, and Spaces . . . 5
1.5.1 Toy examples . . . . . . . . . . . . . . . . . . . . . 5
1.5.2 Feature space . . . . . . . . . . . . . . . . . . . . . 6
1.5.3 Affine space . . . . . . . . . . . . . . . . . . . . . 7
1.5.4 Label space . . . . . . . . . . . . . . . . . . . . . . 8
1.5.5 Hypothesis space . . . . . . . . . . . . . . . . . . . 9
1.5.6 Definition of a typical machine learning model,
a mathematical view . . . . . . . . . . . . . . . . . 10
1.6 Requirements for Creating Machine Learning Models . . . 11
1.7 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Relation Between Physics-Law-based and Data-based
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9 This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.10 Who May Read This Book . . . . . . . . . . . . . . . . . . 14
1.11 Codes Used in This Book . . . . . . . . . . . . . . . . . . . 14
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
vii
viii Machine Learning with Python: Theory and Applications
2 Basics of Python 19
2.1 An Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Briefing on Python . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Variable Types . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Numbers . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Underscore placeholder . . . . . . . . . . . . . . . 28
2.3.3 Strings . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.4 Conversion between types of variables . . . . . . . 36
2.3.5 Variable formatting . . . . . . . . . . . . . . . . . 38
2.4 Arithmetic Operators . . . . . . . . . . . . . . . . . . . . . 39
2.4.1 Addition, subtraction, multiplication, division,
and power . . . . . . . . . . . . . . . . . . . . . . 39
2.4.2 Built-in functions . . . . . . . . . . . . . . . . . . 40
2.5 Boolean Values and Operators . . . . . . . . . . . . . . . . 41
2.6 Lists: A diversified variable type container . . . . . . . . . 42
2.6.1 List creation, appending, concatenation,
and updating . . . . . . . . . . . . . . . . . . . . . 42
2.6.2 Element-wise addition of lists . . . . . . . . . . . . 44
2.6.3 Slicing strings and lists . . . . . . . . . . . . . . . 46
2.6.4 Underscore placeholders for lists . . . . . . . . . . 49
2.6.5 Nested list (lists in lists in lists) . . . . . . . . . . 49
2.7 Tuples: Value preserved . . . . . . . . . . . . . . . . . . . . 50
2.8 Dictionaries: Indexable via keys . . . . . . . . . . . . . . . 51
2.8.1 Assigning data to a dictionary . . . . . . . . . . . 51
2.8.2 Iterating over a dictionary . . . . . . . . . . . . . 52
2.8.3 Removing a value . . . . . . . . . . . . . . . . . . 53
2.8.4 Merging two dictionaries . . . . . . . . . . . . . . 54
2.9 Numpy Arrays: Handy for scientific computation . . . . . . 55
2.9.1 Lists vs. Numpy arrays . . . . . . . . . . . . . . . 55
2.9.2 Structure of a numpy array . . . . . . . . . . . . . 55
2.9.3 Axis of a numpy array . . . . . . . . . . . . . . . . 60
2.9.4 Element-wise computations . . . . . . . . . . . . . 61
2.9.5 Handy ways to generate multi-dimensional
arrays . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.9.6 Use of external package: MXNet . . . . . . . . . . 63
2.9.7 In-place operations . . . . . . . . . . . . . . . . . 66
2.9.8 Slicing from a multi-dimensional array . . . . . . . 67
2.9.9 Broadcasting . . . . . . . . . . . . . . . . . . . . . 67
Contents ix
2.9.10 Converting between MXNet NDArray
and NumPy . . . . . . . . . . . . . . . . . . . . . 70
2.9.11 Subsetting in Numpy . . . . . . . . . . . . . . . . 71
2.9.12 Numpy and universal functions (ufunc) . . . . . . 71
2.9.13 Numpy array and vector/matrix . . . . . . . . . . 72
2.10 Sets: No Duplication . . . . . . . . . . . . . . . . . . . . . 75
2.10.1 Intersection of two sets . . . . . . . . . . . . . . . 75
2.10.2 Difference of two sets . . . . . . . . . . . . . . . . 75
2.11 List Comprehensions . . . . . . . . . . . . . . . . . . . . . 76
2.12 Conditions, “if” Statements, “for” and “while” Loops . . . 77
2.12.1 Comparison operators . . . . . . . . . . . . . . . . 77
2.12.2 The “in” operator . . . . . . . . . . . . . . . . . . 78
2.12.3 The “is” operator . . . . . . . . . . . . . . . . . . 78
2.12.4 The ‘not’ operator . . . . . . . . . . . . . . . . . . 80
2.12.5 The “if” statements . . . . . . . . . . . . . . . . . 80
2.12.6 The “for” loops . . . . . . . . . . . . . . . . . . . 81
2.12.7 The “while” loops . . . . . . . . . . . . . . . . . . 82
2.12.8 Ternary conditionals . . . . . . . . . . . . . . . . . 84
2.13 Functions (Methods) . . . . . . . . . . . . . . . . . . . . . 84
2.13.1 Block structure for function definition . . . . . . . 84
2.13.2 Function with arguments . . . . . . . . . . . . . . 84
2.13.3 Lambda functions (Anonymous functions) . . . . 86
2.14 Classes and Objects . . . . . . . . . . . . . . . . . . . . . . 86
2.14.1 A simplest class . . . . . . . . . . . . . . . . . . . 86
2.14.2 A class for scientific computation . . . . . . . . . 89
2.14.3 Subclass (class inheritance) . . . . . . . . . . . . . 90
2.15 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.16 Generation of Plots . . . . . . . . . . . . . . . . . . . . . . 92
2.17 Code Performance Assessment . . . . . . . . . . . . . . . . 93
2.18 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3 Basic Mathematical Computations 95
3.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.1.1 Scalar numbers . . . . . . . . . . . . . . . . . . . . 96
3.1.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . 96
3.1.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . 98
3.1.4 Tensors . . . . . . . . . . . . . . . . . . . . . . . . 100
x Machine Learning with Python: Theory and Applications
3.1.5 Sum and mean of a tensor . . . . . . . . . . . . . 101
3.1.6 Dot-product of two vectors . . . . . . . . . . . . . 102
3.1.7 Outer product of two vectors . . . . . . . . . . . . 105
3.1.8 Matrix-vector product . . . . . . . . . . . . . . . . 106
3.1.9 Matrix-matrix multiplication . . . . . . . . . . . . 106
3.1.10 Norms . . . . . . . . . . . . . . . . . . . . . . . . 108
3.1.11 Solving algebraic system equations . . . . . . . . . 109
3.1.12 Matrix inversion . . . . . . . . . . . . . . . . . . . 111
3.1.13 Eigenvalue decomposition of a matrix . . . . . . . 113
3.1.14 Condition number of a matrix . . . . . . . . . . . 116
3.1.15 Rank of a matrix . . . . . . . . . . . . . . . . . . 118
3.2 Rotation Matrix . . . . . . . . . . . . . . . . . . . . . . . . 119
3.3 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.3.1 1-D piecewise linear interpolation using
numpy.interp . . . . . . . . . . . . . . . . . . . . . 121
3.3.2 1-D least-squares solution approximation . . . . . 122
3.3.3 1-D interpolation using interp1d . . . . . . . . . . 124
3.3.4 2-D spline representation
using bisplrep . . . . . . . . . . . . . . . . . . . . 124
3.3.5 Radial basis functions for smoothing and
interpolation . . . . . . . . . . . . . . . . . . . . . 126
3.4 Singular Value Decomposition . . . . . . . . . . . . . . . . 129
3.4.1 SVD formulation . . . . . . . . . . . . . . . . . . . 129
3.4.2 Algorithms for SVD . . . . . . . . . . . . . . . . . 130
3.4.3 Numerical examples . . . . . . . . . . . . . . . . . 131
3.4.4 SVD for data compression . . . . . . . . . . . . . 133
3.5 Principal Component Analysis . . . . . . . . . . . . . . . . 135
3.5.1 PCA formulation . . . . . . . . . . . . . . . . . . 135
3.5.2 Numerical examples . . . . . . . . . . . . . . . . . 137
3.6 Numerical Root Finding . . . . . . . . . . . . . . . . . . . 143
3.7 Numerical Integration . . . . . . . . . . . . . . . . . . . . . 145
3.7.1 Trapezoid rule . . . . . . . . . . . . . . . . . . . . 145
3.7.2 Gauss integration . . . . . . . . . . . . . . . . . . 147
3.8 Initial data treatment . . . . . . . . . . . . . . . . . . . . . 148
3.8.1 Min-max scaling . . . . . . . . . . . . . . . . . . . 149
3.8.2 “One-hot” encoding . . . . . . . . . . . . . . . . . 152
3.8.3 Standard scaling . . . . . . . . . . . . . . . . . . . 153
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Contents xi
4 Statistics and Probability-based Learning Model 157
4.1 Analysis of Probability of an Event . . . . . . . . . . . . . 158
4.1.1 Random sampling, controlled random
sampling . . . . . . . . . . . . . . . . . . . . . . . 158
4.1.2 Probability . . . . . . . . . . . . . . . . . . . . . . 160
4.2 Random Distributions . . . . . . . . . . . . . . . . . . . . . 164
4.2.1 Uniform distribution . . . . . . . . . . . . . . . . . 165
4.2.2 Normal distribution (Gaussian distribution) . . . 165
4.3 Entropy of Probability . . . . . . . . . . . . . . . . . . . . 167
4.3.1 Example 1: Probability and its entropy . . . . . . 169
4.3.2 Example 2: Variation of entropy . . . . . . . . . . 170
4.3.3 Example 3: Entropy for events with a variable
that takes different numbers of values of uniform
distribution . . . . . . . . . . . . . . . . . . . . . . 172
4.4 Cross-Entropy: Predicated and True Probability . . . . . . 173
4.4.1 Example 1: Cross-entropy of a quality
prediction . . . . . . . . . . . . . . . . . . . . . . . 174
4.4.2 Example 2: Cross-entropy of a poor
prediction . . . . . . . . . . . . . . . . . . . . . . . 175
4.5 KL-Divergence . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.5.1 Example 1: KL-divergence of a distribution
of quality prediction . . . . . . . . . . . . . . . . . 176
4.5.2 Example 2: KL-divergence of a poorly
predicted distribution . . . . . . . . . . . . . . . . 176
4.6 Binary Cross-Entropy . . . . . . . . . . . . . . . . . . . . . 177
4.6.1 Example 1: Binary cross-entropy for a distribution
of quality prediction . . . . . . . . . . . . . . . . . 178
4.6.2 Example 2: Binary cross-entropy for a poorly
predicted distribution . . . . . . . . . . . . . . . . 178
4.6.3 Example 3: Binary cross-entropy for more uniform
true distribution: A quality prediction . . . . . . . 179
4.6.4 Example 4: Binary cross-entropy for more uniform
true distribution: A poor prediction . . . . . . . . 180
4.7 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . 180
4.8 Naive Bayes Classification: Statistics-based Learning . . . 181
4.8.1 Formulation . . . . . . . . . . . . . . . . . . . . . 181
4.8.2 Case study: Handwritten digits recognition . . . . 181
4.8.3 Algorithm for the Naive Bayes classification . . . 182
4.8.4 Testing the Naive Bayes model . . . . . . . . . . . 185
4.8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . 187
xii Machine Learning with Python: Theory and Applications
5 Prediction Function and Universal Prediction Theory 189
5.1 Linear Prediction Function and Affine Transformation . . . 190
5.1.1 Linear prediction function: A basic
hypothesis . . . . . . . . . . . . . . . . . . . . . . 191
5.1.2 Predictability for constants, the role
of the bias . . . . . . . . . . . . . . . . . . . . . . 192
5.1.3 Predictability for linear functions:
The role of the weights . . . . . . . . . . . . . . . 192
5.1.4 Prediction of linear functions: A machine
learning procedure . . . . . . . . . . . . . . . . . . 193
5.1.5 Affine transformation . . . . . . . . . . . . . . . . 194
5.2 Affine Transformation Unit (ATU), A Simplest
Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.3 Typical Data Structures . . . . . . . . . . . . . . . . . . . 198
5.4 Demonstration Examples of Affine Transformation . . . . . 199
5.4.1 An edge, a rectangle under affine
transformation . . . . . . . . . . . . . . . . . . . . 202
5.4.2 A circle under affine transformation . . . . . . . . 204
5.4.3 A spiral under affine transformation . . . . . . . . 205
5.4.4 Fern leaf under affine transformation . . . . . . . 205
5.4.5 On linear prediction function with affine
transformation . . . . . . . . . . . . . . . . . . . . 206
5.4.6 Affine transformation wrapped with activation
function . . . . . . . . . . . . . . . . . . . . . . . . 206
5.5 Parameter Encoding and the Essential Mechanism
of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
5.5.1 The x to ŵ encoding, a data-parameter
converter unit . . . . . . . . . . . . . . . . . . . . 210
5.5.2 Uniqueness of the encoding . . . . . . . . . . . . . 211
5.5.3 Uniqueness of the encoding: Not affected
by activation function . . . . . . . . . . . . . . . . 212
5.6 The Gradient of the Prediction Function . . . . . . . . . . 213
5.7 Affine Transformation Array (ATA) . . . . . . . . . . . . . 213
5.8 Predictability of High-Order Functions of a Deepnet . . . . 214
5.8.1 A role of activation functions . . . . . . . . . . . . 214
5.8.2 Formation of a deepnet by chaining ATA . . . . . 215
5.8.3 Example: A 1 → 1 → 1 network . . . . . . . . . . 217
5.9 Universal Prediction Theory . . . . . . . . . . . . . . . . . 218
Contents xiii
5.10 Nonlinear Affine Transformations . . . . . . . . . . . . . . 219
5.11 Feature Functions in Physics-Law-based Models . . . . . . 220
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6 The Perceptron and SVM 223
6.1 Linearly Separable Classification Problems . . . . . . . . . 224
6.2 A Python Code for the Perceptron . . . . . . . . . . . . . 226
6.3 The Perceptron Convergence Theorem . . . . . . . . . . . 233
6.4 Support Vector Machine . . . . . . . . . . . . . . . . . . . 237
6.4.1 Problem statement . . . . . . . . . . . . . . . . . 237
6.4.2 Formulation of objective function
and constraints . . . . . . . . . . . . . . . . . . . . 238
6.4.3 Modified objective function with constraints:
Multipliers method . . . . . . . . . . . . . . . . . 242
6.4.4 Converting to a standard quadratic programming
problem . . . . . . . . . . . . . . . . . . . . . . . . 245
6.4.5 Prediction in SVM . . . . . . . . . . . . . . . . . . 249
6.4.6 Example: A Python code for SVM . . . . . . . . . 250
6.4.7 Confusion matrix . . . . . . . . . . . . . . . . . . 254
6.4.8 Example: A Sickit-learn class for SVM . . . . . . 254
6.4.9 SVM for datasets not separable with
hyperplanes . . . . . . . . . . . . . . . . . . . . . 256
6.4.10 Kernel trick . . . . . . . . . . . . . . . . . . . . . 257
6.4.11 Example: SVM classification with curves . . . . . 258
6.4.12 Multiclass classification via SVM . . . . . . . . . . 260
6.4.13 Example: Use of SVM classifiers for
iris dataset . . . . . . . . . . . . . . . . . . . . . . 260
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
7 Activation Functions and Universal
Approximation Theory 265
7.1 Sigmoid Function (σ(z)) . . . . . . . . . . . . . . . . . . . 266
7.2 Sigmoid Function of an Affine Transformation
Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
7.3 Neural-Pulse-Unite (NPU) . . . . . . . . . . . . . . . . . . 269
7.4 Universal Approximation Theorem . . . . . . . . . . . . . 274
7.4.1 Function approximation using NPUs . . . . . . . . 274
7.4.2 Function approximations using neuron
basis functions . . . . . . . . . . . . . . . . . . . . 275
xiv Machine Learning with Python: Theory and Applications
7.4.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . 281
7.5 Hyperbolic Tangent Function (tanh) . . . . . . . . . . . . . 282
7.6 Relu Functions . . . . . . . . . . . . . . . . . . . . . . . . . 283
7.7 Softplus Function . . . . . . . . . . . . . . . . . . . . . . . 286
7.8 Conditions for activation functions . . . . . . . . . . . . . . 288
7.9 Novel activation functions . . . . . . . . . . . . . . . . . . 288
7.9.1 Rational activation function . . . . . . . . . . . . 288
7.9.2 Power function . . . . . . . . . . . . . . . . . . . . 292
7.9.3 Power-linear function . . . . . . . . . . . . . . . . 294
7.9.4 Power-quadratic function . . . . . . . . . . . . . . 297
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
8 Automatic Differentiation and Autograd 303
8.1 General Issues on Optimization and Minimization . . . . . 303
8.2 Analytic Differentiation . . . . . . . . . . . . . . . . . . . . 304
8.3 Numerical Differentiation . . . . . . . . . . . . . . . . . . . 305
8.4 Automatic Differentiation . . . . . . . . . . . . . . . . . . . 305
8.4.1 The concept of automatic or algorithmic
differentiation . . . . . . . . . . . . . . . . . . . . 305
8.4.2 Differentiation of a function with respect
to a vector and matrix . . . . . . . . . . . . . . . 306
8.5 Autograd Implemented in Numpy . . . . . . . . . . . . . . 308
8.6 Autograd Implemented in the MXNet . . . . . . . . . . . . 310
8.6.1 Gradients of scalar functions with simple
variable . . . . . . . . . . . . . . . . . . . . . . . . 311
8.6.2 Gradients of scalar functions in high
dimensions . . . . . . . . . . . . . . . . . . . . . . 313
8.6.3 Gradients of scalar functions with quadratic
variables in high dimensions . . . . . . . . . . . . 318
8.6.4 Gradient of scalar function with a matrix of
variables in high dimensions . . . . . . . . . . . . 319
8.6.5 Head gradient . . . . . . . . . . . . . . . . . . . . 320
8.7 Gradients for Functions with Conditions . . . . . . . . . . 322
8.8 Example: Gradients of an L2 Loss Function for
a Single Neuron . . . . . . . . . . . . . . . . . . . . . . . . 323
8.9 Examples: Differences Between Analytical, Autograd,
and Numerical Differentiation . . . . . . . . . . . . . . . . 327
8.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Contents xv
9 Solution Existence Theory and
Optimization Techniques 331
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 331
9.2 Analytic Optimization Methods: Ideal Cases . . . . . . . . 332
9.2.1 Least square formulation . . . . . . . . . . . . . . 332
9.2.2 L2 loss function . . . . . . . . . . . . . . . . . . . 333
9.2.3 Normal equation . . . . . . . . . . . . . . . . . . . 334
9.2.4 Solution existence analysis . . . . . . . . . . . . . 334
9.2.5 Solution existence theory . . . . . . . . . . . . . . 336
9.2.6 Effects of parallel data-points . . . . . . . . . . . . 337
9.2.7 Predictability of the solution against
the label . . . . . . . . . . . . . . . . . . . . . . . 337
9.3 Considerations in Optimization for Complex
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
9.3.1 Local minima . . . . . . . . . . . . . . . . . . . . 339
9.3.2 Saddle points . . . . . . . . . . . . . . . . . . . . . 340
9.3.3 Convex functions . . . . . . . . . . . . . . . . . . 343
9.4 Gradient Descent (GD) Method for Optimization . . . . . 344
9.4.1 Gradient descent in one dimension . . . . . . . . . 345
9.4.2 Remarks . . . . . . . . . . . . . . . . . . . . . . . 346
9.4.3 Gradient descent in hyper-dimensions . . . . . . . 347
9.4.4 Property of a convex function . . . . . . . . . . . 348
9.4.5 The convergence theorem for the Gradient
Decent algorithm . . . . . . . . . . . . . . . . . . 349
9.4.6 Setting or the learning rates . . . . . . . . . . . . 351
9.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 353
9.5.1 Numerical experiment . . . . . . . . . . . . . . . . 354
9.6 Gradient Descent with Momentum . . . . . . . . . . . . . 363
9.6.1 The most critical problem with GD methods . . . 363
9.6.2 Formulation . . . . . . . . . . . . . . . . . . . . . 365
9.6.3 Numerical experiment . . . . . . . . . . . . . . . . 368
9.7 Nesterov Accelerated Gradient . . . . . . . . . . . . . . . . 370
9.7.1 Formulation . . . . . . . . . . . . . . . . . . . . . 370
9.8 AdaGrad Gradient Algorithm . . . . . . . . . . . . . . . . 371
9.8.1 Formulation . . . . . . . . . . . . . . . . . . . . . 371
9.8.2 Numerical experiment . . . . . . . . . . . . . . . . 372
9.9 RMSProp Gradient Algorithm . . . . . . . . . . . . . . . . 374
9.9.1 Formulation . . . . . . . . . . . . . . . . . . . . . 375
9.9.2 Numerical experiment . . . . . . . . . . . . . . . . 375
xvi Machine Learning with Python: Theory and Applications
9.10 AdaDelta Gradient Algorithm . . . . . . . . . . . . . . . . 378
9.10.1 The idea . . . . . . . . . . . . . . . . . . . . . . . 378
9.10.2 Numerical experiment . . . . . . . . . . . . . . . . 378
9.11 Adam Gradient Algorithm . . . . . . . . . . . . . . . . . . 381
9.11.1 Formulation . . . . . . . . . . . . . . . . . . . . . 381
9.11.2 Numerical experiment . . . . . . . . . . . . . . . . 382
9.12 A Case Study: Compare Minimization Techniques
Used in MLPClassifier . . . . . . . . . . . . . . . . . . . . 385
9.13 Other Algorithms . . . . . . . . . . . . . . . . . . . . . . . 386
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
10 Loss Functions for Regression 389
10.1 Formulations for Linear Regression . . . . . . . . . . . . . 390
10.1.1 Mathematical model . . . . . . . . . . . . . . . . . 390
10.1.2 Neural network configuration . . . . . . . . . . . . 390
10.1.3 The xw formulation . . . . . . . . . . . . . . . . . 391
10.2 Loss Functions for Linear Regression . . . . . . . . . . . . 391
10.2.1 Mean squared error loss or L2 loss function . . . . 392
10.2.2 Absolute error loss or L1 loss function . . . . . . . 393
10.2.3 Huber loss function . . . . . . . . . . . . . . . . . 394
10.2.4 Log-cosh loss function . . . . . . . . . . . . . . . . 394
10.2.5 Comparison between these loss functions . . . . . 395
10.2.6 Python codes for these loss functions . . . . . . . 396
10.3 Python Codes for Regression . . . . . . . . . . . . . . . . . 398
10.3.1 Linear regression using high-order polynomial
and other feature functions . . . . . . . . . . . . . 401
10.3.2 Linear regression using Gaussian basis
functions . . . . . . . . . . . . . . . . . . . . . . . 404
10.4 Neural Network Model for Linear Regressions
with Big Datasets . . . . . . . . . . . . . . . . . . . . . . . 406
10.4.1 Setting up neural network models . . . . . . . . . 406
10.4.2 Create data iterators . . . . . . . . . . . . . . . . 409
10.4.3 Training parameters . . . . . . . . . . . . . . . . . 411
10.4.4 Define the neural network . . . . . . . . . . . . . . 412
10.4.5 Define the loss function . . . . . . . . . . . . . . . 412
10.4.6 Use of optimizer . . . . . . . . . . . . . . . . . . . 412
10.4.7 Execute the training . . . . . . . . . . . . . . . . . 412
10.4.8 Examining training progress . . . . . . . . . . . . 413
Contents xvii
10.5 Neural Network Model for Nonlinear Regression . . . . . . 415
10.5.1 Train models on the Boston housing price
dataset . . . . . . . . . . . . . . . . . . . . . . . . 416
10.5.2 Plotting partial dependence for two features . . . 416
10.5.3 Plot curves on top of each other . . . . . . . . . . 418
10.6 On Nonlinear Regressions . . . . . . . . . . . . . . . . . . . 418
10.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
11 Loss Functions and Models for Classification 421
11.1 Prediction Functions . . . . . . . . . . . . . . . . . . . . . 421
11.1.1 Linear function . . . . . . . . . . . . . . . . . . . . 422
11.1.2 Logistic prediction function . . . . . . . . . . . . . 422
11.1.3 The tanh prediction function . . . . . . . . . . . . 423
11.2 Loss Functions for Classification Problems . . . . . . . . . 423
11.2.1 The margin concept . . . . . . . . . . . . . . . . . 423
11.2.2 0–1 loss . . . . . . . . . . . . . . . . . . . . . . . . 424
11.2.3 Hinge loss . . . . . . . . . . . . . . . . . . . . . . 425
11.2.4 Logistic loss . . . . . . . . . . . . . . . . . . . . . 426
11.2.5 Exponential loss . . . . . . . . . . . . . . . . . . . 427
11.2.6 Square loss . . . . . . . . . . . . . . . . . . . . . . 427
11.2.7 Binary cross-entropy loss . . . . . . . . . . . . . . 429
11.2.8 Remarks . . . . . . . . . . . . . . . . . . . . . . . 432
11.3 A Simple Neural Network for Classification . . . . . . . . . 432
11.4 Example of Binary Classification Using Neural
Network with mxnet . . . . . . . . . . . . . . . . . . . . . 433
11.4.1 Dataset for binary classification . . . . . . . . . . 433
11.4.2 Define loss functions . . . . . . . . . . . . . . . . . 435
11.4.3 Plot the convergence curve of the
loss function . . . . . . . . . . . . . . . . . . . . . 437
11.4.4 Computing the accuracy of the
trained model . . . . . . . . . . . . . . . . . . . . 437
11.5 Example of Binary Classification Using Sklearn . . . . . . 438
11.6 Regression with Decision Tree, AdaBoost,
and Gradient Boosting . . . . . . . . . . . . . . . . . . . . 443
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
12 Multiclass Classification 445
12.1 Softmax Activation Neural Networks for
k-Classifications . . . . . . . . . . . . . . . . . . . . . . . . 445
xviii Machine Learning with Python: Theory and Applications
12.2 Cross-Entropy Loss Function for k-Classifications . . . . . 447
12.3 Case Study 1: Handwritten Digit Classification
with 1-Layer NN . . . . . . . . . . . . . . . . . . . . . . . . 448
12.3.1 Set contexts according to computer hardware . . . 448
12.3.2 Loading the MNIST dataset . . . . . . . . . . . . 448
12.3.3 Set model parameters . . . . . . . . . . . . . . . . 451
12.3.4 Multiclass logistic regression . . . . . . . . . . . . 451
12.3.5 Defining a neural network model . . . . . . . . . . 452
12.3.6 Defining the cross-entropy loss function . . . . . . 452
12.3.7 Optimization method . . . . . . . . . . . . . . . . 453
12.3.8 Accuracy evaluation . . . . . . . . . . . . . . . . . 453
12.3.9 Initiation of the model and training execution . . 453
12.3.10 Prediction with the trained model . . . . . . . . . 455
12.4 Case Study 2: Handwritten Digit Classification with
Sklearn Random Forest Multi-Classifier . . . . . . . . . . . 456
12.5 Case Study 3: Comparison of Random Forest,
Extra-Forest, and Gradient Boosting for
Multi-Classifier . . . . . . . . . . . . . . . . . . . . . . . . 460
12.6 Multi-Classification via TensorFlow . . . . . . . . . . . . . 464
12.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
13 Multilayer Perceptron (MLP) for Regression
and Classification 467
13.1 The General Architecture and Formulations of MLP . . . . 467
13.1.1 The general architecture . . . . . . . . . . . . . . 467
13.1.2 The xw+b formulation . . . . . . . . . . . . . . . 469
13.1.3 The xw formulation, use of affine transformation
weight matrix . . . . . . . . . . . . . . . . . . . . 471
13.1.4 MLP configuration with affine transformation
weight matrix . . . . . . . . . . . . . . . . . . . . 473
13.1.5 Space evolution process in MLP . . . . . . . . . . 474
13.2 Neurons-Samples Theory . . . . . . . . . . . . . . . . . . . 474
13.2.1 Affine spaces and the training parameters
used in an MLP . . . . . . . . . . . . . . . . . . . 475
13.2.2 Neurons-Samples Theory for MLPs . . . . . . . . 476
13.3 Nonlinear Activation Functions for the Hidden Layers . . . 478
13.4 General Rule for Estimating Learning Parameters
in an MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
Contents xix
13.5 Key Techniques for MLP and Its Capability . . . . . . . . 479
13.6 A Case Study on Handwritten Digits Using MXNet . . . . 481
13.6.1 Import necessary libraries and load data . . . . . 481
13.6.2 Set neural network model parameters . . . . . . . 482
13.6.3 Softmax cross entropy loss function . . . . . . . . 482
13.6.4 Define a neural network model . . . . . . . . . . . 483
13.6.5 Optimization method . . . . . . . . . . . . . . . . 484
13.6.6 Model accuracy evaluation . . . . . . . . . . . . . 484
13.6.7 Training the neural network and timing
the training . . . . . . . . . . . . . . . . . . . . . . 484
13.6.8 Prediction with the model trained . . . . . . . . . 486
13.6.9 Remarks . . . . . . . . . . . . . . . . . . . . . . . 487
13.7 Visualization of MLP Weights Using Sklearn . . . . . . . . 488
13.7.1 Import necessary Sklearn module . . . . . . . . . 488
13.7.2 Load MNIST dataset . . . . . . . . . . . . . . . . 488
13.7.3 Set an MLP model . . . . . . . . . . . . . . . . . . 489
13.7.4 Training the MLP model and time the
training . . . . . . . . . . . . . . . . . . . . . . . . 489
13.7.5 Performance analysis . . . . . . . . . . . . . . . . 489
13.7.6 Viewing the weight matrix as images . . . . . . . 490
13.8 MLP for Nonlinear Regression . . . . . . . . . . . . . . . . 490
13.8.1 California housing data and preprocessing . . . . . 492
13.8.2 Configure, train, and test the MLP . . . . . . . . 493
13.8.3 Compute and plot the partial dependence . . . . . 494
13.8.4 Comparison studies on different regressors . . . . 495
13.8.5 Gradient boosting regressor . . . . . . . . . . . . . 495
13.8.6 Decision tree regressor . . . . . . . . . . . . . . . . 498
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
14 Overfitting and Regularization 501
14.1 Why Regularization . . . . . . . . . . . . . . . . . . . . . . 501
14.2 Tikhonov Regularization . . . . . . . . . . . . . . . . . . . 504
14.2.1 Demonstration examples: One data-point . . . . . 508
14.2.2 Demonstration examples: Two data-points . . . . 517
14.2.3 Demonstration examples: Three data-points . . . 521
14.2.4 Summary of the case studies . . . . . . . . . . . . 525
14.3 A Case Study on Regularization Effects using MXNet . . . 526
14.3.1 Load the MNIST dataset . . . . . . . . . . . . . . 527
14.3.2 Define a neural network model . . . . . . . . . . . 527
xx Machine Learning with Python: Theory and Applications
14.3.3 Define loss function and optimizer . . . . . . . . . 527
14.3.4 Define a function to evaluate the accuracy . . . . 528
14.3.5 Define a utility function plotting
convergence curve . . . . . . . . . . . . . . . . . . 528
14.3.6 Train the neural network model . . . . . . . . . . 529
14.3.7 Evaluation of the trained model: A typical case
of overfitting . . . . . . . . . . . . . . . . . . . . . 531
14.3.8 Application of L2 regularization . . . . . . . . . . 531
14.3.9 Re-initializing the parameters . . . . . . . . . . . 531
14.3.10 Training the L2-regularized neural
network model . . . . . . . . . . . . . . . . . . . . 531
14.3.11 Effect of the L2 regularization . . . . . . . . . . . 533
14.4 A Case Study on Regularization Parameters
Using Sklearn . . . . . . . . . . . . . . . . . . . . . . . . . 534
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
15 Convolutional Neural Network (CNN)
for Classification and Object Detection 539
15.1 Filter and Convolution . . . . . . . . . . . . . . . . . . . . 539
15.2 Affine Transformation Unit in CNNs . . . . . . . . . . . . 542
15.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
15.4 Up Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 545
15.5 Configuration of a Typical CNN . . . . . . . . . . . . . . . 545
15.6 Some Landmark CNNs . . . . . . . . . . . . . . . . . . . . 546
15.6.1 LeNet-5 . . . . . . . . . . . . . . . . . . . . . . . . 547
15.6.2 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . 548
15.6.3 VGG-16 . . . . . . . . . . . . . . . . . . . . . . . . 549
15.6.4 ResNet . . . . . . . . . . . . . . . . . . . . . . . . 549
15.6.5 Inception . . . . . . . . . . . . . . . . . . . . . . . 551
15.6.6 YOLO: A CONV net for object detection . . . . . 551
15.7 An Example of Convolutional Neural Network . . . . . . . 552
15.7.1 Import TensorFlow . . . . . . . . . . . . . . . . . 553
15.7.2 Download and preparation of a CIFAR10
dataset . . . . . . . . . . . . . . . . . . . . . . . . 553
15.7.3 Verification of the data . . . . . . . . . . . . . . . 553
15.7.4 Creation of Conv2D layers . . . . . . . . . . . . . 554
15.7.5 Add Dense layers to the Conv2D layers . . . . . . 556
15.7.6 Compile and train the CNN model . . . . . . . . . 557
15.7.7 Evaluation of the trained CNN model . . . . . . . 557
Contents xxi
15.8 Applications of YOLO for Object Detection . . . . . . . . 558
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
16 Recurrent Neural Network (RNN) and Sequence
Feature Models 563
16.1 A Typical Structure of LSTMs . . . . . . . . . . . . . . . . 564
16.2 Formulation of LSTMs . . . . . . . . . . . . . . . . . . . . 565
16.2.1 General formulation . . . . . . . . . . . . . . . . . 565
16.2.2 LSTM layer and standard neural layer . . . . . . . 566
16.2.3 Reduced LSTM . . . . . . . . . . . . . . . . . . . 566
16.3 Peephole LSTM . . . . . . . . . . . . . . . . . . . . . . . . 567
16.4 Gated Recurrent Units (GRUs) . . . . . . . . . . . . . . . 568
16.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
16.5.1 A simple reduced LSTM with a standard NN layer
for regression . . . . . . . . . . . . . . . . . . . . . 569
16.5.2 LSTM class in tensorflow.keras . . . . . . . . . . . 574
16.5.3 Using LSTM for handwritten digit recognition . . 575
16.5.4 Using LSTM for predicting dynamics of
moving vectors . . . . . . . . . . . . . . . . . . . . 578
16.6 Examples of LSTM for Speech Recognition . . . . . . . . . 584
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
17 Unsupervised Learning Techniques 585
17.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 585
17.2 K-means for Clustering . . . . . . . . . . . . . . . . . . . . 585
17.2.1 Initialization of means . . . . . . . . . . . . . . . . 586
17.2.2 Assignment of data-points to clusters . . . . . . . 587
17.2.3 Update of means . . . . . . . . . . . . . . . . . . . 588
17.2.4 Example 1: Case studies on comparison of
initiation methods for K-means clustering . . . . 590
17.2.5 Example 2: K-means clustering on the
handwritten digit dataset . . . . . . . . . . . . . . 601
17.3 Mean-Shift for Clustering Without Pre-Specifying k . . . . 605
17.4 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . 609
17.4.1 Basic structure of autoencoders . . . . . . . . . . 610
17.4.2 Example 1: Image compression and denoising . . . 611
17.4.3 Example 2: Image segmentation . . . . . . . . . . 611
17.5 Autoencoder vs. PCA . . . . . . . . . . . . . . . . . . . . . 615
17.6 Variational Autoencoder (VAE) . . . . . . . . . . . . . . . 617
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
xxii Machine Learning with Python: Theory and Applications
18 Reinforcement Learning (RL) 625
18.1 Basic Underlying Concept . . . . . . . . . . . . . . . . . . 625
18.1.1 Problem statement . . . . . . . . . . . . . . . . . 625
18.1.2 Applications in sciences, engineering,
and business . . . . . . . . . . . . . . . . . . . . . 626
18.1.3 Reinforcement learning approach . . . . . . . . . . 627
18.1.4 Actions in discrete time: Solution strategy . . . . 628
18.2 Markov Decision Process . . . . . . . . . . . . . . . . . . . 629
18.3 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
18.4 Value Functions . . . . . . . . . . . . . . . . . . . . . . . . 630
18.5 Bellman Equation . . . . . . . . . . . . . . . . . . . . . . . 631
18.6 Q-learning Algorithm . . . . . . . . . . . . . . . . . . . . . 633
18.6.1 Example 1: A robot explores a room with
unknown obstacles with Q-learning algorithm . . . 633
18.6.2 OpenAI Gym . . . . . . . . . . . . . . . . . . . . . 635
18.6.3 Define utility functions . . . . . . . . . . . . . . . 636
18.6.4 A simple Q-learning algorithm . . . . . . . . . . . 636
18.6.5 Hyper-parameters and convergence . . . . . . . . 640
18.7 Q-Network Learning . . . . . . . . . . . . . . . . . . . . . . 641
18.7.1 Example 2: A robot explores a room with
unknown obstacles with Q-Network . . . . . . . . 641
18.7.2 Building TensorFlow graph . . . . . . . . . . . . . 642
18.7.3 Results from the Q-Network . . . . . . . . . . . . 644
18.8 Policy gradient methods . . . . . . . . . . . . . . . . . . . 646
18.8.1 PPO with NN policy . . . . . . . . . . . . . . . . 646
18.8.2 Strategy used in policy gradient methods
and PPO . . . . . . . . . . . . . . . . . . . . . . . 647
18.8.3 Ratio policy . . . . . . . . . . . . . . . . . . . . . 649
18.8.4 PPO: Controlling a pole staying upright . . . . . . 650
18.8.5 Save and reload the learned model . . . . . . . . . 654
18.8.6 Evaluate and view the trained model . . . . . . . 654
18.8.7 PPO: Self-driving car . . . . . . . . . . . . . . . . 657
18.8.8 View samples of the racing car before training . . 658
18.8.9 Train the racing car using the CNN policy . . . . 659
18.8.10 Evaluate and view the learned model . . . . . . . 660
18.9 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
Index 663
Chapter 1
Introduction
1.1 Naturally Learned Ability for Problem Solving
We are constantly dealing with all kinds of problems every day, and would
like to solve these problems for timely decisions and actions. We may notice
that for many of the daily-life problems, our decisions are often made spon-
taneously, swiftly without much consciousness. This is because we have been
constantly learning to solve such problems in the past since we were born,
and therefore the solutions have already been encoded in the neuron cells in
our brain. When facing similar problems, our decision is spontaneous.
For many complicated problems, especially in science and engineering,
one would need to think harder and even conduct extensive research and
study on the related issues before we can provide a solution. What if we want
to give spontaneous reliable solutions to these types of problems as well?
Some scientists and engineers may be able to do this for some problems, but
not many. Those scientists are intensively trained or educated in specially
designed courses for dealing with complicated problems.
What if a normal layman would also like to be able solve these challenging
types of problems? One way is to go through a special learning process.
The alternative may be through machine learning, to develop a special
computer model with a mechanism that can be trained to extract features
from experience or data to provide a reliable and instantaneous solution for
a type of problem.
1.2 Physics-Law-based Models
Problems in science and engineering are usually much more difficult to solve.
This is because we humans can only experience or observe the phenomena
1
2 Machine Learning with Python: Theory and Applications
associated with the problem. However, many phenomena are not easily
observable and have very complicated underlying logic. Scientists have been
trying to unveil the underlying logic by developing some theories (or laws or
principles) that can help to best describe these phenomena. These theories
are then formulated in the form of algebraic, differential, or integral system
equations that govern the key variables involved in the phenomena. The
next step is then to find a method that can solve these equations for these
variables varying in space and with time. The final step is to find a way
to validate the theory by observation and/or experiments to measure the
values of these variables. The validated theory is used to build models to
solve problems that exhibit the same phenomena. This type of model is
called physics-law-based model.
The above-mentioned process is essentially what humans on earth have
been doing in trying to understand nature, and we have made tremendous
progress so far. In this process, we have established a huge number of areas
of studies, physics, mathematics, biology, etc., which are now referred to as
sciences.
Understanding nature is only a part of the story. Humans want to
invent and build new things. A good understanding of various phenomena
enables us to do so, and we have practically built everything around us,
buildings, bridges, airplanes, space stations, cars, ships, computers, cell
phones, internet, communication systems, and energy systems. Such a list is
endless. In this process, we humans established a huge number of areas of
development, which we are now referred to as engineering.
Understanding biology helped us to discover medicines, treatments for
illnesses of humans and animals, treatments for plants and the environment,
as well as proper measures and policies dealing with the relationships
between humans, animals, plants, and environments. In this process, we
humans established a huge number of areas of studies, including medicine,
agriculture, and ecology.
In the relentless quest by humans in history, countless theories, laws,
techniques, methods, etc., have been developed in various areas of science,
engineering, and biology. For example, in the study of a small area of compu-
tational mechanics for designing structural systems, we have developed the
finite element method (FEM) [1], smoothed finite element method (S-FEM)
[2], meshfree methods [3, 4], inverse techniques [5], etc., just to name a few
that the author has been working on. It is not possible and necessary to list
all of these kinds of methods and techniques. Our discussion here is just to
provide an overall view of how a problem can be solved based on physics laws.
Introduction 3
Note that there are many problems in nature, engineering, and society
for which it is difficult to describe and find proper physics laws to accurately
and effectively solve them. Alternative means are thus needed.
1.3 Machine Learning Models, Data-based
There is a large class of complicated problems (in science, engineering,
biology, and daily-life) that do not yet have known governing physics laws, or
the solutions to the governing laws’ equations are too expensive to obtain. For
this type of problem, on the other hand, we often have some data obtained
and accumulated through observations or measurements or historic records.
When the data are sufficiently large and of good quality, it is possible to
develop computer models to learn from these data. Such a model can then be
used to find a solution for this type of problem. This kind of computer model
is defined as a data-based model or machine learning model in this book.
Different types of effective artificial Neural Networks (NNs) with various
configurations have been developed and widely used for practical problems
in sciences and engineering, including multilayer perceptron (MLP) [6–9],
Convolutional Neural Networks (CNNs) [10–14], and Recurrent Neural
Networks (RNNs) [15–17]. TrumpetNets [8] and TubeNets [9, 18–20] were
also recently proposed by the author for creating two-way deepnets using
physics-law-based models as trainers, such as the FEM [1] and S-FEM [2].
The unique feature of TrumpetNets and TubeNets is their effectiveness for
both forward and inverse problems [5]. It has a unique net architecture.
Most importantly, solutions to inverse problems can be analytically derived
in explicit formulae for the first time. This implies that when a data-based
model is built properly, one can find solutions very efficiently.
Machine learning is essentially to mimic the natural learning process
occurring in biological brains that can have a huge number of neurons. In
terms of usage of data, we may have three major categories:
1. Supervised Learning, using data with true labels (teachers).
2. Unsupervised Learning, using data without labels.
3. Reinforcement Learning, using a predefined environment.
In terms of problems to solve, there are the following:
1. Binary classification problems, answer in probability to yes or no.
2. k-classification problems, answer in probabilities to k classes.
3. k-clustering problems, answer in k clusters of data-points.
4 Machine Learning with Python: Theory and Applications
4. Regression (linear or nonlinear), answer in predictions of continuous
functions.
5. Feature extraction, answer in key features in the dataset.
6. Abnormality detection, answer in abnormal data.
7. Inverse analysis, answer in prediction on features from known responses.
In terms of learning methodology or algorithms, we have the following:
1. Linear and logistic regression, supervised.
2. Decision Tree, supervised.
3. Support Vector Machine (SVM), supervised.
4. Naive Bayes, supervised.
5. Multi-Layer Perceptron (MLP) or artificial Neural Networks (NNs),
supervised.
6. k-Nearest Neighbors (kNN), supervised.
7. Random Forest, supervised.
8. Gradient Boosting types of algorithms, supervised.
9. Principal Components Analysis (PCA), unsupervised.
10. K-means, Mean-Shift, unsupervised.
11. Autoencoders, unsupervised.
12. Markov Decision Process, reinforcement Learning.
This book will cover most of these algorithms, but our focus will be
more on neural network-based models because rigorous theory and predictive
models can be established.
Machine learning is a very active area of research and development. New
models, including the so-called cognitive machine learning models, are being
studied. There are also techniques for manipulating various ML models. This
book, however, will not cover those topics.
1.4 General Steps for Training Machine Learning Models
General steps for training machine learning models are summarized as
follows:
1. Obtaining the dataset for the problem, by your own means of data
generation, or imported from other existing sources, or computer
syntheses.
2. Clean up the dataset if there are objectively known defaults in it.
3. Determine the type of hypothesis for the model.
Introduction 5
4. Develop or import proper module for the needed algorithm for the
problem. The learning ability (number of the learning parameters) of
the model and the size of the dataset shall be properly balanced, if
possible. Otherwise, consider the use of regularization techniques.
5. Randomly initialize the learning parameters, or import some known pre-
trained learning parameter.
6. Perform the training with proper optimization techniques and monitor-
ing measures.
7. Test the trained model using an independent test dataset. This can also
be done during the training.
8. Deploy the trained and tested model to the same type of problems, where
the training and testing datasets are collected/generated.
1.5 Some Mathematical Concepts, Variables, and Spaces
We shall define variables and spaces often used in this book for ease of
discussion. We first state that this book deals with only real numbers, unless
specified when geometrically closed operations are required. Let us introduce
two toy examples.
1.5.1 Toy examples
Toy Example-1, Regression: Assume we are to build a machine learning
model to predict the quality of fruits. Based on its three features, size,
weight, and roundness (that can easily observe and measure), we aim to
establish a machine learning regression model to predict the values of two
characteristics, sweetness and vitamin-C content (that are difficult to
quantify nondestructively), for any given fruit. To build such a model, we
make 8,000 measurements to randomly selected fruits from the market and
create a dataset with 8,000 paired data-points. Each data-point records
the values of these three features and pairs with the values of these two
characteristics. The values of these two characteristics are called labels
(ground truth) to the data-point. The dataset is called labeled dataset that
can be used systematically to train a machine learning model.
Toy Example-2, Classification: Assume we are to build a machine learning
model to classify the type of fruits based on its three features (size, weight,
and roundness). In this case, we want a machine to predict whether any
given frait is an apple or orange, so that it can be packaged separately
in an automatic manner. To achieve this, we make 8,000 measurements to
6 Machine Learning with Python: Theory and Applications
randomly selected fruits of these two types from the market, and create a
dataset with 8,000 paired data-points. Each data-point records the values
of these three features and pairs with two labels (ground truth) of yes-or-no
for apple or yes-or-no for orange. The dataset is also called labeled dataset
for model training.
With an understanding of these two typical types of examples, it should
be easy to extend this to many other types of problems for which a machine
learning model can be effective.
1.5.2 Feature space
Feature space Xp : Machine learning uses datasets that contain observed or
measured p variables of real numbers in R, often called features. In our two
toy examples, p = 3. We may define a p-dimensional feature space Xp which is
a vector space (https://en.wikipedia.org/wiki/Vector space) over real num-
bers in R with inner product defined. A vector in Xp for an arbitrary point
(x1 , x2 , . . . , xp ) is written as
x = [x1 , x2 , . . . , xp ], x ∈ Xp (1.1)
The origin of Xp is at x = [0, 0, . . . , 0] following the standard for all vector
spaces. Note that we use italic for scalar variables, bold face for all vectors
and matrices, and blackboard bold for spaces (or sets or of that nature),
and this convention is followed throughout this book. Also, we define, in
general, all vectors in row vectors by default, as we usually do in Python
programming. A column vector is treated as special case of 2D array (matrix)
with only one column.
It is clear that the feature space Xp is a special (with vector operations
defined) case of the real space Rp . Thus, Xp ∈ Rp .
Also, xi (i = 1, 2, . . . , p) is called linear basis functions (not to be confused
with the basis vectors), because a linear combination of xi gives a new x that
is still in Xp . A two-dimensional (2D) feature space X2 is the black plane
x1 − x2 shown in Fig. 1.1.
An observed data-point xi with p features is a discrete point in the space,
and the corresponding vector xi is expressed as
xi = [xi1 , xi2 , . . . , xip ], xi ∈ Xp , ∀ i = 1, 2, . . . , m (1.2)
where m is the number of measurements or observations or data-points in
the dataset. It is also often referred as number of samples in a dataset. For
these two toy examples, m = 8,000. For the example shown in Fig. 1.1, these
4 blue vectors are for four data-points in space X2 , and m = 4.
Introduction 7
Figure 1.1: Data-points in a 2D feature space X2 with blue vectors: xi = [xi1 , xi2 ]; and
2
the same data-points in the augmented feature space X , called affine space, with red
vectors: xi = [1, xi1 , xi2 ]; i = 1, 2, 3, 4.
These data-points xi (i = 1, 2, . . . , m) can be stacked to form a dataset
noted as X ∈ Xp . This is for convenience in formulation. We do not form
such a matrix in computation because it is usually very large for big datasets
with large m.
1.5.3 Affine space
p
Affine space X : It is an augmented feature space. It is the red plane
shown in Fig. 1.1. It has a “complete” linear bases (or basis functions):
x = [1, x1 , x2 , . . . , xp ] (1.3)
By complete linear bases, we mean all bases up to the 1st order of all the
variables including the 0th order. The 0th order basis is the constant basis 1
that provides the augmentation. Affine space is not a vector space, because
p p
0∈ / X and (xi + xj ) ∈
/ X where i, j=1 or 2 or 3 or 4 in Fig. 1.1. This special
and fundamentally useful space always has a constant 1 as a component, and
thus it does not have an origin by definition. Operation that occurs on an
affine space and still stays in an affine space is called affine transformation.
It is the most essential operation in major machine learning models, and the
fundamental reason for such models being predictive.
An observed data-point with p features can also be presented as an
p
augmented discrete point in the X space and can be expressed by
p
xi = [1, xi1 , xi2 , . . . , xip ], xi ∈ X , ∀ i = 1, 2, . . . , m (1.4)
8 Machine Learning with Python: Theory and Applications
p
A X space can be created by first spanning Xp by one dimension to Xp+1
via introduction of a new variable x0 as
[x0 , x1 , x2 , . . . , xp ] (1.5)
and then set x0 = 1. These 4 red vectors shown in Fig. 1.1 live in an affine
2
space X .
p
Note that the affine space X is neither Xp+1 nor Xp , and is quite
p
special. A vector in a X is in Xp+1 , but the tip of the vector is confined
in “hyperplane” of x0 = 1. For convenience of discussion in this book, we
say that an affine space has a pseudo-dimension that is p + 1. Its true
dimension is p, but it is a hyperplane in a Xp+1 space.
In terms of function approximation, the linear bases given in Eq. (1.3)
can be used to construct any arbitrary linear function in the feature
space. A proper linear combination of these complete linear bases is still
in the affine space. Such a combination can be used to perform an affine
transformation, which will be discussed in detail in Chapter 5.
These data-points xi (i = 1, 2, . . . , m) are stacked to form an augmented
p
dataset X ∈ X , which is the well-known moment matrix in function
approximation theory [1–4]. Again, this is for convenience in formulation.
We may not form such a matrix in computation.
1.5.4 Label space
Label space Yk : Consider a labeled dataset for a supervised machine
learning model creation. We shall introduce variables (y1 , y2 , . . . , yk ) of real
numbers in R. For toy example-1, k = 2. We may define a label space Yk over
real numbers. It is a vector space. A vector in space Yk for can be written as
y = [y1 , y2 , . . . , yk ], y ∈ Yk ∈ Rk (1.6)
A label in a dataset is paired with a data-point. The label for data-point xi
which is denoted as yi can be expressed as
yi = [yi1 , yi2 , . . . , yik ], yi ∈ Yk , ∀i = 1, 2, . . . , m (1.7)
For the toy example-1, yij (i = 1, 2, . . . , 8000; j = 1, 2) are 8,000 real numbers
in 2D space Y2 . For the toy example-2, each label, yi1 or yi2 , has a value of
0 or 1 (or −1 or 1), but the labels can still be viewed living in Y2 .
These labels yi (i = 1, 2, . . . , m) can be stacked to form a label set Y ∈ Yk ,
although we may not really do so in computation.
Introduction 9
Typically, affine transformations end at the output layer in a neural
network and produces a vector in a label space, so that a loss function can
be constructed there for “terminal control”.
1.5.5 Hypothesis space
The learning parameters ŵ in a machine learning model are continuous vari-
ables that live in a hypothesis space noted as WP over the real numbers.
Learning parameters are also called training or trainable parameters. We
use these terms interchangeably. The learning parameters include weights
and biases in each and all the layer. The hat above w implying that it
is a collection of all weights and biases, so that we have single notation
in a vector for all learning parameters. Its dimension P depends on type
of hypothesis used including the configuration of neural networks or ML
models. These parameters always work with feature vectors, resulting in
intermediate feature vectors in a new feature space or in a label space,
thorough a properly designed architecture.
These parameters need to be updated which involves vector operations.
To ensure convergence, we would need the vector of all learning parameters
obey important vector properties, such as inner products, norms and the
Cauchy-Schwartz inequality, etc. We will do such proofs multiple times in
this book. Therefore, we require WP be a vector space, so that each update
to the current learning parameters results new parameters that are still in
the same vector space, until they converge.
Note that the learning parameters, in general, are in matrix form or
column vectors (that can be viewed as a special case of matrix). In a typical
machine learning model, there could be multiple matrices of different sizes.
These matrices form affine transformation matrices that operates on
features on affine spaces. A component in a “vector” of the hypothesis space
can be in fact a matrix in general, and thus it is not easy to comprehend
intuitively. The easiest (and valid) way is to “flatten” all the matrix and
then “concatenate” them together to form a tall vector, and then treat it as
a usual vector. We do this kind of flattening and concatenation all the time
in Python. Such a flattened tall vector ŵ in the hypothesis space WP can
be written generally as,
ŵ = [W 0 , W 1 , . . . , W P ] ∈ WP (1.8)
We will discuss in later chapters the details about WP for various models
including estimation of the dimension P .
10 Machine Learning with Python: Theory and Applications
1.5.6 Definition of a typical machine learning model,
a mathematical view
Finally, we can define mathematically ML models for prediction as a mapping
operator:
p
M(ŵ ∈ WP ; X ∈ X , Y ∈ Yk ) : Xp → Yk (1.9)
It reads that the ML model M uses a given dataset X with Y to train its
learning parameters ŵ, and produces a map (or giant functions) that makes
a prediction in the label space for any point in the feature space.
The ML model shown in Eq. (1.9) is in fact a data-parameter
converter: it converts a given dataset to learning parameters during training
and then converts the parameters back in making a prediction for a given set
of feature variables. It can also be mathematically viewed as a giant function
with k components in the feature space Xp and controlled (parameterized)
by the training parameters in WP . When the parameters are tuned, one gets
a set of k giant functions over the feature space.
On the other hand, this set of k giant functions can also be viewed as
continuous (differentiable) functions of these parameters for any given data-
point in the dataset, which can be used to form a loss function that is also
differentiable. Such a loss function can be the error between these k giant
functions and the corresponding k labels given in the dataset. It can be
viewed as a functional of prediction functions that in turn are functions of
ŵ in the vector space WP . The training is to minimize such a loss function
for all the data-points in the dataset, by updating the training parameters
to become minimizers. This overview picture will be made explicitly into
a formula in later chapters. The success factors for building a quality ML
model include: (1) type of hypothesis, (2) number of learning parameters
in WP , (3) quality (representativeness to the underlaying problem to be
modeled, including correctness, size, data-point distribution over the features
space, and noise level) of the dataset in Xp , and (4) techniques to find the
minimizer of learning parameters to best produce the label in the dataset.
We will discuss this in detail in later chapters for different machine learning
models.
Concepts on spaces are helpful in our later analysis of the predictive
properties of machine learning models. Readers may find difficulty in
comprehending these concepts at this stage, and thus are advised to just have
some rough idea for now and to revisit this section when reading relevant
chapters. Readers may jump to Section 13.1.5 and take a look at Eq. (13.13)
there just for a quick glance on how the spaces evolve in a deepnet.
Introduction 11
Note also that there are ML models for discontinuous feature variables,
and the learning parameters may not need to be continuous. Such methods
are often developed based on proper intuitive rules and techniques, and
we will discuss some of those. The concepts on spaces may not be directly
applicable but can often help.
1.6 Requirements for Creating Machine Learning Models
To train a machine learning model, one would need the following:
1. A dataset, which may be obtained via observations, experiments, and
physics-law-based models. The dataset is usually divided (in a random
manner) into two mutually independent subsets, training dataset and
testing dataset, typically at a rate of 75:25. The independence of the test-
ing dataset is critical, because ML models are determined largely by the
training dataset, and hence their reliability depends on objective testing.
2. Labels with the dataset, if possible.
3. Prior information on the dataset if possible, such as the quality of the
data and key features of the data. This can be useful in choosing a proper
algorithm for the problem, and in application of regularization techniques
in the training.
4. Proper computer software modules and/or effective algorithms.
5. A computer, preferably connected to the internet.
1.7 Types of Data
Data are the key to any data-based models. There are many types of data
available for different types of problems that one may make use of as follows:
• Images: photos from cameras (more often now cellphones), images
obtained from the open websites, computer tomography (CT), X-ray,
ultrasound, Magnetic resonance imaging (MRI), etc.
• Computer-generated data: data from proven physics-law-based mod-
els, other surrogate models, other reliable trained machine learning
models, etc.
• Text: unclassified text documents, books, emails, webpages, social media
records, etc.
• Audio and video: audio and video recordings.
12 Machine Learning with Python: Theory and Applications
Note that the quality and the sampling domain of the dataset play
important roles in training reliable machine learning models. Use of a trained
model beyond the data sampling domain requires a special caution, because
it can go wrong unexpectedly, and hence be very dangerous.
1.8 Relation Between Physics-Law-based and
Data-based Models
Machine learning models are in general slow learners, fast predictors,
while physics-law-based models do not need to learn (using existing laws),
but are slow in prediction. This is because the strategies for physics-law-
based models and those for data-based models are quite different. ML models
use datasets to train the parameters, but physics-law-based models use laws
to determine the parameters.
However, at the detailed computational methodology level, many tech-
niques used in both models are in fact the same or quite similar. For example,
when we express a variable as a function of other variables, both models
use basis functions (polynomial, or radial basis function (RBF), or both).
In constructing objective functions, the least squares error formulation is
used in both. In addition, the regularization methods used are also quite
similar. Therefore, one should not study these models in total isolation. The
ideas and techniques may be deeply connected and mutually adaptable. This
realization can be useful in better understanding and further development
of more effective methods for both models, by exchanging the ideas and
techniques from one to another. In general, for physics-law-based computa-
tional methods, such as the general form of meshfree methods, we understand
reasonably well why and how a method works in theory [3]. Therefore, we are
quite confident about what we are going to obtain when a method is used
for a problem. For data-based methods, however, this is not always true.
Therefore, it is of importance to develop fundamental theories for data-based
methods. The author made some attempts [21] to reveal the relationship
between physics-law-based and data-based models, and to establish some
theoretical foundation for data-based models. In this book, we will try to
discuss the similarities and differences, when a computational method is
used in both models.
1.9 This Book
This book offers an introduction to general topics on machine learning. Our
focus will be on the basic concepts, fundamental theories, and essential
Introduction 13
computational techniques related to creation of various machine learn-
ing models. We decided not to provide a comprehensive document for
all the machine learning techniques, models, and algorithms. This is
because the topic of machine learning is very extensive and it is not possible
to be comprehensive in content. Also, it is really not possible for many read-
ers to learn all the content. In addition, there are in fact plenty of documents
and codes available publicly online. There is no lack of material, and there is
no need to simply reproduce these materials. In the opinion of the author, the
best learning approach is to learn the most essential basics and build a strong
foundation, which is sufficient to learn other related topics, methods, and
algorithms. Most importantly, readers with strong fundamentals can even
develop innovative and more effective machine models for their problems.
Based on this philosophy, the highlights of the book that cannot be found
easily or in good completion in the open literature are listed as follows, many
of which are the outcomes of author’s study in the past years:
1. Detailed discussion on and demonstration of predictability for arbitrary
linear functions of the basic hypothesis used in major ML models.
2. Affine transformation properties and their demonstrations, affine space,
affine transformation unit, array, chained arrays, roles of the weights and
biases, and roles of activation functions for deepnet construction.
3. Examination of predictability of high-order functions and a Universal
Prediction Theory for deepnets.
4. A concept of data-parameter converter, parameter encoding, and unique-
ness of the encoding.
5. Role of affine transformation in SVM, complete description of SVM
formulation, and the kernel trick.
6. Detailed discussion on and demonstration of activation functions,
Neural-Pulse-Unit (NPU), leading to the Universal Approximation
Theorem for wide-nets.
7. Differentiation of a function with respect to a vector and matrix, leading
to automatic differentiation and Autograd.
8. Solution Existence Theory, effects of parallel data-points, and pre-
dictability of the solution against the label.
9. Neurons-Samples Theory gives, for the first time, a general rule of thumb
on relationship between the number of data-points and the number
neurons in a neural network (or the total pseudo-dimensions of affine
spaces involved).
10. Detailed discussion on and demonstration of Tikhonov regularization
effects.
14 Machine Learning with Python: Theory and Applications
The author has made substantial effort to write Python codes to demonstrate
the essential and difficult concepts and formulations, which allows readers
to comprehend each chapter earlier. Based on the learning experience of the
author, this can make the learning more effective.
The chapters of this book are written, in principle, readable indepen-
dently, by allowing some duplicates. Necessary cross-references between
chapters provided are kept minimum.
1.10 Who May Read This Book
The book is written for beginners interested to learn the basics of machine
learning, including university students who have completed their first
year, graduate students, researchers, and professionals in engineering and
sciences. Engineers and practitioners who want to learn to build machine
learning models may also find the book useful. Basic knowledge of college
mathematics is helpful in reading this book smoothly.
This book may be used as a textbook for undergraduates (3rd year or
senior) and graduate students. If this book is adopted as a textbook, the
instructor may contact the author ([email protected]) directly for some
homework and course projects and solutions.
Machine learning is still a fast developing area of research. There still exist
many challenging problems, which offer ample opportunities for research to
develop new methods and algorithms. Currently, it is a hot topic of research
and applications. Different techniques are being developed every day, and
new businesses are formed constantly. It is the hope of the author that this
book can be helpful in studying existing and developing machine learning
models.
1.11 Codes Used in This Book
The book has been written using Jupiter Notebook with codes.
Readers who purchased the book may contact the author directly
(mailto:[email protected]) to request a softcopy of the book with codes
(which may be updated), free for academic use after registration. The
conditions for use of the book and codes developed by the author, in both
hardcopy and softcopy, are as follows:
1. Users are entirely at their own risk using any of part of the codes and
techniques.
Introduction 15
2. The book and codes are only for your own use. You are not allowed to
further distribute without permission from the author of the code.
3. There will be no user support.
4. Proper reference and acknowledgment must be given for the use of the
book, codes, ideas, and techniques.
Note that the handcrafted codes provided in the book are mainly for
studying and better understanding the theory and formulation of ML
methods. For production runs, well-established and well-tested packages
should be used, and there are plenty out there, including but not limited
to Scikit learn, PyTouch, TensorFlow, and Keras. Also, our codes provided
are often run with various packages/modules. Therefore, care is needed when
using these codes, because the behavior of the codes often depends on the
versions of Python and all these packages/modules. When the codes do not
run as expected, version mismatch could be one of the problems. When this
book was written, the versions of Python and some of the packages/modules
were as follows:
• Python 3.6.13 :: Anaconda, Inc.
• Jupyter Notebook (web-based) 6.3.0
• TensorFlow 2.4.1
• keras 2.4.3
• gym 0.18.0
When issues are encountered in running a code, readers may need to
check the versions of the packages/modules used. If Anaconda Navigator
is used, the versions of all these packages/modules installed with the Python
environment are listed when the Python environment are highlighted. You
can also check the versions of a package in a code cell of the Jupyter
Notebook. For example, to check the version of the current environment
of Python, one may use
!python -V # ! is used to execute an external command
Python 3.6.13 :: Anaconda, Inc.
To check the version of a package/module, one may use
• import package name
• print(‘package name version’,package name)
16 Machine Learning with Python: Theory and Applications
For example,
import keras
print('keras version',keras.__version__)
import tensorflow as tf
print('tensorflow version',tf.version.VERSION)
keras version 2.4.3
tensorflow version 2.4.1
If the version is indeed an issue, one would need to either modify the code
to fit the version or install the correct version in your system, by may be
creating an alternative environment. It is very useful to query on the web
using the error message, and solutions or leads can often be found. This is
the approach the author often takes when encountering an issue in running
a code. Finally, this book has used materials and information available on
the web with links. These links may change over time, because of the nature
of the web. The most effective way (and often used by the author) to dealing
with this matter is to use keywords to search online, if the link is lost.
References
[1] G.R. Liu and S.S. Quek, The Finite Element Method: A Practical Course,
Butterworth-Heinemann, London, 2013.
[2] G.R. Liu and T.T. Nguyen, Smoothed Finite Element Methods, Taylor and Francis
Group, New York, 2010.
[3] G.R. Liu, Mesh Free Methods: Moving Beyond the Finite Element Method, Taylor
and Francis Group, New York, 2010.
[4] G.R. Liu and Gui-Yong Zhang, Smoothed Point Interpolation Methods: G Space
Theory and Weakened Weak Forms, World Scientific, New Jersey, 2013.
[5] G.R. Liu and X. Han, Computational Inverse Techniques in Nondestructive Evalua-
tion, Taylor and Francis Group, New York, 2003.
[6] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of
Brain Mechanisms, New York, 1962. https://books.google.com/books?id=7FhRAA
AAMAAJ.
[7] D.E. Rumelhart, G.E. Hinton and R.J. Williams, Learning Internal Representations
by Error Propagation, 1986.
[8] G.R. Liu, FEA-AI and AI-AI: Two-way deepnets for real-time computations for both
forward and inverse mechanics problems, International Journal of Computational
Methods, 16(08), 1950045, 2019.
[9] G.R. Liu, S.Y. Duan, Z.M. Zhang et al., TubeNet: A special trumpetnet for explicit
solutions to inverse problems, International Journal of Computational Methods,
18(01), 2050030, 2021. https://doi.org/10.1142/S0219876220500309.
Introduction 17
[10] Fukushima Kunihiko, Neocognitron: A self-organizing neural network model for a
mechanism of pattern recognition unaffected by shift in position, Biological Cyber-
netics, 36(4), 193–202, Apr 1980. https://doi.org/10.1007%2Fbf00344251.
[11] D. Ciregan, U. Meier and J. Schmidhuber, Multi-column deep neural networks
for image classification, 2012 IEEE Conference on Computer Vision and Pattern
Recognition, 2012.
[12] M.V. Valueva, N.N. Nagornov, P.A. Lyakhov et al., Application of the residue number
system to reduce hardware costs of the convolutional neural network implementation,
Mathematics and Computers in Simulation, 177, 232–243, 2020.
[13] Duan Shuyong, Ma Honglei, G.R. Liu et al., Development of an automatic lawnmower
with real-time computer vision for obstacle avoidance, International Journal of
Computational Methods, Accepted, 2021.
[14] Duan Shuyong, Lu Ningning, Lyu Zhongwei et al., An anchor box setting technique
based on differences between categories for object detection, International Journal of
Intelligent Robotics and Applications, 6, 38–51, 2021.
[15] M. Warren and P. Walter, A logical calculus of ideas immanent in nervous activity,
Bulletin of Mathematical Biophysics, 5, 127–147, 1943.
[16] J. Schmidhuber, Habilitation Thesis: An Ancient Experiment with Credit Assignment
Across 1200 Time Steps or Virtual Layers and Unsupervised Pre-training for a
Stack of Recurrent NNs, 1993, TUM. https://people.idsia.ch//∼juergen/habilitation/
node114.html.
[17] Yu Yong, Si Xiaosheng, Hu Changhua et al., A review of recurrent neural networks:
LSTM cells and network architectures, Neural Computation, 31(7), 1235–1270,
2019. https://direct.mit.edu/neco/article/31/7/1235/8500/A-Review-of-Recurrent-
Neural-Networks-LSTM-Cells.
[18] L. Shi, F. Wang, S. Duan et al., Two-way TubeNets uncertain inverse methods for
improving positioning accuracy of robots based on interval, The 11th International
Conference on Computational Methods (ICCM2020), 2020.
[19] Duan Shuyong, Shi Lutong, G.R. Liu et al., An uncertainty inversion technique using
two-way neural network for parameter identification of robot arms, Inverse Problems
in Science & Engineering, 29, 3279–3304, 2021.
[20] Duan Shuyong, Wang Li, G.R. Liu et al., A technique for inversely identifying joint-
stiffnesses of robot arms via two-way TubeNets, Inverse Problems in Science &
Engineering, 13, 3041–3061, 2021.
[21] G.R. Liu, A neural element method, International Journal of Computational Methods,
17(07), 2050021, 2020.
MACHINE LEARNING
WITH PYTHON
Chapter 2
Basics of Python
This chapter discusses basics of Python language for coding machine learning
models. Python is a very powerful high-level programming language with
the need for compiling, but with some level of efficiency of machine-level
language. It has become the top popular tool for the development of tools and
applications in the general area of machine learning. It has rich libraries for
open access, and new libraries are constantly being developed. The language
itself is powerful in terms of functionality. It is an excellent tool for effective
and productive coding and programming. It is also fast, and the structure
of the language is well built for making use of bulky data, which is often the
case in machine learning.
This chapter is not a formal training on Python, but just to help readers
have a smoother start in learning and practicing the materials in the later
chapters. Our focus will be on some useful simple tricks that are familiar
to the author, and some behavior subtleties that often affect our coding in
ML. Readers familiar with Python may simply skip this chapter. We will
use the Jupyter Notebook as the platform for the discussions, so that the
documentation and demonstration can be all in a single file.
You may go online and have the Jupyter Notebook installed at, for
example, https://www.anaconda.com/distribution/, where you can have the
Jupyter Notebook and Python installed at the same time, and maybe along
with another useful Python IDE (Integrated Development Environment)
called Spyder. In my laptop, I have all three pieces ready to use.
A Jupyter Notebook consists of “cells” of different types: cells for codes
and cells for text called “makedown”. Each cell is framed with color borders,
and the color shows up when the cell is clicked on. A green color border
indicates that this cell is in the input mode, and one can type and edit the
contents. Pressing “ctrl + Enter” within the cell, the green border changes
19
20 Machine Learning with Python: Theory and Applications
to blue color, indicating that this cell is formatted or executed, and may
produce an outcome. Double clicking on the blue framed cell sets it back to
the input mode. The right vertical border is made thicker for better viewing.
This should be sufficient for us to get going. One will get more skills (such
as adding cells, deleting cells, and converting cell types) by playing and
navigating among the menu bars on the top of the Notebook window.
Googling the open online sources is excellent for getting help when one
has a question. The author does this all the time. Sources of the reference
materials include the following:
• https://docs.python.org/3.7/
• https://docs.scipy.org/doc/numpy/reference/?v=20191112052936
• https://medium.com/ibm-data-science-experience/markdown-for-jupyte
r-notebooks-cheatsheet-386c05aeebed
• https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
• https://www.python.org/about/
• https://www.learnpython.org/
• https://en.wikipedia.org/wiki/Python (programming language)
• https://www.learnpython.org/en/Basic Operators+
• https://www.python.org/+
• https://jupyter.org/
• https://www.youtube.com/watch?v=HW29067qVWkb
• https://pynative.com/
The following lists the details of versions of modules related to jupyter-
notebook in the current installation in the author’s laptop (Use “!” to execute
an external command):
!jupyter --version
jupyter core : 4.7.1
jupyter-notebook : 6.3.0
qtconsole : not installed
ipython : 7.16.1
ipykernel : 5.3.4
jupyter client : 6.1.12
jupyter lab : not installed
nbconvert : 6.0.7
ipywidgets : 7.6.3
nbformat : 5.1.3
traitlets : 4.3.3
Basics of Python 21
2.1 An Exercise
Let us have a less conventional introduction here. Different from other books
on computer languages, we start the discussion on how to make use of our
own codes that we may develop during the course of study.
First, we “import” the Python system library or module from external or
internal sources, so that functions (also called methods) there can be used
in our code for interaction with your computer system. The most important
environment setting is the path.
import sys # import an external module "sys" which
# provides tools for accessing the computer system.
sys.path.append('grbin')
# I made a code in folder grbin in the current
# working directory, and want to use it later.
#print(sys.path)
# check the current paths To execute this or any
# cell use Ctrl-Enter (hold on Ctrl and press Enter)
Note that “#” in a code cell starts a comment line. It can be put anywhere
in a line in a code. Everything in the line behind # becomes comments, and
is not treated as a part of the code.
One may remove “#” in front of print(sys.path), execute it, and a number
of paths will be printed. Many of them were set during the installations
of the system and various codes, including the Anaconda and Python.
“grbin” in the current working directory has just been added in using the
sys.path.append().
When multiple lines of comments are needed, we use “doc-strings” as
follows:
'''Inside here are all comments with multiple lines. It is \
a good way to convey information to users, co-programmers. \
Use a backslash to break a line.'''
'Inside here are all comments with multiple lines. It is a
good way to convey information to users, co-programmers.
Use a backslash to break a line.'
Just for demonstration purposes, we now import our own “module” (a
Python file named as grcodes.py) “grcodes”, and then give it an alias “gr” for
easy reference later, when accessing the attributes, functions, classes, etc.,
inside the module.
22 Machine Learning with Python: Theory and Applications
import grcodes as gr # a Python code grcodes.py in 'grbin'.
The following cell contains the Python code “grcodes.py”. Readers may
create the “grcodes.py” file and put it in the folder “grbin” (or any other
folder), so that the cell above can be executed and “gr.printx()” can be used.
from __future__ import print_function # import external module
import sys
# Define function
def printx(name):
""" This prints out both the name and its value together.
usage: name = 88.0; printx('name') """
frame=sys._getframe(1)
print(name,'=',repr(eval(name,frame.f_globals,frame.f_locals)))
Let us try to use a function in the imported module grcodes, using its
alias gr.
x = 1.0 # Assign x a value.
print(x) # The Python built-in print() function prints
# the value of the given argument x.
gr.printx('x') # a function from the gr module. It prints the
# argument name, and its value at the same time.
1.0
x = 1.0
help(gr.printx) # Find out the usage of the gr.printx function
Help on function printx in module grcodes:
printx(name)
This prints out both the name and its value together.
usage: name = 88.0; printx('name')
Nice. I have actually completed a simple task of printing out “x” using
Python, and in two ways. The gr.printx function is equivalent to doing the
following:
print('x=',x) # you must type the same x twice
x= 1.0
Notice in this case that you must type the same x twice, which gives room
for error. A good code should have as little as possible repetition, allowing
Basics of Python 23
easy maintenance. When a change is needed, the programmer (or others
using or maintaining the code) shall just need to do it once.
Alternatively, one can import functions from a module in the following
alternative manner:
from grcodes import printx # you may import more functions
# by adding the function names separated with ",".
#from grcodes import * # Import everything from grcodes
# This is not a very good practice, because it can
# lead to problems when some function names in
# grcodes happened to be the same as those in the code.
In this case, we can now use the imported functions as if it is written in
the current file (notebook).
gr.printx('x')
printx('x') # Notice that "gr." is no longer needed.
x = 1.0
x = 1.0
2.2 Briefing on Python
Now, what is Python? Python was created by Guido van Rossum and first
released in 1991. Python’s design philosophy emphasizes code “readability”.
It uses an object-oriented approach aiming to help programmers to write
clear, less repetitive, logical codes for small- and large-scale projects that
may have teams of people working together.
Python is open source, and its interpreters are available for many
operating systems. A global community of programmers develops and
maintains CPython, an open-source reference implementation. A non-
profit organization, the Python Software Foundation, manages and directs
resources for Python and CPython development.
The language’s core philosophy is summarized in the document The Zen
of Python (PEP 20), which includes aphorisms such as the following:
• Beautiful is better than ugly.
• Explicit is better than implicit.
• Simple is better than complex.
• Complex is better than complicated.
• Readability counts.
24 Machine Learning with Python: Theory and Applications
Guido van Rossum manages Python development projects together with
a steering council. There are two major Python versions, Python 2 and
Python 3, and they are quite different. Python 3.0, released 2008, was a
major revision. It is not completely backward-compatible with Python 2.
Due to the number of codes written for Python 2, support for Python 2.7
(the last release in the 2.x series) was extended to 2020. At the time of
writing this book, Python 3.9 had already been released. This tutorial uses
Python 3.6 because it supports more existing libraries and modules.
There are a huge number (probably in the order of hundreds) of computer
programming languages developed so far. The author’s first experience with
computer programming languages was in the 1970s, when learning BASICS
for programming. He used ALGOL60 later and then FORTRAN for a long
time from the 1970s till today, along with limited use of Matlab, C, C++,
and now Python. Any programming language has a complicated syntax and
deeply organized logic. For a user like the author, the best approach to learn a
computer programming language is via examples and practice, while paying
attention to the syntax, property, and the behavior. For a beginner, following
examples is probably the best approach to get started. This will be the
guidance in writing this section of the book. For rigorous syntax, readers may
read the relevant documentations that are readily available online. We will
give a lot of examples, with explanations in the form of comments (as a
programmer often does). All these examples may be directly executed within
this book while reading, so that readers can have a real-time feeling for
easy observation of the behavior and hence comprehension. Readers may
also make use of it via a simple copy and paste to form his/her notebook.
Because of this example-based approach, the discussions on different topics
may jump a little here and there.
To write and execute a Python code, one often uses an IDE. Jupyter
Notebook, PyCharm, and Spyder are among the popular IDEs. In this
book, we use Jupyter Notebook (https://jupyter.org/) via the distributor
Anaconda (https://www.anaconda.com/). Jupyter Notebook can be used
not only as an IDE but also as a nice document generator allowing
description text (markdown cells) and code cells to be edited together in
one document. The lecture notes used in the author’s machine learning
course have also been mostly developed using Jupyter Notebook. The
documents created using Jupyter Notebook can be exported in various types
of documents, including ascii doc, html, latex, markdown, pdf, Python(.py),
and slides(.slides.html). Readers and students may use this notebook as a
Basics of Python 25
template for your documents (project reports, homeworks, etc.), if so desired.
If one needs to use a spelling check when typing in the markdown cells in
a Jupyter Notebook, the following commands should be executed in the
Anaconda Prompt:
• %pip install jupyter contrib nbextensions.
• %jupyter contrib nbextension install --user.
• %jupyter nbextension enable spellchecker/main.
This would mark the misspelled words for you (but will not provide
suggestions). Other necessary modules with add-on functions may also be
installed in a similar manner.
This book covers in a brief manner a tiny portion of Python.
2.3 Variable Types
Python is said to be object oriented. Every variable in Python is an object. It
is “dynamically typed”: the variable type is determined at the point when it
is typed. You do not need to declare variables before using them, or declare
their types. It has some basic types of variables: Numbers and Strings. These
variables can stand alone, or form a Tuple, List, Dictionary, Set, Numpy
Arrays, etc. They all can be subjected to various operations (arithmetic,
boolean, logical, formatting, etc.) in a code. Note that the variable is loosely
defined, meaning it could be a Tuple, List, Dictionary, etc. For example, a
List can be in a List, a Tuple in a List, or a List in a Tuple.
2.3.1 Numbers
Python supports three types of numbers — integers, floating point
numbers, and complex numbers.
To define an integer variable, one can simply assign it with an integer.
my_int = 48 # by this assignment, my_int becomes an integer
print(my_int); printx('my_int')
48
my int = 48
26 Machine Learning with Python: Theory and Applications
type(my_int) # Check the type of the variable. print() is not
# needed, if it is the last line in the cell
int
my_int = 5.0 # by this my_int becomes now a float
type(my_int)
float
my_complex=8.0+5.0j # by this my_complex is a complex number
print(my_complex)
printx('my_complex')
(8+5j)
my complex = (8+5j)
my_int, my_float, my_string = 20, 10.0, "Hello!"
if my_string == "Hello!": # comparison operators: ==, !=, <, <=, >, >=
print("A string: %s" % my_string) # Indented 4 spaces
if isinstance(my_float, float) and my_float == 10.0:
#isinstance():returns if an object is an instance of a class
print("This is a float, and it is %f" % my_float)
if isinstance(my_int, int) and my_int == 20:
print("This is an integer, and it is: %d" % my_int)
A string: Hello!
This is a float, and it is 10.000000
This is an integer, and it is: 20
To list all variables, functions, modules currently in the memory, try this:
#%whos # you may remove "#" and try this
Basics of Python 27
The type of a variable can be converted:
my_float = 5.0 # by this assignment, my_float becomes a float
print(my_float)
my_float = float(6)
# create a float, by converting an integer, using float()
print(my_float)
print(int(7.0)) # float is converted to integer.
printx('int(7.0)')
5.0
6.0
7
int(7.0) = 7
To check the memory address of a variable, use
a = 1.0
print('a=',a, 'at memory address:',id(a))
a= 1.0 at memory address: 1847993237600
b = a
print('b=',b, 'at memory address: ',id(b))
b= 1.0 at memory address: 1847993237600
Notice that ‘b’ has the same address of ‘a’.
a, b = 2.0, 3.0
print('a=',a, 'at memory address: ',id(a))
print('b=',b, 'at memory address: ',id(b))
a= 2.0 at memory address: 1847974064064
b= 3.0 at memory address: 1847974063944
Notice the change in address when the value of a variable changes.
28 Machine Learning with Python: Theory and Applications
2.3.2 Underscore placeholder
n1=100000000000
n_1=100_000_000_000 # for easy reading
print('Yes, n1 is same as n_1') if n1==n_1 else print('No')
# Ternary if Statement
n2=1_000_000_000
print('Total=',n1+n2)
print('Total=',f'{n1+n2:,}') # f-string (Python3.6 or later)
total=n1+n2
print('Total=',f'{total:_}')
Yes, n1 is same as n 1
Total= 101000000000
Total= 101,000,000,000
Total= 101 000 000 000
2.3.3 Strings
Strings are bits of text, which are very useful in coding in generating labels
and file names for outputs. Strings can be defined with anything between
quotes. The quote can be either a pair of single quotes or a pair of double
quotes.
my_string = "How are you?"
# spring is defined, the characters in it can be indexed
print(my_string, my_string[0:3],my_string[5],my_string[10:])
my_string = 'Hello,' + " hello!" + " I am here."
# note "+" operator for strings is concatenation
print(my_string)
How are you? How r u?
Hello, hello! I am here.
Although both single and double quotes can all be used, when there are
apostrophes in a string, one should use double quotes, or these apostrophes
would terminate the string if single quotes are used and vice versa. For
example,
Basics of Python 29
my_string = "Do not worry, just use double quotes to 'escape'."
print(my_string)
Do not worry, just use double quotes to 'escape'.
One may exchange the role of these two types of quotes:
my_string = 'Do not worry about "double quotes".
print(my_string)
Do not worry about "double quotes"
One shall refer to the Python documentation, when needing to include things
such as carriage returns, backslashes, and Unicode characters. Below are
some more handy and clean operators applied to numbers and strings. You
may try it out and get some experience.
one, two, three = 1, 2, 3 # Assign values to variables.
summation = one + two + three
print('summation=',summation) # printx('summation')
summation= 6
one, two, three = 1, 2, 3.0 # variable type can be mixed!
summation = one + two + three
print('Summation=',summation)
Summation= 6.0
one3 = two3 = three3 = 3 # Assign a same value to variables
print(one3, two3, three3)
3 3 3
30 Machine Learning with Python: Theory and Applications
More handy operations:
hello, world = "Hello,", "world!"
helloworld = hello + " " + world + "!!" # concatenate strings
print(helloworld, ' ', hello + " " + world)
lhw=len(helloworld) # length of the string, counting the space
# and the punctuations.
print('The length of the "helloword" is',lhw)
Hello, world!!! Hello, world!
The length of the "helloword" is 15
You can split the string to a list of strings, each of which is a word.
the_words = helloworld.split(" ") # creates a list of strings
# Similar operations on Lists later
print("Split the words of the string: %s" % the_words)
print('Joined together again with a space as separator:','
'.join(the_words))
Split the words of the string: ['Hello,', 'world!!!']
Joined together again with a space as separator: Hello, world!!!
To find a letter (character) in a string, try this:
my_string = "Hello world!"
print('"o" is right after the',my_string.index("o"),\
'th letter.') # "\" is used to break a line
print('The first letter "l" is right after the', \
my_string.index("l"), 'nd letter.')
"o" is right after the 4 th letter.
The first letter "l" is right after the 2 nd letter.
Do not like the white-spaces between “4” and “th” “2” and “nd”? use string
concatenation:
Basics of Python 31
print('The position of the letter "o" is right after the ' +
str(my_string.index("o")) + 'th letter.')
# "+" concatenate
print('The 1st letter "l" is right after the ' +
str(my_string.index("l")) + 'nd letter.')
The position of the letter "o" is right after the 4th letter.
The 1st letter "l" is right after the 2nd letter.
You may need to find the frequency of each element in a list.
from collections import Counter # import Counter module.
my_list = ['a','a','b','b','b','c','d','d','d','d','d']
count = Counter(my_list) # Counter object is a dictionary
print(count) # of frequencies of each element in the list
# See also Dictionary later
Counter({'d': 5, 'b': 3, 'a': 2, 'c': 1})
print('The frequency of "b" is', count['b'])
# frequency of an element indexed by its key
The frequency of "b" is 3
Note Python (and many other programming languages) starts counting at 0
instead of 1.
We list below more operations that can be useful.
Conversion between uppercase and lowercase of a string
my_string = "Hello world!"
print(my_string.upper(),my_string.lower(),my_string.title())
# convert to uppercase and lowercase, respectively.
HELLO WORLD! hello world! Hello World!
• Reversion of a string using slicing (also see section on Lists).
32 Machine Learning with Python: Theory and Applications
my_string = "ABCDEFG"
reversed_string = my_string[::-1]
print(reversed_string)
GFEDCBA
The title() function of string class
my_string = "my name is professor g r liu"
new_string = my_string.title()
print(new_string)
My Name Is Professor G R Liu
Use of repetitions
n = 8
my_list = [0]*n
print(my_list)
[0, 0, 0, 0, 0, 0, 0, 0]
my_string = "abcdefg "
print(my_string*2) #concatenated n times and then print out
abcdefg abcdefg
lotsofhellos = "Hello " * 5 #concatenate 5 times
print(lotsofhellos)
Hello Hello Hello Hello Hello
my_list = [1,2,3,4,5]
print(my_list*2) #concatenate 2 times and then print out
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
Basics of Python 33
Length of a given argument using len()
print('The length of "ABCD" is', len('ABCD'))
The length of "ABCD" is 4
print('The length of',my_string,'is', len(my_string))
The length of abcdefg is 8
even_numbers, odd_numbers= [2,4,6,8], [1,3,5,7]
length = len(even_numbers) #get the length using len()
all_numbers = odd_numbers + even_numbers #concatenation
print(all_numbers,' The original length is',
length, '. The new length is',len(all_numbers))
[1, 3, 5, 7, 2, 4, 6, 8] The original length is 4 . The new
length is 8
Sort the elements using sorted()
print(sorted('BACD'),sorted('ABCD',reverse=True))
print(sorted(all_numbers), sorted(all_numbers,reverse=True))
['A', 'B', 'C', 'D'] ['D', 'C', 'B', 'A']
[1, 2, 3, 4, 5, 6, 7, 8] [8, 7, 6, 5, 4, 3, 2, 1]
Multiplying each element in a list by a same number
original_list, n = [1,2,3,4], 2
new_list = [n*x for x in original_list]
# list comprehension for element-wise operations
print(new_list)
[2, 4, 6, 8]
34 Machine Learning with Python: Theory and Applications
Generating index for a list using enumerate()
my_list = ['a', 'b', 'c']
for index, value in enumerate(my_list):
print('{0}: {1}'.format(index+1, value))
1: a
2: b
3: c
for index, value in enumerate(my_list): # generate indices
print(f'{index+1}: {value}') # f-string
1: a
2: b
3: c
Error exception tracks code while avoiding stop execution
a, b = 1,2
try:
print(a/b) # exception raised when b=0
except ZeroDivisionError:
print("division by zero")
else:
print("no exceptions raised")
finally:
print("Regardless of what happened, run this always")
0.5
no exceptions raised
Regardless of what happened, run this always
Get the memory size in bytes
import sys #import sys module
num = "AAA"
print('The memory size is %d'%sys.getsizeof(num),'bytes')
# memory size of string
Basics of Python 35
The memory size is 52 bytes
num = 21099
print('The memory size is %d'%sys.getsizeof(num),'bytes')
# memory size of integer
The memory size is 28 bytes
num = 21099.0
print('The memory size is %d'%sys.getsizeof(num),'bytes')
# memory size of float
The memory size is 24 bytes
Check whether the string starts with or ends with something
astring = "Hello world!"
print(astring.startswith("Hello"),astring.endswith("asdf"),\
astring.endswith("!"))
True False True
The first one printed True, as the string starts with “Hello”. The second one
printed False, as the string certainly does not end with “asdf”. The third
printed True, as the string ends with “!”. Their boolean values are useful
when creating conditions. More such functions:
my_string="Hello World!"
my_string1="HelloWorld"
my_string2="HELLO WORLD!"
print (my_string.isalnum()) #check if all char are numbers
print (my_string1.isalpha()) #check if all char are alphabetic
print (my_string2.isupper()) #test if string is upper case
False
True
True
36 Machine Learning with Python: Theory and Applications
my_string3, my_string4, my_string5="hello world!"," ", "8888a"
print (my_string3.istitle()) #test if contains title words
print (my_string3.islower()) #test if string is lower case
print (my_string4.isspace()) #test if string is spaces
print (my_string5.isdigit()) #test if string is digits
False
True
True
False
Checking the type and other attributes of variables
n, x, s = 8888,8.0, 'string'
print (type(n), type(x), type(s)) # check the type of an object
print (len(s),len(str(n)),len(str(x)))
<class 'int'> <class 'float'> <class 'str'>
6 4 3
2.3.4 Conversion between types of variables
When one of the variables is a floating point number in an operation with
other integers, the variable becomes a floating point number.
a = 2
print('a=',a, ' type of a:',type(a))
b = 3.0; a = a + b
print('a=',a, ' type of a:',type(a))
print('b=',b, ' type of a:',type(b))
a= 2 type of a: <class 'int'>
a= 5.0 type of a: <class 'float'>
b= 3.0 type of a: <class 'float'>
The type of a variable can be converted to other types.
Basics of Python 37
n, x, s = 8888,8.5, 'string'
sfn = str(n) #integer to string
print(sfn,type(sfn))
sfx = str(x) #float to string
print(sfx,type(sfx))
8888 <class 'str'>
8.5 <class 'str'>
xfn = float(n) #integer to float
print(xfn,type(xfn))
nfx = int(x) #float to integer
print(nfx,type(nfx))
8888.0 <class 'float'>
8 <class 'int'>
#a = int('Hello') # string to integer: produces ValueError
#a = int('8.5') # string to integer: produces ValueError
a = int('85') # works,'85' is converted to an integer
print(a,type(a))
85 <class 'int'>
a = float('85') # how about this one?
print(a,type(a))
85.0 <class 'float'>
8.0 + float("8.0") #try this
16.0
a = int(False) # check this out
print(a,type(a))
0 <class 'int'>
38 Machine Learning with Python: Theory and Applications
However, operators with mixed numbers and strings are not permitted, and
it triggers a TypeError:
my_mix = my_float + my_string
----------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-67-8c4f7138852e> in <module>
----> 1 my mix = my float + my string
TypeError: unsupported operand type(s) for +: 'float' and 'str'
2.3.5 Variable formatting
Formatting is very useful in printing out variables. Python uses C-style string
formatting to create new, formatted strings. The “%” operator is used to
format a set of variables enclosed in a “tuple”, which is a fixed size list (to
be discussed later). It produces a normal text at a location given by one
of the “argument specifiers” like “%s”, “%d”, “%f”, etc. The best way to
understand this is through examples.
name = "John"
print("Hello, %s! how are you?" % name) # %s is for string
# for two or more argument specifiers, use a tuple:
name, age = "Kevin", 23
print("%s is %d years old." % (name,age)) # %d is for digit
Hello, John! how are you?
Kevin is 23 years old.
print(f"{name} is {age} years old.")
# f-string Python 3.6 or later.
Kevin is 23 years old.
Any object that is not a string (a list for example) can also be formatted
using the %s operator. The %s operator formats the object as a string using
the “repr” method and returns it. For example:
list1,list2,x=[1,2,3],['John','Ian'],21.5 # multi- assignment
print("List1:%s; List2:%s\n x=%s,x=%f,x=%.3f,x=%e" \
% (list1, list2, x, x, x, x))
Basics of Python 39
List1:[1, 2, 3]; List2:['John', 'Ian']
x=21.5,x=21.500000,x=21.500,x=2.150000e+01
print(f"List1:{list1};List2:{list2};x={x},x={x:.2f}, x={x:.3e}")
# powerful f-string
List1:[1, 2, 3]; List2:['John', 'Kevin']; x=21.5,x=21.50,x=2.150e+01
Often used formatting argument specifiers (if not using f-string):
• %s - String (or any object with a string representation, like numbers).
• %d - Integers.
• %f - Floating point numbers.
• %.f - Floating point numbers with a fixed number-of-digits to the right of
the dot.
• %e - scientific notation: a float multiplied by the specified power of 10.
• %x/%X - Integers in hex representation (lowercase/uppercase).
2.4 Arithmetic Operators
Addition, subtraction, multiplication, and division operators can be used
with numbers.
2.4.1 Addition, subtraction, multiplication, division, and power
+ − ∗ / //(floor division) % (remainder of integer division or modulo) ∗∗
number = 1 + 2 * 3 / 4.0
print(number)
2.5
The modulo (%) operator returns the integer remainder of the division:
dividend % divisor = remainder.
numerator, denominator = 11, 2
floor = numerator // denominator #floor division
print(str(numerator)+'//'+str(denominator)+ '=', floor)
remainder = numerator % denominator
print(str(numerator) +'%'+ str(denominator) +'=', remainder)
print(floor*denominator + remainder)
40 Machine Learning with Python: Theory and Applications
11//2= 5
11%2= 1
11
Using two multiplication symbols makes a power relationship.
squared, cubed = 7 ** 2, 2 ** 3
print('7 ** 2 =', squared, ', and 2 ** 3 =',cubed)
7 ** 2 = 49 , and 2 ** 3 = 8
bwlg_XOR = 7^2
print(bwlg_XOR) # ^ is XOR (bitwise logic gate) operator!
Python allows simple swap operation between two variables.
a, b = 100, 200
print('a=',a,'b=',b)
a, b = b, a # swapping without using a "mid-man"
print('a=',a,'b=',b)
a= 100 b= 200
a= 200 b= 100
2.4.2 Built-in functions
Python provides a number of build-in functions and types that are always
available. For a quick glance, see the following table, or find more details at
https://docs.python.org/3/library/functions.html.
Basics of Python 41
#help(all) # to find out what a builtin function does
2.5 Boolean Values and Operators
Boolean values are two constant objects: True and False. When used as an
argument to an arithmetic operator, they behave like the integers 0 and 1,
respectively. The built-in function bool() can be used to cast any value to a
Boolean. The definitions are as follows:
print(bool(5),bool(-5),bool(0.2),bool(-0.1),bool(str('a')),
bool(str('0')))
# True True True True True True
print(bool(0),bool(-0.0)) # These are all zero
# False False
print(bool(''),bool([]),bool({}),bool(())) # all empty (0)
# False False False False
True True True True True True
False False
False False False False
bool() returns False, only if the value is zero or the container is empty.
Otherwise, True. Note that str(‘0’) is neither zero nor empty.
42 Machine Learning with Python: Theory and Applications
Boolean operators include “and” and “or”.
print(True and True, False or True, True or True,)
# True True True
print(False and False, False and True)
# False False
True True True
False False
2.6 Lists: A diversified variable type container
We already saw Lists a few times. This Section gives more details. A list is
a collection of variables, and it is very similar to arrays (see Numpy Array
section for more details). A list may contain any type of variables, and as
many variables as one likes. These variables are held in a pair of square
brackets [ ]. Lists can be iterated over for operations when needed. It is
one of the “iterables”. Let us look at the following examples.
2.6.1 List creation, appending, concatenation, and updating
x_list = [] # Use [] to define a placeholder for x_list.
# It is empty but with an address assigned.
print('x_list=',x_list)
print(hex(id(x_list))) # memory address in hexadecimal
x list= []
0x1ae44e9f548
x_list.append(1) # 1 is appended as the 0th member in this list
x_list.append(2) # 2 is appended as the 1st member
x_list.append(3.) # Variable type changed!
print(x_list[0]) # prints 1, the 0th element ...
print(x_list[1]) # prints 2
print(x_list[2]) # prints 3
print(x_list) # print all in the list
3.0
[1, 2, 3.0]
Basics of Python 43
for x in x_list: # prints out 1,2,3.0 in an iteration
print(x, end=',')
print('\n')
x_list2 = x_list*2
# concatenation of 2 x_list (not element-wise
# multiplication!) this creates an independent new x_list2
print(x_list2)
1,2,3.0,
[1, 2, 3.0, 1, 2, 3.0]
print(id(x_list),id(x_list2)) # addresses are different
1847992120648 1847993139592
id(x_list[1]) # Again, print() function is not needed
# because this is the last line in the cell
1594536160
x_list3 = x_list # assignment is a "pointer" to x_list3
print(x_list,' ',x_list3,)
[1, 2, 3.0] [1, 2, 3.0]
print(id(x_list),id(x_list3)) # They share the same address
1847992120648 1847992120648
x_list4 = x_list.copy() # copy() function creates x_list4
# it is a new independent list
print(x_list,' ',x_list4)
[1, 2, 3.0] [1, 2, 3.0]
44 Machine Learning with Python: Theory and Applications
print(id(x_list),id(x_list4)) # x_list4 has its own address
1847992120648 1847993186760
x_list[0] = 4.0 # Assign the 0th element a new value
print(x_list)
[4.0, 2, 3.0]
print(x_list3,' ',x_list4) # x_list3 is changed with x_list,
# because assignment creates a "pointer". x_list4 is not
# changed, because it was created using copy() function.
[4.0, 2, 3.0] [1, 2, 3.0]
print(x_list2) # Changes to x_list has no affect
[1, 2, 3.0, 1, 2, 3.0]
Creating a list by unpacking a string of digits:
num = 19345678
list_of_digits=list(map(int, str(num))) #list iterable
print(list_of_digits)
list_of_digits=[int(x) for x in str(num)] #list comprehension
print(list_of_digits)
[1, 9, 3, 4, 5, 6, 7, 8]
[1, 9, 3, 4, 5, 6, 7, 8]
2.6.2 Element-wise addition of lists
Element-wise addition of lists needs a little trick. The best ways, including
the use of numpy arrays, will be discussed in the list comprehension section.
Here, we use a primitive method to achieve this.
Basics of Python 45
list1, list2= [20, 30, 40], [5, 6, 8]
print (list1, ' ', list2, ' ', list1+list2)
print ("Original list 1: " + str(list1))
print ("Original list 2: " + str(list2))
print ('"+" is not addition, but concatenation:',list1+list2)
[20, 30, 40] [5, 6, 8] [20, 30, 40, 5, 6, 8]
Original list 1: [20, 30, 40]
Original list 2: [5, 6, 8]
"+" is not addition, it is concatenation: [20, 30, 40, 5, 6, 8]
# We shall use a for-loop to achieve element-wise addition:
add_list = []
for i in range(0, len(list1)): # for-loop to add up one-by-one!
add_list.append(list1[i] + list2[i])
print ("Element-wise addition of 2 lists is: " + str(add_list))
Element-wise addition of 2 lists: [25, 36, 48]
id(add_list[0]) # check the address of the list
1594536896
id(list1[0])
1594536736
add_list = []
for i1,i2 in zip(list1,list2): # for-loop and zip() to add it up
add_list.append(i1+i2)
print ("The element-wise addition of 2 lists: ",add_list)
The element-wise addition of 2 lists: [25, 36, 48]
46 Machine Learning with Python: Theory and Applications
2.6.3 Slicing strings and lists
Slicing is a useful and efficient operation to manipulate parts of a string, list,
or array (to be discussed later). Our discussion starts from slicing strings,
and then lists.
my_string = "Hello world!"
#123456789TET # conventional order
print('0123456789TE') # ordering in Python
print (my_string)
print('5th=',my_string[4]) # take the 5th character
print('7-11th=',my_string[6:11]) # 7th to 11th
0123456789TE
Hello world!
5th= o
7-11th= world
print('[6:-1]=',my_string[6:-1])
# "-1" for the last slice from the 6th to (last-1)th
print('[:]=',my_string[:]) # all characters in the string
print('[6:]=',my_string[6:]) # slice from 7th to the end
print('[:-1]=',my_string[:-1]) # to (last-1)th
[6:-1]= world
[:]= Hello world!
[6:]= world!
[:-1]= Hello world
my_string = "Hello world!"
#123456789TET # conventional order
print('[3:9:2]=',my_string[3:9:2]) # 4th to 9th step 2
# Syntax:[start:stop:step]
my_string = "Hello world!"
[3:9:2]= l o
Basics of Python 47
Using a negative step, we can easily reverse a string, as we have seen earlier:
my_string = "Hello world!"
print('string:',my_string)
print('[::-1]=',my_string[::-1]) # all but from the last
string: Hello world!
[::-1]= !dlrow olleH
In summary, if we just have one number in the brackets, the slicing takes
the character at the (number +1)th position. This is because Python counts
from zero. A colon stands for all available. If it is used alone, the slice is the
entire string. If there is a number on its left, the slice is from the number
to the right-end, and vice versa. A negative number means it counts the
number but is from the right-end: −3 means “the 3rd character from the
right-end”. One can also use the step option for skipping.
Note that when accessing a string with an index which does not exist, it
generates an exception of IndexError.
print('[14]=',my_string[14]) # index out of range error
-----------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-102-024f17c69a4f> in <module>
----> 1 print('[14]=',my string[14]) # gives an index out of range error
IndexError: string index out of range
print('[14:]=',my_string[14:]) # This will not give an
# error, but gives nothing: nothing can be sliced
[14:]=
The very similar rules detailed above for strings apply also to a list, by
treating a variable in the list as a character.
48 Machine Learning with Python: Theory and Applications
# Create my_list that contains mixed type variables:
list2=[]
my_list=[0, 1, 2, 3,'4E', 5, 6, 7,[8,8], 9]
# 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 -> indices
# 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 -> Python indices
#-10 -9,-8,-7,-6, -5,-4,-3, -2, -1 -> Python reverse indices
print(my_list[0:10:1]) # [start:end:step] end is inclusive!
print(my_list[:]) # A colon stands for all variable
[0, 1, 2, 3, '4E', 5, 6, 7, [8, 8], 9]
[0, 1, 2, 3, '4E', 5, 6, 7, [8, 8], 9]
print(my_list[0:]) # from (0+1)st to the end
print(my_list[1:]) # from (1+1)th to the end
print(my_list[8:]) # from the (8+1)th to the end
print(my_list[8:9]) # Gives a list in list
[0, 1, 2, 3, '4E', 5, 6, 7, [8, 8], 9]
[1, 2, 3, '4E', 5, 6, 7, [8, 8], 9]
[[8, 8], 9]
[[8, 8]]
print(my_list[:]) # A colon stands for all variable
print(my_list[:3]) # 0,1,2,
print(my_list[:1]) # from 1st to 1st: 0
print(my_list[:0]) # from 1st to 0th: empty []
print(my_list[-1]) # reads out the last: 9
print(my_list[-1:]) # Slices out the last:
print(my_list[::-1]) # reverse the list
[0, 1, 2, 3, '4E', 5, 6, 7, [8, 8], 9]
[0, 1, 2]
[0]
[]
9
[9]
[9, [8, 8], 7, 6, 5, '4E', 3, 2, 1, 0]
Basics of Python 49
When accessing a list with an index that does not exist, it generates an
exception of IndexError.
#print(my_list[11]) # will give an index out of range error
print(my_list[10:]) # from the (10+1)th to the end: no more
# there, not out of range, but an empty list
[]
2.6.4 Underscore placeholders for lists
nlist = [10, 20, 30, 40, 50, 6.0, '7H'] # Mixed variables
_, _, n3,_,*nn = nlist # when only the 3rd is needed,
# skip one, and then the rest
print ('n3=',n3, 'the rest numbers',*nn)
nlist = [10, 20, 30, 40, 50, 60, 70]
_, _, n3, *nn, nlast = nlist # The 3rd, last and the rest
# in between are needed
print('n3=',n3,', the last=', nlast,', and all the rest numbers',*nn)
n3= 30 the rest numbers 50 6.0 7H
n3= 30 , the last= 70 , and all the rest numbers 40 50 60
2.6.5 Nested list (lists in lists in lists)
nested_list = [[11, 12], ['2B',22], [31, [32,3.2]]]
# A nested list of mixed types of variables
printx('nested_list')
print(len(nested_list)) #number of sub-lists in nested_list
nested list = [[11, 12], ['2B', 22], [31, [32, 3.2]]]
3
print(nested_list[0]) #1st sub-list in the nested_list
print(nested_list[1]) #2nd sub-list in the nested_list
print(nested_list[2]) #3rd sub-list in the nested_list
print(nested_list) #print all for easy viewing
50 Machine Learning with Python: Theory and Applications
[11, 12]
['2B', 22]
[31, [32, 3.2]]
[[11, 12], ['2B', 22], [31, [32, 3.2]]]
print(nested_list[0][0]) #1st element in 1st sub-list
print(nested_list[0][1]) #2nd element in 1st sub-list
11
12
print(nested_list)
print(nested_list[1][0]) # Try this: what would this be?
print(nested_list[2][1]) #?
print(nested_list[2][1][0]) #?
[[11, 12], ['2B', 22], [31, [32, 3.2]]]
2B
[32, 3.2]
32
2.7 Tuples: Value preserved
After the discussion about the List, discussing Tuples becomes straightfor-
ward. This is because they are essentially the same, and the major difference
is as follows:
• A Tuple is usually enclosed with (), but a List is with [].
• A Tuple is immutable, but a List is mutable. This means that tuples
cannot be changed after they are created. Values in Tuples are preserved.
Because a Tuple is immutable, we use it to store data that needs to be
preserved. Thus, its use is very much limited. It is used to store constants
preventing them from being changed. Also, operating on Tuples is faster.
Except these differences, a Tuple behaves like a List. It can be accessed
via index, iterated over, and assigned to other variables. Below are some
examples.
Basics of Python 51
ttuple = (10, 20, 30, 40, 50, 6.0, '7H') # create a Tuple
gr.printx('ttuple') # print(ttuple)
aa = ttuple[0]
print('aa=',aa)
print(ttuple[1], ' ',ttuple[6],' ',ttuple[-1])
ttuple = (10, 20, 30, 40, 50, 6.0, '7H')
aa= 10
20 7H 7H
for i, data in enumerate(ttuple):
# use enumerate function to get both index and content
if i < 3:
print(i, ':', data)
0 : 10
1 : 20
2 : 30
# ttuple[2] = 300 # this gives an error
The above may be all we need to know about Tuples. We now discuss
another useful data structure in Python.
2.8 Dictionaries: Indexable via keys
A dictionary is a data type similar to a list. It contains paired keys and
values. The key is a string and can be used for indexing. The value can
be any type of object: a string, a number, a list, etc. Because the key and
value are paired, each value stored in a dictionary can be accessed using the
corresponding key. A dictionary does not contain any duplicated keys. The
values may be with duplication.
2.8.1 Assigning data to a dictionary
For example, phone numbers can be assigned to a dictionary in the following
format:
52 Machine Learning with Python: Theory and Applications
phonebook1 = {} # placeholder for a dictionary
phonebook1["Kevin"] = 513476565
phonebook1["Richard"] = 513387234
phonebook1["Jim"] = 513682762
phonebook1["Mark"] = 513387234 # A duplicated value
gr.printx('phonebook1')
print(phonebook1)
phonebook1 = {'Kevin': 513476565, 'Richard': 513387234,
'Jim': 513682762, 'Mark': 513387234}
{'Kevin': 513476565, 'Richard': 513387234, 'Jim': 513682762,
'Mark': 513387234}
A dictionary can also be initialized in the following means:
phonebook2 = {'Joanne': 656477456, 'Yao': 656377243,
'Das': 656662798}
print(phonebook2)
{'Joanne': 656477456, 'Yao': 656377243, 'Das': 656662798}
phonebook0 = {
"John" : [788567837,788347278],
# this is a list, John has 2 phonenumbers
'Mark': 513683222,
'Joanne': 656477456
}
print(phonebook0)
{'John': [788567837, 788347278], 'Mark': 513683222, 'Joanne':
656477456}
2.8.2 Iterating over a dictionary
Like lists, dictionaries can be iterated over. Because keys and values are
recorded in pairs, we may use for-loop to access them.
Basics of Python 53
for name, number in phonebook1.items():
print("Phone number of %s is %d" % (name, number))
Phone number of Kevin is 513476565
Phone number of Richard is 513387234
Phone number of Jim is 513682762
Phone number of Mark is 513387234
for key, value in phonebook1.items():
print(key, value)
Kevin 513476565
Richard 513387234
Jim 513682762
Mark 513387234
for key in phonebook1.keys():
print(key)
Kevin
Richard
Jim
Mark
for value in phonebook1.values():
print(value)
513476565
513387234
513682762
513387234
2.8.3 Removing a value
To delete a pair of records, we use the build-in function del or pop, by using
the keys.
54 Machine Learning with Python: Theory and Applications
phonebook2 = {'Joanne': 656477456, 'Yao': 656377243,
'Das': 656662798}
del phonebook2["Yao"]
print(phonebook2)
value = phonebook2.pop("Das")
print(phonebook2)
print('value for poped out key',value)
{'Joanne': 656477456, 'Das': 656662798}
{'Joanne': 656477456}
value for poped out key 656662798
2.8.4 Merging two dictionaries
First, use update() method.
phonebook1.update(phonebook2) # phonebook1 is updated
print(phonebook1)
{'Kevin': 513476565, 'Richard': 513387234, 'Jim': 513682762,
'Mark': 513387234, 'Joanne': 656477456}
Now, use a simpler means called double-star. This allows one to create a 3rd
new dictionary that is a combination of two dictionaries, without affecting
the original two.
phonebook3 = {**phonebook1, **phonebook0}
print(phonebook3)
{'Kevin': 513476565, 'Richard': 513387234, 'Jim': 513682762,
'Mark': 513683222,
'Joanne': 656477456, 'John': [788567837, 788347278]}
Duplicated keys (if any) are removed in the new dictionary:
dict_1 = {'Apple': 7, 'Banana': 5}
dict_2 = {'Banana': 3, 'Orange': 4} #'Banana' is in dict_1
Basics of Python 55
combined_dict = {**dict_1, **dict_2}
#'Banana' in dict_1 will be replaced in the new dictionary
print(combined_dict)
{'Apple': 7, 'Banana': 3, 'Orange': 4}
2.9 Numpy Arrays: Handy for scientific computation
Numpy arrays are similar to Lists, and much easier to work with for scientific
computations. Operations on Numpy arrays are usually much faster for bulky
data.
2.9.1 Lists vs. Numpy arrays
(1) Similarities:
• Both are mutable (the elements there can be added and removed after
the creation. A mutating operation is also called “destructive”, because it
modifies the list/array in place instead of returning a new one).
• Both can be indexed.
• Both can be sliced.
(2) Differences:
• For using arrays, one needs to import Numpy module, but lists are
build-in.
• Array works for element-wise operations, but lists cannot (need some
coding to do that).
• Data types in an array must be the same, but a list can have different types
of data (part of the reason why element-wise operation is not generally
accepted).
• Numpy array can be multi-dimensional.
• Operations with arrays are, in general, much faster than those on lists.
• Storing arrays uses less memory than storing lists.
• Numpy arrays are more convenient to use for mathematical operations
and machine learning algorithms.
2.9.2 Structure of a numpy array
We first brief on the structure of a numpy array in comparison with a list
we discussed earlier. To start the discussion, we import the numpy package.
56 Machine Learning with Python: Theory and Applications
import numpy as np # Import numpy & give it an alias np
#dir(np) #try this (remove #)
x1 = np.array([28, 3, 28, 0]) # a one-dimensional (1D) numpy array
print('x1=',x1) # A numpy array looks like a list.
gr.printx('x1') # This specifies that it is an array.
x1 = [28 3 28 0]
x1 = array([28, 3, 28, 0])
As shown above, a numpy array is “framed” in a pair of square brackets
(same as a list).
x2 = np.array([[51,22.0],[0,0],(18+9j,3.)]) # mixed types
print('x2=',x2) # All become complex-valued
x2= [[51.+0.j 22.+0.j]
[ 0.+0.j 0.+0.j]
[18.+9.j 3.+0.j]]
This is a 2D numpy array. It is framed in a double pair of square brackets.
A list does not have multi-dimensionality, except in the form of nesting: lists
in lists.
We can also create numpy arrays using lists. In the following, we create first
two lists, and then create Numpy arrays using these two lists:
list_w = [57.5, 64.3, 71.6, 68.2] # list, peoples' weights (Kg)
list_h = [1.5, 1.6, 1.7, 1.65] # list heights (m)
print('list_w=',list_w, '; list_h= ',list_h)
list w= [57.5, 64.3, 71.6, 68.2] ; list h= [1.5, 1.6, 1.7, 1.65]
narray_w = np.array(list_w) # convert list to numpy array
narray_h = np.array(list_h)
print('narray_w=',narray_w, '; narray_h= ',narray_h)
narray w= [57.5 64.3 71.6 68.2] ; narray h= [1.5 1.6 1.7 1.65]
Basics of Python 57
Let us create a function that prints out the information of a given numpy array.
def getArrayInfo(a):
'''Get the information about a given array:
getArrayInfo(array)'''
print('elements of the first axis of the array:',a[0])
print('type:',type(a))
print('number of dimensions, a.ndim:', a.ndim)
print('a.shape:', a.shape)
print('number of elements, a.size:', a.size)
print('a.dtype:', a.dtype)
print('memory address',a.data)
help(getArrayInfo) # may try this
Help on function getArrayInfo in module main :
getArrayInfo(a)
Get the information about a given array: getArrayInfo(array)
We see here that ”’ ”’ useful to provide a simple instruction for the use of a created
function. Let us now use it to get the information for nist w.
getArrayInfo(narray_w)
elements of the first axis of the array: 57.5
type: <class 'numpy.ndarray'>
number of dimensions, a.ndim: 1
a.shape: (4,)
number of elements, a.size: 4
a.dtype: float64
memory address <memory at 0x000001AE7D549408>
We note that narray w is 1D in dimension, and has a shape of (4,) meaning it
has four entries. The shape of a numpy array is given in a tuple.
Slicing also works for a numpy array, in the similar way as for lists. Let us take
a slice for an array.
print(list_w[1:3]) # a slice between the 2nd and 3rd elements
[64.3, 71.6]
print(narray_w[1:3]) # a slice between the 2nd and 3rd elements
[64.3 71.6]
58 Machine Learning with Python: Theory and Applications
Let us now append an element to both the list and the numpy array.
# For lists, we use:
list_w.append(59.8)
print(list_w)
# For numpy array we shall use:
print(np.append(narray_w,59.8))
[57.5, 64.3, 71.6, 68.2, 59.8]
[57.5 64.3 71.6 68.2 59.8]
print(list_w,' ',narray_w)
print(type(list_w),' ',type(narray_w))
print(len(list_w),' ', narray_w.ndim) # Use len() to get the length
[57.5, 64.3, 71.6, 68.2, 59.8] [57.5 64.3 71.6 68.2]
<class 'list'> <class 'numpy.ndarray'>
5 1
nwh = (narray_w,narray_h) # This forms a tuple of np arrays
print(nwh)
(array([57.5, 64.3, 71.6, 68.2]), array([1.5 , 1.6 , 1.7 , 1.65]))
To form a multi-dimensional array, we may use the following (more on this later):
arr = np.array([narray_w,narray_h])
arr
array([[57.5 , 64.3 , 71.6 , 68.2 ],
[ 1.5 , 1.6 , 1.7 , 1.65]])
getArrayInfo(arr)
elements of the first axis of the array: [57.5 64.3 71.6 68.2]
type: <class 'numpy.ndarray'>
number of dimensions, a.ndim: 2
a.shape: (2, 4)
number of elements, a.size: 8
a.dtype: float64
memory address <memory at 0x000001AE44FB12D0>
Basics of Python 59
We note that arr is of dimension 2, and has a shape of (2, 4), meaning it has
two entries along axis 0 and 4 entries along axis 1. We see again that the shape
of a numpy array is given in a tuple. A multi-dimensional numpy array can be
transported:
arrT = arr.T
print(arrT)
getArrayInfo(arrT) # see the change in shape from (2,4) to (4,2)
[[57.5 1.5 ]
[64.3 1.6 ]
[71.6 1.7 ]
[68.2 1.65]]
elements of the first axis of the array: [57.5 1.5]
type: <class 'numpy.ndarray'>
number of dimensions, a.ndim: 2
a.shape: (4, 2)
number of elements, a.size: 8
a.dtype: float64
memory address <memory at 0x000001AE44FB12D0>
It is seen that the dimension remains 2, but the shape is changed from (2,4) to
(4,2). The value of an entry in a numpy array can be changed.
arr = np.array([[57.5 , 64.3 , 71.6 , 68.2 ],
[ 1.5 , 1.6 , 1.7 , 1.65]])
arrb = arr
printx('arrb')
arr[0,0]= 888.0 # change is done to arr only
printx('arrb')
arrb = array([[57.5 , 64.3 , 71.6 , 68.2 ],
[ 1.5 , 1.6 , 1.7 , 1.65]])
arrb = array([[888. , 64.3 , 71.6 , 68.2 ],
[ 1.5 , 1.6 , 1.7 , 1.65]])
Notice the behavior of the array created via an assignment: changes in an array
will affect the other. This behavior was observed for lists. To create an independent
array, use copy() function.
arr = np.array([[57.5 , 64.3 , 71.6 , 68.2 ],
[ 1.5 , 1.6 , 1.7 , 1.65]])
arrc = arr.copy() #This is expensive. Do it only it is necessary
printx('arrc')
arr[0,0]= 77.0
printx('arr')
printx('arrc')
60 Machine Learning with Python: Theory and Applications
arrc = array([[57.5 , 64.3 , 71.6 , 68.2 ],
[ 1.5 , 1.6 , 1.7 , 1.65]])
arr = array([[77. , 64.3 , 71.6 , 68.2 ],
[ 1.5 , 1.6 , 1.7 , 1.65]])
arrc = array([[57.5 , 64.3 , 71.6 , 68.2 ],
[ 1.5 , 1.6 , 1.7 , 1.65]])
2.9.3 Axis of a numpy array
Axis is an important concept for numpy array operations. A 1D array has axis 0, a
2D array has two axes, 0 and 1, and so on. The definition is given in Fig. 2.1.
Multidimensional numpy array structure and axes:
We can now use an axis to stack up arrays to form new arrays, as follows:
arr=np.stack([narray_w,narray_h],axis=0)
#stack up 1D arrays along axis 0
print(arr)
[[57.5 64.3 71.6 68.2 ]
[ 1.5 1.6 1.7 1.65]]
We can use np.ravel to flatten an array.
print(arr)
rarr=np.ravel(arr)
print(rarr)
getArrayInfo(rarr)
Figure 2.1: Picture modified from that in “Introduction to Numerical Computing with
NumPy”, SciPy 2019 Tutorial, by Alex Chabot-Leclerc.
Basics of Python 61
[[57.5 64.3 71.6 68.2 ]
[ 1.5 1.6 1.7 1.65]]
[57.5 64.3 71.6 68.2 1.5 1.6 1.7 1.65]
elements of the first axis of the array: 57.5
type: <class 'numpy.ndarray'>
number of dimensions, a.ndim: 1
a.shape: (8)
number of elements, a.size: 8
a.dtype: float64
memory address <memory at 0x000001AE7D5494C8>
It is seen that the dimension is changed from 2 to 1, and the shape is changed
from (2,4) to (8,).
In machine learning computations, we often perform summation of entries of an
array along an axis of the array. This can be done easily using the np.sum function.
print(arr)
print('Column-sum:',np.sum(arr,axis=0),np.sum(arr,axis=0).shape)
print('row-sum:',np.sum(arr,axis=1),np.sum(arr,axis=1).shape)
[[57.5 64.3 71.6 68.2 ]
[ 1.5 1.6 1.7 1.65]]
Column-sum: [59. 65.9 73.3 69.85] (4,)
row-sum: [261.6 6.45] (2,)
Notice that the dimension of the summed array is reduced.
2.9.4 Element-wise computations
Element-wise computation using Numpy arrays is very handy, which is very much
different from that for lists.
print('listwh=',list_w+list_h) # + is a concatenation for lists.
listwh= [57.5, 64.3, 71.6, 68.2, 59.8, 1.5, 1.6, 1.7, 1.65]
print('narraywh=',narray_w+narray_h)
# + is element-wise addition for numpy arrays.
narraywh= [59. 65.9 73.3 69.85]
Let us compute the weights in pounds, using 1kg = 2.20462 lbs.
print(narray_w * 2.20462) # element-wise multiplication
[126.76565 141.757066 157.850792 150.355084]
62 Machine Learning with Python: Theory and Applications
Let us compute the Body Mass Index or BMI using these narrays.
bmi = narray_w / narray_h ** 2 # formula to compute the BMI
print(bmi)
[25.55555556 25.1171875 24.77508651 25.05050505]
This includes element-wise power operation and division as well.
# lbmi = list_w / list_h ** 2 # would this work? Try it!
We discussed element-wise operations for lists earlier, and used special operations
(list comprehension) and special functions such as zip(). The alternative is the
“numpy-way”. This is done by first converting the lists to numpy arrays, then
performing the operations in numpy using these arrays, and finally converting the
results back to a list. When the lists are large in size, this numpy-way can be much
faster, because all these operations can be performed in bulk in numpy, without
element-by-element accessing of the memories.
import numpy as np
list1 = [20, 30, 40, 50, 60]
list2 = [4, 5, 6, 2, 8]
(np.array(list1) + np.array(list2)).tolist()
[24, 35, 46, 52, 68]
The results are the same as those we obtained before using special list element-
wise operations.
2.9.5 Handy ways to generate multi-dimensional arrays
In machine learning and mathematical computations in general, multi-dimensional
arrays are frequently used, because one has to deal with big data frequently. Numpy
supports the necessary functions (tools) to generate, manipulate, and operate multi-
dimensional arrays.
np.arange(2, 8, 0.5, dtype=np.float) # equally spaced values
array([2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. , 6.5, 7. , 7.5])
np.linspace(1., 4., 6)
# arrays `with a specified number of elements with equal value spacing
array([1. , 1.6, 2.2, 2.8, 3.4, 4. ])
Basics of Python 63
a = np.array([1.,2.,3.])
a.fill(9.9) # all entries with the same value
print(a)
[9.9 9.9 9.9]
x = np.empty((3, 4)) # shape (dimension) of (3,4) specified,
# without initializing entries
print(x)
[[2. 2.5 3. 3.5]
[4. 4.5 5. 5.5]
[6. 6.5 7. 7.5]]
x = np.zeros((6,6)) # initialized with zero
x
array([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])
y = np.ones((2,2))*2 # a 2 by 2 array with 1.0 in all the entries
print(y)
[[2. 2.]
[2. 2.]]
x[3:5,3:5] += y # Assign y to a sliced portion in x
print(x)
[[0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0.]
[0. 0. 0. 2. 2. 0.]
[0. 0. 0. 2. 2. 0.]
[0. 0. 0. 0. 0. 0.]]
2.9.6 Use of external package: MXNet
For computations in machine learning, a number of useful packages/modules/
libraries have been developed for creating large-scale machine learning models.
MXNet is one of those. We will also make use of it in this book. Because it is
an external package, MXNet needs to be installed in your computer system using
pip.
pip install mxnet
64 Machine Learning with Python: Theory and Applications
Note that if an error message like “No module named ‘xyz’ ” is encountered,
which is likely during this course using our codes, one shall perform the installation
of “xyz” module in a similar way, so that all the functions and variables
defined there can be made use of. Note also that there are a huge number of
modules/libraries/packages openly available, and it is not possible to install all of
them. The practical way is installing it only when it is needed. One may encounter
issues in installations, many of which are related to compatibility of versions of the
involved packages. Searching online for help can often resolve these issues, because
the chance is high that someone has already encountered similar issues earlier, and
the huge online community has already provided some solution.
After mxnet module is installed, we import it to our code.
import mxnet as mx
mx.__version__
'1.7.0'
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
from mxnet import nd # import Mxnet library/package
# see https://gluon.mxnet.io for details
x = nd.empty((3, 4)) # Try also x = nd.ones((3, 4))
print(x)
print(x.size) # print the size (number of elements) of a
# multi-dimensional mxnet or nd.array
[[ 0.000 0.000 0.000 0.000]
[ 0.000 0.000 0.000 0.000]
[ 0.000 0.000 0.000 0.000]]
<NDArray 3x4 @cpu(0)>
12
We often create arrays with randomly sampled values when working with neural
networks. In such cases, we initialize an array using a standard normal distribution
with zero mean and unit variance. For example,
y = nd.random_normal(0, 1, shape=(3, 4)) # 0 mean, variance 1
printx('y')
y
y =
[[-0.681 -0.135 0.377 0.410]
[ 0.571 -2.758 1.076 -0.614]
[ 1.831 -1.147 0.054 -2.507]]
<NDArray 3x4 @cpu(0)>
[[-0.681 -0.135 0.377 0.410]
[ 0.571 -2.758 1.076 -0.614]
Basics of Python 65
[ 1.831 -1.147 0.054 -2.507]]
<NDArray 3x4 @cpu(0)>
Element-wise operations of addition, multiplication, and exponentiation all work
for multi-dimensional arrays.
x=nd.exp(y)
x
[[ 0.506 0.873 1.458 1.507]
[ 1.771 0.063 2.934 0.541]
[ 6.239 0.318 1.055 0.081]]
<NDArray 3x4 @cpu(0)>
Often used matrix (2D array) transpose can be obtained as follows:
y.T
[[-0.681 0.571 1.831]
[-0.135 -2.758 -1.147]
[ 0.377 1.076 0.054]
[ 0.410 -0.614 -2.507]]
<NDArray 4x3 @cpu(0)>
Now, we can multiply matrices with comparable dimensions. Now we can
multiply matrices with comparable dimensions using dot-product in both numpy
and MXNet:
nd.dot(x, y.T) # x: 3×4; y.T: 4×3
[[ 0.705 -1.476 -3.776]
[ 0.114 3.662 1.970]
[-3.860 3.774 10.910]]
<NDArray 3×3 @cpu(0)>
Note that nd arrays behave differently from np arrays. They do not usually
work together without proper conversion. Therefore, special care is need. When
strange behavior is observed, one may print out the variable to check the array type.
The same is generally true when numpy arrays work with arrays in other external
modules, because the array objects are, in general, different from one module to
another. One may use asnumpy() to convert an nd-array to an np-array, when so
desired. Given below is an example (more on this later):
np.dot(x.asnumpy(), y.T.asnumpy())
# convert nd array to np array and then use numpy np.dot()
array([[ 0.705, -1.476, -3.776],
[ 0.114, 3.662, 1.970],
[-3.860, 3.774, 10.910]], dtype=float32)
66 Machine Learning with Python: Theory and Applications
2.9.7 In-place operations
In machine learning, we frequently deal with bigdata. To avoid expensive and very
complicated moving data operations, we prefer in-place operations. Let us first take
a look at the following computations and the locations of the data:
print('id(y) before operation:', id(y))
y = y + x # x and y must be shape compatible
print('id(y) after operation:', id(y))# location of y changes
id(y) before operation: 1847926752760
id(y) after operation: 1847462282128
For in-place operations, we do this:
print('id(y) before operation:', id(y))
y[:] = x + y
# addition first, put it in a temporary buffer, then copy to y[:]
print('id(y) after operation:', id(y)) # memory of y remains the same
id(y) before operation: 1847462282128
id(y) after operation: 1847462282128
To perform an in-place addition without using even a temporary buffer, we do
this in MXNet:
print('id(y) before operation:', id(y))
print(nd.elemwise_add(x, y, out=y)) # for mxnet nd.arrays
print('id(y) after operation:', id(y))
# memory location un-changed
id(y) before operation: 1847462282128
[[ 0.837 2.485 4.752 4.931]
[ 5.883 -2.568 9.878 1.009]
[ 20.547 -0.194 3.220 -2.263]]
<NDArray 3x4 @cpu(0)>
id(y) after operation: 1847462282128
If we do not plan to reuse x, then the result can be assigned to x itself. We may
do this in MXNet:
print('id(x) before operation:', id(x))
x += y
print(x)
print('id(x) after operation:', id(x))
id(x) before operation: 1847462202168
Basics of Python 67
[[ 1.343 3.358 6.210 6.438]
[ 7.653 -2.504 12.811 1.550]
[ 26.785 0.124 4.275 -2.182]]
<NDArray 3x4 @cpu(0)>
id(x) after operation: 1847462202168
2.9.8 Slicing from a multi-dimensional array
To read the second and third rows from x, we do this:
print(x)
x[1:3] # read the second and third rows from x
[[ 1.343 3.358 6.210 6.438]
[ 7.653 -2.504 12.811 1.550]
[ 26.785 0.124 4.275 -2.182]]
<NDArray 3x4 @cpu(0)>
[[ 7.653 -2.504 12.811 1.550]
[ 26.785 0.124 4.275 -2.182]]
<NDArray 2x4 @cpu(0)>
x[1:2,1:3] # read the 2nd raw, 2nd to 3rd column from x
[[-2.504 12.811]]
<NDArray 1x2 @cpu(0)>
x[1,2] = 88.0 # change the value at the 2nd raw 3rd column
print(x)
x[1:2,1:3] = 88.0
# change the values from the 2nd raw and 2nd to 3rd column
print(x)
[[ 1.343 3.358 6.210 6.438]
[ 7.653 -2.504 88.000 1.550]
[ 26.785 0.124 4.275 -2.182]]
<NDArray 3x4 @cpu(0)>
[[ 1.343 3.358 6.210 6.438]
[ 7.653 88.000 88.000 1.550]
[ 26.785 0.124 4.275 -2.182]]
<NDArray 3x4 @cpu(0)>
2.9.9 Broadcasting
What would happen if one adds a vector (1D array) y to a matrix (2D array) X? In
Python, this can be done, and is often done in machine learning. Such an operation
is performed using a procedure called “broadcasting”: the low-dimensional array
68 Machine Learning with Python: Theory and Applications
is duplicated along any axis with dimension 1 to match the shape of the high-
dimensional array, and then the desired operation is performed.
import numpy as np
y = np.arange(6) # y has an initial of shape (6), or (1,6)
print('y = ', y,'Shape of y:', y.shape)
x = np.arange(24)
print('x = ', x,'Shape of x:', x.shape)
y = [0 1 2 3 4 5] Shape of y: (6,)
x = [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
20 21 22 23]
Shape of x: (24,)
X = np.reshape(x,(4,6)) # reshape to (4,6), to match (6,) of y
print(X.shape)
print('X = \n', X,'Shape of X:', X.shape)
print('X + y = \n', X + y) # y's shape expands to (4,6) by copying
# the data along axis 0,then the addition
(4, 6)
X =
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]] Shape of X: (4, 6)
X + y =
[[ 0 2 4 6 8 10]
[ 6 8 10 12 14 16]
[12 14 16 18 20 22]
[18 20 22 24 26 28]]
print('shape of X is:',X.shape, ' shape of y is:',y.shape)
print(np.dot(X,y))
shape of X is: (4, 6) shape of y is: (6,)
[ 55 145 235 325]
z = np.reshape(X,(2,3,4))
print (z)
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Basics of Python 69
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
a = np.arange(12).reshape(3,4)
a.fill(100)
a
array([[100, 100, 100, 100],
[100, 100, 100, 100],
[100, 100, 100, 100]])
z + a
array([[[100, 101, 102, 103],
[104, 105, 106, 107],
[108, 109, 110, 111]],
[[112, 113, 114, 115],
[116, 117, 118, 119],
[120, 121, 122, 123]]])
a = np.arange(4)
a.fill(100)
a
array([100, 100, 100, 100])
z + a
array([[[100, 101, 102, 103],
[104, 105, 106, 107],
[108, 109, 110, 111]],
[[112, 113, 114, 115],
[116, 117, 118, 119],
[120, 121, 122, 123]]])
Broadcasting Rules: https://docs.scipy.org/doc/numpy/user/basics.broadcasting.
html.
When operating on two arrays, NumPy compares their shapes element-wise. It
starts with the trailing dimensions and works its way forward. Two dimensions are
compatible when
1. they are equal, or
2. one of them is 1.
70 Machine Learning with Python: Theory and Applications
If these conditions are met, the dimensions are compatible, otherwise not compatible
hence a ValueError. Figure 2.2 shows some examples.
Broadcasting operations:
Figure 2.2: Picture modified from that in “Introduction to Numerical Computing with
NumPy”, SciPy 2019 Tutorial, by Alex Chabot-Leclerc.
2.9.10 Converting between MXNet NDArray and NumPy
Converting between MXNet NDArrays and NumPy arrays is easy. The converted
arrays do not share memory.
import numpy as np
x = np.arange(24).reshape(4,6)
y = np.arange(6)
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
from mxnet import nd
ndx = nd.array(x)
npa = ndx.asnumpy()
type(npa)
numpy.ndarray
Basics of Python 71
npa
array([[ 0.000, 1.000, 2.000, 3.000, 4.000, 5.000],
[ 6.000, 7.000, 8.000, 9.000, 10.000, 11.000],
[ 12.000, 13.000, 14.000, 15.000, 16.000, 17.000],
[ 18.000, 19.000, 20.000, 21.000, 22.000, 23.000]],
dtype=float32)
ndy = nd.array(npa)
ndy
[[ 0.000 1.000 2.000 3.000 4.000 5.000]
[ 6.000 7.000 8.000 9.000 10.000 11.000]
[ 12.000 13.000 14.000 15.000 16.000 17.000]
[ 18.000 19.000 20.000 21.000 22.000 23.000]]
<NDArray 4x6 @cpu(0)>
To figure out the detailed differences between the MXNet NDArrays and the
NumPy arrays, one may refer to https://gluon.mxnet.io/chapter01 crashcourse/nd
array.html.
2.9.11 Subsetting in Numpy
Another feature of numpy arrays is that they are easy to subset.
print(bmi[bmi > 25]) # Print only those with BMI above 25
[ 25.556 25.117 25.051]
2.9.12 Numpy and universal functions (ufunc)
NumPy is a useful library for the Python for large-scale numerical computations
including but not limited to machine learning. It supports efficient operations for
bulky data in multi-dimensional arrays (ndarrays) of large size. It also offers a large
collection of high-level mathematical functions to operate on these arrays [1]. In
2005, Travis Oliphant created NumPy by incorporating features of Numarray into
Numeric, which was originally created by Jim Hugunin with several other developers
with extensive modifications. More details can be found at https://en.wikipedia.
org/wiki/NumPy.
The Numpy documentation states that “a universal function (or ufunc for short)
is a function that operates on ndarrays in an element-by-element fashion, supporting
array broadcasting, type casting, and several other standard features. A ufunc is
a “vectorized” wrapper for a function that takes a fixed number of specific inputs
and produces a fixed number of specific outputs. In NumPy, universal functions
are instances of the numpy.ufunc class (to be discussed later). Many of the built-in
functions are implemented in compiled C code. The basic ufuncs operate on scalars,
but there is also a generalized kind for which the basic elements are sub-arrays
72 Machine Learning with Python: Theory and Applications
(vectors, matrices, etc.), and broadcasting is done over other dimensions. One can
also produce custom ufunc instances using the frompyfunc factory function”.
More details of unfuncs can be found at scipy documentation (https://docs.sc
ipy.org/doc/numpy/reference/ufuncs.html). Given below are two of many ufuncs.
exp(x, /[, out, where, casting, order, . . . ]) Calculate the exponential of all elements
in the input array. log(x, /[, out, where, casting, order, . . . ]) Natural logarithm,
element-wise. One may use the following for more details.
# help(np.log) # use this to find out what a ufunc does
2.9.13 Numpy array and vector/matrix
Numpy uses primarily the structure of arrays. The behavior of a numpy array is
similar, and yet quite different from the conventional concepts of vector and matrix
that we learned from linear algebra. It can be quite confusing and even frustrating
during coding and debugging. The author has bothered by such differences quite
frequently in the past. The items below are some of the issues one may need to pay
attention to.
(1) Numpy array is in principle multidimensional
First, we note that numpy array is far more than 1D vector and 2D matrix. By
default, we can have as many as 32D, and that can also be changed. Thus, numpy
array is extremely powerful, and is a data structure works well for complex machine
learning models that use multidimensional dataset and data flows. All the element-
wise operations, broadcasting, handling of flows of large volume data, etc. work very
efficiently. It, however, does not follow precisely the concept of vectors, matrices
that we established in the conventional linear algebra and most frequently used.
This is essentially the root of confusion in many cases, in the author’s opinion.
Understanding the following key points is a good start to mitigate the confusion.
(2) Numpy 1D array vs. vector
Numpy 1D array is similar to the usual vector concept in linear algebra. The
difference is that 1D array does not distinguish row or column vector. It is just
a 1D array with a shape of (n,), where n is the length of the array. It behaves
largely like the row vector, but not quite. For example, transpose() has no effect on
it. This is because what the transpose() function does is to swap two axises of an
array with two or more axises.
The column vector in linear algebra should be treated in numpy as a special case
of 2D array with only one column. One can create an array like the column vector
in linear algebra, by adding an additional axis to the 1D array. See the following
examples.
Basics of Python 73
a = np.array([1, 2, 3]) # create an 1D array with length 3
print(a, a.T, a.shape, a.T.shape) # its shape will be (3,)
[1 2 3] [1 2 3] (3,) (3,)
As shown, “.T” for transpose() has no effect to the 1D array.
an = a[:,np.newaxis] # add in a newaxis, shape becomes (3, 1)
print(an, an.shape) # it's a column vector, a special 2D array
[[1]
[2]
[3]] (3, 1)
print(an.T, an.T.shape) # .T works, it comes row vector
[[1 2 3]] (1, 3)
The axis added array becomes a 2D array, and the transpose() function works. It
creates a “real” row vector that is a special case of a 2D array in numpy.
The same can also be achieved using the following tricks.
a1 = a.reshape(a.shape+(1,)) # reshape, adds one more dimension
print(a1, a1.shape)
aN = a[:,None] # adds in a new dimension without elements
print(aN, aN.shape)
[[1]
[2]
[3]] (3, 1)
[[1]
[2]
[3]] (3, 1)
Adding an axis can be done at any axis:
a0 = a[np.newaxis,:] # the new axis is added to the 0-axis
print(a0, a0.shape) # shape becomes (1,3), an row vector
[[1 2 3]] (1, 3)
74 Machine Learning with Python: Theory and Applications
print(an+a0) # interesting trick to create a Hankel matrix
[[2 3 4]
[3 4 5]
[4 5 6]]
Once knowing how to create arrays equivalent to the usual row and column
vectors in the conventional linear algebra, we shall be much more comfortable in
debugging codes when encountering strange behavior.
Another often encountered example is when solving linear system equations.
From the conventional linear algebra, we know that the right-hand-side (rhs) should
be a column vector, and the solution also should be a column vector. In using
numpy.linalg.solve() for a linear algebraic equation, we can feed in with a 1D array
as the rhs vector, and it will return a solution that is also a 1D array. Of course,
we can also feed in with a column vector (a 2D array with only one column). In
that case we will get the solution in a 2D array with only one column. We shall see
examples in Section 3.1.11, and many cases in later chapters. These behaviors are
all expected and correct in numpy
(3) Numpy 2D array vs. matrix
Numpy 2D array is similar to the usual matrix in linear algebra largely, but not
quite. For example, the matrix multiplication in linear algebra is often done in
numpy using the dot-product, such as np.dot() or the “@” operator in numpy
(version 3.5 or later). The “*” operator is an element-wise multiplication, as shown
in Section 2.9.4. We will see more examples in Chapter 3 and later chapters. Also,
some operations to a numpy array can results in dimension change. For example,
when mean() is applied to an array, the dimension is reduced. Thus, care is required
on which axis is collapsed.
Note that there is a numpy matrix class (see definition of class in Section 2.14)
that can be used to create matrix objects. These objects behavior quite similar as
the matrix in linear algebra. We try not to use it, because it will be deprecated one
day, as announced in the online document numpy-ref.pdf.
Based on the author’s limited experience, once we are aware of these differences
and behavior subtleties (more discussion later when we encountered one), we can
then pay attention to the behavior subtleties. It is often helpful to frequently
checking the shape of the arrays. This allows us to work more effectively with
powerful numpy arrays, including performing proper linear algebra analysis. At this
moment, it is quite difficult to discuss the theorems in linear algebra using 1D, 2D
or higher dimensional array concepts. In the later chapters, we will still follow the
general rules, principles, and use the terms of vector and matrix in the conventional
linear algebra, because many theoretical findings and arguments are based on it. A
vector refers generally to a 1D numpy array, and a matrix to a 2D numpy array.
When examine the outcomes in numpy codes, we shall notice the behavior subtleties
of the numpy arrays.
Basics of Python 75
2.10 Sets: No Duplication
A set is a list, but with no duplicate entries.
sentence1 = "His first name is Mark and Mark is his first name"
words1 = sentence1.split() # use split() to form a set of words
print(words1) # whole list of these words is printed
word_set1 = set(words1) # convert to a set
print(word_set1) # print. No duplication
['His', 'first', 'name', 'is', 'Mark', 'and', 'Mark', 'is', 'his',
'first', 'name']
{'and', 'His', 'is', 'first', 'Mark', 'his', 'name'}
Using a set to get rid of duplication is useful for many situations. Many other
useful operations can be applied to sets. For example, we may want to find the
intersection of two sets. To show this, let us create a new list and then a new set.
sentence2 = "Her first name is Jane and Jane is her first name"
words2 = sentence2.split()
print(words2) # whole list of these words is printed.
word_set2 = set(words2) # convert to a set
print(word_set2) # print. No duplication
['Her', 'first', 'name', 'is', 'Jane', 'and', 'Jane', 'is', 'her',
'first', 'name']
{'her', 'Jane', 'Her', 'is', 'first', 'and', 'name'}
2.10.1 Intersection of two sets
print(word_set1.intersection(word_set2)) #intersection of two sets
{'first', 'and', 'name', 'is'}
2.10.2 Difference of two sets
print(word_set1.difference(word_set2))
{'his', 'His', 'Mark'}
This finds words in word set1 that are not in word set2.
We may also want to find the words in word set2 that are not in word set1 as
follows:
print(word_set1.symmetric_difference(word_set2))
{'her', 'His', 'Jane', 'Her', 'Mark', 'his'}
76 Machine Learning with Python: Theory and Applications
Can we do similar operations to lists? Try this.
#print(words1.intersection(words2)) # intersection of two list?
# No. It throws an AttributeError.
2.11 List Comprehensions
We used list comprehension for particular situations a few times. List comprehension
is a very powerful tool for operations on all iterables including lists and numpy
arrays. When used on a list, it creates a new list based on another list, in a single,
readable line.
In the following example, we would like to create a list of integers which specify
the length of each word in a sentence, but only if the word is not “the”. The natural
way to do this is as follows:
sentence="Raises the Sun and comes the light" # Create a string.
words = sentence.split() # Create a list of words using split().
word_lengths = [] # Empty list for the lengths of the words
words_nothe = [] # Empty list for the words that are not "the"
for word in words:
if word != "the":
word_lengths.append(len(word))
words_nothe.append(word)
print(words_nothe, ' ',word_lengths)
['Raises', 'Sun', 'and', 'comes', 'light'] [6, 3, 3, 5, 5]
With a list comprehension, we simply do this:
words = sentence.split()
word_lengths = [len(word) for word in words if word != "the"]
words_nothe = [word for word in words if word != "the"]
print(words_nothe, ' ',word_lengths)
['Raises', 'Sun', 'and', 'comes', 'light'] [6, 3, 3, 5, 5]
The following is even better:
words_nothe2=[] # Empty list to hold the lists of word & length
# for words that are not "the", lists in list
words_nothe2=[[word,len(word)] for word in words if word != "the"]
# one may use () instead of the inner []
print(words_nothe2)
[['Raises', 6], ['Sun', 3], ['and', 3], ['comes', 5], ['light', 5]]
Basics of Python 77
The following example uses list comprehension to numpy arrays:
import numpy as np
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
fx = []
x = np.arange(-2, 2, 0.5) # Numpy array with equally spaced values
fx = np.array([-(-xi)**0.5 if xi < 0.0 else xi**0.5 for xi in x])
# Creates a piecewise function (such as activation function).
# The created list is then converted to a numpy array.
print('x =',x); print('fx=',fx)
x = [-2.000 -1.500 -1.000 -0.500 0.000 0.500 1.000 1.500]
fx= [-1.414 -1.225 -1.000 -0.707 0.000 0.707 1.000 1.225]
2.12 Conditions, “if ” Statements, “for” and “while” Loops
In machine learning programming, one frequently uses conditions, “if” statements,
“for” and “while” loops. Boolean variables are used to evaluate conditions. The
boolean values True or False are returned when an expression is compared or
evaluated.
2.12.1 Comparison operators
Comparison operators include ==, <, <=, >, >=, !=. The == operator compares
the values of both the operands and checks for value equality, while “!=” checks for
“inequality”, and others check for inequality and/or equality.
x = 2
print(x == 2) # The comparison result in a boolean value: True
print(x == 3) # The comparison result in a boolean value: False
print(x < 3) # The comparison result in a boolean value: True
True
False
True
x = 2
if x == 2:
print("x equals 2!")
else:
print("x does not equal to 2.")
x equals 2!
78 Machine Learning with Python: Theory and Applications
name, age = "Richard", 18
if name == "Richard" and age == 18: # if-block!
print("He is famous. His name is", name, \
"and he is only",age,"years old.")
He is famous. His name is Richard and he is only 18 years old.
temp_critical = 48.0 #unit: degree Celsius (C).
current_temp = 50.0
if current_temp >= temp_critical:
print("The current temperature is", current_temp, \
"degree C. It is above the critical temperature of",\
temp_critical,"degree C. Actions are needed. ")
else:
print("The current temperature is", current_temp, \
"degree C. It is below the critical temperature of",\
temp_critical, "degree C. No action is needed for now.")
# Notice the use of block, and "\" to break a long line.
The current temperature is 50.0 degree C. It is above the
critical temperature of 48.0 degree C. Actions are needed.
2.12.2 The “in” operator
The “in” operator is used to check if a specified object exists within an iterable
object container, such as a list.
name1, name2 = "Kevin", 'John'
groupA= ["Kevin", "Richard"] # a list with two strings.
if name1 in groupA:
print("The person's name is either",groupA[0], "or", groupA[1])
if name2 in groupA:
print("The person's name is either",groupA[0], "or", groupA[1])
else:
print(name2, "is not found in group A")
The person's name is either Kevin or Richard
John is not found in group A
2.12.3 The “is” operator
Unlike the double equals operator “==”, the “is” operator does not check the values
of the operands. It checks whether both the operands refer to the same object or
not. For example,
Basics of Python 79
x, y = ['a','b'], ['a','b']
z = y # makes z pointing to y
print(x == y,' x=',x,'y=',y,hex(id(x)),hex(id(y))) # Prints True,
# because the values in x and y are equal
print(x is y,'x=',x,'y=',y,hex(id(x)),hex(id(y))) # Prints False,
# because x and y with different ID
print(y is z,' y=',y,'z=',z,hex(id(y)),hex(id(z))) # True, y is z
True x= ['a', 'b'] y= ['a', 'b'] 0x1ae25570188 0x1ae25570208
False x= ['a', 'b'] y= ['a', 'b'] 0x1ae25570188 0x1ae25570208
True y= ['a', 'b'] z= ['a', 'b'] 0x1ae25570208 0x1ae25570208
y[1]='x' # change the 2nd element
print('After change one value in y')
print(x == y,'x=',x,'y=',y,hex(id(x)),hex(id(y))) # Prints False,
# their values are no longer equal
print(x is y,'x=',x,'y=',y,hex(id(x)),hex(id(y)))
# False, x is # NOT y
print(y is z,' y=',y,'z=',z,hex(id(y)),hex(id(z))) # True, y is z:
# they change together!
After change one value in y
False x= ['a', 'b'] y= ['a', 'x'] 0x1ae25570188 0x1ae25570208
False x= ['a', 'b'] y= ['a', 'x'] 0x1ae25570188 0x1ae25570208
True y= ['a', 'x'] z= ['a', 'x'] 0x1ae25570208 0x1ae25570208
y.append(x)
print('After change one value in y')
print(x == y,'x=',x,'y=',y,hex(id(x)),hex(id(y))) # Prints False,
# their values are no longer equal
print(x is y,'x=',x,'y=',y,hex(id(x)),hex(id(y)))
# False, x is # NOT y
print(y is z,' y=',y,'z=',z,hex(id(y)),hex(id(z))) # True, y is z:
# they change together!
After change one value in y
False x= ['a', 'b'] y= ['a', 'x', ['a', 'b'], ['a', 'b']]
0x1ae25570188 0x1ae25570208
80 Machine Learning with Python: Theory and Applications
False x= ['a', 'b'] y= ['a', 'x', ['a', 'b'], ['a', 'b']]
0x1ae25570188 0x1ae25570208
True y= ['a', 'x', ['a', 'b'], ['a', 'b']] z= ['a', 'x', ['a',
'b'], ['a', 'b']] 0x1ae25570208 0x1ae25570208
2.12.4 The ‘not’ operator
Using “not” before a boolean expression inverts the value of the expression.
print(not False) # Prints out True
print(not False == False) # Prints out False
True
False
The “not” operator can also be used with “is” and “in”: “is not”,“not in”:
x, y = [1,2,3], [1,2,3]
z = y # makes z pointing to same object as y
print(x == y,' x=',x,'y=',y,hex(id(x)),hex(id(y))) # True,
# because the values in x and y are equal
print(x is not y,'x=',x,'y=',y,hex(id(x)),hex(id(y))) # False, the
# values in x in NOT y
print(y is z,' y=',y,'z=',z,hex(id(y)),hex(id(z))) # True, y is z
True x= [1, 2, 3] y= [1, 2, 3] 0x1ae25570348 0x1ae255702c8
True x= [1, 2, 3] y= [1, 2, 3] 0x1ae25570348 0x1ae255702c8
True y= [1, 2, 3] z= [1, 2, 3] 0x1ae255702c8 0x1ae255702c8
2.12.5 The “if ” statements
As any other language, the “if” statement is often used for programming, we have
already seen some earlier. The following are more examples for using Python’s
conditions in the “if” statement with code blocks:
temp_good = 22.0 # unit: degree Celsius (C).
temp_now1 = 48.0 # Readers may change this value and try
statement1=(temp_now1<=(temp_good+10.0))&(temp_now1>=(temp_good-10.0))
# comfortable
statement2 = temp_now1 < (temp_good-10.0) # cold
statement3 = temp_now1 > (temp_good+10.0) # hot
if statement1 is True: # do not forget ":"
print("Okay, it is",temp_now1,"degree C. Comfortable. Let's go.")
pass # do something
elif statement2 is True: # do not forget ":"
print("No, it is",temp_now1,"degree C. Too cold. We cannot go.")
pass # do something else
Basics of Python 81
elif statement3 is True:
print("No, it",temp_now1,"degree C. Too hot. We cannot go.")
pass # do something else
else:
print("Let's check the temperature.") # do another thing
pass
No, it 48.0 degree C. Too hot. We cannot go.
Note that there is no limit in Python on how many blocks one can use in an if
statement.
2.12.6 The “for” loops
There are two types of loops in Python, for and while. We have used “for” loops
already. We discuss it here in more detail together with the while loops. A for loop
iterates over a given iterable sequence. The starting and stopping points and the
step-size are controlled by the sequence, as shown in the following example. It is
sequence controlled.
primes = [2, 3, 5] # define a list (that is iterable)
for prime in primes: # do not forget ":"
print(prime, end=' ')
2 3 5
For loops can iterate over a sequence of numbers using the “range” function. The
range() is a built-in function which returns a range object: a sequence of integers.
It generates integer numbers between the given start-integer and the stop-integer.
It is generally used to iterate over with for loops.
print('Member in range(10) are')
for n in range(10): # Syntax: range(start, stop[, step])
print(n, end=',')
print('\nMembers in range(3, 8) are')
for n in range(3, 8): # Default step is 1
print(n, end=',')
Member in range(10) are
0,1,2,3,4,5,6,7,8,9,
Members in range(3, 8) are
3,4,5,6,7,
print('Members in range(3, 10, 2) are')
for n in range(-3, 10, 2): # starting from a negative value
print(n, end=',')
82 Machine Learning with Python: Theory and Applications
print('\nMembers in range(10, -3, -2) are')
for n in range(10, -3, -2): # reverse range
print(n, end=',')
Members in range(3, 10, 2) are
-3,-1,1,3,5,7,9,
Members in range(10, -3, -2) are
10,8,6,4,2,0,-2,
For a given list of five numbers, let us display each element that doubled, using a
for loop and range() function.
print("Double the numbers in a list, using for-loop and range()")
given_list = [10, 30, 40, 50]
for i in range(len(given_list)):
print("Index["+str(i)+"]","Value in the given list is",
given_list[i],", and its double is", given_list[i]*2)
Double the numbers in a list, using for-loop and range()
Index[0] Value in the given list is 10 , and its double is 20
Index[1] Value in the given list is 30 , and its double is 60
Index[2] Value in the given list is 40 , and its double is 80
Index[3] Value in the given list is 50 , and its double is 100
The range() function returns an immutable sequence object of integers, so it is
possible to convert a range() output to a list, using the list class. For example,
print("Converting python range() output to a list")
list_rng = list(range(-10,12,2))
print(list_rng)
Converting python range() output to a list
[-10, -8, -6, -4, -2, 0, 2, 4, 6, 8, 10]
2.12.7 The “while” loops
The “while” loops repeat as long as a certain boolean condition is met. The condition
controls the operations. For example,
#To print out 0,1,2,3,4,5,6,7,8,9,
count = 0
while count < 10: # do not forget ":"
print(count, end=',')
count += 1 # This is the same as count = count + 1
0,1,2,3,4,5,6,7,8,9,
Basics of Python 83
“break” and “continue” statements: break is used to exit a “for” loop or a “while”
loop, whereas continue is used to skip the current block, and return to the “for” or
“while” statement.
# print all digits limited by a given number.
count = 0
while True:
print(count,end=',')
count += 1
if count >= 10:
break
print('\n')
# Prints out only even numbers: 0,2,4,6,8,
for n in range(10): # for-loop to control the range
if n % 2 != 0: # Check condition, control what to print
continue # continue to count on (in the for-loop)
print(n,end=',')
0,1,2,3,4,5,6,7,8,9,
0,2,4,6,8,
When the loop condition fails, then the code part in “else” is executed. If break
statement is executed inside the for loop, then the “else” part is skipped. Note that
the “else” part is executed even if there is a continue statement before it.
# Prints out 0,1,2,3,4 and then it prints "count value reached 5"
count=0
nlimit = 5
while(count<nlimit):
print(count, end=',')
count +=1
else:
print("count value reached %d" %(nlimit))
# Prints out 1,2,3,4
for i in range(1, 10):
if(i%5 == 0): # modulo division (%)
break
print(i, end=',')
else:
print("This is not printed because for-loop is terminated due\
to the break but not due to fail in condition")
0,1,2,3,4,count value reached 5
1,2,3,4,
84 Machine Learning with Python: Theory and Applications
2.12.8 Ternary conditionals
The following 4-line code
condition=True
if condition:
x=1
else:
x=0
print(x)
1
can be written in one line, with ternary conditionals:
condition=True
x=1 if condition else 0
print(x)
1
It is simple, readable and dry. Thus conditionals are frequently used in Python.
2.13 Functions (Methods)
Functions offer a convenient way to divide a code into useful blocks that can be called
unlimited times when needed. This can drastically reduce the repeat of codes, and
make codes clean, more readable, and easy to maintain. In addition, functions are
good ways to define interfaces for easy sharing of the codes among programmers.
2.13.1 Block structure for function definition
A function has a “block” structure. Block keywords include those we have already
seen, such as “if”, “for”, and “while”. Functions in Python are defined using the
block keyword “def”, followed by a function name that is also the block’s name.
The function is called using the function name followed by (), which brackets the
arguments, if any. Try this simplest function:
def print_hello(): # do not forget ":"
print("Hello, welcome to this simple function!")
print_hello()
Hello, welcome to this simple function!
2.13.2 Function with arguments
In the simple case given above, no argument is required. Functions are often created
with required arguments that are variables passed from the caller to the function.
Basics of Python 85
def greeting_student(username, greeting):
print(f"Hello, {username}, greetings! Wish you {greeting}")
greeting_student("Kevin", "a fun journey in using functions!")
Hello, Kevin, greetings! Wish you a fun journey in using functions!
Functions may be created with return values to the caller, using the keyword
“return”.
def sum_two_numbers(a, b):
return a + b
x, y = 2.0, 8.0
apb = sum_two_numbers(x, y)
print(f'{x} + {y} = {apb}') # print('%f + %f = %f'%(x,y,apb))
x, y = 20, 80
apb = sum_two_numbers(x, y)
print(f'{x} + {y} = {apb}, perfect!')
#print('%d + %d = %d: perfect!'%(x,y,apb))
print(f'{x + y}, perfect!') #f-string allow the use of operations
2.0 + 8.0 = 10.0
20 + 80 = 100, perfect!
100, perfect!
Variable scope: LEGB rule (local, enclosing, global, buildings) defines the sequence
for Python to search for a variable. The searching terminates when the variable is
found.
def sum_two_numbers(a, b):
a += 1
print('Inside the function, a=',a)
print('Inside the function x=',x)
return a + b
x, y = 2.0, 8.0
print('Before the function is called,x=',x,'y=',y)
apb = sum_two_numbers(x, y)
print('After the function is called, %f + %f = %f'%(x,y,apb))
Before the function is called,x= 2.0 y= 8.0
Inside the function, a= 3.0
Inside the function x= 2.0
After the function is called, 2.000000 + 8.000000 = 11.000000
86 Machine Learning with Python: Theory and Applications
2.13.3 Lambda functions (Anonymous functions)
Lambda function is a single-line function. It may be the simplest form of functions
and the most useful. The linear functions can be defined simply as
f = lambda x, k, b: k * x + b # do not forget ":"
print(f(1,1,1),f(2,2,2),f(1,2,3))
2 6 5
The quadratic functions can be defined as
f = lambda x, a, b, c: a*x**2 + b*x +c
print(f(2,2,-4,6))
Often Lambda functions are used together with the normal functions, especially for
returning a value where a single-line function comes in handy.
def func2nd_order(a,b,c):
return lambda x: a*x**2 + b*x +c
f2 = func2nd_order(2,-4,6)
print('f2(2)=',f2(2), 'or', func2nd_order(2,-4,6)(2))
f2(2)= 6 or 6
2.14 Classes and Objects
A class is a single entity that encapsulates variables and functions (or methods).
A class is essentially a template to create class objects. One can create unlimited
number of class objects with it, and each class object gets the structure, vari-
ables/attributes and-functions (or methods) from the class. References used in this
section includes the following:
• https://www.python-course.eu/python3 class and instance attributes.php.
• https://realpython.com/instance-class-and-static-methods-demystified/.
2.14.1 A simplest class
class C: # Define a lass named C
''' A simplest possible class named "C" '''
ca = "class attribute" # an attribute defined in the class
help(C) # check out what has been created
Basics of Python 87
Help on class C in module main :
class C(builtins.object)
| A simplest possible class named "C"
|
| Data descriptors defined here:
|
| dict
| dictionary for instance variables (if defined)
|
| weakref
| list of weak references to the object (if defined)
|
| --------------------------------------------------------------
| Data and other attributes defined here:
|
| ca = 'class attribute'
This example shows how a Class is structured, comments given in ”’ ”’ in the class
definition are used to convey a message to the user, and how an attribute can be
created in the class. We can now use it to observe some behavior of a class attribute
and instance attributes.
i1 = C() # an instance i1 (a class object) is created.
i2 = C() # an instance i2 (another class object) is created.
print('i1.ca=',i1.ca) # prints: 'class attribute!'
print('i2.ca=',i2.ca) # prints: 'class attribute!'
print('C.ca =',C.ca) # prints: 'class attribute!'
#print('ca=',ca) # NameError:'ca' is not defined
i1.ca= class attribute
i2.ca= class attribute
C.ca = class attribute
C.ca = "This is a changed class attribute 'ca'"
# Changing the class attribute, via class
print('C.ca =',C.ca)
print('i1.ca=',i1.ca)
print('i2.ca=',i2.ca)
C.ca = This is a changed class attribute 'ca'
i1.ca= This is a changed class attribute 'ca'
i2.ca= This is a changed class attribute 'ca'
88 Machine Learning with Python: Theory and Applications
Note the values in the instances are also changed.
i1.ca = "This is a changed instance attribute 'ca'"
# Changing an instance attribute
print('i1.ca=',i1.ca) # Changed
print('C.ca =',C.ca) # Unchanged
print('i2.ca=',i2.ca) # Unchanged
i1.ca= This is a changed instance attribute 'ca'
C.ca = This is a changed class attribute 'ca'
i2.ca= This is a changed class attribute 'ca'
The change is only effective for the instance attribute that is changed.
C.ca = "The 2nd changed class attribute 'ca'"
# Changing the class attribute, via class
print('C.ca =',C.ca) # should change accordingly
print('i1.ca=',i1.ca) # Will not change! No longer follow C
# because it made a change after creation
print('i2.ca=',i2.ca) # should change according to class C
# because it has not been changed creation
C.ca = The 2nd changed the class attribute 'ca'
i1.ca= This is a changed instance attribute 'ca'
i2.ca= The 2nd changed class attribute 'ca'
Class attributes and object instance attributes are stored in separate dictionaries:
C.__dict__
mappingproxy({' module ': ' main ',
' doc ': ' A simplest possible class named "C" ',
'ca': "The 2nd changed the class attribute 'ca'",
' dict ': <attribute ' dict ' of 'C' objects>,
' weakref ': <attribute ' weakref ' of 'C' objects>})
i1.__dict__
{'ca': "This is a changed instance attribute 'ca'"}
It is clear that a dictionary has been created when a change is made at the instance
level, departing from the class level.
i2.__dict__
{}
No dictionary has been created, because no change is made at the instance level. It
stays with the class.
Basics of Python 89
i2.ca = "Make now a change at the instance y to the attribute 'ca'"
i2.__dict__
{'ca': "Make now a change at the instance y to the attribute 'ca'"}
A dictionary has now been created, because the change is made at the instance
level. It departed from the class level. Any future change at the class level to this
attribute will no longer affect the attribute at this instance level.
2.14.2 A class for scientific computation
Let us look at an example for simple scientific computations. We first create a class
called Circle to compute the area of a circle for given radius. The following is the
code:
class Circle:
''' Class "Circle": Compute the area of a circle '''
pi = 3.14159 #class attribute of constants used class-wide
# and class specific
def __init__(self, radius): # a special constructor
# __init__ is executed when the class is called.
# it is used to initiate a class. For this simple
# task we need only one variable: radius.
# "self" is used to reserve an argument place for an
# instance (to-be-created) itself to pass along.
# It is used for all the functions in a class.
self.radius = radius # This allows the instance accessing
# the variable: radius.
def circle_area(self): # function computes circle area
return self.pi * self.radius **2 # pi gets in there via
# the object instance itself
#help(Circle) # check out what has been created
We can now use this class to perform computations.
r = 10
c10 = Circle(r) # create an instance c10. c10 is now passed to
# self inside the class definition
# 10 is passed to the self.radius
print('Circle.pi before re-assignment',Circle.pi)
# access pi via class
90 Machine Learning with Python: Theory and Applications
print('Radius=',c10.radius)
# access via object c10.radius is the self.radius in __init__
print('c10.pi before re-assignment', c10.pi)
# The class attribute is accessed via instance attribute
Circle.pi before re-assignment 3.14159
Radius= 10
c10.pi before re-assignment 3.14159
c10.pi=3.14 # this will change the constant for instance c10
# It will not change the class-wide pi value
print('c10.pi after re-assignment via c10.pi:',c10.pi)
print('Circle.pi after re-assignment via c10.pi:',Circle.pi)
print('circle_area of c10 =',c10.circle_area())
print('circle_area of Circle100=',Circle(100).circle_area())
c10.pi after re-assignment via c10.pi: 3.14
Circle.pi after re-assignment via c10.pi: 3.14159
circle area of c10 = 314.0
circle area of Circle100= 31415.899999999998
It is seen that the Class Circle works well. Let us now create a subclass.
2.14.3 Subclass (class inheritance)
Subclasses can often be used, to take the advantages of the inheritance feature in
Python. This allows us to create new classes by fully making use of an existing class
for its entire structure (attributes and functions), without affecting the ongoing
use of the existing class. It is thus also useful in upgrading the existing programs
because of the reduced duplications.
Assume that the Circle code created above has already been distributed and
used by many. We now decide to create another class to compute the area of a
partial Circle, given the value of the portion of the circle. We can now create a
subclass for this purpose, called P circle, without affecting the use of the already
distributed Circle. The following is the code:
class P_circle(Circle):
# Subclass P_circle referring the base (or parent)
# Circle in (). This establishes the inheritance
''' Subclass "P_circle" based on Class "Cirle": Compute the\
area of a circle portion '''
def __init__(self,radius,portion):
# with 3 attributes: self, radius, and portion.
super().__init__(radius) # This brings in base attributes
# from the base class Circle.
self.portion = portion # Subclass attribute.
Basics of Python 91
def pcircle_area(self):
# define a function to compute the area of a partial circle
return self.portion*self.circle_area() # New function in
# subclass. The base class Circle is used here.
#help(P_circle) # check out what has been created
Readers may remove “#” in above cell, execute it, and take a moment to read
through this information, and see how the subclass is structured, its connection
with the base class, how self is been used to prepare for connections with the future
objects to be assigned, and what are the attributes and functions that are newly
created and inherited from the base class.
pc10 = P_circle(10.,0.5) # create an object instance using the
# subclass, with argument radius=10 and portion =50%
pc10.pi # we have the same attribute from the base class
3.14159
pc10.radius # we have the same attribute from the base class
10.0
pc10.pcircle_area() # area of a 50% partial circle is computed
157.0795
Let us now make a change to constant pi via subclass instance.
pc10.pi = 3.14
pc10.pi
3.14
It changed. Let us check pi via the base class c10.
c10.pi
3.14
It remains unchanged. Actions to the subclass are not affecting the base class.
2.15 Modules
We now touch upon modules. A module is a piece of Python file that has a specific
functionality. For example, when writing a finite element program, we may write
one module for creating the stiffness matrix and another for solving the system
92 Machine Learning with Python: Theory and Applications
equations. Each module is a separated Python file, which can be written and edited
independently. This helps a lot in organizing and maintaining large programs.
A module in Python is a Python file with .py extension. The file name will be
the module name. Such a module can have a set of functions, classes, and variables
defined. In a module, one can import other modules in the procedure mentioned in
the beginning of this chapter.
2.16 Generation of Plots
Python is also very powerful in generating plots. This is done by import modules
that are openly available. Here, we shall present a simple demo plot of scattered
circles.
First, we import the modules needed.
import numpy as np
import matplotlib.pyplot as plt
# matplotlib.pyplot is a plot function in the matplotlib module
%matplotlib inline
# to have the plot generated inside the notebook
# Otherwise, it will be generated in a new window
We now generate sample data, and then plot 80 randomly generated circles.
n=80
x=np.random.rand(n) # Coordinates randomly generated
y=np.random.rand(n)
colors=np.random.rand(n)
areas=np.pi*(18*np.random.rand(n))**2
# circle radiuses from 0~20 randomly generated
plt.scatter(x,y,s=areas,c=colors,alpha=0.8)
plt.show()
Figure 2.3: Randomly generated circular areas filled with different colors.
Basics of Python 93
# Plost a curve
x = range(1000)
y = [i ** 2 for i in x]
plt.plot(x,y)
plt.show();
Figure 2.4: Curve for a quadratic function.
x = np.linspace(0, 1, 1000)**1.5
plt.hist(x);
Figure 2.5: An example of a histogram.
2.17 Code Performance Assessment
Performance assessment on a code can be done in two ways. Typical example codes
are given below. Readers may make use of these codes for accessing computational
performance to his/her codes.
94 Machine Learning with Python: Theory and Applications
import time # import time module
import numpy as np
g=list(range(10_000_000))
#print(g)
q=np.array(g,'float64')
#print(q)
start = time.process_time()
sg=sum(g)
t_elapsed = (time.process_time() - start)
print(sg,'Elapsed time=',t_elapsed)
start = time.process_time()
sq=np.sum(q)
t_elapsed = (time.process_time() - start)
print(sq,t_elapsed)
49999995000000 Elapsed time= 0.28125
49999995000000.0 0.03125
%%timeit #use timeit
sg=sum(g)
329 ms ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.18 Summary
With the basic knowledge of Python and its related modules, and the essential
techniques for coding, we are now ready to learn to code for computations and
machine learning techniques. In this process one can also gradually improve his/her
Python coding skills.
Reference
[1] C.R. Harris, K.J. Millman and J.v.d.W. Stéfan et al. Array programming with NumPy,
Nature, 585(7825), 357–362, Sep 2020. http://dx.doi.org/10.1038/s41586-020-2649-2.
Chapter 3
Basic Mathematical Computations
This chapter discusses typical scientific computations using codes in Python.
We will focus on how numerical data are represented mathematically, how
they are structured or organized, stored, manipulated, operated upon, or
computed in effective manners. Subtleties in those operations in Python will
be examined. At the end of this chapter, techniques for initial treatment
of datasets will also be discussed. The reference materials used in the
chapter include Numpy documentation, Scipy documentation, https://gluon.
mxnet.io/, and https://jupyter.org/. Our discussion shall start with some
basic linear algebra operations on data with a structure of vector, matrix
and tensor.
3.1 Linear Algebra
Linear algebra is most essential for any computations that involve big data,
such as in machine learning. We plan to briefly review basic linear algebraic
operations, through the use of Python programming, using modules that
have already been developed in the Python community at large. We shall go
through the basic concepts, the mathematical notation, data structure, and
the computation procedure. Readers feel free to skim or skip this chapter
if you are already confident in the basic linear algebra computations. Our
discussion will start from the data structure. First, we import necessary
modules and functions.
import sys # import "sys" module
sys.path.append('grbin') # current/relative directory,
# like ..\\..\\code
# Or absolute folder like 'F:\\xxx\\...\\code'
95
96 Machine Learning with Python: Theory and Applications
import grcodes as gr # grcodes module is placed in the
# folder above
from grcodes import printx # import a particular function
import numpy as np # Import Numpy package
We will also use the MXNet package. If not done yet, we shall have
MXNet installed using: pip install mxnet.
After the installation, we import the MXNet.
from mxnet import nd # Import NDArray and give it a alias nd
3.1.1 Scalar numbers
As discussed in Chapter 2, scalar numerical numbers for mathematical
computations have three major types: integer, real number, and complex
number. In Python, such a numerical number is assigned with a unique name
and a given address. It is accessed by calling its name, can be updated, and
used as an argument of a properly defined function (built-in, defined in a code
or in an imported module). In Python programming, all these operations
on a number are straightforward. The most often encountered problem in
computation is that the number may get out of the bound of the digits of
the computer (over- or under-flow), and may become illegal as an argument
for a function. Also, we assume that the numbers generated in Python can
cover the entire real space with the machine accuracy of limit.
3.1.2 Vectors
A vector refers to an object that has more than one component stacked along
one dimension. It can give a physical meaning depending on the type of the
physics problem. For example, a force vector in three-dimensional (3D) space
has three components, each representing the projection of the force vector
onto one of the three axes of the space. It is also known as the degrees of
freedom (DoF). Higher dimensions are encountered in discretized numerical
models, such as the finite element method or FEM (e.g., [1]). In the FEM,
the solid structure is discretized in space with elements and nodes. The DoFs
of such a discretized model depend on the number of nodes, which can be
very large, often in the order of millions. Therefore, we form vectors with
millions of components or entries. In machine learning models, the features
and labels can be written in vector form.
Basic Mathematical Computations 97
In this chapter, we will not discuss much about the physical problems.
Instead, we discuss general aspects of a vector in abstract, and issues on
computational operations that we may perform to the vector for a given
coordinate system. The DoF for a vector is also referred as the length. A
vector of length p has a shape denoted in Python as (p’).
p = 15 # Length of the vector
x = nd.arange(p) # Create a vector that is an nd-array
# using nd.arange() function
gr.printx('x') # x is now a vector with n components
printx('x') # Use the printx function directly
print(x.shape)
x =
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.]
<NDArray 15 @cpu(0)>
x =
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.]
<NDArray 15 @cpu(0)>
(15,)
Note that in mathematics, the notation of vector is often presented as a
column vector. In Python, is shows as a raw vector.
print(x.T) # Transpose of x
print(x.T.shape)
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.]
<NDArray 15 @cpu(0)>
(15,)
In mathematics, the transpose of a (column) vector shall become a
raw vector, and vice versa. In MXNet NDArray, The transformed vector
is stored in the same place but marked as transpose, so that operations,
such as multiplication, can be performed properly. This saves operations for
physically copying and moving the data, improving efficiency. To confirm
this, we just print out the addresses, as follows.
print(id(x))
print(id(x.T)) # Transpose of x
2212173184080
2212173184080
98 Machine Learning with Python: Theory and Applications
It is clear that in NDArray, a vector is a vector. One does not distinguish
whether it is a column or row vector. Its transport is just a marker on it and
only one set of data is stored. This is an example that MXNet pays special
attention to not moving the data in memory unnecessarily. We do not seem
to observe this behavior in numpy arrays, as shown below.
xnp = np.arange(15) # an Numpy array is generated
print(id(xnp), xnp.shape)
print(id(xnp.T),xnp.T.shape) # address is changed
2212173195344 (15,)
2212173195424 (15,)
The shape of the numpy array is unchanged (still structured as row-vector
or vector), but its transpose is given a separated address.
3.1.3 Matrices
A matrix refers to an object that has more than one dimensions of data
structure, in which each dimension has more than one component. It can be
viewed as stacks of vectors of some length. It again can have a physical mean-
ing depending on the type of physics problem. For example, the stiffness (and
mass) matrix created based on a discretized numerical model for a solid struc-
ture, such as the FEM, has a two-dimensional (2D) structure. In each of the
dimensions, the number of components is the same as that of the DoFs. The
whole matrix is a kind of spatially distributed “stiffness” of the structure [1].
In machine learning models, the input data points in the feature space, and
learning parameters in the hypothesis space may be written in a matrix form.
Again, we will not discuss much about the physical problem here.
Instead, we discuss general aspects of a matrix in abstract, and issues on
computational operations that we may perform to the matrix. Such an
abstract matrix can be presented in arrays of multi-dimensions known as
shape in Python defined in Chapter 2.
A = x.reshape((3, 5)) # Create a 2D matrix by reshaping
# a 1D array
print("A=", A, "\n A.T=", A.T)
A=
[[ 0. 1. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]
Basic Mathematical Computations 99
[10. 11. 12. 13. 14.]]
<NDArray 3x5 @cpu(0)>
A.T=
[[ 0. 5. 10.]
[ 1. 6. 11.]
[ 2. 7. 12.]
[ 3. 8. 13.]
[ 4. 9. 14.]]
<NDArray 5x3 @cpu(0)>
print(A.shape, A.T.shape)
(3, 5) (5, 3)
The shape or the dimension of the matrix A is 3 by 5, and that of its
transpose becomes (5, 3).
As discussed in Chapter 2 in detail, each of the components (or entries)
in the vector or matrix can be accessed by indexing or slicing.
print('A[1, 2] = ', A[1, 2]) # via index
print('row 2', A[2, :]) # slice the 3rd row (count from 0)
print('column 1', A[:, 1]) # slice the 2nd column
A[1, 2] =
[7.]
<NDArray 1 @cpu(0)>
row 2
[10. 11. 12. 13. 14.]
<NDArray 5 @cpu(0)>
column 1
[ 1. 6. 11.]
<NDArray 3 @cpu(0)>
The matrix can also be transposed.
A.T # A transpose, shape becomes 5 by 3
print(id(A), id(A.T))
2212173182960 2213649918832
It is found that the transposed matrix has its own address also in MXNet.
100 Machine Learning with Python: Theory and Applications
3.1.4 Tensors
The term of “Tensor” requires some clarification. In mathematics or physics,
tensor has a specific well-defined meaning. It refers to a structured data
(a single number, a vector, or a multi-dimensional matrix) that obeys
a certain tensor transformation rule under coordinate transformations.
Therefore, tensor is a very special group of structured data or object, and
not all matrices can be called a tensor. In fact, most of them are not. So long
as the tensor transformation rules are obeyed, it can be classed in different
orders. Scalars are 0th order tensors, vectors are 1st-order tensors, and 2D
matrices are 2nd-order tensors, and so on.
Having said that, in the machine learning (ML) community, however, any
matrix with dimension higher than 2 is called a tensor. It can be viewed as
stacks of matrices of same shape. This ML tensor seems to carry a meaning
of big data that needs to be structured in high dimensions. The ML tensor
is now used as a general way of representing an array with an arbitrary
dimension or arbitrary number of axes. The use of ML tensors becomes
more convenient when dealing with, for example, images that can have 3D
data structures, with axes corresponding to the height, width, and the three
color (RGB) channels. In numpy, a tensor is a multidimensional array.
Because there is usually no such a coordinate transformation performed in
machine learning, there will be no possible confusion caused in our discussion
in this book. From now onwards, we will call the ML tensor a tensor,
with the understanding that it may not obey the real-tensor transformation
rules and we do not perform such a transformation in machine learning
programming.
We now use nd.range() and then the reshape() to create a 3D nd-array.
X = nd.arange(24).reshape((2, 3, 4))
print('X.shape =', X.shape)
print('X =', X)
X.shape = (2, 3, 4)
X =
[[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]]
Basic Mathematical Computations 101
[[12. 13. 14. 15.]
[16. 17. 18. 19.]
[20. 21. 22. 23.]]]
<NDArray 2x3x4 @cpu(0)>
Element-wise operations are applicable to all tensors.
A = nd.arange(8).reshape((2, 4))
B = nd.ones_like(A)*8 # get shape of A, assign uniform entries
print('A =', A, '\n B =', B)
print('A + B =', A + B, '\n A * B =', A * B)
A =
[[0. 1. 2. 3.]
[4. 5. 6. 7.]]
<NDArray 2x4 @cpu(0)>
B =
[[8. 8. 8. 8.]
[8. 8. 8. 8.]]
<NDArray 2x4 @cpu(0)>
A + B =
[[ 8. 9. 10. 11.]
[12. 13. 14. 15.]]
<NDArray 2x4 @cpu(0)>
A * B =
[[ 0. 8. 16. 24.]
[32. 40. 48. 56.]]
<NDArray 2x4 @cpu(0)>
3.1.5 Sum and mean of a tensor
x = nd.arange(5)
print(x)
[0. 1. 2. 3. 4.]
<NDArray 5 @cpu(0)>
nd.sum(x) # summation of all entries/elements: 0+1+2+3+4=10
102 Machine Learning with Python: Theory and Applications
[10.]
<NDArray 1 @cpu(0)>
X = nd.ones(15).reshape(3,5)
X
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
<NDArray 3x5 @cpu(0)>
nd.sum(X) # summation of all entries/elements
[15.]
<NDArray 1 @cpu(0)>
print(nd.mean(X), nd.sum(X)/X.size) # same as nd.mean()
[1.]
<NDArray 1 @cpu(0)>
[1.]
<NDArray 1 @cpu(0)>
3.1.6 Dot-product of two vectors
Dot-product may be one of the most, if not the most, widely used operations
in scientific computation including machine learning. We discussed about it
briefly in Section 2.9. Here, we shall discuss more on its use for vectors that
may have different data structure resulting in some subtleties, as mentioned
in Section 2.9.13
Given two vectors a and b, their dot-product is often written in linear
algebra as a b or a · b. Essentially, it is just a sum of their element-wise
products, which results in a scalar. This implies that the shape of a and b
must be compatible: both have the same length. Let us see some examples.
a = nd.arange(5)
b = nd.ones_like(a) * 2 #This ensures the compatibility
print(f"a={a},a.shape={a.shape} \nb={b},b.shape={b.shape}")
printx('nd.dot(a, b)')
printx('nd.dot(a, b).shape')
print(f"np.dot(a, b)={np.dot(a.asnumpy(), b.asnumpy())}")
Basic Mathematical Computations 103
print(f"np.dot(a.T,b)={np.dot(a.asnumpy().T, b.asnumpy())}")
printx('np.dot(a.asnumpy(), b.asnumpy()).shape')
a=
[0. 1. 2. 3. 4.]
<NDArray 5 @cpu(0)>,a.shape=(5,)
b=
[2. 2. 2. 2. 2.]
<NDArray 5 @cpu(0)>,b.shape=(5,)
nd.dot(a, b) =
[20.]
<NDArray 1 @cpu(0)>
nd.dot(a, b).shape = (1,)
np.dot(a, b)=20.0
np.dot(a.T,b)=20.0
np.dot(a.asnumpy(), b.asnumpy()).shape = ()
Note that the transpose() to the vector will have not effect, because
transpose() in numpy swaps the axises of a 2D array. An numpy 1D array
has a shape of (n,) and hence no action can be taken. A numpy 1D array is
not treated as a matrix, as discussed in Section 2.9.13. When b is a column
vector, a special case of 2D array, it then has two axises like a matrix. The
dot-product a · b is same as the matrix-product ab (where a is defined as
a row vector and b a column vector), in terms of the scalar value resulted.
Thus, in our formulation, we do not distinguish them mathematically, and
we often use the following equality.
a · b = ab (3.1)
In computations in numpy, however, there are some subtleties. The dot-
product of two (1D array) vectors gives a scalar, and the dot-product of a
(1D array) vector with a column vector gives an 1D array with the same
scalar as the sole element. In NDArray, such subtleties are not observed.
Readers may examine the following code carefully to make sense of this.
b_c = b.reshape(-1, 1) # convert a 1D array to a column vector
print(b_c, 'b_c.shape=', b_c.shape)
printx('np.dot(a.asnumpy(),b.asnumpy())') # np dot-product (scalar)
printx('np.dot(a.asnumpy(),b_c.asnumpy())') # np dot-product (array)
print(a.asnumpy()@b_c.asnumpy()) # np matrix-product (array)
printx('nd.dot(a, b)') # nd dot-product (array)
printx('nd.dot(a, b_c)') # nd dot-product (array)
printx('nd.dot(a, b_c).shape')
104 Machine Learning with Python: Theory and Applications
[[2.]
[2.]
[2.]
[2.]
[2.]]
<NDArray 5x1 @cpu(0)> b_c.shape= (5, 1)
np.dot(a.asnumpy(),b.asnumpy()) = 20.0
np.dot(a.asnumpy(),b_c.asnumpy()) = array([20.], dtype=float32)
[20.]
nd.dot(a, b) =
[20.]
<NDArray 1 @cpu(0)>
nd.dot(a, b_c) =
[20.]
<NDArray 1 @cpu(0)>
nd.dot(a, b_c).shape = (1,)
As seen, all these give the same scalar value but in a different data
structure.
The dot-product of two column vectors (special matrices) a and b of
equal length is written linear algebra as a b or b a, which gives the same
scalar (but in a 2D array or a matrix with only one element).
a_c = a.reshape(-1, 1) # convert a 1D array to a column vector
print(a_c),
printx('np.dot(a_c.asnumpy().T,b_c.asnumpy())') # scalar in 2D array
printx('np.dot(b_c.asnumpy().T,a_c.asnumpy())')
printx('np.dot(b_c.asnumpy().T,a_c.asnumpy()).shape')
[[0.]
[1.]
[2.]
[3.]
[4.]]
<NDArray 5x1 @cpu(0)>
np.dot(a_c.asnumpy().T,b_c.asnumpy()) = array([[20.]], dtype=float32)
np.dot(b_c.asnumpy().T,a_c.asnumpy()) = array([[20.]], dtype=float32)
np.dot(b_c.asnumpy().T,a_c.asnumpy()).shape = (1, 1)
To access the scalar value in a 2D array of shape (1, 1), simply use.
print(np.dot(a_c.asnumpy().T,b_c.asnumpy())[0][0])
20.0
Basic Mathematical Computations 105
One may use flatten() to make a column vector back to a row vector.
printx('np.dot(a_c.asnumpy().flatten(),b_c.asnumpy())')
printx('np.dot(a_c.asnumpy().flatten(),b_c.asnumpy()).shape')
np.dot(a_c.asnumpy().flatten(),b_c.asnumpy()) = array([20.], dtype=float32)
np.dot(a_c.asnumpy().flatten(),b_c.asnumpy()).shape = (1,)
To access the scalar value in a 1D array of shape (1,), simply use,
printx('np.dot(a_c.asnumpy().flatten(),b_c.asnumpy())[0]')
np.dot(a_c.asnumpy().flatten(),b_c.asnumpy())[0] = 20.0
One may use ravel() to convert a multidimensional array to a 1D array (in
this case no copy of the array is made).
printx('np.dot(a_c.asnumpy().T,b_c.asnumpy()).ravel()')
printx('np.dot(a_c.asnumpy().T,b_c.asnumpy()).ravel()[0]')
np.dot(a_c.asnumpy().T,b_c.asnumpy()).ravel() = array([20.], dtype=float32)
np.dot(a_c.asnumpy().T,b_c.asnumpy()).ravel()[0] = 20.0
3.1.7 Outer product of two vectors
Given two vectors a and b, the outer product a⊗b becomes a matrix; in
its ij position, the element is ai bj . Thus, the shapes of a and b are always
compatible.
a = np.arange(3)
b = np.ones(5) * 2
print(a, b)
print('np.outer=\n',np.outer(a, b))
[0 1 2] [2. 2. 2. 2. 2.]
np.outer=
[[0. 0. 0. 0. 0.]
[2. 2. 2. 2. 2.]
[4. 4. 4. 4. 4.]]
A matrix (2D array) is created using two 1D arrays of arbitrary lengths,
with the help of the np.outer() function. One can achieve the same results
using the @ operator, but a needs to be a column vector with shape (n, 1)
106 Machine Learning with Python: Theory and Applications
and b needs to be a row vector with shape (1, m). Readers may try this
as an exercise. Note that although we may get the same results, using the
built-in np.outer() is recommended, because it is usually much faster, and
does not need additional operations. This recommendation applies to all the
other similar situations.
3.1.8 Matrix-vector product
When the dimensionality (shape) is compatible (or made compatible via
“broadcasting”), one can obtain a matrix-vector product, using the np.dot()
function.
A35 = nd.arange(15).reshape(3,5)
d5 = nd.ones(A35.shape[1]) # get the 2nd element of shape A35
print(A35,A35.shape, d5, d5.shape)
f = nd.dot(A35, d5) # shape compatible:[3,5]X[5]->vector
# of length 3
print(f,f.shape)
[[ 0. 1. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]
[10. 11. 12. 13. 14.]]
<NDArray 3x5 @cpu(0)> (3, 5)
[1. 1. 1. 1. 1.]
<NDArray 5 @cpu(0)> (5,)
[10. 35. 60.]
<NDArray 3 @cpu(0)> (3,)
#nd.dot(d5,A35) # shape error: [5]X[3,5] not compatible
d3 = nd.ones(A35.shape[0])
nd.dot(d3,A35) # this works:[3]X[3,5]-> vector of length 5
[15. 18. 21. 24. 27.]
<NDArray 5 @cpu(0)>
3.1.9 Matrix-matrix multiplication
Further, dot-product can also be used for matrix-matrix multiplications, as
long as the shape is compatible.
Basic Mathematical Computations 107
A23 = nd.ones(shape=(2, 3))
B35 = nd.ones(shape=(3, 5))
print(A23,B35)
nd.dot(A23, B35) # [2,3]X[3,5]: shape compatible ->[2,5]
[[1. 1. 1.]
[1. 1. 1.]]
<NDArray 2x3 @cpu(0)>
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
<NDArray 3x5 @cpu(0)>
[[3. 3. 3. 3. 3.]
[3. 3. 3. 3. 3.]]
<NDArray 2x5 @cpu(0)>
#nd.dot(B35,A23) # this would give a dot shape error: [4,5]X[3,4]
In numpy, we have similar ways to perform matrix-matrix dot-product
operation.
import numpy as np
print('np.dot():\n',np.dot(A23.asnumpy(),B35.asnumpy()))
print('numpy @ operator:\n',A23.asnumpy() @ B35.asnumpy())
np.dot():
[[3. 3. 3. 3. 3.]
[3. 3. 3. 3. 3.]]
numpy @ operator:
[[3. 3. 3. 3. 3.]
[3. 3. 3. 3. 3.]]
Care is needed in dealing with matrix-vector, matrix-matrix operations,
because of the requirement of dimension compatibility. It is important to
always check the consistency of the dimensions of all the terms in the same
equation. Readers may need to struggle a while to get used to it. Operations
between one-dimensional arrays (row vectors) in Python are rather simpler,
because there is only on dimension to check and we usually use only the
dot-product (inner product).
108 Machine Learning with Python: Theory and Applications
3.1.10 Norms
A norm is used in Python to measure how “big” a vector or matrix is. There
are types of norm measures, but they all produce a non-negative value for
the measure. The most often used or default one is the L2-norm. It is the
square root of the sum of the squared elements in the vector or matrix or
tensor. For matrices, it is often called the Frobenius norm. The computation
is by calling a norm() function:
d = nd.ones(9) # create an array (vector)
print(d,nd.sum(d))
printx('nd.norm(d)') # use nd.norm()
print(np.linalg.norm(d.asnumpy())) # use numpy linalg.norm()
[1. 1. 1. 1. 1. 1. 1. 1. 1.]
<NDArray 9 @cpu(0)>
[9.]
<NDArray 1 @cpu(0)>
nd.norm(d) =
[3.]
<NDArray 1 @cpu(0)>
3.0
Note the difference that the nd.norm returns an NDArray, but
np.linalg.norm gives a float.
#help(nd.norm) # when wondering, use this
print(nd.norm(A23),np.sqrt(6*1**2)) # nd.norm() for matrix,
# default L2
print(np.linalg.norm(A23.asnumpy())) # numpy linalg.norm()
# for matrix
[2.4494898]
<NDArray 1 @cpu(0)> 2.449489742783178
2.4494898
print(nd.norm(A23,ord=2, axis=1)) # nd.norm() for matrix,
# along axis 1
print(np.linalg.norm(A23.asnumpy(),ord=2, axis=1))
# numpy linalg.norm() for matrix
Basic Mathematical Computations 109
[1.7320508 1.7320508]
<NDArray 2 @cpu(0)>
[1.7320508 1.7320508]
The L1-norm of a vector is the sum of the absolute value of the elements
in a vector. The L1-norm of a matrix can be defined as the maximum of
L1-norm of column vectors of the matrix. For computing the L1-norms of a
vector, we use the following:
printx('nd.sum(nd.abs(d))') # use nd.norm() for vector
printx('nd.norm(d,1)')
nd.sum(nd.abs(d)) =
[9.]
<NDArray 1 @cpu(0)>
nd.norm(d,1) =
[9.]
<NDArray 1 @cpu(0)>
print(np.sum(np.abs(d.asnumpy()))) # numpy for vector
print(np.linalg.norm(d.asnumpy(),1)) # np.linalg.norm() for vector
9.0
9.0
print(np.linalg.norm(A23.asnumpy(),1)) # np.linalg.norm()
# for matrix
2.0
3.1.11 Solving algebraic system equations
We can use numpy.linalg.solve to solve a set of linear algebraic system
equations given as follows:
KD = F (3.2)
where K is a given positive-definite (PD) square matrix (stiffness matrix in
FEM, for example), F is a given vector (nodal forces in FEM), and D is the
unknown vector (the nodal displacements). The K matrix is a symmetric-
positive-definite (SPD) for well-posed FEM models [1].
110 Machine Learning with Python: Theory and Applications
import numpy as np
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
K = np.array([[1.5, 1.], [1.5, 2.]]) # A square matrix
print('K:',K)
F = np.array([1, 1]) # one may try F = np.array([[2], [1]])
print('F:',F)
D = np.linalg.solve(K,F)
print('D:',D)
K: [[ 1.500 1.000]
[ 1.500 2.000]]
F: [1 1]
D: [ 0.667 0.000]
If careful, one should see that the input F is a 1D numpy array, and the
result is also a 1D array, which is not the convention of linear algebra, as
discussed in Section 2.9.13. One can also purposely define a column vector
(a 2D array with only on column) following the convention of linear algebra,
and get the solution. In this case, however, the returned solution will be the
same, but is in a column vector. Readers may try this as an exercise.
Note that solving linear algebraic system equations numerically can be
very time consuming and expensive, especially for large systems. With
the development of computer hardware and software in the past decades,
numerical algorithms for solving linear algebraic systems are well developed.
The most effective solver for very large systems uses iterative methods.
It converts the problem of solving algebraic equations to a minimization
problem with a properly defined error residual function as a cost or
loss function. A gradient-based algorithm, such as the conjugate gradient
methods and Krylov methods, can be used to minimize the residual error.
These methods are essentially the same as those used in machine learning.
The numpy.linalg.solve uses routines from the widely used efficient Linear
Algebra PACKage or LAPACK.
For a matrix that is not square, we shall use the least-square solvers
for a best solution with minimized least-square error. The function in
numpy is numpy.linalg.lstsq(). We will see examples later when discussing
interpolations.
Basic Mathematical Computations 111
3.1.12 Matrix inversion
Readers may notice that the solution to Eq. (3.2) can be written as
D = K−1 F (3.3)
where K−1 is the inverse of K. Therefore, if one can compute K−1 , the
solution is simply a matrix-vector product. Indeed, for small systems, this
approach does work, and is used by many. We now use numpy.linalg.inv()
to compute the inverse of a matrix.
from numpy.linalg import inv # import the inv() function
Kinv = inv(K)
print(Kinv)
[[ 1.333 -0.667]
[-1.000 1.000]]
print(np.allclose(np.dot(K, Kinv), np.eye(2)))
print(np.allclose(np.dot(Kinv, K), np.eye(2)))
True
True
The solution to Eq. (3.2) is obtained as follows:
D = np.dot(Kinv,F)
print('D:',D)
D: [ 0.667 0.000]
which is the same as the one obtained earlier.
Multiple matrices can be inverted at once.
a = np.array([[[1., 2.], [3., 4.]], [[1, 3], [3, 5]]])
print(a)
inv(a)
[[[ 1.000 2.000]
[ 3.000 4.000]]
[[ 1.000 3.000]
[ 3.000 5.000]]]
112 Machine Learning with Python: Theory and Applications
array([[[-2.000, 1.000],
[ 1.500, -0.500]],
[[-1.250, 0.750],
[ 0.750, -0.250]]])
a = np.array([[1.5, 1.], [1.5, 1.]])
# A singular matrix, because its
print(a) # two columns are parallel
#ainv = inv(a) # This would give a Singular matrix error
[[ 1.500 1.000]
[ 1.500 1.000]]
Note that computing numerically the inverse of a matrix of large size is
much more expensive, compared to solving the algebraic system equations.
Therefore, one would like to avoid the computation of inverse matrix. In
many cases, we can change the matrix inversion problem to a set of problems
of solving algebraic equations.
In machine learning computations, one may encounter matrices that
are singular, which do not have an inverse, leading to breakdown in
computations. More often, the matrices are nearly singular, which may allow
the computation to continue but can lead to serious error, showing as some
unexpected, strange behavior. When such behavior is observed, the chance
is high that the system matrix may be “bad-conditioned”, and one should
check for possible errors in the data or the formulation procedure that may
lead to the nearly singular system matrix. If the problem is rooted in the data
itself, one may need to clean up the data or check for data error. After this
is exhausted, one may resort to mathematical means, one of which is the use
of singular value decomposition (SVD) to get the best possible information
from the data. See later in this chapter on SVD.
The key point we would like to make here is that the most important
factor that controls whether we can solve an algebraic system equation for
quality solution is the property (or characteristics or condition) of the system
matrix. Therefore, studying the property of a matrix is of fundamental
importance.
Eigenvalues (if they exist) and their corresponding eigenvectors are the
characteristics of a matrix.
Basic Mathematical Computations 113
3.1.13 Eigenvalue decomposition of a matrix
A diagonalizable matrix can have an eigenvalue decomposition, which gives
a set of eigenvalues and the corresponding eigenvectors. The original matrix
can be decomposed into a diagonal matrix with eigenvalues at the diagonal
and a matrix consisting of eigenvectors. In particular, for real symmetric
matrices, eigenvalue decomposition is useful, and the computation can be
fast, because the eigenvalues are all real and the eigenvectors can be made
real and orthonormal. Consider a real symmetric square matrix A that is
positive-definite (PD). It has the following eigenvalue decomposition:
A = VΛV (3.4)
where stands for transpose, Λ is a diagonal matrix with eigenvalues at the
diagonal, and matrix V is formed with eigenvectors corresponding to the
eigenvalues. It is an orthonormal matrix, because
VV = I (3.5)
which implies also that the inverse of V equals its transpose.
V−1 = V (3.6)
In addition, once a matrix is decomposed, computing the inverse of it
becomes trivial simply by matrix multiplications. To see this, we start from
the definition of inverse of a matrix, AA−1 = I, and use Eqs. (3.4) and (3.6),
which leads to
A−1 = VΛ−1 V (3.7)
Because the matrix is PD, its inverse exists and all the eigenvalues are
nonzero (positive-definite). The inverse of the diagonal matrix Λ is simply
the same diagonal matrix with diagonal terms replaced by the reciprocals of
the eigenvalues.
Eigenvalue decomposition can be viewed as a special case of SVD. For
general matrices that we often encounter in machine learning, the SVD is
more widely used for matrix decomposition and will be discussed later in
this chapter, because it exists for all matrices.
In this section, let us see an example of how the eigenvalues and the
corresponding eigenvectors can be computed in Numpy.
114 Machine Learning with Python: Theory and Applications
import numpy as np
from numpy import linalg as lg # import linalg module
A = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) # Identity matrix
print('A=',A)
e, v = lg.eig(A)
print('Eigenvalues:',e)
print('Eigenvectors:\n',v)
A= [[1 0 0]
[0 1 0]
[0 0 1]]
Eigenvalues: [ 1.000 1.000 1.000]
Eigenvectors:
[[ 1.000 0.000 0.000]
[ 0.000 1.000 0.000]
[ 0.000 0.000 1.000]]
It is clearly seen that the identity matrix has three eigenvalues all of 1, and
their corresponding eigenvectors are three linearly independent unit vectors.
Let us look at a more general symmetric matrix.
A = np.array([[1, 0.2, 0], [0.2, 1, 0.5], [0, 0.5, 1]])
# Symmetric A
print('A:\n',A)
e, v = lg.eig(A)
print('Eigenvalues:',e, '\n Eigenvectors:\n',v)
A:
[[ 1.000 0.200 0.000]
[ 0.200 1.000 0.500]
[ 0.000 0.500 1.000]]
Eigenvalues: [ 1.539 1.000 0.461]
Eigenvectors:
[[-0.263 0.928 0.263]
[-0.707 -0.000 -0.707]
[-0.657 -0.371 0.657]]
We obtain three eigenvalues and their corresponding eigenvectors, and
they are all real numbers. These eigenvectors are orthonormal. To see this,
lets compute:
Basic Mathematical Computations 115
print(np.dot(v,v.T))
[[ 1.000 0.000 0.000]
[ 0.000 1.000 -0.000]
[ 0.000 -0.000 1.000]]
This means that all these three eigenvectors are orthogonal with each
other, and the dot-product between them becomes a unit. We are now ready
to recover the original matrix A, using these eigenvalues and eigenvectors.
print(A)
lamd = np.eye(3)*e
A_recovered = v@[email protected]
print(A_recovered)
[[ 1.000 0.200 0.000]
[ 0.200 1.000 0.500]
[ 0.000 0.500 1.000]]
[[ 1.000 0.200 0.000]
[ 0.200 1.000 0.500]
[ 0.000 0.500 1.000]]
It is clear that matrix A is recovered with the machine error. This
means that the information in matrix A is fully kept in its eigenvalues and
eigenvectors.
Next, we use the eigenvalues and eigenvectors to compute its inverse,
using Eq. (3.7).
lamd_inv = np.eye(3)/e
A_inv = v@[email protected]
print(A_inv)
[[ 1.056 -0.282 0.141]
[-0.282 1.408 -0.704]
[ 0.141 -0.704 1.352]]
which is the same as that obtained directly using the numpy.linalg.inv()
function:
from numpy.linalg import inv
print(inv(A))
116 Machine Learning with Python: Theory and Applications
[[ 1.056 -0.282 0.141]
[-0.282 1.408 -0.704]
[ 0.141 -0.704 1.352]]
Let us now compute the eigenvalues and eigenvectors of an asymmetric
matrix.
A = np.array([[1,-0.2, 0], [0.1, 1,-0.5], [0, 0.3, 1]])
# Symmetric A
print('A:\n',A)
e, v = lg.eig(A)
print('Eigenvalues:',e, '\n Eigenvectors:\n',v)
A:
[[ 1.000 -0.200 0.000]
[ 0.100 1.000 -0.500]
[ 0.000 0.300 1.000]]
Eigenvalues: [1.+0.j 1.+0.4123j 1.-0.4123j]
Eigenvectors:
[[-0.9806+0.j 0. +0.3651j 0. -0.3651j]
[ 0. +0.j 0.7528+0.j 0.7528-0.j ]
[-0.1961+0.j -0. -0.5477j -0. +0.5477j]]
We now see one real eigenvalue, but the other two eigenvalues are
complex valued. These two complex eigenvalues are conjugates of each other.
Similar observations are made for the eigenvectors. We conclude that a real
asymmetric matrix can have complex eigenvalues and eigenvectors. Complex
valued matrices shall in general have complex eigenvalues and eigenvectors.
A special class of complex valued matrices called the Hermitian matrix (self-
adjoint matrix) has real eigenvalues. This example shows that the complex
space is geometrically closed, but the real space is not. An n by n real matrix
should have n eigenvalues (and eigenvectors), but they may not be all in the
real space. Some of them get into the complex space (that with the real space
as its special case).
3.1.14 Condition number of a matrix
The condition number of a matrix is a measure of the “level” of singularity.
There are a number of options of norms for computing the condition number.
It is always larger than or equal to 1 for any measure. This implies that the
best condition of a matrix is 1, which is the condition number of any unit
matrix that has no (the lowest) singularity. Any other matrix shall have some
Basic Mathematical Computations 117
level of singularity. The condition number for a matrix with the highest level
of singularity is infinite. Let us see some examples.
from numpy import linalg as lg
A = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
# A unit matrix
# Clearly, it has 3 eigenvalues of all 1.0
print(A, '\n Condition number of A=',lg.cond(A))
# or lg.cond(A,2)
# Option 2-> L2 more measure
[[1 0 0]
[0 1 0]
[0 0 1]]
Condition number of A= 1.0
Because matrix A is a unit matrix, we got a condition number of 1, as
expected. Another example is as follows:
A = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 0]])
# A singular matrix
# It has 2 eigenvalues of all 1.0, and 1 eigenvalue of 0
print(A, '\n Condition number of A=',lg.cond(A))
[[1 0 0]
[0 1 0]
[0 0 0]]
Condition number of A= inf
Because matrix A is singular, its condition number is inf which is a numpy
number for infinity, as expected. More examples are as follows:
A = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 10.]])
# Last entry is 10.
print(A, '\n Condition number of A=',lg.cond(A))
[[ 1.000 0.000 0.000]
[ 0.000 1.000 0.000]
[ 0.000 0.000 10.000]]
Condition number of A= 10.0
The condition number of this A is 10.0, which is 10.0/1.0. Again, the
condition number is the ratio of the largest eigenvalue and the smallest
eigenvalue. We can now conclude that if the largest eigenvalue of a matrix is
118 Machine Learning with Python: Theory and Applications
very large or the smallest eigenvalue of the matrix is very small, the matrix
is likely singular, depending on their ratio.
This finding implies that a normalization to a matrix (which is often
done in machine learning) will not in theory change its condition number.
It may help in reducing the loss of significant digits (because of the limited
presentation of floats in computer hardware).
3.1.15 Rank of a matrix
We mentioned rank before. If a square matrix has a full rank, it implies that
all the columns (or rows) are mutually linearly independent. Such a matrix is
not singular. For a singular matrix, its rank should be less than full, which is
called rank deficiency. One can further ask what the level of rack deficiency
is. The answer is the deficit number of its rank. Essentially, if the matrix has
a rank of 2, we know that two linearly independent vectors can be formed
using these columns (or rows) in the matrix.
For a non-square matrix, the full rank is the number of its columns or
rows, whichever is smaller. Similarly, it can also have rank deficiency if the
rank is smaller than the full rank.
Let us now examine it in more detail using Numpy.
from numpy.linalg import matrix_rank, eig
print('Rank=',matrix_rank(np.eye(4))) # Identity matrix
Rank= 4
It is seen that the identity matrix has a shape of 4 × 4. It has a full rank.
A = np.array([[1,-0.2, 0], [0.1, 1,-0.5], [0.1, 1,-0.5]])
# singular A
print(A, '\n Rank=', matrix_rank(A))
[[ 1.000 -0.200 0.000]
[ 0.100 1.000 -0.500]
[ 0.100 1.000 -0.500]]
Rank= 2
This singular matrix has two linearly independent columns, and hence
a rank of 2. It has a rank deficiency of 1. Thus, it should also have a zero
eigenvalue, as shown below. If the matrix has a rank deficiency of n, it shall
have n zero eigenvalues. This is easy checked out using Numpy.
Basic Mathematical Computations 119
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
eig(A)
(array([ 0.956, 0.544, -0.000]),
array([[ 0.955, -0.296, 0.088],
[ 0.209, -0.675, 0.438],
[ 0.209, -0.675, 0.894]]))
A = np.array([[1, -0.2, 0], [0.1, 1, -0.5]])
print('A:',A, '\n Rank=',matrix_rank(A))
A: [[ 1.000 -0.200 0.000]
[ 0.100 1.000 -0.500]]
Rank= 2
The matrix has only two rows, and a rank of 2. It has a full rank.
3.2 Rotation Matrix
For two-dimensional cases, the coordinate transformation (rotation) matrix
can be given as follows:
cos θ − sin θ
T= (3.8)
sin θ cos θ
where T is the transformation (or rotation) matrix, and θ is the rotation
angle. A given vector (displacement, force, for example) can be written with
two components in the coordinate system as follows:
d = [u v] (3.9)
The new coordinates of the vector dθ that is rotated by θ can be computed
using the rotation matrix.
import numpy as np
theta = 45 # Degree
thetarad = np.deg2rad(theta)
c, s= np.cos(thetarad), np.sin(thetarad)
T = np.array([[c, -s],
[s, c]])
print('Transformation matrix T:\n',T)
120 Machine Learning with Python: Theory and Applications
Transformation matrix T:
[[ 0.707 -0.707]
[ 0.707 0.707]]
d = np.array([1, 0]) # Original vector
T @ d # rotated by theta
array([ 0.707, 0.707])
T @ (T@d) # rotated by 2 thetas
array([-0.000, 1.000])
T @ T # 2 theta rotations
array([[ 0.000, -1.000],
[ 1.000, -0.000]])
T @ (T@(T @ T)) # 4 theta rotations
array([[-1.000, 0.000],
[ 0.000, -1.000]])
T@(T @ (T@(T @ (T @ (T@(T @ T)))))) # 8 theta rotations =
# no rotation
array([[ 1.000, 0.000],
[-0.000, 1.000]])
3.3 Interpolation
Interpolation is a frequently used numerical technique for getting an
approximate date based on known data. Machine learning is in some way
quite similar to the interpolation. This section studies general issues related
to interpolation, using numpy. Interpolation is also known as curve fitting.
We show here some examples of function interpolation and approximation,
using values given at discrete points in a space.
Let us try numpy.interp first. More descriptions can also be found in
Scipy documentation (https://docs.scipy.org/doc/numpy-1.13.0/reference/
generated/numpy.interp.html).
Basic Mathematical Computations 121
3.3.1 1-D piecewise linear interpolation using numpy.interp
# Data available
xn = [1, 2, 3] # data: given coordinates x
fn = [3, 2, 0] # data: given function values at x
# Query/Prediction f at a new location of x
x = 1.5
f = np.interp(x, xn, fn) # get approximated value at x
print(f'f({x:.3f})≈{f:.3f}')
f(1.500)≈2.500
np.interp(2, xn, fn) # Is it a data-passing interpolation?
2.0
np.interp([0, 1, 1.5, 2.72, 3.14], xn, fn) # querying at
# more points
array([ 3.000, 3.000, 2.500, 0.560, 0.000])
np.interp(4, xn, fn)
0.0
In practice, we know that interpolation can be a quite dangerous
operation, and hence extra care is required, especially when extrapolating.
To avoid or to be made aware of an extrapolation, one can set a warning
when the interpolation occurs outside the domain that the data covers.
out_of_domain = -109109109.0 # A warning number is used
print(np.interp(2.9, xn, fn,right=out_of_domain))
# print out the number when extrapolation occurs
print(np.interp(3.5, xn, fn,right=out_of_domain))
0.20000000000000018
-109109109.0
Interpolation using higher-order polynomials can be more accurate but
can also be a bigger problem. Piecewise linear approximation has often been
found to be much safer and can be very effective, when dense data are
available. Given below is an example using piecewise linear interpolation for
the approximation of a sine function.
122 Machine Learning with Python: Theory and Applications
import matplotlib.pyplot as plt # module for plot the results
x = np.linspace(0, 2*np.pi, 20) # data: x values
y = np.sin(x) # data: function values at x
xvals = np.linspace(0, 2*np.pi, 50) # generate dense x data
# at which the values are
# obtained via interpolation
yinterp = np.interp(xvals, x, y)
plt.plot(x, y, 'o') # plot the original data points
plt.plot(xvals, yinterp, '-x') # plot interpolated data points
plt.show() # show the plots
Figure 3.1: Fitted sine curve.
3.3.2 1-D least-square solution approximation
This is to fit a given set of data with a straight line in x − y plane
y = wx + b (3.10)
We shall determine the gradient w and bias b, using the data pair [xi , yi ]. In
this example, Eq. (3.10) can be rewritten as
y =X·w (3.11)
where X = [x, 1] and w = [w, b]. Now, we can use the np.linalg.lstsq to solve
for w:
w_true,b_true = 1.0, -1.0 # used for generating data
x = np.array([0, 1, 2, 3]) # x value at which data
# will be generated
X = np.vstack([x, np.ones(len(x))]).T # Form the matrix of data
X
Basic Mathematical Computations 123
array([[ 0.000, 1.000],
[ 1.000, 1.000],
[ 2.000, 1.000],
[ 3.000, 1.000]])
y = w_true*x+b_true+np.random.rand(len(x))/1.0
# generate y data random noise added
print(y)
w, b = np.linalg.lstsq(X, y, rcond=None)[0]
w, b
[-0.686 0.754 1.643 2.702]
(1.1055249814646126, -0.5550392342429096)
#help(np.linalg.lstsq) # to find out the details of this function.
import matplotlib.pyplot as plt
plt.plot(x, y, 'o', label='Original data', markersize=10)
plt.plot(x, w*x + b, 'r', label='Fitted line')
plt.legend()
plt.show()
Figure 3.2: Least square approximation of data via a straight line.
We have, in fact, created the simplest machine learning model, known as
linear regression.
Scipy package offers efficient functions for machine learning computations,
including interpolation. Let us examine some examples that are available
at the Scipy documentation (https://docs.scipy.org/doc/scipy/reference/
tutorial/interpolate.html).
124 Machine Learning with Python: Theory and Applications
3.3.3 1-D interpolation using interp1d
import numpy as np
from scipy import interpolate
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
x0, xL = 0, 18
x = np.linspace(x0, xL, num=11, endpoint=True)
# x data points
y = np.sin(-x**3/8.0) # y data points
print('x.shape:',x.shape,'y.shape:',y.shape)
f = interp1d(x, y) # linear interpolation
f2 = interp1d(x, y, kind='cubic') # Cubit interpolation
# try also quadratic
xnew = np.linspace(x0,xL,num=41,endpoint=True)
# x prediction points
plt.plot(x, y, 'o', xnew, f(xnew), '-', xnew, f2(xnew), '--')
plt.legend(['data', 'linear', 'cubic'], loc='best')
plt.show()
x.shape: (11,) y.shape: (11,)
Figure 3.3: Interpolation using Scipy.
3.3.4 2-D spline representation using bisplrep
x, y = np.mgrid[-1:1:28j, -1:1:28j] # x, y data grid
z = (x**2+y**2)*np.exp(-2.0*(x*x+y*y+x*y)) # z data
plt.figure()
Basic Mathematical Computations 125
plt.pcolor(x, y, z, shading='auto') # plot the initial data
plt.colorbar()
plt.title("Function sampled at discrete points")
plt.show()
Figure 3.4: Spline representation using bisplrep, coarse grids.
xnew, ynew = np.mgrid[-1:1:88j, -1:1:88j] # for view at grid
tck = interpolate.bisplrep(x, y, z, s=0) # B-spline
znew = interpolate.bisplev(xnew[:,0], ynew[0,:], tck) # z value
plt.figure()
plt.pcolor(xnew, ynew, znew, shading='auto')
plt.colorbar()
plt.title("Interpolated function.")
plt.show()
Figure 3.5: Spline representation using bisplrep, fine grids.
126 Machine Learning with Python: Theory and Applications
3.3.5 Radial basis functions for smoothing and interpolation
Radial basis functions (RBFs) are useful basis functions for approximation
of functions. RBFs are distance functions, and hence work well for irregular
grids (even randomly distributed points), in high dimensions, and are often
found less prone to overfitting. They are also used for constructing meshfree
methods [2]. In using Scipy, the choices of RBFs are as follows:
• “multiquadric”: sqrt((r/self.epsilon)**2 + 1)
• “inverse”: 1.0/sqrt((r/self.epsilon)**2 + 1)
• “gaussian”: exp(-(r/self.epsilon)**2)
• “linear”: r
• “cubic”: r**3
• “quintic”: r**5
• “thin plate”: r**2 * log(r).
The default is “multiquadric”.
First, let us look at one-dimensional examples.
import numpy as np
from scipy.interpolate import Rbf, InterpolatedUnivariateSpline
import matplotlib.pyplot as plt
# Generate data
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
x = np.linspace(0, 10, 9)
print('x=',x)
y = np.sin(x)
print('y=',y)
# fine grids for plotting the interpolated data
xi = np.linspace(0, 10, 101)
# use fitpack2 method
ius=InterpolatedUnivariateSpline(x,y) # interpolation
# function
yi = ius(xi) # interpolated values at fine grids
plt.subplot(2, 1, 1) # have 2 sub-plots plotted together
plt.plot(x, y, 'bo') # original data points in blue dots
plt.plot(xi, np.sin(xi), 'r') # original function, red line
Basic Mathematical Computations 127
plt.plot(xi, yi, 'g') # Spline interpolated, green line
plt.title('Interpolation using univariate spline')
plt.show()
# use RBF method
rbf = Rbf(x, y)
fi = rbf(xi)
plt.subplot(2, 1, 2) # have 2 plots plotted together
plt.plot(x, y, 'bo') # original data points in blue dots
plt.plot(xi, np.sin(xi), 'r') # original function, red line
plt.plot(xi, fi, 'g') # RBF interpolated, green line
plt.title('Interpolation using RBF - multiquadrics')
plt.show()
x= [ 0.000 1.250 2.500 3.750 5.000 6.250 7.500 8.750
10.000]
y= [ 0.000 0.949 0.598 -0.572 -0.959 -0.033 0.938 0.625
-0.544]
Figure 3.6: Comparison of interpolation using spline and radial basis function (RBF).
We now examine some two-dimensional examples.
import numpy as np
from scipy.interpolate import Rbf
import matplotlib.pyplot as plt
from matplotlib import cm
128 Machine Learning with Python: Theory and Applications
# 2-d tests - setup scattered data
x = np.random.rand(108)*4.0-2.0
y = np.random.rand(108)*4.0-2.0
z = (x+y)*np.exp(-x**2-y**2+x*y)
di = np.linspace(-2.0, 2.0, 108)
XI, YI = np.meshgrid(di, di)
# use RBF https://docs.scipy.org/doc/scipy/reference/
# generated/scipy.
# interpolate.Rbf.html#scipy.interpolate.Rbf
rbf = Rbf(x, y, z, epsilon=2)
ZI = rbf(XI, YI)
# plot the result
plt.pcolor(XI, YI, ZI, cmap=cm.jet, shading='auto')
plt.scatter(x, y, 88, z, cmap=cm.jet)
plt.title('RBF interpolation - multiquadrics')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.colorbar();
Figure 3.7: Two-dimensional interpolation using RBF.
RBFs can be used for interpolation over N -dimensions. Below is an
example for 3D.
Basic Mathematical Computations 129
from scipy.interpolate import Rbf
x, y, z, d = np.random.rand(4, 20)
# randomly generated data 0~1
# 4D, 50 points in each dimension
#print(d) # original data
rbfi = Rbf(x, y, z, d) # RBF interpolator
xi = yi = zi = np.linspace(0, 1, 10)
di = rbfi(xi, yi, zi) # interpolated values
print(di)
[-0.378 -0.058 0.450 0.832 0.825 0.665 0.559 0.512
0.430 0.497]
3.4 Singular Value Decomposition
3.4.1 SVD formulation
Singular value decomposition (SVD) is an essential tool for many numerical
operations on matrices, including signal processing, statistics, and machine
learning. It is a general factorization of a matrix (real or complex) that
may be singular and of any shape. It is very powerful, because any such a
matrix exists an SVD and can be found numerically. It is a generalization of
the eigenvalue decomposition that works for diagonalizable square matrices,
which was discussed earlier in this chapter.
A general (real or complex, square, or not square) m × p matrix A has
the following singular value decomposition:
A = UΣV∗ (3.12)
where * stands for Hessian (transpose of the matrix and conjugate to the
complex values).
• U is an m × m unitary matrix.
• Σ is an m × p rectangular diagonal matrix with non-negative real numbers
on the diagonal entries.
• V is a p × p unitary matrix.
• The diagonal entries σi in Σ are known as the singular values of A.
• The columns of U are called the left-singular vectors of A.
• The columns of V are called the right-singular vectors of A.
More detailed discussions on SVD can be found at Wikipedia (https://
en.wikipedia.org/wiki/Singular value decomposition).
130 Machine Learning with Python: Theory and Applications
3.4.2 Algorithms for SVD
Computation of SVD for large matrices can be very expensive. The
often used SVD algorithm is based on the QR decomposition (https://en.
wikipedia.org/wiki/QR decomposition) and its variations. The basic idea is
to decompose the given matrix into an orthogonal matrix Q and an upper
triangular matrix R. Readers can refer to the Wikipedia page for more details
and the leads there on the related topic. Here, we discuss a simple approach
to compute SVD based on the well-established eigenvalue decomposition.
This approach is not used for numerical computation of SVD. This is because
it uses a normal matrix that with condition number squared, leading to
numerical instability issues for large systems. For our theoretical analysis
and formula derivation, this is not an issue, and thus will be used here. For
this simple approach to work, we would need to impose some condition on
matrix A.
Consider a general m×p matrix A of real numbers with m>p, and assume
it has a rank of p. Such a matrix is often encountered in machine learning.
We first form a normal matrix B:
B = A A (3.13)
which will be a p × p symmetric square matrix (smaller in size). Therefore,
B will be orthogonally diagonalizable. Because matrix A has a rank of p, B
will also be symmetric-positive-definite (SPD). Thus, B has an eigenvalue
decomposition, we perform such a decomposition. The results can be written
in the form of
B = Ve ΛVe (3.14)
where Ve is a p × p orthonormal matrix of p eigenvectors of the B matrix.
Λ is a p × p square diagonal matrix. The diagonal entries are the eigenvalues
that are positive real numbers.
On the other hand, we know that matrix A has an SVD decomposition;
we thus also have
A = UΣV (3.15)
Because matrix A has a rank of p, the singular value in Σ shall all be positive
real numbers. Using Eq. (3.15), we have
A A = UΣV UΣV = VΣU UΣV = VΣ2 V = B (3.16)
Basic Mathematical Computations 131
In the above derivation, we used the fact that U is unitary: U U = I, and
Σ is diagonal (not affected by the transpose). Comparing Eq. (3.16) with
Eq. (3.14), we have
V = Ve (3.17)
√
Σ= Λ (3.18)
Using now Eq. (3.15), and the orthonormal property of V: V V = I, we
have
AV = UΣ (3.19)
Because Σ has all positive eigenvalues, we finally obtain
U = AVΣ−1 (3.20)
It is easy to confirm U is unitary. Finally, if A is rank deficient
(rank(A)< p), matrix B will have zero eigenvalues. In such cases, we simply
discard all the zero-eigenvalues and their corresponding eigenvectors. This
will still give us an SVD in a reduced form, and all the process given above
still holds
Readers may derive the similar set of equations for an m × p matrix A
of real numbers with p > m, and assume it has a rank of m.
The above analysis proves in theory that any matrix has a SVD. In
practical computations, we usually do not use the above procedure to
compute the SVD. This is because the condition number of matrix B is
squared, as seen in Eq. (3.13), leading to numerical instability. The practical
SVD algorithms often use the QR decomposition of A, which avoids forming
B. With the theoretical foundation, we now use some simple examples to
demonstrate the SVD process using Python.
3.4.3 Numerical examples
import numpy as np
a = np.random.randn(3, 6) # matrix with random numbers
print(a)
[[-1.582 0.398 0.572 -1.060 1.000 -1.532]
[ 0.257 0.435 -1.851 0.894 1.263 -0.364]
[-1.251 1.276 1.548 1.039 0.165 -0.946]]
132 Machine Learning with Python: Theory and Applications
u, s, vh = np.linalg.svd(a, full_matrices=True)
print(u.shape, s.shape, vh.shape)
(3, 3) (3,) (6, 6)
print('u=',u,'\n','s=',s,'\n','vh=',vh)
u= [[-0.677 0.283 0.679]
[ 0.202 0.959 -0.198]
[-0.707 0.003 -0.707]]
s= [ 3.412 2.466 1.881]
vh= [[ 0.589 -0.318 -0.544 0.048 -0.158 0.479]
[-0.083 0.216 -0.652 0.227 0.606 -0.318]
[-0.128 -0.382 -0.180 -0.867 0.166 -0.159]
[-0.510 -0.736 -0.110 0.412 -0.115 -0.066]
[ 0.300 -0.343 0.476 0.115 0.723 0.171]
[-0.529 0.218 -0.083 -0.104 0.210 0.781]]
smat = np.zeros((3, 6))
smat[:3, :3] = np.diag(s)
print(smat)
[[ 3.412 0.000 0.000 0.000 0.000 0.000]
[ 0.000 2.466 0.000 0.000 0.000 0.000]
[ 0.000 0.000 1.881 0.000 0.000 0.000]]
np.allclose(a, np.dot(u, np.dot(smat, vh)))
# Is original a recovered?
True
print(a)
print(np.dot(u, np.dot(smat, vh)))
[[-1.582 0.398 0.572 -1.060 1.000 -1.532]
[ 0.257 0.435 -1.851 0.894 1.263 -0.364]
[-1.251 1.276 1.548 1.039 0.165 -0.946]]
[[-1.582 0.398 0.572 -1.060 1.000 -1.532]
[ 0.257 0.435 -1.851 0.894 1.263 -0.364]
[-1.251 1.276 1.548 1.039 0.165 -0.946]]
Basic Mathematical Computations 133
We note here that the SVD of a matrix keeps the full information of
the matrix: using all these singular values and vectors, one can recover the
original matrix. What if one uses only some of these singular values (and the
corresponding singular vectors)?
3.4.4 SVD for data compression
We can use SVD to compress data, by discarding some (often many) of these
singular values and vectors of the data matrix. The following is an example
of compressing an m × n array of an image data.
from pylab import imshow,gray,figure
from PIL import Image, ImageOps
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
A = Image.open('../images/hummingbird.jpg') # open an image
print(np.shape(A)) # check the shape of A, it is a 3D tensor
A = np.mean(A,2) # get 2-D array by averaging RBG values
m, n = len(A[:,0]), len(A[1])
r = m/n # Aspect ratio of the original image
print(r,len(A[1]),A.shape,A.size)
fsize, dpi = 3, 80 # inch, dpi (dots per inch, resolution)
plt.figure(figsize=(fsize,fsize*r), dpi=dpi)
gray()
imshow(A)
U, S, Vh = np.linalg.svd(A, full_matrices=True)
print(U.shape, S.shape, Vh.shape)
# Recover the image
k = 20 # use first k singular values
S = np.resize(S,[m,1])*np.eye(m,n)
Compressed_A=np.dot(U[:,0:k],np.dot(S[0:k,0:k],Vh[0:k,:]))
#print(Compressed_A.shape,'Compressed_A=',Compressed_A)
plt.figure(figsize=(fsize,fsize*r), dpi=dpi)
gray()
imshow(Compressed_A)
134 Machine Learning with Python: Theory and Applications
(405, 349, 3)
1.160458452722063 349 (405, 349) 141345
(405, 405) (349,) (349, 349)
<matplotlib.image.AxesImage at 0x203185a7080>
Figure 3.8: Reproduced image using compressed data in comparison with the original
image.
It is clear that when k = 20 (out of 349) singular values are used, the
reconstructed image is quite close to the original one. Readers may estimate
how much the storage can be saved if one keeps 10% of the singular values
(and the corresponding singular vectors), assuming the reduced quality is
acceptable.
Basic Mathematical Computations 135
3.5 Principal Component Analysis
Principal component analysis (PCA) is an effective technique to extract
features from datasets. It was invented in 1901 by Karl Pearson [3]. It is a pro-
cedure that converts a dataset of p observations (raw features) of a possibly
correlated to a reduced set of variables (extracted features), using an orthog-
onal transformation. The principal components produced via a PCA are not
linearly correlated and are sorted by their variance values. Principal compo-
nents at the top of the sorted list account for higher variability in the dataset.
It is an effective way to reduce the dimension of feature spaces of datasets,
so that machine learning models such as the neural network can work more
effectively [4, 5]. Note that PCA is known to be sensitive to the relative
scaling of the original observations, see Wikipedia (https://en.wikipedia.
org/wiki/Principal component analysis#cite note-1) for more details.
PCA can be performed at least in two ways. One is to use a regression
approach which finds the set of orthogonal axes in an iterative manner.
The other is to use eigenvalue decomposition algorithms. The following
examples use the 2nd approach.
3.5.1 PCA formulation
Consider a general m × p matrix A of real numbers with m>p. We first form
a matrix B:
B = A A (3.21)
which is a p × p symmetric square matrix with reduced size. Thus, it will be
at least semi-positive-definite, and often SPD. We can perform an eigenvalue
decomposition to it, which gives
B = VΣV (3.22)
These decomposed matrices are as follows:
• V is a p × p orthonormal matrix of p eigenvectors of the B matrix.
• Σ is a p×p square diagonal matrix. The diagonal entries are the eigenvalues
that are non-negative real numbers.
The PCA is then given as
AP CA = AV (3.23)
which is the projection of A on these p orthonormal eigenvectors. It has the
same shape as the original A that is m × p.
136 Machine Learning with Python: Theory and Applications
One may reconstruct A using the following formula, if using all the
eigenvectors:
Ar = AP CA V = AVV = A (3.24)
This is because eigenvectors are orthonormal. It is often that the first few
(ranked by the value of the eigenvalues in descending order) eigenvectors
contain most of the overall information of the original matrix A. In this
case, we can use only a small number of eigenvectors to reconstruct the A
matrix. For example, if we use k p number of eigenvectors, we have
Ar = AP CA [0 : m, 0 : k]V [0 : k, 0 : p]
(3.25)
= A[0 : m, 0 : p]V[0 : p, 0 : k]V [0 : k, 0 : p] = A
This will, in general, not equal the original A, but can be often very close
to it. In this case, the storage becomes m × k + k × p which can be much
smaller than the original size of m × p. In Eq. (3.25), we used the Python
syntax, and hence it is very close to that in the Python code.
Note that if matrix A has dimensions of m < p, we simply treat its
transpose in the same way mentioned above.
One can also perform a similar analysis by forming a normal matrix B
using the following equation instead:
B = AA (3.26)
which will be an m×m symmetric square matrix of reduced size. Assuming it
is at least semi-positive-definite, we can perform an eigenvalue decomposition
to it, which gives
B = VΣV (3.27)
In this case, these decomposed matrices are as follows:
• V is an m × m orthonormal matrix of m eigenvectors of the B matrix.
• Σ is an m × m square diagonal matrix. The diagonal entries are the
eigenvalues that are non-negative real numbers.
The PCA is then given as
AP CA = V A (3.28)
It has the same shape as the original A that is m×p. One may reconstruct A
using the following formula and all the eigenvectors (that are orthonormal):
Ar = VAP CA = VV A = A (3.29)
Basic Mathematical Computations 137
We can use only a small number of eigenvectors to reconstruct the A matrix.
For example, if we use k m number of eigenvectors, we have
Ar = V[0 : k, 0 : m]V [0 : m, 0 : k]AP CA [0 : k, 0 : p]
(3.30)
= A
Note that for large systems, we do not really form the normal matrix
B, perform eigenvalue decomposition, and then compute V numerically.
Instead, the QR transformation type of algorithms are used. This is because
of the instability reasons mentioned in the beginning of Section 3.4.2.
3.5.2 Numerical examples
3.5.2.1 Example 1: PCA using a three-line code
We show an example of PCA code with only three lines. It is from glowing-
python (https://glowingpython.blogspot.com/2011/07/principal-component-
analysis-with-numpy.html), with permission. It is inspired by the function
princomp of matlab’s statistics toolbox and quite easy to follow. We modified
the code to exactly follow the PCA formulation presented above.
import numpy as np
from pylab import plot,subplot,axis,show,figure
def princomp(A):
""" PCA on matrix A. Rows: m observations; columns:
p variables. A will be zero-centered and normalized
Returns:
coeff: eig-vector of A^T A. Row-reduced observations,
each column is for one principal component.
score: the principal component - representation of A in
the principal component space.Row-observations,
column-components.
latent: a vector with the eigenvalues of A^T A.
"""
# eigenvalues and eigenvectors of covariance matrix
# modified. It was:
# M = (A-np.mean(A.T,axis=1))
138 Machine Learning with Python: Theory and Applications
# [latent,coeff] = np.linalg.eig(np.cov(M))
# score = np.dot(coeff.T,M)
A=(A-np.array([np.mean(A,axis=0)])) # subtract the mean
[latent,coeff] = np.linalg.eig(np.dot(A.T,A))
score = np.dot(A,coeff) # projection on the new space
return coeff,score,latent
Let us test the code using a 2D dataset.
# A simple 2D dataset
np.set_printoptions(formatter={'float': '{: 0.2f}'.format})
Data = np.array([[2.4,0.7,2.9,2.5,2.2,3.0,2.7,1.6,1.8,1.1,
1.6,0.9],
[2.5,0.5,2.2,1.9,1.9,3.1,2.3,2.0,1.4,1.0,
1.5,1.1]])
A = Data.T # Note: transpose to have A with m>p
print('A.T:\n',Data)
coeff, score, latent = princomp(A) # change made. It was A.T
print('p-by-p matrix, eig-vectors of A:\n',coeff)
print('A.T in the principal component space:\n',score.T)
print('Eigenvalues of A, latent=\n',latent)
figure(figsize=(50,80))
figure()
subplot(121)
# every eigenvector describe the direction of a principal
# component.
m = np.mean(A,axis=0)
plot([0,-coeff[0,0]*2]+m[0], [0,-coeff[0,1]*2]+m[1],'--k')
plot([0, coeff[1,0]*2]+m[0], [0, coeff[1,1]*2]+m[1],'--k')
plot(Data[0,:],Data[1,:],'ob') # the data points
axis('equal')
subplot(122)
# New data produced using the s
plot(score.T[0,:],score.T[1,:],'*g') # Note: transpose back
axis('equal')
show()
Basic Mathematical Computations 139
A.T:
[[ 2.40 0.70 2.90 2.50 2.20 3.00 2.70 1.60 1.80
1.10 1.60 0.90]
[ 2.50 0.50 2.20 1.90 1.90 3.10 2.30 2.00 1.40 1.00
1.50 1.10]]
p-by-p matrix, eig-vectors of A:
[[ 0.74 -0.67]
[ 0.67 0.74]]
A.T in the principal component space:
[[ 0.82 -1.79 0.98 0.49 0.26 1.66 0.90 -0.11 -0.37
-1.16 -0.45 -1.24]
[ 0.23 -0.11 -0.33 -0.28 -0.08 0.27 -0.12 0.40 -0.18 -0.01
0.03 0.20]]
Eigenvalues of A, latent=
[ 11.93 0.58]
<Figure size 3600x5760 with 0 Axes>
Figure 3.9: Data process with PCA.
3.5.2.2 Example 2: Truncated PCA
This example is a modified PCA based on the previous code. The test is
done for an image compression application. The code is from glowingpython
(https://glowingpython.blogspot.com/2011/07/pca-and-image-compression-
with-numpy.html), with permission.
140 Machine Learning with Python: Theory and Applications
import numpy as np
def princomp(A,numpc=0):
# computing eigenvalues and eigenvectors of covariance
# matrix A
A = (A-np.array([np.mean(A,axis=0)]))
# subtract the mean (along columns)
[latent,coeff] = np.linalg.eig(np.dot(A.T,A))
#was: A = (A-np.mean(A.T,axis=1)).T # subtract the mean
#was: [latent,coeff] = np.linalg.eig(np.cov(M))
p = np.size(coeff,axis=1)
idx = np.argsort(latent) # sorting the eigenvalues
idx = idx[::-1] # in ascending order
# sorting eigenvectors according to eigenvalues
coeff = coeff[:,idx]
latent = latent[idx] # sorting eigenvalues
if numpc < p and numpc >= 0:
coeff = coeff[:,range(numpc)] # cutting some PCs
#score = np.dot(coeff.T,M) # projection on the new space
score = np.dot(A,coeff)
# projection of the data on the new space
return coeff,score,latent
The following code computes the PCA of matrix A, which is an image in
color scale. It first converts Image A into gray scale. After the PCA is done,
a different reduced number of principal components are used to reconstruct
the image.
from pylab import imread,subplot,imshow,title,gray,figure,
show,NullLocator
from ipykernel import kernelapp as app
from PIL import Image, ImageOps
%matplotlib inline
#A = Image.open('./images/hummingbirdcapsized.jpg')
A = Image.open('../images/hummingbird.jpg') # open an image
#A = ImageOps.flip(B) # flip it if so required
# or use A = imread('./images/hummingbirdcapsized.jpg')
A = np.mean(A,2) # to get a 2-D array
Basic Mathematical Computations 141
full_pc = np.size(A,axis=1)
# numbers of all the principal components
r = len(A[:,0])/len(A[1])
print(r,len(A[1]),A.shape,A.size)
i = 1
dist = []
figure(figsize=(11,11*r))
for numpc in range(0,full_pc+10,50): # 0 50 100 ... full_pc
coeff, score, latent = princomp(A,numpc)
print(numpc,'coeff, score, latent \n',
coeff.shape, score.shape, latent.shape)
Ar = np.dot(score,coeff.T)+np.mean(A,axis=0)
#was:Ar = np.dot(coeff,score).T+np.mean(A,axis=0)
# difference in Frobenius.norm
dist.append(np.linalg.norm(A-Ar,'fro'))
# showing the pics reconstructed with less than 50 PCs
if numpc <= 250:
ax = subplot(2,3,i,frame_on=False)
ax.xaxis.set_major_locator(NullLocator())
ax.yaxis.set_major_locator(NullLocator())
i += 1
imshow(Ar) #imshow(np.flipud(Ar))
title('PCs # '+str(numpc))
gray()
figure()
imshow(A) #imshow(np.flipud(A))
title('numpc FULL: '+str(len(A[1])))
gray()
show()
1.160458452722063 349 (405, 349) 141345
0 coeff, score, latent
(349, 0) (405, 0) (349,)
50 coeff, score, latent
(349, 50) (405, 50) (349,)
100 coeff, score, latent
(349, 100) (405, 100) (349,)
150 coeff, score, latent
(349, 150) (405, 150) (349,)
200 coeff, score, latent
142 Machine Learning with Python: Theory and Applications
(349, 200) (405, 200) (349,)
250 coeff, score, latent
(349, 250) (405, 250) (349,)
300 coeff, score, latent
(349, 300) (405, 300) (349,)
350 coeff, score, latent
(349, 349) (405, 349) (349,)
Figure 3.10: Images reconstructed using reduced PCA components, in comparison with
the original image.
Basic Mathematical Computations 143
We can see that 50 principal components give a pretty good quality image,
compared to the original one.
To assess the quality of the reconstruction quantitatively, we compute
the distance of the reconstructed images from the original one in the
Frobenius norm, for a different number of eigenvalues/eigenvectors used in
the reconstruction. The results are plotted in Fig. 3.11, with the x-axis for
the number of eigenvalues/eigenvectors used. The sum of the eigenvalues is
plotted in the blue curve, and the Frobenius norm is plotted in the red curve.
The sum of the eigenvalues relates to the level of variance contribution.
from pylab import plot, axis, cumsum
figure()
perc = cumsum(latent)/sum(latent)
dist = dist/max(dist)
plot(range(len(perc)),perc,'b',range(0,full_pc+10,50), dist,'r')
axis([0,full_pc,0,1.1])
show()
Figure 3.11: Quality of the reconstructed images.
In practical computations, the QR decomposition can be used to compute
the eigenvectors V, to avoid numerical stability, as discussed earlier for SVD.
3.6 Numerical Root Finding
Module scipy.optimize offers a function fsolve() to find roots of a set of
given nonlinear equations defined by f (x) = 0, with estimated locations of
the roots. The fsolve() is a wrapper around the algorithms in MINPACK
that uses essentially a variant of the Newton iteration method (https://en.
wikipedia.org/wiki/Newton%27s method), which finds the root using the
144 Machine Learning with Python: Theory and Applications
Figure 3.12: The Newton Iteration: The function is shown in blue and the tangent line at
local xi is in red. We see that xi gets closer and closer to the root of the function when the
number of iterations i increases (https://en.wikipedia.org/wiki/Newton%27s method#/
media/File:NewtonIteration Ani.gif) under the CC BY-SA 3.0. (https://creativecommons.
org/licenses/by-sa/3.0/) license.
function derivative to approximate the function locally. This process can
be easily viewed from the animation nicely made by Ralf Pfeifer shown in
Fig. 3.12.
Let us define two functions, one with a single variable and another
with two variables, as examples to demonstrate how to find the roots. For
functions with a single variable x, we define the following function that is
often encountered in structural mechanics problems:
import numpy as np
def BeamOnFoundation(bz): # Deflection of a beam on foundation
return np.exp(-bz)*(np.sin(bz)+np.cos(bz))
def f(x): # function whose root to be found
return 2*BeamOnFoundation(x/2)-1-BeamOnFoundation(x)
from scipy.optimize import fsolve
starting_guess = 5 # specify estimated location of the root
x_root=fsolve(f, starting_guess)
print('x_root=',x_root)
np.isclose(f(x_root), 0) # check if f(x_root)=0.0.
x root= [ 1.86]
array([ True])
Basic Mathematical Computations 145
Let us now consider a set of two functions with two variables.
def f2d(x):
return [x[0]*np.sin(x[1])-5,x[1]*x[0]-x[1]-8]
# x is an array function-1 function-2
x_roots = fsolve(f2d, [3, 2]) # specify 2 estimated roots
print('x_roots=',x_roots)
np.isclose(f2d(x_roots), [0.0, 0.0]) # check if f(x_root)=0.0.
x roots= [ 5.25 1.88]
array([ True, True])
Note in general, a polynomial (or other algebraic equation) can have
complex roots, even though their coefficients are all real. This is another
case where the complex space is geometrically closed, but the real space is
not. A polynomial of nth order should have n roots, but they may not be
all in the real space. Some of them get into the complex space.
3.7 Numerical Integration
Numerical integration is one of the routine operations in computations for
practical problems in sciences and engineering. This is because only simple
functions can be analytically integrated, and one has to resort to numerical
means for real-life problems. Different types of numerical integration tech-
niques have been developed in the past, and numpy made the computation
easy to implement and use. Our discussion on this topic starts from the
classical trapezoid rule that may be familiar to many readers. More reference
materials can be found from the Scipy.integrate documentation (https://
docs.scipy.org/doc/scipy/reference/tutorial/integrate.html) and notebook.
community (https://notebook.community/sodafree/backend/build/ipyth
on/docs/examples/notebooks/trapezoid rule).
3.7.1 Trapezoid rule
The trapezoid rule for definite integration uses the following formula:
n
b
1s
f (x) dx ≈ (xk − xk−1 ) (f (xk ) + f (xk−1 )) . (3.31)
a 2
k=1
We define a simple polynomial function and sample it in a finite range [a, b]
at ns number of sampling points equally spaced.
146 Machine Learning with Python: Theory and Applications
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
def f(x):
return 9*x**3-8*x**2-7*x+6 # define a polynomial function
a, b, n = -1., 2, 400 # large n for plotting the curve
x = np.linspace(a, b, n) # x at n points in [a,b]
y = f(x) # compute the function values
The function is integrated over [a, b], by sampling a small number of
points.
ns = 6 # sample ns points for integration
xint = np.linspace(a, b, ns)
yint = f(xint)
Plot the function curve and the trapezoidal (shaded) areas below it.
plt.plot(x, y, lw=2) # plot the function as a line of width 2
#plt.axis([a, b, 0, 150]) # plot x and y axes
plt.fill_between(xint, 0, yint, facecolor='gray', alpha=0.4)
# plot the shaded area over which the integration is done
plt.text((a+b)/2,12,r"$\int_a^b f(x)dx$", horizontalalignment=\
'center',fontsize=15); # use \ to change line in code
Figure 3.13: Integration using the trapezoidal rule.
Basic Mathematical Computations 147
The trapezoid integration computes the shaded area. Thus, it is likely an
approximation as shown.
from scipy.integrate import quad # quadrature (integration)
integral, error = quad(f, a, b)
# shall give the results and the error
integral_trapezoid=sum((xint[1:]-xint[:-1])*(yint[1:]+yint[:-1]))/2
# use the trapezoid formula
print("The results should be:", integral, "+/-", error)
print("The results by the trapezoid approximation with",len(xint),
"points is:", integral_trapezoid)
The results should be: 17.25 +/- 1. 9775770133077287e-13
The results by the trapezoid approximation with 6 points are:
18.240000000000002
3.7.2 Gauss integration
Gauss integration (or quadrature) is regarded as one of the most effective
numerical integration techniques. It samples the integrand function at
specific points called the Gauss points and sums up these sampled function
values weighted by the Gauss weights for these points. It can produce exact
values (to machine accuracy) for the integration of a polynomial integrand,
because the Gauss point locations are the roots of the Legendre polynomials
defined in the natural coordinates in [−1, 1].
The Gauss integration is widely applied in numerical integration if the
fixed locations of sampling points are not a concern. It is a standard
integration scheme used in the FEM [1]. Here, we show an example using the
p roots() function available in Scipy module to find the roots of polynomials,
and then carry out the integration.
from pylab import *
from scipy.special.orthogonal import p_roots
def gauss(f,n,a,b):
[x,w] = p_roots(n+1) # roots of the Legendre polynomial
# and weights
G=0.5*(b-a)*sum(w*f(0.5*(b-a)*x+0.5*(b+a)))
# in natural coordinates
# sample the function values at these roots and sum up.
return G
148 Machine Learning with Python: Theory and Applications
def my_f(x):
return 9*x**3-8*x**2-7*x+6 # define a polynomial function
ng = 2
integral_Gauss = gauss(my_f,ng,a,b)
print("The results should be:", integral, "+/-", error)
print("The results by the trapezoid approximation with",
len(xint),"points is:", integral_trapezoid)
print("The results by the Gauss integration with", ng,
'Gauss points:', "points is:", integral_Gauss)
The integral results should be: 17.25 +/-
1.9775770133077287e-13
The results by the trapezoid approximation with 6 points is:
18.240000000000002
The results by the Gauss integration with 2 Gauss points:
points is: 17.250000000000007
It is observed that the Gauss integration gives a much more accurate
solution with a much smaller number of sampling points. In fact, the solution
is exact (within the machine error) for this example because the integrand
is a polynomial of the order of 3. We need only 2 Gauss points to obtain the
exact solution. The general formula for polynomial integrands is ng = n+1 2 ,
where n is the order of the polynomial integrand and ng is the number
of Gauss points needed to obtain the exact solution for the integral. Note
that when the trapezoid integration rule is used with 6 sampling points, the
solution is still quite far off.
For general complicated integrand functions, Gauss integration may not
give the exact solution. The accuracy, however, will still be much better
compared to the trapezoid rule or the rectangular rule (which we did not
discuss, but very similar to the trapezoid rule). In other words, for solutions
of similar accuracy, Gauss integration uses less sampling points.
3.8 Initial data treatment
Finally, let us introduce techniques often used for initial treatment for
datasets. Consider a given training dataset X ∈ Xm×p . In machine learning
models, m is the number of data-points in the dataset, and p is the number of
feature variables. The values of the data are often in a wide range for real-life
problems. For numerical stability reasons, we usually perform normalization
to the given dataset before feeding it to a model. There are mainly two
Basic Mathematical Computations 149
techniques are used: min-max feature scaling and standard scaling. Such a
scaling or normalization is also called transformation in many ML modules.
3.8.1 Min-max scaling
For formulation for min-max scaling is given as follows.
X − X. min(axis = 0)
Xscaled = (3.32)
X. max(axis = 0) − X. min(axis = 0)
where X. min and X. min will be (row) vectors, and we used the Python
syntax of broadcasting rules and element-wise divisions. This would bring
all values for each feature into [0, 1] range. A more generalized formula that
can bring these values to an arbitrary range of [a, b] is given as follows.
X − X. min(axis = 0)
Xscaled = a + (b − a) (3.33)
X. max(axis = 0) − X. min(axis = 0)
Here we used again the Python syntax so that scalars, vectors and matrix
are all in the same formula.
Once such a scaling transformation to the training dataset is done, X. min
and X. min can be used to perform exactly the same transformation to the
testing dataset to ensure consistency for proper predictions.
The following is a simple code to perform min-max scaling using
Eq.(3.32).
np.set_printoptions(precision=4)
X = [[-1, 2, 8], # an assumed toy training dataset
[2.5, 6, 1.5], # with 4 samples, and 3 features
[3, 11, -6],
[21, 7, 2]]
print(f"Original training dataset X:\n{X}")
X = np.array(X)
X_scaled = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
print(f"Scaled training dataset X:\n{X_scaled}")
print(f"Maximum values for each feature:\n{X.max(axis=0)}")
print(f"Minimum values for each feature:\n{X.min(axis=0)}")
Original training dataset X:
[[-1, 2, 8], [2.5, 6, 1.5], [3, 11, -6], [21, 7, 2]]
Scaled training dataset X:
[[0. 0. 1. ]
[0.1591 0.4444 0.5357]
150 Machine Learning with Python: Theory and Applications
[0.1818 1. 0. ]
[1. 0.5556 0.5714]]
Maximum values for each feature:
[21. 11. 8.]
Minimum values for each feature:
[-1. 2. -6.]
We can now perform the same transformation to the testing dataset using
X. min and X. min of the training dataset.
Xtest = [[-2, 3, 7], # assumed testing dataset
[5, 4, 5.5]]
Xt_scaled = (Xtest - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
print(f"Scaled corresponding testing dataset Xtest:\n{Xt_scaled}")
Scaled corresponding testing dataset Xtest:
[[-0.0455 0.1111 0.9286]
[ 0.2727 0.2222 0.8214]]
The inverse transformation can be done with ease.
X_back = X_scaled*(X.max(axis=0) - X.min(axis=0))+ X.min(axis=0)
print(f"Back transformed training dataset:\n{X_back}")
Xt_back = Xt_scaled*(X.max(axis=0) - X.min(axis=0))+ X.min(axis=0)
print(f"Back transformed testing dataset Xtest:\n{Xt_back}")
Back transformed training dataset:
[[-1. 2. 8. ]
[ 2.5 6. 1.5]
[ 3. 11. -6. ]
[21. 7. 2. ]]
Back transformed testing dataset Xtest:
[[-2. 3. 7. ]
[ 5. 4. 5.5]]
It is clearly seen that the min-max scaling does no harm to the dataset.
One can get it back as needed.
The same min-max scaling can be done using Sklearn.
Basic Mathematical Computations 151
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler() # create an instance
scaler.fit(X) # fit with the training dataset
X_scaled = scaler.transform(X) # perform the scaling transformation
print(f"Scaled training dataset X:\n{X_scaled}")
print(f"Maximum values for each feature:\n{scaler.data_max_}")
print(f"Minimum values for each feature:\n{scaler.data_min_}")
Scaled training dataset X:
[[0. 0. 1. ]
[0.1591 0.4444 0.5357]
[0.1818 1. 0. ]
[1. 0.5556 0.5714]]
Maximum values for each feature:
[21. 11. 8.]
Minimum values for each feature:
[-1. 2. -6.]
Xtest = [[-2, 3, 7], # assumed testing dataset
[5, 4, 5.5]]
Xt_scaled = scaler.transform(Xtest)
print(f"Scaled corresponding testing dataset:\n{Xt_scaled}")
Scaled corresponding testing dataset:
[[-0.0455 0.1111 0.9286]
[ 0.2727 0.2222 0.8214]]
X_back = scaler.inverse_transform(X_scaled)
print(f"Back transformed training dataset:\n{X_back}")
Xt_back = scaler.inverse_transform(Xt_scaled)
print(f"\nBack transformed testing dataset Xtest:\n{Xt_back}")
Back transformed training dataset:
[[-1. 2. 8. ]
[ 2.5 6. 1.5]
[ 3. 11. -6. ]
[21. 7. 2. ]]
Back transformed testing dataset Xtest:
[[-2. 3. 7. ]
[ 5. 4. 5.5]]
152 Machine Learning with Python: Theory and Applications
3.8.2 “One-hot” encoding
Many ML datasets use categorical features. For example, a color variable may
have values of “red”, “green”, and “blue”. These values must be converted
to numerical numbers for building a ML model. Consider a single column
feature vector is given originally as [[green], [red], [0], [blue]], one can simply
encode this feature vector as [[1], [2], [0], [3]], where the integers are arbitrary
but distinct. The treatment to dataset coded in this manner will be the same
as any ordinary dataset we discussed before. However, this implies that the
colors are having significance in values, which may not be what we want.
To avoid such a problem, we often use the so-called “one-hot” encoding.
The single column dataset is then encoded to a matrix X with three columns,
as shown in the code below. Thus, one-hot encoding results in a significant
increase in the number of feature vectors, so that the features can all be
made unique, and the categories are not given any value significance. Let us
scale such kind of dataset.
# red green blue
X = [[0, 1, 0 ], # a 'one-hot' training dataset
[1, 0, 0 ],
[0, 0, 0 ],
[0, 0, 1 ]]
scaler.fit(X)
print(f"Original 'one-hot' training dataset X:\n{X}")
X_scaled = scaler.transform(X) # perform the scaling transformation
print(f"Scaled 'one-hot' training dataset X:\n{X_scaled}")
print(f"Maximum values for each feature:\n{scaler.data_max_}")
print(f"Minimum values for each feature:\n{scaler.data_min_}")
Original 'one-hot' training dataset X:
[[0, 1, 0], [1, 0, 0], [0, 0, 0], [0, 0, 1]]
Scaled 'one-hot' training dataset X:
[[0. 1. 0.]
[1. 0. 0.]
[0. 0. 0.]
[0. 0. 1.]]
Maximum values for each feature:
[1. 1. 1.]
Minimum values for each feature:
[0. 0. 0.]
It is seen that the min-max scaling used has not changed anything to the
one-hot dataset, as expected. They do not have value significance.
Basic Mathematical Computations 153
3.8.3 Standard scaling
When the dataset has a distribution that is close to the normal distribution,
one can use the standard scaling. The formulation for the standard scaling
is given as follows.
X − X.mean(axis = 0)
Xscaled = (3.34)
X.std(axis = 0)
The following is a simple code to perform min-max scaling using
Eq.(3.34).
X = [[-1, 2, 8], # an assumed toy training dataset
[2.5, 6, 1.5],
[3, 11, -6],
[21, 7, 2]]
print(f"Original training dataset X:\n{X}")
X = np.array(X)
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)
print(f"Standard scaled training dataset X:\n{X_scaled}")
print(f"Mean value for each feature:\n{X.mean(axis=0)}")
print(f"Standard deviation for each feature:\n{X.std(axis=0)}")
Original training dataset X:
[[-1, 2, 8], [2.5, 6, 1.5], [3, 11, -6], [21, 7, 2]]
Standard scaled training dataset X:
[[-0.8592 -1.4056 1.3338]
[-0.4515 -0.1562 0.0252]
[-0.3932 1.4056 -1.4848]
[ 1.7039 0.1562 0.1258]]
Mean value for each feature:
[6.375 6.5 1.375]
Standard deviation for each feature:
[8.5832 3.2016 4.9671]
Note that the values are not confirmed in [−1, 1]. It follows a normal dis-
tribution. We can now perform the same transformation to the corresponding
testing dataset using the fitted instance of the training dataset.
Xtest = [[-2, 3, 7], # an assumed testing dataset
[5, 4, 5.5]]
Xt_scaled = (Xtest - X.mean(axis=0)) / X.std(axis=0)
print(f"Standard scaled corresponding testing dataset Xtest:
\n{Xt_scaled}")
154 Machine Learning with Python: Theory and Applications
Standard scaled corresponding testing dataset Xtest:
[[-0.9757 -1.0932 1.1325]
[-0.1602 -0.7809 0.8305]]
The inverse transformation can be done with ease.
X_back = X_scaled*X.std(axis=0) + X.mean(axis=0)
print(f"Back transformed training dataset:\n{X_back}")
Xt_back = Xt_scaled*X.std(axis=0) + X.mean(axis=0)
print(f"Back transformed corresponding testing dataset Xtest:
\n{Xt_back}")
Back transformed training dataset:
[[-1. 2. 8. ]
[ 2.5 6. 1.5]
[ 3. 11. -6. ]
[21. 7. 2. ]]
Back transformed corresponding testing dataset Xtest:
[[-2. 3. 7. ]
[ 5. 4. 5.5]]
It is clearly seen that the standard scaling does no harm to the dataset.
One can get it back as needed.
The same standard scaling can be done using Sklearn.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # create an instance
scaler.fit(X)
X_scaled = scaler.transform(X)
print(f"Scaled dataset X:\n{X_scaled}")
print(f"Mean values for each feature:\n{scaler.mean_}")
print(f"Standard deviations for each feature:\n{np.sqrt(scaler.var_)}")
Scaled dataset X:
[[-0.8592 -1.4056 1.3338]
[-0.4515 -0.1562 0.0252]
[-0.3932 1.4056 -1.4848]
[ 1.7039 0.1562 0.1258]]
Mean values for each feature:
[6.375 6.5 1.375]
Standard deviations for each feature:
[8.5832 3.2016 4.9671]
Basic Mathematical Computations 155
Xtest = [[-2, 3, 7], # assumed testing dataset
[5, 4, 5.5]]
Xt_scaled = scaler.transform(Xtest)
print(f"Scaled corresponding testing dataset Xtest:\n{Xt_scaled}")
Scaled corresponding testing dataset Xtest:
[[-0.9757 -1.0932 1.1325]
[-0.1602 -0.7809 0.8305]]
X_back = scaler.inverse_transform(X_scaled)
print(f"Back transformed training dataset:\n{X_back}")
Xt_back = scaler.inverse_transform(Xt_scaled)
print(f"\nBack transformed testing dataset Xtest:\n{Xt_back}")
Back transformed training dataset:
[[-1. 2. 8. ]
[ 2.5 6. 1.5]
[ 3. 11. -6. ]
[21. 7. 2. ]]
Back transformed testing dataset Xtest:
[[-2. 3. 7. ]
[ 5. 4. 5.5]]
Note that the same scaling can be done for the labels in the training
dataset, if they are not probability distribution types of data. When perform
the testing on the trained model or prediction using the trained model, the
labels should be scaled back to the original really data unit.
Also, it is a good practice to take look at the distribution of the data-
points. This is usually done after scaling so that the region of the data-points
is normalized. One may simply plot the so-called kernel density estimation
(KDE) using, for example, seaborn.kdeplot().
References
[1] G.R. Liu and S.S. Quek, The Finite Element Method: A Practical Course, Butterworth-
Heinemann, 2013. London.
[2] G.R. Liu, Mesh Free Methods: Moving Beyond the Finite Element Method, Taylor and
Francis Group, New York, 2010.
156 Machine Learning with Python: Theory and Applications
[3] K. Pearson, On lines and planes of closest fit to systems of points in space, Philosophical
Magazine, 2(1), 559–572, 1901.
[4] G.R. Liu, S.Y. Duan, Z.M. Zhang et al., Tubenet: A special trumpetnet for explicit
solutions to inverse problems, International Journal of Computational Methods,
18(01), 2050030, 2021. https://doi.org/10.1142/S0219876220500309.
[5] Shuyong Duan, Zhiping Hou, G.R. Liu et al., A novel inverse procedure via creating
tubenet with constraint autoencoder for feature-space dimension-reduction, Interna-
tional Journal of Applied Mechanics, 13(08), 2150091, 2021.
Chapter 4
Statistics and Probability-based Learning Model
This chapter discusses some topics of probability and statistics related to
machine learning models and the computation techniques using Python.
Referenced materials include codes from Numpy documentation (https://
numpy.org/doc/), Jupyter documentation (https://jupyter.org/), and
Wikipedia (https://en.wikipedia.org/wiki/Main Page). Codes from mxnet-
the-straight-dope (https://github.com/zackchase/mxnet-the-straight-dope)
are also used under the Apache-2.0 License.
Building a machine learning model is mostly for prediction, classification,
or identification, based on the data available and the knowledge about the
data. Predictions can be deterministic and probabilistic. We often want to
predict the probability of the occurrence of an event, which can be very
useful and more practical for some problems.
For example, for aircraft maintenance, the engineers might want to assess
how likely it is for the engine of the aircraft to get into an unhealthy state,
based on records and/or diagnostic data. For a doctor, he/she may want
to predict the possibility of a patient having a critical illness in the next
period of time, based on the patient’s health records, diagnostic data, and
the current health environment. Health care organizations want to predict
the likelihood of the occurrence of a pandemic. For all these types of tasks,
we need to resort to means of quantifying the probability of the occurrence of
the event. It can be a complicated topic of study and research, and machine
learning models may help.
This chapter focuses on some of the basic concepts, theories, formulations,
and computational techniques that we may need to build machine learning
models using probability and statistics. At the end of this chapter, we will
introduce a Naive Bayes classification model.
157
158 Machine Learning with Python: Theory and Applications
4.1 Analysis of Probability of an Event
4.1.1 Random sampling, controlled random sampling
In machine learning, one often needs to sample numbers in a random manner.
This can be done numerically. In Python, we import random module to do so.
The random() can then be used to generate “simulated” random numbers.
First, let us use a code to produce random integers. We shall use
random.randint() for this, which samples numbers uniformly in a given
range.
# help(random.randint) # check it out
import random # random module
na, nb, n = 1, 100, 5 # n integers in na~nb
for i in range(n): # First, generate n
print(random.randint(na,nb),' ',end ='') # random integers
# in na~nb
print('\n')
for i in range(n): # Generate again
print(random.randint(na,nb),' ',end ='') # n random integers
86 61 35 81 40
4 92 95 31 70
We generated 5 random integer numbers twice. These generated numbers
are “random”, because two of the same generations gave two sets of different
numbers. One can try to execute the above cell multiple times, and it should
be found that each time different sets of numbers are generated.
Now, let us redo the same, but this time we use random.seed() to specify
the same seed for each of the generations.
random.seed(1) # seed value 1 for random number generation
for i in range(n): # Generate n random integers
print(random.randint(na, nb),' ',end ='')
print('\n')
random.seed(1) # The same seed value (try also seed(2))
for i in range(n):
print(random.randint(na, nb),' ',end ='')
Statistics and Probability-based Learning Model 159
18 73 98 9 33
18 73 98 9 33
We see now that the same set of numbers is generated, which is some kind
of controlled random sampling by a seed value. The use of random.seed()
may confuse many beginners, but the above example shall eliminate the
confusion. Function random.seed() is used just to ensure the repeatability
when one reruns the code again, which is important for code development
ensuring reproducibility. We will use it quite frequently.
Also, we see the fact that random numbers generated by a computer
are not entirely random and are controllable to a certain degree. Naturally,
it should be, because any (classic) computer is deterministic in nature.
This pseudo-random feature is useful: when we study a probability event,
we make use of the randomness of random.randint() or random.random().
When we want our study and code to be repeatable, we make use of
random.seed().
Note that the seed value of 1 can be changed to any other number,
and with a different seed value used, a different set of random numbers
is generated.
Let us now generate real numbers.
#random.seed(1) # seed for random number generation
n = 5
for i in range(n): #generates n random real numbers
print(random.random())
0.11791870367106105
0.7609624449125756
0.47224524357611664
0.37961522332372777
0.20995480637147712
It is seen that real numbers are generated in between 0 and 1. It is
produced by generating a random integer first using random.randint() and
then dividing it by its maximum range. The reader may switch on and off
random.seed(1) or change the seed value to see the difference.
160 Machine Learning with Python: Theory and Applications
4.1.2 Probability
Probability is a numerical measure on the likelihood of the occurrence of an
event or a prediction. Assume, for example, the probability of the failure of
a structure is 0.1. We can then denote it mathematically as
P r(failure=“yes”) = 0.1 (4.1)
In this case, there is only one random variable that takes two possible discrete
values: “yes” with probability of 0.1, and “no” with probability of 0.9. Such
a distribution of a random variable is known as the Bernoulli distribution.
For general events, there may be more possible discrete random variables
and random variables with continuous distributions. Statistics studies the
techniques for sampling, interpreting, and analyzing the data about an event.
Machine learning is based on a dataset available for an event, and thus
statistical analysis helps us to make sense of a dataset and hopefully produces
a prediction in terms of probability.
We use Python to perform statistical analysis to datasets. We first
import necessary packages, including the MXNet packages (https://gluon.
mxnet.io/).
import numpy as np # numpy package, give an alias np
import mxnet as mx # mxnet package, give an alias mx
from mxnet import nd # ndarray class from mxnet package
mx.random.seed(1) # seed for random number generation
# for repeatability of the code
Let us consider a simple event: tossing a die that has six identical surfaces,
each of which is marked with a unique digit number, from 1 to 6. In this
case, the random variable can take 6 possible discrete values. Assume that
such markings do not introduce any bias (fair die), and do not affect in any
way the outcome of a tossing. We want to know the probability of getting
a particular number on the top surface, after a number of tossings. One can
then perform “numerical” experiments: tossing the die the large number of
times virtually in a computer and counting the times that the number shows
on the top surface. We use the following code to do this:
pr = nd.ones(6)/6 # probability distribution for a
# number on top. A total of 6
# values. Assume they have the
# same Pr (uniform distribution)
Statistics and Probability-based Learning Model 161
print(pr)
n_top_array = nd.sample_multinomial(pr,shape=(1))
# toss once using the
# sample_multinomial() function
print('The number on top surface =', n_top_array)
[0.16666667 0.16666667 0.16666667 0.16666667 0.16666667
0.16666667]
<NDArray 6 @cpu(0)>
The number on top surface =
[3]
<NDArray 1 @cpu(0)>
For this problem, we know (assumed) that the theoretical or the “true”
probability for a number showing on the top surface is 1/6 ≈ 0.1667.
The one-time toss above gives an nd-array with just one entry that is the
number on the top surface of the die. To obtain a probability, we shall toss
for many times for statistics to work. This is done by simply specifying the
length of the nd-array in the handy nd.sample multinomial() function.
n_surfaces = 6 # number of possible values
n_tosses = 18 # number of tosses
mx.random.seed(1)
toss_results = nd.sample_multinomial(pr, shape=(n_tosses))
# toss n_tosses times
print("Tossed", n_tosses,'times.')
print("Toss results", toss_results)
Tossed 18 times.
Toss results
[3 0 0 3 1 4 4 5 3 0 0 2 2 2 4 2 4 4]
<NDArray 18 @cpu(0)>
This time, we tossed 18 times, resulting to an nd-array with 18 entries.
Note that if mx.random.seed(1) is not used in the above cell, we would get a
different array each tossing, because of the random nature. For this controlled
tossing (that readers can repeat) with random.seed(1), we got 1 “5” in 18
times of tossing. We thus have Pr(die=“5”) = 1/18. We got 3 “3”s, which
gives Pr(die=“3”) = 3/18=1/6, and so on. The values of Pr(die=“5”) and
Pr(die=“3”) are quite apart. Let us toss some more times.
162 Machine Learning with Python: Theory and Applications
n_t = 20
print(nd.sample_multinomial(pr, shape=(n_t))) #toss n_t times
[2 3 5 4 0 0 1 1 0 2 3 0 0 0 2 4 5 2 0 4]
<NDArray 20 @cpu(0)>
It is difficult to count and calculate the probabilities manually. Let us use
the following code available at the mxnet site to do so:
# The code is modified from these at https://github.com/
# zackchase/mxnet-the-straight-dope/blob/master/chapter01_
# crashcourse/probability.ipynb; # Under Apache-2.0 License.
#
n_tosses = 2000 # number of tosses
toss_results = nd.sample_multinomial(pr, shape=(n_tosses))
# toss, record the results
record=nd.zeros((n_surfaces,n_tosses)) # count the event (tossing)
# results: times of each of 6 surfaces appearing on top
n_digit_number = nd.zeros(n_surfaces) # Initial with zeros for an
# array to hold the probability of on-top
# appearances for each of the 6 numbers
for i, digit_number in enumerate(toss_results):
n_digit_number[int(digit_number.asscalar())] += 1
# counts and put in the
# corresponding place.
record[:,i] = n_digit_number # records the results
n_digit_number[:]=n_digit_number/n_tosses # compute the Pr
print('Total number of tosses:',n_tosses)
print('Probability of each of the 6 digits:',n_digit_number)
print('Theoretical (true) probabilities:',pr)
Total number of tosses: 2000
Probability of each of the 6 digits:
[0.1675 0.1865 0.1705 0.1635 0.15 0.162 ]
<NDArray 6 @cpu(0)>
Theoretical (true) probabilities:
[0.16667 0.16667 0.16667 0.16667 0.16667 0. 16667]
<NDArray 6 @cpu(0)>
We see the probability values for all the 6 digits getting closer to the
theoretical or true probability.
Statistics and Probability-based Learning Model 163
import numpy as np
np.set_printoptions(suppress=True)
print(record) # print out the records
[[ 0. 0. 0. ... 333. 334. 335.]
[ 0. 0. 1. ... 373. 373. 373.]
[ 0. 1. 1. ... 341. 341. 341.]
[ 1. 1. 1. ... 327. 327. 327.]
[ 0. 0. 0. ... 300. 300. 300.]
[ 0. 0. 0. ... 324. 324. 324.]]
<NDArray 6x2000 @cpu(0)>
We now normalized the data, which we often do in machine learning, by
the total number of tosses using the following codes:
x = nd.arange(n_tosses).reshape((1,n_tosses)) + 1
#print(x)
observations = record / x # Pr of 6 digits for all tosses
print(observations[:,0]) # observations for 1st toss
print(observations[:,10]) # for first 10 toss
print(observations[:,999]) # for first 1000 toss
[0. 0. 0. 1. 0. 0.]
<NDArray 6 @cpu(0)>
[0.181819 0.272728 0.090909 0.181819 0.090909 0. 181819]
<NDArray 6 @cpu(0)>
[0.175 0.185 0.16 0.164 0.144 0.172]
<NDArray 6 @cpu(0)>
This simple experiment gives us 1,000 observations for six possible values
of uniform distribution (any of the 6 digits has equal chance to land on
top). When the probability of the appearance of each of the six surfaces
of the die is computed after 1,000 times of tossing, we got roughly 0.14 to
0.19. These probabilities will change a little each time we do the experiment
because of the random nature. If we would do 10,000 times of tossing for each
experiment, we shall get all probabilities quite close to the theoretical value
of 1/6 ≈ 0.1667. Readers can try this very easily using the code given above.
Let us now plot the “numerical” experimental results. For this, we use
matplotlib library.
164 Machine Learning with Python: Theory and Applications
# The code is modified from these at https://github.com/
# zackchase/mxnet-the-straight-dope/blob/master/chapter01_
# crashcourse/probability.ipynb; Under Apache-2.0 License.
%matplotlib inline
from matplotlib import pyplot as plt
plt.plot(observations[0,:].asnumpy(),label="Observed P(die=1)")
plt.plot(observations[1,:].asnumpy(),label="Observed P(die=2)")
plt.plot(observations[2,:].asnumpy(),label="Observed P(die=3)")
plt.plot(observations[3,:].asnumpy(),label="Observed P(die=4)")
plt.plot(observations[4,:].asnumpy(),label="Observed P(die=5)")
plt.plot(observations[5,:].asnumpy(),label="Observed P(die=6)")
plt.axhline(y=0.166667, color='black', linestyle='dashed')
plt.legend()
plt.show()
Figure 4.1: Probabilities obtained via finite sampling from a uniform distribution of a
fair die.
It is clear that the more experiments we do, the probability gets closer
to the theoretical value of 1/6.
The above discussion is for very simple events of a die toss. It gives a
clear view on some of the basic issues and procedures related to the statistics
analysis and probability computation for complicated events.
4.2 Random Distributions
In machine learning, one often needs to sample numbers in a random manner.
Depending on types of problems, the distribution of the data of a variable
can have different types. Sampling of data numerically shall be based on a
given/assumed distribution type. We did so in the beginning for this chapter
using uniform distribution. We shall now examine this further.
Statistics and Probability-based Learning Model 165
4.2.1 Uniform distribution
These numbers generated based on uniform distribution shall have an equal
chance to land anywhere in between the specified range. To check the
uniformity of the numbers generated using random.randint(), we can run
it for a large number of times, say 1 million, and see how these numbers are
distributed. We use the following code to do so:
# This code are modified from these at https://github.com/
# zackchase/mxnet-the-straight-dope/blob/master/chapter01_
# crashcourse/probability.ipynb; Under Apache-2.0 License.
import numpy as np
import matplotlib.pyplot as plt
import random
na, nb, n = 0, 99, 100
counts = np.zeros(n) # Array to hold counted numbers
fig, axes = plt.subplots(2,3,figsize=(15,8),sharex=True)
axes = axes.reshape(6)
n_samples = 1000001
for i in range(1, n_samples):
counts[random.randint(na, nb)]+=1 # Random integers
if i in [10, 100, 1000, 10000, 100000, 1000000]:
axes[int(np.log10(i))-1].bar(np.arange(na+1,nb+2),counts)
plt.show()
Figure 4.2: Finite samplings from a uniform distribution.
It is observed that with the increase of sampling, the uniformity increases.
4.2.2 Normal distribution (Gaussian distribution)
The normal distribution is also called Gaussian distribution. It is widely used
in statistics because many events in nature, science, and engineering obey
166 Machine Learning with Python: Theory and Applications
this distribution. It is defined using the following Gaussian density function
of variable x:
2
1 − x−μ
√
p(x) = √ e 2σ (4.2)
σ 2π
where μ and σ are, respectively, the mean and the standard deviation of
the distribution. The normal distribution is often denoted as N (μ, σ 2 ).
In particular, when μ = 0 and σ = 1, we have the standard normal
distribution denoted as N (0, 1) and its density function becomes simply
−( √x )2
p(x) = √12π e 2 .
The gauss() in the numpy random module can be used to conveniently
generate normal distribution numbers.
from random import gauss
mu, sigma, n = 0., 0.1, 10
for i in range(n): # generates n random numbers
print(f'{gauss(mu, sigma):.4f} ', end ='')
# mu: mean; sigma: standard deviation
-0.1939 0.1794 0.0614 -0.1348 0.1020 0.0432 -0.2144 -0.0636
0.0502 -0.1377
Let us plot out the density function defined in Eq. (4.2). The bell shape of
the function may already be familiar to you.
x = np.arange(-0.5, 0.5, 0.001) # define variable x
def gf(mu,sigma,x): # define the Gauss function
return 1/(sigma*np.sqrt(2*np.pi))*np.exp(-.5*((x-mu)/sigma)**2)
mu, sigma = 0, 0.1 # mean 0, standard deviation 0.1
plt.figure(figsize=(6, 4))
plt.plot(x, gf(mu,sigma,x))
plt.show()
Figure 4.3: A typical normal distribution (Gaussian distribution).
Statistics and Probability-based Learning Model 167
Let us now generate some random samples of Gauss distribution using
np.random.normal(). We then compare the sampled data with the “true”
Gaussian distribution.
n = 500
samples = np.random.normal(mu, sigma, n) #generate samples
count, bins, ignored = plt.hist(samples, 80, density=True)
# plot histogram of the samples
plt.plot(bins, gf(mu,sigma,bins),linewidth=2, color='r')
# plot true Gauss distribution
plt.show()
Figure 4.4: Sampling from a normal (Gaussian) distribution.
Numpy can generate samples of about 40 different types of distributions.
Readers are referred to the numpy documentation (https://numpy.org/
doc/1.16/reference/routines.random.html) for details when needed.
4.3 Entropy of Probability
For given probabilities of random variables of a statistics event, one can
evaluate the corresponding entropy. It is a measure for uncertainty of the
probability distribution for the event. It is the dot-product of the probability
vector (that holds the probability values of a random variable) with its
negative logarithm. The entropy Hp for an event with probability p is
expressed by
Hp = − pi log pi = −p · log(p) (4.3)
i
where pi is the probability of the ith possible value of the variable and
i pi = 1. Vector p is the vector that holds these the probabilities. The
168 Machine Learning with Python: Theory and Applications
negative sign is needed, because entropy is positive and log(pi ) is always
negative for 0 ≤ pi ≤ 1. In the computation, we often have it normalized by
dividing it with the total number of possible values. Entropy is used very
often in machine learning, especially for constructing objective functions,
because it is a measure of the uncertainty that needs to be minimized.
Since the logarithm is used frequently, we shall first examine it in more
detail using numpy log() function.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
p = np.arange(0.01, 1.0, .01) # generate variables
logp = -np.log(p)
# negative sign for positive value: log(p)<0 for 0<p<1
plt.plot(p,logp,color='blue')
plt.xlabel('Probability p')
plt.ylabel('Negative log(p)')
plt.title('Negative log Value of Probability')
plt.show()
Figure 4.5: A log function of probability.
Let us mention a few important features, which are the root reasons for
why the logarithm is used so often in machine learning.
• −log(p) is monotonically varying with argument p. The monotonous is
important because it does not affect the locations of the stationary points
of the original function when it is in logarithm. This is an excellent
property for optimization algorithms that are frequently used in machine
learning.
Statistics and Probability-based Learning Model 169
• −log(p) is monotonically decaying with increasing probability. This
reverses the trend of the probability, and thus it is the proper behavior
for measuring the entropy for a high probability range. This is because
when the probability is high, the uncertainty level is low (because we are
quite certain that it is likely to happen) and so the value of entropy. When
the probability is low, the uncertainty should also be low (because we are
quite certain that it is unlikely to happen). In this case, we simply make
use of the probability itself in the entropy equation.
• Entropy Hp is a combination of both the probability and its negative
logarithm in the form of a product, as shown in Eq. (4.3). This combination
gives the needed behavior, and it is nicely defined to suit our purpose by
making use of the features of the logarithm function.
The following examples demonstrate how the entropy function works:
4.3.1 Example 1: Probability and its entropy
Consider an event with a variable that takes two values. We made an
observation first, which produces probability vector q1 with entries of these
two probabilities of the corresponding two variables. We then made another
observation, which produces probabilities q2 . We would like to evaluate the
entropy of the probability of these two observations.
q1 =np.array([0.999, 0.001]) # Pr. distribution with low
# uncertainty: quite sure whether the event is to
# happen, because the variable is with either a
# very high or low chance to be observed.
q2 =np.array([ 0.5,0.5 ]) # Distribution with high uncertainty:
# Not sure whether the event is to happen,
# because the variable is with neither a
# high or low chance to be observed.
# - sign for getting a positive value:
print('q1=',q1,' -log(q1)=',-np.log(q1)) # log(p): negative
# for 0<p<1
print('q2=',q2,' -log(q2)=',-np.log(q2))
H_q1 = -np.dot(q1,np.log(q1))/len(q1) #Entropy: Uncertainty
H_q2 = -np.dot(q2,np.log(q2))/len(q2)
print('H_q1=',H_q1,'H_q2=',H_q2)
q1= [0.999 0.001] -log(q1)= [0.0010005 6.90775528]
q2= [0.5 0.5] -log(q2)= [0.69314718 0.69314718]
H q1= 0.003953627556116044 H q2= 0.34657359027997264
170 Machine Learning with Python: Theory and Applications
In this example, we see the contradicting behavior of p and −log(p).
Vector q1 has either low or high probability values for its two variables,
meaning that the uncertainty is low, and hence the computed entropy is
low. On the other hand, q2 have two probabilities in the middle for both
variables, meaning that it is very uncertain. The computed entropy is high
as expected.
4.3.2 Example 2: Variation of entropy
To show this more clearly, we create artificial events q1 with a variable that
takes two possible values, and let the probabilities be v1 and v2 for these two
values, which change in the reverse manner while the sum of the probabilities
equals 1. We write the following code to compute the entropy changing with
the changes in the probability of v1 and v2 :
# An event with a variable that takes two values
v1 = np.arange(0.01, 1.0, .05)
gap = (v1[1]-v1[0])*len(v1)/3.
v1 /= (v1[0]+v1[-1]) # create an array that holds linearly
# changing probability values.
v2 = v1[::-1] # create the revise of v1.
print(v1,np.sum(v1)) # to check it out
print(v2,np.sum(v2))
print(v1+v2,np.sum(v1+v2)/2)
xtick = range(len(v1)) #[0,1,2,3,4]
plt.bar(range(len(v1)),v1,width=gap*1.2,alpha=.9,color='blue')
plt.bar(range(len(v2)),v2,width=gap,alpha=.9,color='red')
plt.xlabel('Event ID, blue: v1, red: v2')
plt.ylabel('Probability')
plt.xticks(xtick)
plt.show()
H_qf = np.array([]) # initialize the array for entropy
for q1 in list(zip(v1,v2)):
# create a pair of probability compute the entropy and append
H_qf = np.append(H_qf,-(np.dot(q1,np.log(q1)))/2)
plt.plot(v1,H_qf)
plt.xlabel('Probability, v1 (v2=1-v1)')
plt.ylabel('Entropy of events')
plt.title('Entropy of Events')
plt.show()
Statistics and Probability-based Learning Model 171
[0.01030928 0.06185567 0.11340206 0.16494845 0.21649485 0.26804124
0.31958763 0.37113402 0.42268041 0.4742268 0.5257732 0.57731959
0.62886598 0.68041237 0.73195876 0.78350515 0.83505155 0.88659794
0.93814433 0.98969072] 10.0
[0.98969072 0.93814433 0.88659794 0.83505155 0.78350515 0.73195876
0.68041237 0.62886598 0.57731959 0.5257732 0.4742268 0.42268041
0.37113402 0.31958763 0.26804124 0.21649485 0.16494845 0.11340206
0.06185567 0.01030928] 9.999999999999998
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 10.0
Figure 4.6: Probabilities of 20 events each of which has two values.
Figure 4.7: Variation of entropy of the probabilities of the 20 events.
It is clear that when the probability of v1 and v2 is at 0.5, the entropy is
the largest. The entropy is smallest at two ends, as expected.
172 Machine Learning with Python: Theory and Applications
4.3.3 Example 3: Entropy for events with a variable that takes
different numbers of values of uniform distribution
Let us take a look at events with a variable that can take different numbers of
possible values. We assume that the probability distribution for the variable
is uniform for all these events. We want to find out how the entropy for the
probability distribution changes with the number of variables of the events.
# An event with a variable that can take many values of
→uniform probability
N = 0
max_v = 100 # Events with N variables
# capped at max_values.
Ni = np.array([]) # For the number of v
H_qf = np.array([]) # To hold the entropy
while N < max_v:
N += 1
Ni = np.append(Ni,N)
qf = np.ones(N)
qf = qf/np.sum(qf) # uniform sample generated
H_qf = np.append(H_qf,-np.dot(qf,np.log(qf))/len(qf))
print('Probability distribution:',qf[0:max_v:10])
print('H_qf=', H_qf[0:max_v:10])
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.plot(Ni,H_qf)
plt.xlabel('Number of variables, all with same probability')
plt.ylabel('Entropy')
plt.title('Events with variables of uniform distribution')
plt.show()
Probability distribution: [0.01 0.01 0.01 0.01 0.01 0.01 0.01
0.01 0.01 0.01]
H qf= [-0. 0.21799048 0.14497726 0.11077378 0.09057493
0.07709462 0.06739137 0.06003774 0.05425246 0.04956988]
Statistics and Probability-based Learning Model 173
Figure 4.8: Entropy of probability of events with uniform distributions.
We find that (1) the entropy is zero when N = 1; (2) it peaks at N = 3;
and (3) when N get very big, the entropy becomes small, implying that when
an event has a very large number of variables, the entropy becomes small.
This is because the probability of each of the variables becomes very small
due to the uniform probability assumption for all these variables.
4.4 Cross-Entropy: Predicated and True Probability
Let us now look at the cross-entropy, which is an often used concept in
statistics. The cross-entropy of a distribution q relative to the distribution
p is defined as follows:
Hpq = − pi log qi = −p · log(q) (4.4)
i
In general, the cross-entropy is a measure of the similarity of two distribu-
tions (from the same space). In machine learning, we are interested in the
cross-entropy of the predicated probability q with respect to the true one, p.
In this case, Hpq can be a measure of the performance of a prediction model,
and hence often used as an objective or loss function in machine learning
models.
We note the following properties:
• Cross-entropy is not symmetric: Hpq = Hqp , if p = q, which is obvious
from Eq. (4.4).
• We shall have Hpq ≥ Hp , and Hqp ≥ Hq . The difference will be the
KL-divergence, which is always positive (see the following section).
174 Machine Learning with Python: Theory and Applications
• When these two distributions are the same, the cross-entropy becomes the
entropy studied in the previous section. All these three inequalities above
become equal.
• Therefore, in machine learning models, even if the prediction is perfect,
the cross-entropy will still not be zero, because the true distribution itself
may have an entropy. If p is the entropy of the true distribution, the cross-
entropy Hpq is bounded from below by Hp . It can only be zero if the true
distribution is without any uncertainty (probabilities of the variables are
all zero, except for one of them, which is 1).
We look at some simple examples.
4.4.1 Example 1: Cross-entropy of a quality prediction
We examine a simple event with a variable that can take two possible values.
Assuming we have a quality prediction, how is it measured in cross-entropy?
# A good prediction case
q_good = np.array([0.9,0.1]) # predicted Pr. of 2 values
y = np.array([0.99,0.01]) # true Pr. of 2 values
p = y # Truth
q = q_good # prediction
print('p=',p,' log(p)=',-np.log(p))
print('q=',q,' log(q)=',-np.log(q))
print(' Entropy: Hp=',-np.dot(p,np.log(p))/len(p),\
' Hq=',-np.dot(q,np.log(q))/len(q))
print('\nCross-entropy: Hpq=',-np.dot(p,np.log(q))/len(p),\
'Hqp=',-np.dot(q,np.log(p))/len(q))
#Cross-entropy: Hpq>Hp; Hqp>Hq should hold.
p= [0.99 0.01] log(p)= [0.01005034 4.60517019]
q= [0.9 0.1] log(q)= [0.10536052 2.30258509]
Entropy: Hp= 0.028000767177423672 Hq= 0.1625414866957241
Cross-entropy: Hpq= 0.06366638 Hqp= 0. 234781160
It is seen that the cross-entropy Hpq is low, indicating that the prediction
q is good. Notice that Hpq = Hqp , Hpq ≥ Hp , and Hqp ≥ Hq .
Statistics and Probability-based Learning Model 175
4.4.2 Example 2: Cross-entropy of a poor prediction
Consider again a simple event with a variable that can take two possible
values. This time, we assume a poor-quality prediction, and we examine
how it is measured in cross-entropy.
# A totally-off prediction case
q_bad = np.array([0.1,0.9]) # predicted Pr. of 2 values
y = np.array([0.99,0.01]) # true Pr. of 2 values
p = y # Truth
q = q_bad # prediction
print('p=',p,' log(p)=',-np.log(p))
print('q=',q,' log(q)=',-np.log(q))
print(' Entropy: Hp=',-np.dot(p,np.log(p))/len(p),\
' Hq=',-np.dot(q,np.log(q))/len(p))
print('\nCross-entropy: Hpq=',-np.dot(p,np.log(q))/len(p),\
'Hqp=',-np.dot(q,np.log(p))/len(q))
#Cross-entropy: Hpq>Hp; Hqp>Hq should hold.
p= [0.99 0.01] log(p)= [0.01005034 4.60517019]
q= [0.1 0.9] log(q)= [2.30258509 0.10536052]
Entropy: Hp= 0.028000767177423672 Hq= 0.1625414866957241
Cross-entropy: Hpq= 1.1403064236103417 Hqp= 2.072829100487316
It is seen that the cross-entropy Hpq is high, indicating that the prediction
q is bad. Notice again that Hpq = Hqp , Hpq ≥ Hp , and Hqp ≥ Hq .
We are now ready to discuss the KL-divergence.
4.5 KL-Divergence
Kullback-Leibler Divergence or KL-divergence is a measure of the relative
entropy from one distribution to another. For the given two distributions p
and q, the KL-divergence from q to p is defined as
DKL (p||q) = pi · [log pi − log qi ] = p · [log(p) − log(q)] (4.5)
i
It is also referred to as the relative entropy of q with respect to p that can
be regarded as the true or reference distribution. Using the definitions for
176 Machine Learning with Python: Theory and Applications
the entropy and the cross-entropy, we shall have
DKL (p||q) = Hpq − Hp (4.6)
Note that the KL-divergence of q with respect to p is different from that of
p with respect to q. We have also
DKL (p||q) ≥ 0, equality holds only if p = q (4.7)
This is known as the Gibbs’ inequality (https://en.wikipedia.org/wiki/
Gibbs%27 inequality).
Two simple examples of KL-divergence are given below.
4.5.1 Example 1: KL-divergence of a distribution
of quality prediction
We examine a simple event with a variable that can take two possible values.
Assuming we have a quality prediction of a distribution in relation to the
true distribution, how is it measured in KL-divergence?
# Good prediction case
p = y # true or reference distribution
q = q_good # prediction
print('p=',p,' log(p)=',-np.log(p))
print('q=',q,' log(q)=',-np.log(q))
Dpq=np.sum(np.dot(p,(np.log(p)-np.log(q))))/len(p)
Dqp=np.sum(np.dot(q,(np.log(q)-np.log(p))))/len(p)
print('Dpq=',Dpq,' Dqp=',Dqp)
p= [0.99 0.01] log(p)= [0.01005034 4.60517019]
q= [0.9 0.1] log(q)= [0.10536052 2.30258509]
Dpq= 0.035665613538170556 Dqp= 0.07223967373775611
It is seen that the KL-divergences Dpq and Dqp are all positive. They
all have low values, indicating that the prediction q is good. Notice that
Dpq = Dqp .
4.5.2 Example 2: KL-divergence of a poorly
predicted distribution
Consider again a simple event with a variable that can take two possible
values. Assume that we have a poor prediction of a distribution in relation
Statistics and Probability-based Learning Model 177
to the true or reference distribution. We examine how it is measured in
KL-divergence using the following code:
# A bad prediction case
p = y
q = q_bad
print('p=',p,' log(p)=',-np.log(p))
print('q=',q,' log(q)=',-np.log(q))
Dpq=np.dot(p,(np.log(p)-np.log(q)))/len(p)
Dqp=np.dot(q,(np.log(q)-np.log(p)))/len(p)
print('Dpq=',Dpq, ' Dqp=',Dqp)
p= [0.99 0.01] log(p)= [0.01005034 4.60517019]
q= [0.1 0.9] log(q)= [2.30258509 0.10536052]
Dpq= 1.112305656432918 Dqp= 1.910287613791592
It is seen again that the KL-divergences Dpq and Dqp are all positive.
They all have high values, indicating the prediction q is poor. Notice also
that Dpq = Dqp .
4.6 Binary Cross-Entropy
Let us finally look at the so-called binary cross-entropy used in machine
learning. For the given two distributions p and q, the binary cross-entropy
of q with respect to p is defined as
B
Hpq =− pi · log qi + (1 − pi ) · log (1 − qi )
i (4.8)
= −p · log(q) − (1 − p) · log(1 − q)
In machine learning models, we usually assume that p is the truth
distribution, which can have probability 0 or 1, and hence cannot be subject
to logarithm. The binary cross-entropy can be viewed as a measure of the
entropy of the predicated probability with respect to the true one. It takes
into account both the probability p and q and the converse probability
(1 − p) and (1 − q), and computes the entropy of both of them. It roughly
doubles the cross-entropy, a somewhat enhanced measure of the discrepancy
of the predicated distribution from the true distribution. It is often used to
measure the performance of a model and used as one type of loss function.
We look at some examples.
178 Machine Learning with Python: Theory and Applications
4.6.1 Example 1: Binary cross-entropy for a distribution
of quality prediction
Consider a simple event with a variable that can take four possible values.
Assume that we have a good prediction of a distribution in relation to the
true or reference distribution. We examine how it is measured in the binary
cross-entropy using the following code:
# A good prediction case
import numpy as np
q = np.array([0.9,0.04,0.03,0.03]) # prediction
p = np.array([1.0,0.0,0.,0.]) # truth
p_conv = 1.0 - p # converse side of prediction
q_conv = 1.0 - q # converse side of truth
print(p, q, ' converse:',p_conv,q_conv)
cHpq = -np.sum(np.dot(p,np.log(q)))/len(p)
bcHpq = -np.sum(np.dot(p,np.log(q))+np.dot(p_conv,
np.log(q_conv)))/len(p)
print('Cross-entropy cHpq:',cHpq)
print('Binary cross-entropy bcHpq:',bcHpq)
[1. 0. 0. 0.] [0.9 0.04 0.03 0.03] converse: [0. 1. 1. 1.]
[0.1 0.96 0.97 0.97]
Cross-entropy cHpq: 0.02634012891445657
Binary cross-entropy bcHpq: 0.05177523128687465
It is found that the binary cross-entropy roughly doubles the cross-
entropy value, as expected.
4.6.2 Example 2: Binary cross-entropy for a poorly
predicted distribution
Consider an event with a variable that can take four possible values. Assume
that we have a poor prediction of a distribution in relation to the true
distribution. We examine again how it is measured in the binary cross-
entropy using the following code:
# A bad prediction case
q = np.array([0.4,0.2,0.3,0.1]) # prediction
p = np.array([1.0,0.0,0.,0.]) # truth
Statistics and Probability-based Learning Model 179
p_conv = 1.0 - p # converse side of prediction
q_conv = 1.0 - q # converse side of truth
print(p, q, ' converse:',p_conv,q_conv)
cHpq = -np.sum(np.dot(p,np.log(q)))/len(p)
bcHpq = -np.sum(np.dot(p,np.log(q))+np.dot(p_conv,
np.log(q_conv)))/len(p)
print('Cross-entropy cHpq:',cHpq)
print('Binary cross-entropy bcHpq:',bcHpq)
[1. 0. 0. 0.] [0.4 0.2 0.3 0.1] converse: [0. 1. 1. 1.]
[0.6 0.8 0.7 0.9]
Cross-entropy cHpq: 0.22907268296853875
Binary cross-entropy bcHpq: 0.4003674356962309
It is also found that the binary cross-entropy roughly doubles the cross-
entropy, leading to an enhanced discrepancy measure.
4.6.3 Example 3: Binary cross-entropy for more uniform
true distribution: A quality prediction
In the previous two examples, we studied two cases with the true distribution
at extreme: its probabilities are 1.0 and zeros. For both examples, we
observed an enhanced entropy measure using the binary cross-entropy. In
this example, we consider a more even true distribution and examine the
behavior of the binary cross-entropy.
# A good prediction case
q = np.array([0.4,0.2,0.3,0.1]) # prediction
p = np.array([0.3,0.3,0.2,0.2]) # truth, rather even distribution
p_conv = 1.0 - p # converse side of prediction
q_conv = 1.0 - q # converse side of truth
print(p, q, ' converse:',p_conv,q_conv)
cHpq =-np.sum(np.dot(p,np.log(q)))/len(p)
bcHpq=-np.sum(np.dot(p,np.log(q))+np.dot(p_conv,
np.log(q_conv)))/len(p)
print('Cross-entropy cHpq:',cHpq)
print('Binary cross-entropy bcHpq:',bcHpq)
[0.3 0.3 0.2 0.2] [0.4 0.2 0.3 0.1] converse: [0.7 0.7 0.8
0. 8] [0.6 0.8 0.7 0.9]
180 Machine Learning with Python: Theory and Applications
Cross-entropy cHpq: 0.36475754318911824
Binary cross-entropy bcHpq: 0.5856092407474651
In this case, it is also found that the binary cross-entropy does not give
enhancement.
4.6.4 Example 4: Binary cross-entropy for more uniform
true distribution: A poor prediction
Same as the previous example, but consider a case with poor prediction.
# A bad prediction case
q = np.array([0.4,0.05,0.05,0.5]) # prediction
p = np.array([0.1,0.3,0.2,0.2]) # truth, rather even distribution
p_conv = 1.0 - p # converse side of prediction
q_conv = 1.0 - q # converse side of truth
print(p, q, ' converse: ',p_conv,q_conv)
cHpq =-np.sum(np.dot(p,np.log(q)))/len(p)
bcHpq=-np.sum(np.dot(p,np.log(q))+np.dot(p_conv,
np.log(q_conv)))/len(p)
print('Cross-entropy cHpq:',cHpq)
print('Binary cross-entropy bcHpq:',bcHpq)
[0.1 0.3 0.2 0.2] [0.4 0.05 0.05 0.5] converse: [0.9 0.7 0.
8 0.8] [0.6 0.95 0.95 0.5]
Cross-entropy cHpq: 0.43203116151910004
Binary cross-entropy bcHpq: 0.7048313483737685
In this case, we found a similarity: the binary cross-entropy gives an
enhancement.
In conclusion, our study on simple vents above shows that the binary
cross-entropy enhances the discrepancy measure, by taking into consider-
ation both positive samples (with probability close to 1) and “negative”
samples (with probability close to zero).
4.7 Bayesian Statistics
Consider a statistics event with more than one random variable that occurs
jointly. When we deal with such multiple random variables, we may want to
know the joint probability Pr(A,B): the probability of both A = a and B =
b occurring simultaneously, for given elements a and b.
Statistics and Probability-based Learning Model 181
It is clear that for any values a and b, Pr(A,B) ≤ Pr(A = a), because
Pr(A = a) is measured regardless of what happens for B. For A and B to
happen jointly, A has to happen and B also has to happen (and vice versa).
Thus, A,B cannot be more likely than A or B occurring individually.
Pr(A,B)Pr(A) is called conditional probability and is denoted by Pr(B|A),
which is the probability that B happens, under the condition that A has
happened. This leads to the important Bayes’ theorem.
• By construction, we have: Pr(A,B) = Pr(B|A)Pr(A).
• By symmetry, this also holds: Pr(A,B) = Pr(A|B)Pr(B).
• We thus have
P r(A|B) = P r(B|A)P r(A)/P r(B). (4.9)
4.8 Naive Bayes Classification: Statistics-based Learning
4.8.1 Formulation
Based on the Bayesian statistics, a popular algorithm has been developed
known as the Naive Bayes Classifier. Consider an event with p variable
x = {x1 , x2 , . . . , xp } ∈ Xp . We assume that any variable xi is independent
of another. For a given label y, the conditional probability for being x is
expressed as
p(x|y) = p(xi |y) (4.10)
i
Based on Bayes’ Theorem, we have the following formula:
p(x|y)p(y) p(xi |y)p(y)
p(y|x) = = i (4.11)
p(x) p(x)
Although we may not know p(x) (that is the probability that x occurs in the
event), it may not be needed because it is only a matter of normalization in
computing p(y|x). Thus, we may just use the following formula instead:
p(y|x) ∝ p(x|y)p(y) = p(xi |y)p(y) (4.12)
i
4.8.2 Case study: Handwritten digits recognition
We can now use the code provided at mxnet-the-straight-dope (https://
github.com/zackchase/mxnet-the-straight-dope/blob/master/chapter01
crashcourse/probability.ipynb) to show how a Naive Bayes classifier is coded
to identify handwritten digits. We will use the well-known MNIST dataset
182 Machine Learning with Python: Theory and Applications
to train this classifier. The MNIST (https://en.wikipedia.org/wiki/MNIST
database) contains a total of 70,000 images (60,000 for training and 10,000 for
testing) of handwritten 10 digits from 0 to 9, and all these images are labeled.
These images have been taken from American Census Bureau employees and
American high school students.
The digit classification problem is then casted to compute the probability
of a given image x being digit y: p(y|x). Any image x contains p pixels
xi (i = 1, 2, . . . , p), and each pixel xi can take a value of 1 (being lighted on)
or 0 (being lighted off), and hence is a binary variable.
Equation (4.12) can then be used, in which we need to estimate p(y)
and p(xi |y). Both can be computed using the MNIST training dataset for
each digit y. For example, in the total of 60,000 images of digits of the
MNIST training dataset, digit 4 is found 5,800 times, and we then have
5800
p(y = 4) = 60000 . To estimate p(xi |y), we can estimate p(xi = 1|y), because
xi is binary and p(xi = 0|y) = 1 − p(xi = 1|y). Estimating p(xi = 1|y) can
be done by counting the times that pixel i is on for label digit y, and then
dividing it by the number of occurrences of label y in the dataset. In this
simple algorithm, all we need is to count over the MNIST training dataset.
It is quite a straightforward strategy, the training is just counting, and we
can use the following code to get this done:
4.8.3 Algorithm for the Naive Bayes classification
# The codes are modified from these at https://github.com/
# zackchase/mxnet-the-straight-dope/blob/master/chapter01_
# crashcourse/probability.ipynb; Under Apache-2.0 License.
# import all the necessary packages
import numpy as np
import mxnet as mx
from mxnet import nd
def transform(data, label): # define a function to transfer data
return (nd.floor(data/128)).astype(np.float32),
label. astype(np.float32)
# floor of 255/128 = 1 pixel value = 1 when it is on.
# Divide dataset to 2 sets: one for training and one for testing
mnist_train = mx.gluon.data.vision.MNIST(train=True,
transform=transform)
mnist_test = mx.gluon.data.vision.MNIST(train=False,
transform=transform)
Statistics and Probability-based Learning Model 183
print('type:',type(mnist_train))
type: <class 'mxnet.gluon.data.vision.datasets.MNIST'>
import matplotlib.pyplot as plt
%matplotlib inline
image_index=8888 # Check one. Any integer <60,000
# 8888 is digit with label 3, as printed below
print(mnist_train[image_index][1]) #image in 0; label in 1
plt.imshow(mnist_train[image_index][0].reshape((28, 28)).\
asnumpy(), cmap='Greys') #image pixel: 28 by 28
3.0
<matplotlib.image.AxesImage at 0x1c5743ba630>
Figure 4.9: One sample image of handwritten digit from the MNIST dataset.
# Initialize arrays for counts for computing p(y), p(xi|y)
# We initialize all numbers with a count of 1 to avoid
# division by zero, known as Laplace smoothing.
ycount = nd.ones(shape=(10)) #10 possible digits
xcount = nd.ones(shape=(784, 10)) #784 (= 28*28) variables
184 Machine Learning with Python: Theory and Applications
# Aggregate the count of the labels in training dataset
# and number of its corresponding pixels being on (value=1)
for data, label in mnist_train: # loop over the dataset
x = data.reshape((784,))
y = int(label) # get the digit-number
ycount[y] += 1 # add 1 to (digit)th entry
xcount[:, y] += x # add the image data to
# the (digit)th column
# compute the probabilities p(xi|y) (divide per pixel counts
# by total count of the label in the training dataset)
for i in range(10):
xcount[:, i] = xcount[:, i]/ycount[i]
# Compute the probability p(y)
py = ycount / nd.sum(ycount)
The model has been trained using the training dataset. We now plot the
“trained” model.
import matplotlib.pyplot as plt
%matplotlib inline
fig, figarr = plt.subplots(1, 10, figsize=(15, 15))
for i in range(10):
figarr[i].imshow(xcount[:,i].reshape((28,28)).asnumpy(),
cmap='hot')
figarr[i].axes.get_xaxis().set_visible(False)
figarr[i].axes.get_yaxis().set_visible(False)
plt.show()
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
print(py.asnumpy(),nd.sum(py).asnumpy())
Figure 4.10: A kind of mean appearance of handwritten digits.
[0.099 0.112 0.099 0.102 0.097 0.090 0.099 0.104
0. 098 0.099] [1.000]
Statistics and Probability-based Learning Model 185
These pictures show the estimated probability distributions of observing
a switched-on pixel for all these 10 digits. These are the mean appearance of
these digits, or what a digit shall look like on average based on the training
dataset.
4.8.4 Testing the Naive Bayes model
We now examine the performance of this statistics-based model using the
MNIST test dataset. The training (which is just simple counting over
the training dataset) completed above gives us p(xi = 1|y) and p(y).
For a given image x from the test dataset, we compute the likelihood of
the image corresponding to a label y, which is to compute p(y|x) using
Eq. (4.12) where p(x|y) is in turn computed using the trained model. To
avoid chain multiplication of small probability numbers, we compute the
following logarithms instead (known as “log-likelihood”):
log p(y|x) ∝ log p(x|y) + log p(y) = log p(xi |y) + log p(y) (4.13)
i
For the given image x, a feature xi is binary and takes values of either 1
or 0. Because we are using the train model to compute the probabilities, we
shall have
p(xi = 1|y) = p(xi = 1|y) for predicting xi is on
(4.14)
p(xi = 0|y) = 1 − p(xi = 1|y) for predicting xi is off
Equation (4.14) can be written in a single one using a mathematical trick:
p(xi |y) = p(xi = 1|y)xi (1 − p(xi = 1|y))1−xi (4.15)
This is a general equation of computing the probability of an event with
binary variables using a trained model for predicting the probability of the
positive variable. We finally have
log p(xi |y) = [xi log p(xi = 1|y)+(1−xi ) log (1 − p(xi = 1|y))] (4.16)
i i
It is clear now that the testing is essentially measuring the binary cross-
entropy of the distribution of a given image (true distribution) with that
of the average image of a labeled letter computed from the dataset (the
model distribution). Therefore, we can write out Eq. (4.16) directly using
the binary cross-entropy formula.
186 Machine Learning with Python: Theory and Applications
To avoid re-computing the logarithms repetitively, we pre-compute
log p(y) for all y, and also log p(xi |y) and log (1 − p(xi |y)) for all pixels.
logxcount = nd.log(xcount) # pre-computations
logxcountneg = nd.log(1-xcount)
logpy = nd.log(py)
fig, figarr = plt.subplots(2, 10, figsize=(15, 3))
# test and show 10 images
ctr = 0 # initialize the control iterator
y = []
pxm = np.array([])
xi = ()
for data, label in mnist_test: # for any image
x = data.reshape((784,))
y.append(int(label))
# Incorporate the prior probability p(y) since p(y|x) is
# proportional to p(x|y) p(y)
logpx = logpy.copy() #nd.zeros_like(logpy)
for i in range(10):
# compute the log probability for a digit
logpx[i]+=nd.dot(logxcount[:,i],x)+nd.
dot(logxcountneg[:,i],1-x)
# normalize to prevent overflow or underflow by
# subtracting
# the largest value
logpx -= nd.max(logpx)
# and compute the softmax using logpx
px = nd.exp(logpx).asnumpy()
px = px*py.asnumpy() # this proportional to P(y|x)
px /= np.sum(px)
pxm = np.append(pxm,max(px)) # use the one with max Pr.
xi = np.append(xi,np.where(px == np.amax(px)))
# bar chart and image of digit
figarr[1, ctr].bar(range(10), px)
figarr[1, ctr].axes.get_yaxis().set_visible(False)
figarr[0, ctr].imshow(x.reshape((28,28)).asnumpy(),
cmap='hot')
Statistics and Probability-based Learning Model 187
figarr[0, ctr].axes.get_xaxis().set_visible(False)
figarr[0, ctr].axes.get_yaxis().set_visible(False)
ctr += 1
if ctr == 10:
break
np.set_printoptions(formatter={'float': '{: 0.0f}'.format})
plt.show()
print('True label: ',y)
xi = np.array(xi)
print('Predicted digits:',xi)
print('Correct?',np.equal(y,xi))
np.set_printoptions(formatter={'float': '{: 0.1f}'.format})
print('Maximum probability:',pxm)
Figure 4.11: Predicted digits (in probability) using images from the testing dataset of
MINST.
True label: [7, 2, 1, 0, 4, 1, 4, 9, 5, 9]
Predicted digits: [7 2 1 0 4 1 4 9 4 9]
Correct? [True True True True True True True True False True]
Maximum probability: [1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0]
4.8.5 Discussion
The test shows that this classifier made one wrong classification for the first
10 digits in the testing dataset. The 9th digit should be 5, but classified
as 4. For this wrongly classified digit, the confidence level is very close to 1.
The wrong prediction may be due to the incorrect assumptions: each pixel
is independently generated, depending only on the label. Clearly, a digit is
a very complicated function of images, and statistic information alone has
a limit. This type of Naive Bayes classifier was popular in the 1980s and
1990s for applications such as spam filtering. For image processing types of
problems, we now have effective classifies (such as CNN; for example, see
Chapter 15).
188 Machine Learning with Python: Theory and Applications
Alternatively, we can use the cross-entropy or the binary cross-entropy
concept to perform the prediction. Once we obtained p(xi |yj ), j = 1, 2, . . . , 9
using the training dataset, one can then compute the (binary) cross-entropy
between p(xi |yj ) and p(xi |ytest ), where ytest is any given image from the test
dataset (or any other handwritten digit). The yj that gives the least (binary)
cross-entropy should be regarded as the predicted digit.
This simple example does show how statistical analyses are useful in
classifications and machine learning in general. We also showed how statistics
can be computed for given datasets. Powerful Naive classifiers can be
conveniently trained using the existing module at Sklearn (https://scikit-
learn.org/stable/modules/naive bayes.html) for practical problems that are
heavily governed by statistics and where the physics laws are unknown, such
as medical applications, recommendation systems, text classification, and
real-time prediction and recommendation.
Chapter 5
Prediction Function and
Universal Prediction Theory
To build an ML model for predictions, one needs to use some hypothesis,
which predefines the prediction function to connect the feature variables to
the learning parameters. Thus, a proper hypothesis may follow the function
approximation theory, which has been studied intensively in physics-law-
based models [1–4]. The most essential rule is that the prediction function
must be capable of predicting an arbitrary linear function in the feature space
by a chosen set of learning parameters. Therefore, the prediction function is
assumed as a complete linear function of the feature variables, and it is
one of the most basic hypotheses.
It turns out such a complete linear prediction function performs affine
transformations of patterns in the affine space. Such a transformation
preserves affinity, meaning that the ratios of distances (Euclidean)
between points lying on a straight line and the parallelism of parallel line
segments remain unchanged after the transformation. It may not preserve the
angles between line segments and the (Euclidean) distances between points
in the original pattern. Further discussion on more general issues with affine
transformations can be found in Wikipedia (https://en.wikipedia.org/wiki/
Affine transformation) and the links therein.
The affinity ensures a special unique point-to-point seamless and gapless
transformation, meaning that it does not merge two distinct points to one, or
split one point to two, when the learning parameters are varying smoothly.
The connection between the feature variables and the learning parameters
is also smooth. Because an affine transformation is a combination of a linear
transformation and a translation controlled by the learning parameters, any
function up to the first order in the feature space can be reproduced, which
is critically important for machine learning models to be predictive.
189
190 Machine Learning with Python: Theory and Applications
This chapter discusses first the formulation and predictability of predi-
cation functions, followed by discussion on detailed process, properties, and
behavior of affine transformations. We shall focus on two aspects: (1) capabil-
ity in predicting functions in the feature space, and (2) affine transformation
of patterns in the affine space.
Then, the concept of affine transformation unit (ATU) (or linear predic-
tion function unit) is introduced as a building block, and simple neural net-
work codes will be built to perform affine transformations and demonstrate
their behavior and property. Feature encodings by learning parameters and
the uniqueness of the encodings are then studied, demonstrating the concept
of data-parameter converter. Next, an extension of ATU to form an affine
transformation array (ATA) and further extensions of the activation function
wrapped ATA to form MLPs or deepnets will be studied, which shows how
the predictability of a high-order nonlinear function can be established with
a deepnet. Finally, a Universal Prediction Theory is presented offering the
fundamental basis of why a deepnet can be made predictive.
5.1 Linear Prediction Function and Affine Transformation
Figure 5.1 shows the mechanism of the transmission of neurotransmitters in a
synaptic cleft of sensory neurons. Molecules of a neurotransmitter are shown
in a pseudo-colored image from a scanning electron microscope. A terminal
button (green) has been opened to reveal the synaptic vesicles (orange and
blue) inside the neurotransmitter.
(a) (c) (b)
Figure 5.1: (a) Transmission of neurotransmitters; (b) a pseudo-colored image of a
neurotransmitter from a scanning electron microscope: a terminal button (green) has been
opened to reveal the synaptic vesicles (orange and blue) inside; (c) a typical affine trans-
formation in xw formulation in an artificial NN. (Modified based on the image given in the
Psychology book (2d) by Rose M. Spielman et al. under the CC BY 4.0 License). (Credit b:
modification of work by Tina Carvalho, NIH-NIGMS; scale-bar data from Matt Russell).
Prediction Function and Universal Prediction Theory 191
In an artificial NN, we use a so-called affine transformation between the
data-points and the learning parameters in an artificial neuron to somehow
mimic the information transformation process in a neurotransmitter. This
transformation is of most fundamental importance in many ML models, and
thus the related theory is discussed in great detail in this chapter.
5.1.1 Linear prediction function: A basic hypothesis
In machine learning models, the basic hypothesis is that the prediction
function z is given by the following equation:
b
z(ŵ; x) = x ŵ = [1 x] = xw + b
w
x
ŵ (5.1)
p
= W 1 x1 + · · · + W p xp + b = W i xi + b
i=1
where z(ŵ; x) reads as “z is a function of ŵ for given x”, and all vectors are
defined as
x = [x1 , x2 , . . . , xp ] ∈ Xp ∈ Rp
p
x = [1 x] = x0 , x1 , x2 , . . . , xp ∈ X ∈ Rp+1
1
(5.2)
w = [W 1 , W 2 , . . . , W p ] ∈ Wp ∈ Rp
ŵ = [b w] = W 0 , W 1 , W 2 , . . . , W p ∈ Wp+1 ∈ Rp+1
b
in which xi (i = 1, 2, . . . , p) are feature variables, and here we use only the
linear basis functions; W i (i = 1, 2, . . . , p) ∈ R are called weights. Constant
b ∈ R is called bias, and is also often denoted as W 0 ∈ R. These real numbers
form vectors in corresponding spaces (defined in Chapter 1) and used in the
operations above. The hat over w stands for the basis on top of it. It absorbs
the bias and become new vector in the hypothesis space Wp+1 . Both weights
and bias are parameters that can be tuned to predict exactly a desired
arbitrary linear function in the feature space. The relations of these spaces
are discussed in Chapter 1. Notice the transpose we used for the learning
parameters, which implies that they are formed in a matrix (with only one
column in this case). The features are formed in row vectors. This is why
the learning parameter matrix is acting from the right on the feature vector.
We intentionally put together most used formulas for the prediction
function in Eq. (5.1) a single united form, so that the relationship between all
192 Machine Learning with Python: Theory and Applications
these variables can be made clear once for all. Readers may take a movement
to digest this formulation, so that the later formulations can be understood
more easily.
When we write z = xw+b, we call it a xw+b formulation. When we write
z = x ŵ, in which the bias b is absorbed by w, we call it an xw formulation.
Both formulations will be used in the book interchangeably, because they are
essentially the same. The xw+b formulation allows explicit viewing the roles
of weights and biases separately during analysis. The xw formulation is more
concise in derivation processes, and also allows explicit expressions of affine
transformations, which are most essential for major machine learning models.
5.1.2 Predictability for constants, the role of the bias
Note that the bias b is a must have. If we set b = 0 and used only w, the
hypothesis will not be able to even predict a function that is a constant.
This can be easily proven as follows.
Consider a given (label) function y(x) = c, where c is a given constant
in R independent of x. This means that at x = 0, y(x = 0) = c. In order to
predict c using Eq. (5.1) we must have z(w, b; x = 0) = c. Now, if we drop
b in Eq. (5.1), regardless of what we choose for w, the hypothesis always
predicts
z =0·w =0 (5.3)
This means that the constant c ∈ R will never be predicted by the hypothesis
without b. This means also that a pure linear transformation through w is
insufficient for proper prediction, because it cannot even predict constants.
On the other hand, when b is there, we simply choose b = c, and the
constant c is then produced by the hypothesis. This implies also that z must
p
live in an affine space X , an augmented feature space that lives within Xp+1 .
5.1.3 Predictability for linear functions: The role of the weights
Further, by proper choices of w and b, any linear function can be produced
using Eq. (5.1). This can also be easily proven as follows.
Consider any given (label) linear function y ∈ Y of variables x ∈ Xp :
y(x) = xk + c (5.4)
where c ∈ R is a given constant and k is a (column) vector in Wp . Note k
in general may be in Rp , however, we need to perform vector operations on
it, it must be confined in the vector space Wp .
Prediction Function and Universal Prediction Theory 193
By simply choosing w∗ = k, and b∗ = c, we obtain
z(w∗ , b∗ ; x) = xk + c = y(x) (5.5)
The given linear function is predicted exactly, using such a particular choice
of w∗ and b∗ . This means that any given arbitrary linear function of x ∈ Xp
can be predicted using hypothesis Eq. (5.1).
5.1.4 Prediction of linear functions: A machine
learning procedure
The above process showed that to predict a label linear function using
Eq. (5.1) is straightforward because we can choose these learning parameters
by inspection. For more complicated problems, this is not possible, and we
would use a minimization process to find these learning parameters. Here,
we demonstrate such a process to find w∗ and b∗ . This time, let us use the
xw formulation. We rewrite Eq. (5.4) to
y(x) = x k̂ (5.6)
where k̂ = [c k] ∈ Wp+1 ∈ Rp+1 .
Step 1: define a loss function in terms of the learning parameters. The loss
function shall evaluate the error between the hypothesis Eq. (5.1) and the
label function Eq. (5.6). It can have various forms. The widely used one is
the L2 error function that is the error squared:
L(z(ŵ)) = z(ŵ; x) − y(x))2 = [z(ŵ; x) − y(x)]2
= (x ŵ − x k̂) (x ŵ − x k̂) (5.7)
= (ŵ − k̂ )[x x](ŵ − k̂)
It is clear that the loss function is L(z) is a scalar function of the prediction
function z(ŵ) that is in turn a function of the vector of learning parameter
ŵ. Therefore, L(z) is in fact a functional. It takes a vector ŵ ∈ W(p+1) and
produces a positive number in R. It is also quadratic in ŵ.
In the 2nd line of Eq. (5.7), we first moved the transpose into the first
pair of parentheses, and then factor out x and x from these two pairs of
parenthesis to form matrix that is the out-product of [x x], which is a p × p
symmetric matrix of rank 1. All these follow the matrix operation rules. Note
that (ŵ − k̂) is a vector. Therefore, Eq. (5.7) is a standard quadratic form.
If [x x] is SPD, L has a unique minimum at (ŵ − k̂) = 0 or ŵ∗ = k̂. This
would prove that the prediction function is capable of reproducing any linear
194 Machine Learning with Python: Theory and Applications
function uniquely in the feature space, and we are done. However, because
[x x] has only rank 1, we need to manipulate a little further for deeper
inside.
Step 2: minimize the loss function with respect to ŵ ∈ Wp+1 .
This allow us to find ŵ that is the stationary point of L, by setting the
gradient to zero.
∂L(ŵ)
= 2x x(ŵ − k̂) = 0 (5.8)
∂ ŵ
In above, we used again that x x is symmetric. Regardless its contents,
Eq. (5.8) is satisfied when we set ŵ = ŵ∗ = k̂, which gives w∗ = k,
and b∗ = c. This is exactly the same as the results obtained previously.
This proves that the prediction function is capable of reproducing any linear
function in the feature space. Because x x is rank deficient (rank =1), the
solution of ŵ∗ = k̂ is not unique, and there are other (may infinite number
of) solutions of (ŵ − k̂) = 0 that satisfy Eq. (5.8). These solutions live in the
null-space of x x. This implies that we need to have more data points to
make the null-space zero, for unique solution. All these issues relate to the
solution existence theory that demands sufficient quality data-points,
which will be discussed in Chapter 9.
We have now solved analytically an ML problem using a typical mini-
mization procedure to predict a continuous function. This problem we just
examined is simple, but our analysis reveals essential issues in an ML model
using predictive functions. For more complicated problems with datasets of
discrete data-points, we would usually need computational means to solve
it, but the essential concept is the same.
5.1.5 Affine transformation
On the other hand, Eq. (5.1) can be used to perform an affine transfor-
mation, where weights wi (i = 1, 2, . . . , p) are responsible for (pure) linear
transformation and bias b is responsible for translation. Both wi and b are
learning parameters in a machine learning model. To show how Eq. (5.1) is
explicitly used to perform an affine transformation, we perform the following
maneuver in matrix formulations:
First, using each ŵi (i = 1, 2, . . . , k) and Eq. (5.1) we obtain,
zi = x ŵi (5.9)
Prediction Function and Universal Prediction Theory 195
Now, form the following vector,
b1 b2 bk
z = [z1 , z2 , . . . , zk ] = x , x ,..., x
w1 w2 wk
b W0 (5.10)
=x =x or simply z = x Ŵ
W W
Ŵ
where b = [b1 , b2 , . . . , bk ], W0 = [ W 01 , W 02 , . . . , W 0k ] = b, and W =
b1 b2 bk
[w1 , w2 , . . . , wk ].
Notice that Ŵ absorbs b as the hat of W, and hence collects all the
learning parameters for predicting z. Our notation Ŵ allows easy tracking
the variables. It is often used for prediction at the output layer of a neural
network, because z = x Ŵ, and affine transformation is no longer needed at
the output layer.
With Eq. (5.10), we can further construct the following matrix opera-
tion [6].
1 1 0 1 1 0 1
= = or
z b W x W0 W x
z W
x W
x
1 b
1 W0
1 z = 1 x = 1 x or simply (5.11)
0 W 0 W
z x x
W W
z = xW
where 0 = [0, 0, . . . , 0] with p zeros. Matrix W we derived is the affine
transformation matrix. It has a dimension of (p + 1) × (k + 1) and
performs affine transformations an affine transformation from space Xp to
space Xk for given x. It is used in the hidden layers in neural networks,
because transformations must occur in the affine space in these layers to
ensure proper connections.
Matrix W will be used in Chapter 13 when studying MLPs or deepnets.
Note that W can be written as [e1 Ŵ], in which e1 = [1 0] is the first base
vector of space Wp+1 . W contains a unit vector (constant) as its column, and
thus the trainable parameters are in Ŵ. In the output layer of an NN, we use
only Ŵ, because affine transformation is no longer needed there. For machine
learning models, W can be called affine transformation weight matrix.
196 Machine Learning with Python: Theory and Applications
Formulation given above reveals clearly how the affine transformation
weight matrix is derived, and how the traditional weights and biases are
involved in the affine transformation weight matrix. Finally, we note the
following points:
p
1. We know that [1, x] is in X . Now, after action of W (from the right)
k
on it, the resulting [1, z] that is clearly also in X . This is known as the
automorphism property of an affine transformation. A pattern in an
affine space stays in an affine space after affine transformation. We will
demonstrate this in the example section.
2. The affine transformation uses only b and w given in Eq. (5.1), except
that it needs to be arranged in a form of W to properly act on x. This
may be the reason for Eq. (5.1) being often called affine transformation
(although not exactly at least in concept). Equation (5.1) is currently used
in the actual computations of affine transformation, including many parts
in this book. Our Eq. (5.11), however, allows more concise formulation,
and shows explicitly the automorphism. The last equation in Eq. (5.11)
and Eq. (5.10) can and should, of course, also be used for computation.
If we do, the codes and data structure may be even neater, because we
have only xw operations, and not b is involved.
3. For given W 0 (or b) and w, W is a linear operator. Weights w is responsible
for stretching-compression (scaling) and rotation, and W 0 is responsible
for translation. This property is shown as the affinity mentioned earlier.
We will discuss this further in the following sections and demonstrate this
in the example section.
4. For conveniences, we will use both xw formulation and xw+b formulation.
Let us look at some special cases, when Eq. (5.11) is used to perform affine
transformations.
Case 1: if we set all learning parameters to zero, Ŵ = 0, we will obtain
[1 z] = [1 0], meaning that any data-point [1 x] in an affine space
collapsed to the same point [1 0] in another affine space.
Case 2: if we set b = 0 and W = I where I is the identity matrix, we shall
have [1 z] = [1 x]. This means that any original point in affine
space is unchanged (no transformation).
Case 3: if we set b = c where c is a constant vector, and W = I, we shall
have [1 z] = [1 c + x]. This means that any original point in affine
space is translated by c.
Prediction Function and Universal Prediction Theory 197
Case 4: if we set where b = [c, 0, . . . , 0], W = [k, e2 , . . . , ep ] where ei is a
base vector of Wp (with all zero entries except 1 at the ith entry),
we obtain zi = xi (i = 2, . . . , p) and
z1 = x k + c (5.12)
which is Eq. (5.5). This means that the prediction of a linear function in the
feature space can be viewed as an affine transformation in the affine space.
Since k in W is the gradient of the function, it is responsible for rotation.
5.2 Affine Transformation Unit (ATU), A Simplest Network
A p → 1 neural network can be built for predicting arbitrary linear functions
in the feature space, or to perform an affine transformations on the affine
space. The typical architectures are shown in Fig. 5.2.
This net can be set to predict arbitrary linear functions, or to perform
affine transformation defined in Eq. (5.1) for a given data-point xi (i =
1, 2, . . . , p) using sets of learning parameters wi (i = 1, 2, . . . , p) and b. Clearly,
any change in learning parameters shall result in a different value in function
z, for a given same data-point xi (i = 1, 2, . . . , p).
Note that Fig. 5.2 is a basic unit or a building block that can be used
to form a complicated neural network. Therefore, let us write a code to
Figure 5.2: A p → 1 neural network with one layer neurons taking an input of p features,
and one output layer of just one single neuron that produces a single prediction function
z. This forms an affine transformation unit or ATU (or linear function prediction unit).
The net on the left is for xw+b formulation, and on the right is for xw formulation with
p + 1 neurons in the input layer in which the one at the top is fixed as 1. Both ATUs are
essentially identical.
198 Machine Learning with Python: Theory and Applications
study it in great detail using the following examples. Let us first discuss the
data structures. Different ML algorithms may use a different one, and the
following one is quite typical.
5.3 Typical Data Structures
p → 1 nets:
Equation (5.1) can be written in the matrix form with dimensionality clearly
specified as follows:
z(ŵ; x) = x w + b = x ŵ (5.13)
1×p p×1 1×1 1×(p+1) (p+1)×1
p
The prediction function z ∈ X is now clearly specified as a function of w
and b corresponding to any x ∈ Xp . For the ith data-point xi , we have
z(ŵ; xi ) = xi w + b = xi ŵ (5.14)
1×p p×1 1×1 1×(p+1) (p+1)×1
Note that z(ŵ; x) is still a scalar for one data-point. Also, because no further
transformation, z is not needed for one layer nets.
p → k nets:
In hyperspace cases, we would have many, say k, neurons in the output
of the current layer, each neuron performing (independently) an affine
transformation based on the same dataset (see, Fig. 5.13). Therefore, the
output should be an array with k entries. The data may be structured in
matrix form: ⎡ ⎤
W 11 W 12 . . . W 1k
⎢W 21 W 22 . . . W 2k ⎥
⎢ ⎥
[z1 z2 · · · zk ] = [xi1 xi2 · · · xip ]⎢ . .. .. .. ⎥
z(W,b; x )
i 1×k xi1×p
⎣ .
. . . . ⎦
W p1 W p2 ... W pk
W p×k
+[b1 b2 · · · bk ] (5.15)
b 1×k
The above matrix can be written in a concise matrix form as follows, with
all the dimensionality specified clearly:
z(Ŵ; xi ) = xi W + b = xi Ŵ (5.16)
1×k 1×p p×k 1×k 1×(p+1) (p+1)×k
Note that one can stack up as many neurons in a layer as needed because
these weights for each neuron are independent of the those for any other
Prediction Function and Universal Prediction Theory 199
neuron in the stack. This stacking is powerful because it makes the well-
known universal approximation theory (see Chapter 7) workable.
p → k nets with m data-points:
For a dataset with m points, the data may be structured as matrix Xm×p ,
by vertically stacking xi . In this case, m predictions can be correspondingly
made, and the formulation in matrix form becomes
⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
z1 (W, b; x1 ) x11 x12 . . . x1p W 11 W 12 . . . W 1k b
⎢ z2 (W, b; x2 ) ⎥ ⎢ x21 x22 . . . x2p ⎥ ⎢W 21 W 22 . . . W 2k ⎥ ⎢b⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ .. ⎥ = ⎢ .. .. . . .. ⎥ ⎢ .. .. . . .. ⎥ + ⎢ .. ⎥ (5.17)
⎣ . ⎦ ⎣ . . . . ⎦ ⎣ . . . . ⎦ ⎣.⎦
zm (W, b; xm ) xm1 xm2 . . . xmp W p1 W p2 ... W pk b
Z m×1 (W,B;X) X m×p W p×k Bm×1
where vector B has the same b for all entries. The above matrix can be
written in a concise matrix form as follows, with all the dimensionality
specified clearly:
Z(W, B) = X W + B = X Ŵ (5.18)
m×1 m×p p×1 m×k m×(p+1) (p+1)×k
Note that we do not actually form matrix Z in practical computations,
because when a loss function is constructed, it is a form of a summation
over m or a mini-batch size. We will see this frequently in later chapters on
actual machine learning models.
5.4 Demonstration Examples of Affine Transformation
We now present a number of examples of affine transformations. This is
performed as follows.
For a given geometric pattern defined with a set of multiple data-
2
points X ∈ X (an affine space), its ith row is xi = [1, xi1 , xi2 ], computed
using Eq. (5.1):
zI = XŵI (5.19)
where ŵI = [bI , wI ] in which bI and wI are a given set of learning
parameters in the hypothesis space W3 , and
zII = XŵII (5.20)
where ŵII = [bII , wII ] in which bII and wII are a changed set of learning
parameters in W3 .
200 Machine Learning with Python: Theory and Applications
2
This results in a transformed data-point Z = [1, zI , zII ] ∈ X .
The above procedure using affine transformation on the original dataset
2
X ∈ X by varying ŵ for 2 times results in transformed dataset Z that is in
2
the same affine space X , the automorphism.
We now write a code to demonstrate the affinity of the above transforma-
2
tion. Because X is also a 2D plane, we can conveniently plot both original
and transformed patterns together in space R2 using only zI and zII for
visualization and analysis. We first define some functions.
import numpy as np
def logistic0(z): # The sigmoid/logistic function
return 1. / (1. + np.exp(-z))
def net(x,w,b): # An affine transformation net
y = np.dot(x,w) + b
return y
def edge(k,dd): # Define a line pattern in 2D
# feature space (x1, x2)
x = np.arange(-1.0,1.0+dd,dd) # dd: interval
x1 = x # x1 value.
x2 = k * x # x2 value, k slope of the line
x1 = np.append(x1,0) # add the center
x2 = np.append(x2,0)
X = np.stack((x1, x2), axis=-1) # X has two components.
return x1,x2,X
def circle(r,dpi): # Define a circular pattern in 2D
# feature space (x1, x2)
x = np.arange(0.0,2*np.pi,dpi)
x1 = r*np.cos(x) # circle function, for x1 value.
# Radius r and scaling factor c.
x2 = r*np.sin(x) # for x2 value.
x1 = np.append(x1,0) # add the center
x2 = np.append(x2,0)
X = np.stack((x1, x2), axis=-1) # X has two components.
return x1,x2,X
def rectangle(MinX, MaxX, MinY, MaxY,dd,theta):
# data for rectangle pattern in 2D space (x1, x2(y))
x = np.arange(MinX,MaxX+dd,dd)
ymin= np.full(x.shape,MinY) #y=x2
ymax= np.full(x.shape,MaxY)
Prediction Function and Universal Prediction Theory 201
y = np.arange(MinY,MaxY+dd,dd)
xmin= np.full(y.shape,MinX)
xmax= np.full(y.shape,MaxX)
x1 = np.append(np.append(np.append(x,xmax),np.flip(x)),xmin)
x2 = np.append(np.append(np.append(ymin,y),ymax),np.flip(y))
x1 = np.append(x1,(MaxX+MinX)/2) # add the center
x2 = np.append(x2,(MaxY+MinY)/2)
X1 = x1*np.cos(theta)+x2*np.sin(theta)
X2 = x2*np.cos(theta)-x1*np.sin(theta)
X = np.stack((X1, X2), axis=-1) # X has two components.
return X1,X2,X
def spiral(alpha,c):
# Define a spiral pattern in 2D space (x1, x2)
xleft,xright,xdelta = 0.0, 40.01, 0.1
x = np.arange(xleft,xright,xdelta)
x1 = np.exp(alpha*x)*np.cos(x)/c # logarithmic spiral
# function, x1 with decay rate alpha & scaling factor c.
x2 = np.exp(alpha*x)*np.sin(x)/c # x2 value.
x1 = np.append(x1,0) # add the center
x2 = np.append(x2,0)
X = np.stack((x1, x2), axis=-1) # X has two components.
return x1,x2,X
Let us now set up learning parameters w and b.
# Set w0 & b0,so that original [x1,x2] pattern is reproduced.
w0_I=np.array([1.,0.]) # Initial 2 weights (vector),2D space
b0_I=0 # Initial bias, so that z_I = x1
w0_II=np.array([0.,1.]) # Initial 2 weights (vector),2D space
b0_II=0 # Initial bias, so that z_II = x2
# Set wI=w0_I, b_I=b0_I; but w_II, b_II be arbitrary.
#w_I, b_I = w0_I, b0_I # readers may try this
w_I=np.array([.8,.2]) # Arbitrary values for the two weights
# to perform scaling and rotation
b_I = -0.6 # Arbitrary values for the bias to
# perform translation
w_II=np.array([.2,.5]) # Arbitrary values for the two weights
# to perform scaling and rotation
b_II = 0.6 # Arbitrary values for the bias to
# perform translation
202 Machine Learning with Python: Theory and Applications
Next, we define a function for plotting these patterns, initial and affine
transformed with wII and bII , and linear transformed with wII and bII = 0.
%matplotlib inline
import matplotlib.pyplot as plt
def affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II):
plt.figure(figsize=(4.,4.),dpi=90)
plt.scatter(net(X,w0_I,b0_I),net(X,w0_II,b0_II),label=\
"Original: w0I=["+str(w0_I[0])+","+str(w0_I[1])+
"], b0I="+str(b0_I)+"\n w0II=["+str(w0_II[0])+","+
str(w0_II[1])+"], b0II="+str(b0_II),s=10,c='orange')
#plot the initial pattern
plt.scatter(net(X,w_I,b_I),net(X,w_II,b_II),label=\
"Affine: wI=["+str(w_I[0])+","+str(w_I[1])+
"], bI="+str(b_I)+"\n wII=["+str(w_II[0])+","+
str(w_II[1])+"], bII="+str(b_II),s=10,c='blue')
# plot the affine transformed pattern
plt.scatter(net(X,w_I,b_I),net(X,w_II,b0_II),label=\
"Linear: wI=["+str(w_I[0])+","+str(w_I[1])+
"], bI="+str(b_I)+"\n wII=["+str(w_II[0])+","+
str(w_II[1])+"], b0II="+str(b0_II),s=10,c='red')
#plot the linear transformed pattern
plt.xlabel('$z_{I}$')
plt.ylabel('$z_{II}$')
plt.title('linear and affine transformation')
plt.grid(color='r', linestyle=':', linewidth=0.3)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.axis('scaled')
#plt.ylim(-5,9)
plt.show()
5.4.1 An edge, a rectangle under affine transformation
We first create an edge (straight line segment) and a rectangle pattern
represented by a set of orange points, perform the affine transformations
defined by Eq. (5.1), and plot out these patterns before and after the affine
(blue) and linear (red) transformations. Because a rectangle consists of
straight lines, it is easy for us to observe the affinity of the transformation.
x1,x2,X = edge(1.5,0.2)
affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II)
Prediction Function and Universal Prediction Theory 203
Figure 5.3: Affine transformations of a straight line/edge.
x1,x2,X = rectangle(-2.,2.,-1.,1.,0.2,np.pi/4)
affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II)
Figure 5.4: Affine transformations of a rectangle.
From the above two figures, the following observations can be made:
1. After the affine transformation, the original (orange) rectangular pattern
is only rotated, scaled, sheared, and translated to a new one (blue).
Weights w is responsible for the linear transformation results in scaling
and rotations, and b is responsible for translation.
204 Machine Learning with Python: Theory and Applications
2. The transformation moves point to point, edge to edge, and quadrilateral
to quadrilateral.
3. The transformation preserves the ratio of the lengths of parallel line
segments. For example, the ratio of the two longer sides of the orange
rectangle is the same as the ratio of the two longer sides of the blue
quadrilateral.
4. Parallel line segments remain parallel after the affine transformation.
5. It does not preserve distances between points. It preserves only the ratios
of the distances between points lying on a straight line.
6. The affine transformation does not preserve angles between lines.
This simple demonstration helps one to imagine how an affine transformed
pattern covers the (same) space by changing w and b. The pure linear
transformation alone does not change the origin, and hence has a much
limited coverage.
5.4.2 A circle under affine transformation
Let us now take a look at the affine (and linear) transformation to a circle
using the same code.
x1,x2,X = circle(1.0,0.1)
affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II)
Figure 5.5: Affine transformations of a circle.
Prediction Function and Universal Prediction Theory 205
This time, it is clearly seen from Fig. 5.5 that after the transformation, the
original (orange) circular pattern is rotated, scaled, sheared, and translated
to an ellipse (blue). The points that we have observed for the rectangles are
still valid.
5.4.3 A spiral under affine transformation
Let us examine the affine (and linear) transformation to a more complicated
pattern, spiral.
x1,x2,X = spiral(0.1,10.)
affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II)
Figure 5.6: Affine transformations of a spiral.
In this case, the original (orange) spiral is rotated, scaled, sheared, and
translated to a new one. The observations made for the above two examples
still hold.
5.4.4 Fern leaf under affine transformation
Figure 5.7 shows an excellent example of affine transformation available
in the public domain provided by António Miguel de Campos (https://
en.wikipedia.org/wiki/Affine transformation). It is an image of a fractal of
Barnsley’s fern (https://en.wikipedia.org/wiki/Barnsley fern). A leaf of the
fern is an affine transformation of another by a combination of rotation,
scaling, reflection, and translation. The red leaf, for example, is an affine
206 Machine Learning with Python: Theory and Applications
Figure 5.7: An image of a leaf of the fern-like fractal is an affine transformation of another.
transformation of the dark blue leaf, or any of the light blue leaves. The fern
seems to have this typical pattern coded as an ATU in its DNA. This implies
that ATU is as fundamental as DNA.
5.4.5 On linear prediction function with affine transformation
One should not confuse the linear prediction function with the affine
transformation. They are essentially the same hypothesis, but viewed from
different aspects. The former is on the predictability of an arbitrary linear
function in the feature space using hypothesis Eq. (5.1), and the latter is on
affinity when Eq. (5.1) is used for pattern transformation on the affine space.
The predictability of the arbitrary linear function enables affine transforma-
tion. The predictability for constants allows translational transformation,
and the predictability for the linear function enables variable-wise scaling
and rotation, while maintaining the affinity. The prediction of a linear
function in the feature space can be viewed as an affine transformation in
the affine space. Because of this, we use these two terms interchangeably,
knowing this subtle difference.
5.4.6 Affine transformation wrapped with activation function
When an affine transformation z(w, b; x) is wrapped with a nonlinear activa-
tion function (see Chapter 7), the output φ(z) is confined by the activation
function, and the affinity is destroyed. However, φ(z) shall now have some
Prediction Function and Universal Prediction Theory 207
capability to predict nonlinear functions, because the activation functions
used in ML are continuous, smooth (at least piecewise differentiable), and
vary monotonically with z.
As an example, we wrap the affine transformed pattern with the sigmoid
function. In this case, φ(z) is confined in (0, 1). We write the following code
to demonstrate some examples of affine mapping:
# Code for Affine transformation wrapped with sigmoid function
def sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II):
# We do the same affine transformation and put the
# results to a sigmoid.
plt.figure(figsize=(4.5, 3.0),dpi=100)
plt.scatter(logistic0(net(X,w0_I,b0_I)),
logistic0(net(X,w0_II,b0_II)),label=\
"Original: w0I=["+str(w0_I[0])+","+str(w0_I[1])+
"], b0I="+str(b0_I)+"\n w0II=["+str(w0_II[0])+","+
str(w0_II[1])+"], b0II="+str(b0_II),s=10,c='orange')
#plot the initial pattern
plt.scatter(logistic0(net(X,w_I,b_I)),
logistic0(net(X,w_II,b_II)),label=\
"Affine: wI=["+str(w_I[0])+","+str(w_I[1])+
"], bI="+str(b_I)+"\n wII=["+str(w_II[0])+","+
str(w_II[1])+"], bII="+str(b_II),s=10,c='blue')
# plot the affine transformed pattern
plt.scatter(logistic0(net(X,w_I,b_I)),
logistic0(net(X,w_II,b0_II)),label=\
"Linear: wI=["+str(w_I[0])+","+str(w_I[1])+
"], bI="+str(b_I)+"\n wII=["+str(w_II[0])+","+
str(w_II[1])+"], b0II="+str(b0_II),s=10,c='red')
#plot the linear transformed pattern
plt.xlabel('$\sigma (z_{I})$')
plt.ylabel('$\sigma (z_{II})$')
plt.title('Affine transformation wrapped with sigmoid')
plt.grid(color='r', linestyle=':', linewidth=0.3)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
#plt.axis('scaled')
plt.ylim(-.05,1.1)
plt.show()
x1,x2,X = edge(5.0,0.1)
sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II)
208 Machine Learning with Python: Theory and Applications
Figure 5.8: Nonlinear activation function wrapped affine transformations of a straight
line/edge.
x1,x2,X = rectangle(-2.,2.,-1.,1.,0.2,np.pi/4)
sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II)
Figure 5.9: Nonlinear activation function wrapped affine transformations of a rectangle.
For this sigmoid wrapped affine transformation, the following observa-
tions can be made:
1. The transformation still sends a point to a point, an edge to an edge
uniquely. The use of the nonlinear activation function does not change
the uniqueness of the point-to-point transformation. This is because
the activation functions are continuous, smooth (at least piecewise
differentiable), and vary monotonically.
Prediction Function and Universal Prediction Theory 209
2. The affinity is, however, destroyed: the ratios of distances between points
lying on a straight line are changed. Not all the parallel line segments
remain parallel after the sigmoid transformation. The use of the sigmoid
function clearly brings nonlinearity. This gives the net the following
capabilities:
• The output φ(z(ŵ; x)) is now nonlinearly dependent on the features x.
One can now use it for logistic regression for labels given 0, or 1, by
training ŵ.
• φ(z(ŵ; x)) is linearly independent of the features x in the input. This
allows further affine transformations to be carried out in a chain to the
next layer if needed.
• φ(z(ŵ; x)) is also linearly independent of the features ŵ used in this
layer. When we need more layers in the net, fresh ŵs can now be used
for the next layers independently. This enables the creation of deepnets.
Let us now take a look at the wrapped affine transformation to a circle
function, using the same code.
x1,x2,X = circle(2.5,0.1)
sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II)
Figure 5.10: Nonlinear activation function wrapped affine transformations of a circle.
This case shows more server shape destroyed. The uniqueness of the point-
to-point transformation is still preserved. The following is for the wrapped
affine transformation for the spiral pattern.
x1,x2,X = spiral(0.1,10.)
sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II)
210 Machine Learning with Python: Theory and Applications
Figure 5.11: Nonlinear activation function wrapped affine transformations of a spiral.
We see server distortions, due again to the nonlinearity of the nonlinear
activation function, sigmoid. The point-to-point transformation is still
observed, but when z is near 0.0 and 1.0, the original points and the
transformed points are “squashed” closer due to the sigmoid actions. Hence,
information near there is getting transmitted much less through updating of
these learning parameters w and b. In other words, the gradient gets close
to zero, due again to the properties of the sigmoid function.
5.5 Parameter Encoding and the Essential Mechanism
of Learning
5.5.1 The x to ŵ encoding, a data-parameter converter unit
Based on Eq. (5.1), we now see a situation where a dataset (x, z) can
be encoded in a set of learning parameters ŵ in the hypothesis space.
Figure 5.12 shows schematically such an encoded state.
The straight lines in Fig. 5.12 are encoded with a point in the hypothesis
space W2 . For example, the red line is encoded by a red dot at w0 = 1 and
w1 = 1. In other words, using w0 = 1 and w1 = 1, we can reproduce the red
line. The blue line is encoded by a blue dot at w0 = 2 and w1 = −0.5, with
which the blue line can be reproduced. The same applies to the black line. A
machine learning process is to produce an optimal set of dots using a dataset.
After that, one can then produce lines. In essence, a machine learning model
converts or encodes the data to wi . This implies that the size and quality of
Prediction Function and Universal Prediction Theory 211
Figure 5.12: Data (on relations of x-z or x-y for given labels) encoded in model
parameters ŵ in the hypothesis space. In essence, a ML model converts data to ŵ during
training.
the dataset are directly related to the dimension of the affine spaces used in
the model.
On the other hand, if one tunes wi , different prediction functions can be
produced in the label space. Therefore, it is possible to find such a set of
wi that makes the prediction match the given label in the dataset for given
data-points. Finding such a set of wi is the process of learning. Real machine
models are a lot more complicated, but this gives the essential mechanism.
5.5.2 Uniqueness of the encoding
We state that the encoding of a line in the X-z space to a point in the
hypothesis space is unique. It is very easy to prove as follows.
Assume any arbitrary line in the X-z space that has two distinct points
in the hypothesis space ŵ(1) and ŵ(2) ; using Eq. (5.9), this line can be
expressed as
z = xŵ(1) (5.21)
which holds for arbitrary x. This same line can also be expressed as
z = xŵ(2) (5.22)
which holds also for arbitrary x. Using now Eqs. (5.21) and (5.22), we obtain
0 = x[ŵ(2) − ŵ(1) ] (5.23)
212 Machine Learning with Python: Theory and Applications
Equation (5.23) must also hold for arbitrary x. Therefore, we shall have
ŵ(1) = ŵ(2) (5.24)
which completes the proof.
On the other hand, we also state that a point in the hypothesis space
gives a unique line. It can be easily proven as follows.
Given any arbitrary point in the hypothesis space ŵ, assume we can
construct two lines in the X-z space. Using Eq. (5.9), the first line can be
expressed as
z (1) = xŵ (5.25)
The second line can be expressed as
z (2) = xŵ (5.26)
Using Eqs. (5.25) and (5.26), we obtain
z (2) − z (1) = 0 (5.27)
This means that these two lines are the same, which completes the proof.
In fact, the uniqueness can be clearly observed from Fig. 5.12, because a
line is uniquely determined by its slope and bias, and both are given by ŵ.
The uniqueness is one of the most fundamental reasons for a quality
dataset to be properly encoded with the learning parameters based on the
hypothesis of affine transformations, or for a machine learning model to be
capable of reliably learning from data.
5.5.3 Uniqueness of the encoding: Not affected
by activation function
It is observed that the uniqueness of the encoding of an affine transformation
wrapped with an activation function is not affected by the activation
function. This is because activation functions are all strictly monotonic
functions (as will be shown in Chapter 7), which does not change the
uniqueness of its argument. The proof is thus essentially the same as that
given above.
On the other hand, this property implies that the activation function
must be monotonic. This is true for all the activation functions discussed in
Chapter 7.
Prediction Function and Universal Prediction Theory 213
5.6 The Gradient of the Prediction Function
The gradient of the prediction function with respect to the learning
parameters has the following simple forms:
∇w z = x
(5.28)
∇b z = 1
which shows that the gradient with respect to weights relates the feature
variable (the data). The gradient with respect to bias is, however, a unit.
This is because the data corresponding to bias are x0 = 1. This may suggest
that when a regularization technique (see Chapter 14) is used, one may
choose different regularization parameters for weight and bias.
When the xw formulation is used, we have
∇ŵ z = x (5.29)
which can be used in an autograd in machine learning processes.
5.7 Affine Transformation Array (ATA)
We can now duplicate vertically the single neuron on the right in Fig. 5.2
for k times and let all neurons become densely connected, meaning that
each neuron in the output is connected with each of the neurons in the input
(also known as fully connected). This forms a mapping of p → k network.
The prediction functions z from an ATA can be expressed as
z(Ŵ; x) = xW + b = x Ŵ (5.30)
where z is a vector of prediction functions given by
z(Ŵ; x) = [z1 (ŵ1 ; x), z2 (ŵ2 ; x), . . . , k (ŵk ; x)], with
ŵj = [W 0j , W 1j , W 2j , . . . , W pj ] with W 0j = bj are the (p + 1) learning
parameters for the jth neuron in the output layer,
Ŵ = [b W] is a matrix of (p + 1)×k containing all learning parameters
for all neurons in the output layer,
W = [W ij ], (i = 1, 2, . . . , p; j = 1, 2, . . . , k) is a matrix of weights (part of
the learning parameters), and
b = [b1 , b2 , . . . , bk ] is a vector of biases (part of the learning parameters).
The vector ŵ of total learning parameters of a p → k ATA becomes,
ŵ = [ŵ1 , ŵ2 , . . . , ŵk ]
214 Machine Learning with Python: Theory and Applications
Figure 5.13: A p → k neural network with one input layer of p neurons and one output
layer of k neurons that produces k prediction functions zi (i = 1, 2, . . . , k). Each neuron
at the output connects to all the neurons in the input with its own weights. This stack
of ATU forms an affine transformation array or ATA. In other words, the p → k net
has the predictability of k functions in the feature space with p dimensions. Left: xw+b
formulation; Right: xw formulation.
which can also be regarded as the flattened Ŵ. The total number of the
learning parameters is
P = (p + 1) × k
It is clear that the hypothesis space grows fast in multiples for an ATA.
Equation (5.30) is the matrix form of a set of affine transformations.
It is important to note that each zj (wj , bj ) is computed using Eq. (5.1),
using its own weights wj and bias bj . This enables all zj (wj , bj ), j =
1, 2, . . . , k being independent of each other. Therefore, the ATA given in
Fig. 5.13 creates the simplest mapping that can be used for k-dimensional
regression problems using a dataset with p features. Note also that when
k = p, it can perform the p → p affine transformation.
5.8 Predictability of High-Order Functions of a Deepnet
5.8.1 A role of activation functions
Now, we wrap the stack of prediction functions with nonlinear activation
functions. This leads to a vector of
⎡ ⎤
⎢ ⎥
⎢φ(z1 (w1 , b1 )), φ(z2 (w2 , b2 )), . . . , φ(zk (wk , bk ))⎥ (5.31)
⎣ ⎦
(new) (new) (new)
x1 x2 xk
Prediction Function and Universal Prediction Theory 215
(new)
It becomes a set of new features xi , i = 1, 2, . . . , k that are linearly
independent of the original features xi , i = 1, 2, . . . , p. These new features
can then be used as inputs to the next layer. This allows the use of a new
set of learning parameters for the next layer.
It is clear that a role of the activation function is to force the outputs from
an ATA linearly independent of that of the previous ATA, enabling further
affine transformations leading to a chain of ATAs, a deepnet. To fulfill this
important role, the activation function must be nonlinear.
Such a chain of stacks of prediction functions (affine transformations)
wrapped with nonlinear activation functions gives a very complex deepnet,
resulting in a complex prediction function. Further, when affine transfor-
mation Eq. (5.1) is replaced with spatial filters, one can build a CNN
(see Chapter 15) for object detection, and when replaced with temporal
filters, we may have an RNN (see Chapter 16) for time sequential models,
and so on.
5.8.2 Formation of a deepnet by chaining ATA
These new features given in Eq. (5.31) can now be used as the inputs for the
next layer to form a deepnet. To illustrate this more clearly, we consider a
simplified deepnet with 4 − 2 − 3 neurons shown in Fig. 5.14.
Figure 5.14: Schematic drawing of a chain of stacked affine transformations wrapped with
activation functions in a deepnet for approximation of high-order nonlinear functions of
high dimensions. This case is an xw+b formulation. A deepnet using xw formulation will
be given in Section 13.1.4.
216 Machine Learning with Python: Theory and Applications
Here, let us use the number in parentheses to indicate the layer number:
(1)
1. Based on 4 (independent input) features xi (i = 1 ∼ 4) to the first layer,
(1)
a stack of 2 affine transformations zi (i = 1 ∼ 2) takes place, using a
(1)
4 × 2 weight matrix W(1) and biases bi (i = 1 ∼ 2). Affine transformation
(1) (1) (1) (1) (1) (1)
z1 uses wi1 (i = 1 ∼ 4) and b1 , and z2 uses wi2 (i = 1 ∼ 4) and b2 .
Clearly, these are carried out independently using different sets of weights
and biases.
(1) (1)
2. Next, z1 and z2 are, respectively, subjected to a nonlinear activation
(2)
function φ, producing 2 new features xi (i = 1 ∼ 2). Because of the
(2)
nonlinearity of φ, xi will no longer linearly depend on the original
(1)
features xi (i = 1 ∼ 4).
(2)
3. Therefore, xi (i = 1 ∼ 2) can now be used as independent inputs for
the 2nd layer of affine transformations, using a 2 × 3 weight matrix W(2)
(2)
and biases bi (i = 1 ∼ 3), in the same manner. This results in a stack
(3)
of 3 affine transformations zi (i = 1 ∼ 3), which can then be wrapped
again with nonlinear activation functions. This completes the 2nd layer
of 3 stacked affine transformations in a chain.
The above process can continue as desired to increase the depth of the neural
network. Note also that the number of neurons in each layer can be arbitrary
in theory. Because of the stacking and chaining, the hypothesis space is
greatly increased. The stacking causes the increase in multiples, and the
chaining in additions. The prediction functions may live in an extremely high
dimensional space WP for deepnets. For this simple deepnet of 4 − 2 − 3, the
dimension of the hypothesis space becomes P = (4× 2+ 2)+ (2× 3+ 3) = 19.
In general, for a net of p − q − r − k, for example, the formulation should be
P = (p × q + q) + (q × r + r) + (r × k + k) (5.32)
layer 1 layer 2 layer 3
The vector of all trainable parameters in an MLP becomes,
ŵ = Ŵ(1) .f latten(), Ŵ(2) .f latten(), . . . , Ŵ(NL ) .f latten()
where NL is the total number of hidden layers in the MLP. Note that we
may not really perform the foregoing flattening in actual ML models. It is
just for demonstrating the growth of the dimension of the hypothesis space.
In actual computations, we may simply group them in a Python list, and
use an important autograd algorithm to automatically perform the needed
Prediction Function and Universal Prediction Theory 217
Figure 5.15: 1 → 1 → 1 net with sigmoid activation at the hidden and last layers.
forward and backward computations in training an MLP. The computations
over such a high dimension is achieved numerically. This will be discussed
in detail in Chapter 8.
The prediction functions are now functions of ŵ in a high dimensional
hypothesis space WP for MLPs. The formulation on calculating P for more
general MLPs will be given in Chapter 13. The Neurons-Samples Theory
that gives the relationship between the number of neurons and the number
of data-points will be discussed in Section 13.2.1.
5.8.3 Example: A 1 → 1 → 1 network
Consider the simplest 1 → 1 → 1 neural network shown in Fig. 5.15. Let us
use the same linear prediction function Eq. (5.1) and a sigmoid activation
function for the hidden and last layers. In this case, the output at the last
layer x(3) can be obtained as follows:
1
x(3) = σ z (2) = σ w(2) x(2) + b(2) = (2) (2) (2)
(5.33)
1 + e−(w x +b )
where x(2) is the output from the hidden layer:
1
x(2) = σ z (1) = σ w(1) x(1) + b(1) = (1) x+b(1) )
(5.34)
1+e−(w
in which x(1) = x, which is the inputs that can be normalized to be in
(−1, 1). The number in the parenthesis in the superscripts stands for the
layer number. Because x(2) is in (0, 1), we next use the Taylor expansion
consecutively twice to approximate the sigmoid function; we obtain
x(3) = c0 + c1 x + c2 x2 + c3 x3 + · · · (5.35)
where these constants are given, through a lengthy but simple derivation, by
1 (1) (2) (1) (1) (2) (1)
1 (2) (1) (2) 2 (1) 2
c0 = − b b w b −b w − b w b w − 12
16 48
1 (1) (1) 2
− b b − 12
48
218 Machine Learning with Python: Theory and Applications
1 (1) (2) (1) 2 (1) (2) (1)
2
(2) (1)
2
c1 = − w w b + 2b b w + b w −4 (5.36)
16
1 (1) 2 (2) 2 (1)
c2 = − w w b + b(2) w(1)
16
1 (1) 3 (2) 3
c3 = − w w
48
It is clear from Eq. (5.36) that by properly setting (training using a
dataset) the weights and biases for the neurons in the hidden and output
layers, all these constants ci , i = 1, 2, 3, 4 can be determined. The prediction
function at the output x(3) becomes a 3rd-order polynomial of the feature
x as given in Eq. (5.35). This means that our 1 → 1 → 1 net has the
predictability for 3rd-order functions approximately, in contrast to a 1 → 1
net that is only capable for 1st-order functions.
Note that the same analysis can be performed for other types of activation
functions. Also, if we cut off higher-order terms in the Taylor series, we can
approximate even higher-order functions. The point of our discussion here
is not about how well to approximate the sigmoid function, but to show the
capability of the simple 1 → 1 → 1 net. This simple analysis supports a
very important fact that adding layers gives the net the capacity to predict
higher-order nonlinear latent behavior. This is the reason why a deepnet is
powerful if it can be effectively trained.
We note, without further elaboration, that increasing the depth of a net is
equivalent to increasing the order of the shape functions in the physics-law-
based models such as the FEM [1] or the meshfree methods [3]. In contrast,
increasing the number of neurons in a layer is equivalent to increasing the
number of the elements or nodes [5].
5.9 Universal Prediction Theory
A deepnet can be established with the following important properties:
1. The capability of the linear prediction function (affine transformation)
in predicting exactly any function up to the first order, and one-to-one
unique transformation (with or without activation function) in each ATU.
2. The independence of the function approximation (affine transformation)
of a neuron to another in an ATA (due to its independent connections).
3. New independent features are produced in each layer using nonlinear
activation functions for each ATA.
Prediction Function and Universal Prediction Theory 219
4. Chaining of ATA wrapped with nonlinear activation functions provides
the capability for predicting complex nonlinear functions to an arbitrarily
higher order.
In the opinion of the author, these are the fundamental reasons for various
types of deepnets being capable of creating p → k mappings for extremely
complicated problems from p inputs of features to k labels (targeted features)
existing in the dataset. We now summarize our discussion to a Universal
Prediction Theory [6].
Universal Prediction Theory: A deepnet with sufficient layers of suffi-
cient neurons wrapped with nonlinear activation functions can be established
for predictions of latent features existing in a dataset when properly trained.
This theory claims only the capability of a deepnet in terms of creating
giant prediction functions based on the dataset. How to realize its capability
requires a number of techniques, including how to set up a proper structure
of the deepnet for given types of problems and how to find these optimal
learning parameters reliably and effectively. In addition, the applicability of
the trained MLP model depends on the quality of the dataset, as mentioned
in Section 1.5.6. The dataset quality is defined as its representativeness
to the underlaying problem to be modeled, including correctness, size, data-
point distribution over the features space, and noise level.
5.10 Nonlinear Affine Transformations
Note that in the above formulations, the features xi , i = 1, 2, . . . , p are used
in an affine transformation as linear basis functions. However, the basis func-
tions do not have to be linear. Take a one-dimensional problem, for example;
when linear approximation is used, the vector of the features should be
x = [1, x] (5.37)
If one would like to use 2nd-order approximation (often times called
nonlinear regression), the vector of the features simply becomes
x = 1, x, x2 (5.38)
If one knows the dataset well and believes that a particular function can be
used as a basis function, one may simply add it as an additional feature. For
220 Machine Learning with Python: Theory and Applications
example, one can include sin(x) as a feature in the following form:
x = [1, x, sin(x)] (5.39)
The use of nonlinear functions as bases for features is also related to the
so-called support vector machine (SVM) models that we will discuss in
Chapter 6, where we use kernel functions for linearly un-separable classes.
This kind of nonlinear feature basis or kernel is sometimes called feature
functions.
In our neural network models, higher-order and enrichment basis func-
tions can also be used in higher dimensions. For example, for two-dimensional
spaces, we may have features like
x = 1, x1 , x2 , x1 x2 , x21 , x22 , sin(x) (5.40)
This has completed 2nd-order polynomial bases functions, and is enriched
with a sin(x) function (with proper scaling on x). The dataset X shall also
be arranged in the order of all these features. The affine transformation
using the nonlinear basis functions can be exactly the same as using linear
basis discussed in this chapter. In fact, we have already done the affine
transformations for the circle and spiral. Note that for high-dimensional
problems, the feature space with nonlinear bases can be in extremely higher
dimensions. For such cases, the so-called kernel trick may apply to avoid
dimension increase.
5.11 Feature Functions in Physics-Law-based Models
This concept of using higher-order or special feature functions is essentially
the same as in the physics-law-based computational methods. In these
physics-law-based methods, the feature functions are called basis func-
tions. These basis functions are used to approximate the field variables
(displacements, stress, velocity, pressure, etc.) that are governed by a physic-
law in either strong or weak forms.
For example, in the finite element approximation [1], and the smoothed
finite element methods [2], we frequently use higher-order polynomial bases
called higher-order elements. Special basis functions are also used but called
enrichment functions. For example, when we would like to capture the
√
singular stress field in the domain, we add in r in the bases [2]. In the
meshfree methods [3], one can also use distance basis functions, such as
the radial basis functions (RBFs).
Prediction Function and Universal Prediction Theory 221
In methods used to solve linear mechanics problems governed by physics
laws, we often use high-order and special basis functions. The resulting
system equations will still be linear in the field variables. The essential
concept is the same: to capture necessary features in the system (governed by
law or hidden in data), one shall use feature or basis functions of necessary
complexity. The resulting models may still be linear in field variables.
References
[1] G.R. Liu and S.S. Quek, The Finite Element Method: A Practical Course, Butterworth-
Heinemann, London, 2013.
[2] G.R. Liu and T.T. Nguyen, Smoothed Finite Element Methods, Taylor and Francis
Group, New York, 2010.
[3] G.R. Liu, Mesh Free Methods: Moving Beyond the Finite Element Method, Taylor and
Francis Group, New York, 2010.
[4] G.R. Liu and Gui-Yong Zhang, Smoothed Point Interpolation Methods: G Space Theory
and Weakened Weak Forms, World Scientific, London, 2013.
[5] G.R. Liu, A Neural Element Method, International Journal of Computational Methods,
17(07), 2050021, 2020.
[6] G.R. Liu, A thorough study on affine transformations and a novel Universal Prediction
Theory, International Journal of Computational Methods, 19(10), in-printing, 2022.
MACHINE LEARNING
WITH PYTHON
Chapter 6
The Perceptron and SVM
This chapter discusses two fundamentally important supervised machine
learning algorithms, the Perceptron and the Support Vector Machine (SVM),
for classification. Both are conceptually related, but very much different in
formulation and algorithm. In the opinion of the author, there are a number
of key ideas used in both classifiers in terms of computational methods. These
ideas and the resulting formulations are very inspiring and can be used for
many other machine learning algorithms. We hope the presentation in this
chapter can help readers to appreciate these ideas. The referenced materials
and much of the codes are from the Numpy documentations (https://
numpy.org/doc/), Scikit-learn documentations (https://scikit-learn.org/
stable/), mxnet-the-straight-dope (https://github.com/zackchase/mxnet-
the-straight-dope), Jupyter Notebook (https://jupyter.org/), and Wikipedia
(https://en.wikipedia.org/wiki/Main Page).
Our discussion starts naturally from the Perceptron. It is one of the
earliest machine learning algorithms by Frank Roseblatt in 1957 [1, 2].
It is used for problems of binary classification and was one of the well-studied
problems of classification in 1960s. Here, we first introduce the mathe-
matical model of a typical classification problem with the related formulation
and its connection to affine transformation. We then examine in detail
Coli’s Perceptron algorithm that is made available at mxnet-the-straight-
dope (https://github.com/zackchase/mxnet-the-straight-dope) (under the
Apache License 2).
The discussion on the Perceptron is naturally followed by that on SVM
[3, 4, 9]. A complete description of the SVM formulation is provided,
including detailed process leading to a quadratic programming problem, the
kernel trick for linearly inseparable datasets, as well as the concept of affine
transformation used in SVM.
223
224 Machine Learning with Python: Theory and Applications
6.1 Linearly Separable Classification Problems
Let us consider the following problem. Given a set of m data-points with p
features, the ith data-point is noted as xi = {xi1 , xi2 , . . . , xip , } ∈ Xp with
its corresponding label yi ∈ {±1} (meaning the label can be either +1 or −1
for a data-point). Such labels distinguish these data-points in two classes:
positive points and negative ones. We assume that this set of data-points
is linearly separable, meaning that these data-points can be separated into
these two distinct classes using a hyperplane (a line in higher-dimensional
space, Xp ).
Consider a simple problem with only 2 features, x1 and x2 , in a two-
dimensional (2D) space x = {x1 , x2 } ∈ X2 , so that we can have a good
visualization. A set of m data-points is scattered in 2D space shown in
Fig. 6.1.
All these data-points are labeled into two classes: positive points marked
with “+” symbols, and each of the points is labeled with y = +1; negative
points marked with “−” symbols, and each of them is labeled with y = −1.
These two classes of points can be separated by straight lines, such as the
red-dashed line and the red-dotted line in Fig. 6.1. For datasets in the real
world, there are infinite numbers of such lines forming a street. Our goal
is to develop a computer algorithm to find such a line with a given labeled
dataset (the data-points with corresponding labels). This is simple but quite
a typical classification problem.
Assume for the moment that we know the orientation of one such red
line, say the middle red-dashed line for easy discussion. Hence, we know its
normal direction vector w = [w1 , w2 ] ∈ W2 , although we do not yet know
Figure 6.1: Linearly separable data-points in 2D space.
The Perceptron and SVM 225
the translational location of the red-dashed line along its normal. We then
w
have the unit normal vector as w with a length of 1. For any point (not
necessarily the data-point) in the 2D space (marked with a small cross in
Fig. 6.1), we can form a vector x starting at the origin. Now, the dot-product
w
x· (6.1)
w
w
becomes the length of the projection of x on the unit normal w . Therefore,
it is the measure that we need to determine how far point x is away from the
w
origin in the direction of w , which is a useful piece of information. Because
we do not yet know the translational location of the red line in relation to x,
b
we thus introduce a parameter w , where b ∈ W1 is an adjustable parameter
to allow the red line move up and down along w.
Notice in Eq. (6.1) that we used dot-product (the inner product). This
is the same as the matrix-product we used in the Python implementation,
because their shapes match: x is a (row) vector, and w is column vector (a
matrix with single column) with the same length. Therefore, we use both
interchangeably in this book.
The equation for an arbitrary line in relation to given point x shall have
this form:
x·w+b (6.2)
Our task now is to find the red-dashed line by choosing a particular
set of w and b in W3 that separates the data-points, which is the affine
transformation discussed in Chapter 5. The conditions should be
(x · w + b) > 0, for points in upper-right side of the red-dash line: y = +1
(6.3)
(x · w + b) < 0, for points in lower-left side of the red-dash line: y = −1
(6.4)
Note that Eqs. (6.3) and (6.4) are for ideal situations where these two sets of
points might be infinitely close. In practical applications, we often find that
these points are in two distinct classes, and they are separated by a street
with a finite width w (that may be very small). The formulation can now be
modified as follows:
(x · w + b) > w/2, for points in upper-right side of the red-dash line: y = + 1
(6.5)
(x · w + b) < w/2, for points in lower-left side of the red-dash line: y = −1
(6.6)
226 Machine Learning with Python: Theory and Applications
This type of equation is also known as the decision rule: when the condition
is satisfied by an arbitrary point x, it then belongs to a labeled class (y = 1
or y = −1), when the parameters w and b are known. We made excellent
progress.
It is obvious that Eqs. (6.5) and (6.6) can be magically written in a single
equation by putting these two conditions together with their corresponding
labels y.
y(x · w + b) > w/2 or y(x · ŵ) > w/2 or mg > w/2 (6.7)
This is a simplified single-equation decision rule: when the condition is
satisfied by an arbitrary point, it belongs to the labeled domain, and is not
within the street. This single equation is a lot more convenient for developing
the algorithm called a classifier to do the task. Note that mg is called
margin, which will be formally discussed in detail in Chapter 11.
We need to now bring in a labeled dataset to find w and b using the
above decision rule. For any data-point (the ith, for example, and regardless
of which class it belongs to), it must satisfy the following equation:
yi (xi · w + b) > w/2 or yi (xi · ŵ) > w/2 or mg(i) > w/2 (6.8)
where mg(i) is the margin for the ith data-point. Because there are an
infinite number of lines (such as the red-dashed and red-dotted lines shown in
Fig. 6.1) for such a separation, there exist multiple solutions to our problem.
We just want to find one of them that satisfies Eq. (6.8) for all data-points
in the dataset. This process is called training. Because labels are used, it is
a supervised training. The trained model can be used to predict the class of
a given data-point (which may not be from the training dataset), known as
classification or prediction in general. The following is an algorithm to
perform all those: training as well as prediction.
6.2 A Python Code for the Perceptron
The following is an easy-to-follow code available at mxnet-the-straight-
dope (https://github.com/zackchase/mxnet-the-straight-dope), under the
Apache-2.0 License. We have modified it a little and added in some detailed
descriptions as comment lines with the code.
Let us examine the details in this algorithm. As usual, we import the
needed libraries.
The Perceptron and SVM 227
import mxnet as mx
from mxnet import nd, autograd
import matplotlib.pyplot as plt
import numpy as np
# We now generate a synthetic dataset for this examination.
mx.random.seed(1) # for repeatable output of this code
# define a function to generate the dataset that is
# separable with a margin strt_w
def getfake(samples, dimensions, domain_size, strt_w):
wfake = nd.random_normal(shape=(dimensions)) # weights
bfake = nd.random_normal(shape=(1)) # bias
wfake = wfake / nd.norm(wfake) # normalization
# generate linearly separable data, with labels
X = nd.zeros(shape=(samples, dimensions)) # initialization
Y = nd.zeros(shape=(samples))
i = 0
while (i < samples):
tmp = nd.random_normal(shape=(1,dimensions))
margin = nd.dot(tmp, wfake) + bfake
if (nd.norm(tmp).asscalar()<domain_size) & \
(abs(margin.asscalar())>strt_w):
X[i,:] = tmp[0]
Y[i] = 1 if margin.asscalar() > 0 else -1
i += 1
return X, Y, wfake, bfake
# Plot the data with colors according to the labels
def plotdata(X,Y):
for (x,y) in zip(X,Y):
if (y.asscalar() == 1):
plt.scatter(x[0].asscalar(),x[1]. asscalar(),color='r')
else:
plt.scatter(x[0].asscalar(),x[1]. asscalar(),color='b')
# define a function to plot contour plots over [-3,3] by [-3,3]
def plotscore(w,d):
xgrid = np.arange(-3, 3, 0.02) # generating grids
ygrid = np.arange(-3, 3, 0.02)
xx, yy = np.meshgrid(xgrid, ygrid)
zz = nd.zeros(shape=(xgrid.size, ygrid.size, 2))
zz[:,:,0] = nd.array(xx)
zz[:,:,1] = nd.array(yy)
vv = nd.dot(zz,w) + d
CS = plt.contour(xgrid,ygrid,vv.asnumpy())
plt.clabel(CS, inline=1, fontsize=10)
228 Machine Learning with Python: Theory and Applications
street_w = 0.1
ndim = 2
X,Y,wfake,bfake = getfake(50,ndim,3,street_w)
#generates 50 points, in 2D space with a margin of street_w
plotdata(X,Y)
plt.show()
Figure 6.2: Computer-generated data-points that are separable with a straight line.
We now see a dataset with 50 scattered points in 2D space separated by
a street with width street w. These data-points can be clearly separable by
at least one line in between the blue and red data-points.
Let us first take a look at the points after their vectors are all projected
on separable with an arbitrary vector w and with an arbitrary bias b (an
arbitrary affine transformation). We do the same projection using also the
true w and bias b used to generate all these points for comparison. We write
the following code to do so:
wa = nd.array([0.5,0.5]) # a given vector
cs = (wa [0]/nd.norm(wa)).asnumpy() # cosine value
si = (wa [1]/nd.norm(wa)).asnumpy() # sine value
ba = 0.5 * nd.norm(wa) # bias (one may change it)
Xa = nd.dot(X,wa)/nd.norm(wa) + ba # projection (affine mapping)
plt.plot(Xa.asnumpy()*cs,Xa.asnumpy()*si,zorder=1) # results
for (x,y) in zip(Xa,Y):
if (y.asscalar() == 1):
plt.scatter(x.asscalar()*cs,x. asscalar()*si,color='r')
else:
plt.scatter(x.asscalar()*cs,x.asscalar()*si, color='b')
cs = (wfake [0]/nd.norm(wfake)).asnumpy() # projection on true norm
The Perceptron and SVM 229
si = (wfake [1]/nd.norm(wfake)).asnumpy()
Xa = nd.dot(X,wfake) + bfake # with true bias
Figure 6.3: Data-points projected on an arbitrary straight line.
plt.plot(Xa.asnumpy()*cs,Xa.asnumpy()*si,zorder=1)
for (x,y) in zip(Xa,Y):
if (y.asscalar() == 1):
plt.scatter(x.asscalar()*cs,x. asscalar()*si,color='r')
else:
plt.scatter(x.asscalar()*cs,x.asscalar()*si, color='b')
plt.show()
Figure 6.4: Data-points projected on a straight line that is perpendicular to a line that
separates these data-points.
It is seen that
• When these points are projected on a vector that is not the true normal
(along the blue line) direction, the blue and red points are mixed along
the blue line.
230 Machine Learning with Python: Theory and Applications
• When the true normal direction is used, all these points are distinctly
separated into two classes, blue and red, along the orange line. This dataset
is linearly separable.
Let us see how the Perceptron algorithm finds a direction and the bias,
and hence the red line that separates these two classes. We use again the algo-
rithm available at mxnet-the-straight-dope (https://github.com/zackchase/
mxnet-the-straight-dope). The algorithm is based on the following encour-
agement rule: positive events should be encouraged and negative ones should
be discouraged. This rule is used with the decision rule discussed earlier for
each data-point in a given dataset.
# The Perceptron algorithm
def Perceptron(w,b,x,y,strt_w):
if (y*(nd.dot(w,x)+b)).asscalar()<=strt_w/2:
# Decision rule to check whether the line with
# the current parameters in the street
w += y * x # In the street, update w
b += y # update b
update = 1
else: # Otherwise (outside the street)
update = 0 # No action
return update
The above Perception algorithm is used in an iteration to update the
learning parameters: the weights in vector w and bias b, by looping over the
dataset.
w = nd.zeros(shape=(ndim)) #stars with zero (worst case)
b = nd.zeros(shape=(1))
t = 0
print('w:',w.shape,' b:',b.shape,' X:',X.shape,' Y:',Y.shape)
for (x,y) in zip(X,Y):
update = Perceptron(w,b,x,y,street_w)
if (update == 1):
t += 1
print('In the street: update the parameters')
print('data{}, label{}'.format(x.asnumpy(),y. asscalar()))
print('weight{}, bias{}'.format(w.asnumpy(),b. asscalar()))
The Perceptron and SVM 231
plotscore(w,b) # The plane with updated w and b
plotdata(X,Y) # data-points
plt.scatter(x[0].asscalar(),x[1]. asscalar(),color='g')
# currently updated data-point
plt.show()
w: (2,) b: (1,) X: (50, 2) Y: (50,)
In the street: update the parameters
data [ 0.03751943 -0.7298465 ], label -1.0
weight [-0.03751943 0.7298465 ], bias -1.0
Figure 6.5: Results at the first iteration.
In the street: update the parameters
data [-2.0401056 1.4821309], label -1.0
weight [ 2.0025861 -0.7522844], bias -2.0
Figure 6.6: Results at the second iteration.
232 Machine Learning with Python: Theory and Applications
In the street: update the parameters
data [ 1.040828 -0.45256865], label -1.0
weight [ 0.96175814 -0.29971576], bias -3.0
Figure 6.7: Results at the third iteration.
In the street: update the parameters
data [-0.934901 -1.5937568], label 1.0
weight [ 0.02685714 -1.8934726 ], bias -2.0
Figure 6.8: Results at the final iteration.
print('Total number of points:',len(Y),'; times of updates:',t)
print('weight {}, bias {}'.format(w.asnumpy(),b.asscalar()))
print('wfake {}, bfake {}'.format(wfake.asnumpy(),bfake. asscalar()))
Total number of points: 50 ; times of updates: 4
weight [ 0.02685714 -1.8934726 ], bias -2.0
wfake [ 0.0738321 -0.99727064], bfake -0.9501792788505554
The Perceptron and SVM 233
It is seen that all the red dots are on the positive side of the straight line
of x · w + b = 0 with the learned parameters of the weight vector w∗ and
bias b∗ . All the data marked with blue dots are on the negative side of the
line. In the entire process, all these points stay still, and the updates are
done only on the weight vector w and bias b. We shall now examine the
fundamental reasons for this simple algorithm to work.
6.3 The Perceptron Convergence Theorem
Theorem: Consider a dataset with a finite number of data-points. The ith
data-point is paired with its label as [xi , yi ]. Any data-point xi is bounded
by xi ≤ R < ∞, and its label yi ∈ {±1}.
• If the data-points are linearly separable, meaning that there exists at least
one pair of parameters (w∗ , b∗ ) with w∗ ≤ 1 and b2 ≤ 1, such that
yi (xi · w∗ + b∗ ) ≥ w/2 > 0 for all data pairs, where w is a given scalar of
the street width,
• then the Perceptron algorithm converges after at most t = 2(R2 +1)/w2 ∝
R 2
w iterations, with a pair of parameters (wt , bt ) forming a line xi ·wt +bt
that separates the data-points in two classes.
We now prove this Theorem largely following the procedure with codes
at mxnet-the-straight-dope (https://github.com/zackchase/mxnet-the-
straight-dope/blob/master/chapter01 crashcourse/probability.ipynb), under
the Apache-2.0 License. We first check the convergence behavior numerically
(this may take minutes to run).
ws = np.arange(0.025,0.45,0.025) #generate a set of street widths
number_iterations = np.zeros(shape=(ws.size))
number_tests = 10
for j in range(number_tests): #set number of tests to do
for (i,wi) in enumerate(ws):
X,Y,_,_=getfake(1000,2,3,wi) #generate dataset
for (x,y) in zip(X,Y):
number_iterations[i] += Perceptron(w,b,x,y,wi)
#for each test, record the number of updates
number_iterations = number_iterations / 10.0
plt.plot(ws,number_iterations,label='Average number of iterations')
plt.legend()
plt.show()
The test results are plotted in Fig. 6.9. It shows that the number of
iterations needs to increase with the decrease of the street width w, and
234 Machine Learning with Python: Theory and Applications
Figure 6.9: Convergence behavior examined numerically.
the rate is roughly quadratic (inversely). This test supports the convergence
theorem. Let us now prove this in a more rigorous mathematical manner.
The proof assumes that the data are linearly separable. Therefore, there
exists a pair of parameters (w∗ , b∗ ) with w∗ ≤ 1 and b2 ≤ 1. Let us
examine the inner product of the current set of parameters ŵ with the
assumed existing ŵ∗ at each iteration. What we would like the iteration
to do is to update the current ŵ to approach ŵ∗ iteration by iteration, so
that their inner product can get bigger and bigger. Eventually, they can be
parallel with each other. Let us see whether this is really what is happening
in the Perceptron algorithm given above. Our examination is also iteration
by iteration but considers only the iterations when an update is made by the
algorithm, because the algorithm does nothing otherwise. This means that
we perform update only when yt (xt · ŵt ) ≤ w/2 at at the tth step.
At the initial setting in the algorithm, t = 0, we have no idea on what ŵ
should be, and thus set ŵ0 = 0. Here for neat formulation using dot-product.
We assume that column vector ŵ0 and ŵ∗ are flatten to (row) vectors, so
that ŵ can have a dot-product directly with any other (flattened) ŵ resulted
in the iteration process. This can be done easily in numpy using the flatten()
function. Thus we have at the initial setting,
ŵ0 · ŵ∗ = 0
At t = 1, following the algorithm, we bring in arbitrarily a data-point, say
x1 with y1 . We shall find y1 (x1 · ŵ1 ) = 0 ≤ (w/2), because at this point,
the current ŵ1 are still all zero. Therefore, data-point x1 with y1 is in the
The Perceptron and SVM 235
street defined by the line with current ŵ1 . Next, we perform the following
updates:
ŵ1 = ŵ0 + y1 x1
We thus have,
ŵ1 · ŵ∗ = ŵ0 · ŵ∗ + y1 (x1 · ŵ∗ ) ≥ w/2
This is because yi (xi · ŵ∗ ) ≥ w/2 given by the Theorem as a condition. We
see the direction of vector ŵ1 approaches to that of ŵ∗ by w/2. They are
w/2 more aligned.
Similarly, at t = 2, following the algorithm, we have the following update.
ŵ2 = ŵ1 + y2 x2
We now have,
ŵ2 · ŵ∗ = ŵ1 · ŵ∗ + y2 (x2 · ŵ∗ ) ≥ 2(w/2)
This is because of the result obtained at t = 1 and the new addition of
yi (ŵ∗ · xi ) ≥ w/2 condition given by the Theorem. We see the direction of
vector ŵ1 approaches to that of ŵ∗ by 2(w/2). They are now 2(w/2) more
aligned.
It is clear that inner product adds one (w/2) in each iteration. Now at
tth update, we shall have
ŵt · ŵ∗ ≥ t(w/2) (6.9)
We see the direction of vector ŵt approaches to that of ŵ∗ by t(w/2). It is
clear that the algorithm drives ŵ more and more in alignment with ŵ∗ , at
a linear rate of (w/2). The wider the street, the faster the convergence.
We next examine the evolution of the length (amplitude) of vector ŵt+1 .
ŵt+1 2 = ŵt+1 · ŵt+1 = (ŵt + yt xt ) · (ŵt + yt xt )
= ŵt · ŵt + 2yt xt · ŵt + yt xt · yt xt (6.10)
2
= ŵt + 2yt xt · ŵt + yt2 xt 2
Using the conditions given by the Theorem: xi = xi + 1 ≤ R + 1, and
yi ∈ {±1}, and yt (xt · ŵt ) ≤ w/2 (this is the condition for starting the tth
update), we shall have
ŵt+1 2 ≤ ŵt 2 + R2 + 1 + w
236 Machine Learning with Python: Theory and Applications
When t = 0, we have ŵ0 = 0, and hence,
ŵ1 2 ≤ ŵ0 2 + R2 + 1 + w = R2 + 1 + w
When t = 1, we shall have,
ŵ2 2 ≤ ŵ1 2 + R2 + 1 + w ≤ 2(R2 + 1 + w)
This means that each iteration gives (R2 + 1 + w).
At the t = T iteration, we shall have
ŵT ≤ T (R2 + 1 + w) (6.11)
Using the Cauchy-Schwartz inequality, i.e., a·b ≥ a·b, and then Eq. (6.9),
we obtain,
ŵT ŵ∗ ≥ ŵT · ŵ∗ ≥ T (w/2)
Using the conditions given by the Theorem: ŵ∗ = w∗ + b ≤ 1 + 1 = 2,
we have
√
ŵT 2 ≥ T (w/2)
Combining this with the inequality Eq. (6.11) yields,
2T (R2 + 1 + w) ≥ T (w/2) (6.12)
Let us examine this equation. The number of iterations T is linear on the
left and in a square root on the right side of Eq. (6.12), and all others are
constants. Thus, this inequality cannot hold for large T . Therefore, T must
be limited to satisfy Eq. (6.12), which means that the Perceptron algorithm
2
will converge in a finite number of iterations, and T ≤ 8(R w+1+w)
2 ∝ (R 2
w) .
This can also be written as
R √
∝ T (6.13)
w
This means that it takes a square root of times for the Perceptron algorithm
to converge for a given relative street width.
We now give the following remarks:
1. The Perceptron convergence proof requires that the data-points be
separable with a line.
The Perceptron and SVM 237
2. The convergence is independent of the dimensionality of the data. This
is not a condition used in the proof. It is also independent of the number
of observations.
3. The number of iterations increases with the decrease of the street width
w, and the rate is (inversely) quadratic. This echoes the numerical test
conducted earlier.
4. The number of iterations decreases with the increase of the upper bound
of the data R, and the rate is also (inversely) quadratic. This means that if
the data-points are generally farther apart, it is easier to separate, which
is intuitively understandable.
5. The algorithm updates only for the data-points that are not in alignment.
It simply skips the data-points that are already in alignment.
6.4 Support Vector Machine
6.4.1 Problem statement
In our discussions above on the Perceptron, it is seen that there are in fact
an infinite number of solutions of parameters w and b in the hypothesis
space to form straight lines for a linearly separable dataset, as long as the
street width w has a finite value. Readers may observe the different straight
lines obtained by simply starting the classifier with different initial weights
and biases. One may naturally ask what the best solution is among all these
possible solutions. One answer is that the line that separates these two classes
of data with largest “street” width and sits in the middle line of the street
may be the best. To obtain the widest street optimal solution, one can
formulate the problem as an optimization one. In fact, one can simply bring
in an optimization algorithm to control the Perceptron to try for multiple
times to find an optimal solution if the efficiency is not a concern.
Here, we introduce the well-known algorithm known as the support vector
machine or SVM. The initial idea was invented in the early 1960s in Vapnik’s
PhD thesis, and became popular in the 1990s when it (with kernel trick) was
applied for handwritten digit recognition [3, 4].
The SVM is an effective algorithm that is constructed using a systematic
formulation and the Lagrangian multiply approach. Given below is the
detailed description and formulation on SVM. A good reference is the excel-
lent lecture by Prof. Partick Winston at MIT(https://www.youtube.com/
watch?v= PwhiWxHK8o), and also some workings available online (https://
towardsdatascience.com/support-vector-machine-python-example-d67d9b
63f1c8).
238 Machine Learning with Python: Theory and Applications
6.4.2 Formulation of objective function and constraints
Our formulation continues from the formulation we derived for the Percep-
tron. The difference here is that the street width is no longer given in this
SVM setting. We just assume it is there. We must find a way to formulate
the street width, and then maximize it for a given dataset. We first derive
the formula for the street width.
Because we assume that the dataset is linearly separable, there must be
a street width of some finite value w. From the decision rule we derived,
we know the equation for the middle line of the street can be written in an
affine transformation form as
x·w+b=0 (6.14)
where b ∈ W is a learning parameter called bias, x ∈ Xp is a position
vector of an arbitrary point in the feature space with linear polynomial
bases (features), x1 , x2 , . . . , xp ,
x = [x1 , x2 , . . . , xp ] (6.15)
and w ∈ Wp is a vector of weights that are also learning parameters
w1 , w2 , . . . , wp :
w = [w1 , w2 , . . . , wp ] (6.16)
This middle line in a 2D feature space X2 is shown in Fig. 6.10 with red dash-
dot line, which is approximated using the weights w and bias b in hypothesis
space W3 .
Figure 6.10: Linearly separable data-points in 2D space, the width of the street is to be
maximized using SVM.
The Perceptron and SVM 239
On the upper-right gutter, there should be at least one positive data-
point, say x1 , right on it. Its vector x1 is the blue arrow, a support vector
to the gutter. Because x1 belongs to the positive class, its label is +1. The
equation for this gutter line of the street can be given as
x1 · w + b = +1 (6.17)
This is the decision border for data-point x1 . Similarly, on the lower-left
gutter, there is at least one negative data-point, say x2 , right on it. Its vector
x2 supports the gutter. Because x2 belongs to the negative class, its label is
−1. The equation for this gutter line can be given as
x2 · w + b = −1 (6.18)
Consider now the projections of these two support vectors, x1 and x2 on the
w
unit normal of the middle line for the street, w . The projection of x1 gives
the (Euclidean) distance of the upper-right gutter to the origin along the
normalized w. Similarly, the projection of x2 gives the distance of the lower-
left gutter to the origin along the normalized w. Therefore, their difference
gives the width of the street:
w w x1 · w − x2 · w
w = x1 · + b − x2 · −b= (6.19)
w w w
Substituting Eqs. (6.17) and (6.18) to (6.19), we obtain
2
w= (6.20)
w
A very simple formula. Now, we got the equation for the street width, and
that depends only on training parameter weights w! This is not so difficult
to understand because these weights determine the orientation of the gutter
of the street and hence the direction of the street. When the weights change,
the street turns accordingly while remaining in touch with both data-points
x1 and x2 , which results in a change in the street width. The bias b affects
only the translational location of a line. Because the street width is the
difference of the two gutter lines, the bias is thus canceled. Therefore, the
bias b should not affect the width. Here, we observe a fact that the affine
space is not a vector space (As mentioned in Chapter 1), and the difference
of two vectors in the affine space comes back to the feature space (because
of the cancellation of the augment 1).
However, why is the street weight inversely related to the norm of the
w? This may be seemly counterintuitive, but it is really the mathematics at
240 Machine Learning with Python: Theory and Applications
Figure 6.11: Change of width of the street when the street is turned with respect to w.
work. To examine and view what is really happening here, let us look at a
simpler setting where both x1 and x2 (that are on the gutters) are sitting
on the x2 -axis, as shown in the Fig. 6.11.
When the gutters are at the horizontal direction, the equation for the
upper gutter is
0 · x1 + 1 · x2 = +1 + b (6.21)
The street width is w0 . The normal vector w and its norm are given as
follows:
w = 0 1 , w = 02 + 12 = 1 (6.22)
When these gutters rotate to have a slope k while remaining supported by
both x1 and x2 , the equation for the upper gutter becomes
k · x1 + 1 · x2 = +1 + b (6.23)
The new normal vector wk and its norm are given as follows:
wk = k 1 , wk = k 2 + 12 (6.24)
It is obvious that the street width after the rotation, wk , is clearly smaller
than the original street width before the rotation, w0 , while the norm of w
has increased. This is also true for any value of k. The street width is at the
maximum when it is along the horizontal direction which is perpendicular
to the vector x1 − x2 . We write the following code to plot this relationship:
The Perceptron and SVM 241
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
fig, figarr = plt.subplots(1, 2, figsize=(11, 4))
def w_norm(k): # norm of vector w for a line of slope k
return np.sqrt(1. + k**2)
k = np.arange(-10, 10, .1)
y = 2/w_norm(k) # compute the width of the street
figarr[0].plot(k,y,c='r')
figarr[0].set_xlabel('Slope of the street gutter, $k$')
figarr[0].set_ylabel('Width of the street')
figarr[1].plot(w_norm(k),y)
figarr[1].set_xlabel('Norm of the weight vector')
figarr[1].set_ylabel('Width of the street')
#figarr[1].ax1.set_title('ax1 title')
plt.show()
Figure 6.12: Variation of the street width with the slope of the street gutter (left) and
with the norm of the weight vector (right).
It is clear from Fig. 6.12 that the street width is inversely related to the
norm of the w.
Most importantly, this analysis shows that if the street width is maxi-
mized, w must be perpendicular to the gutters (decision boundaries). This
conclusion is true for the arbitrary pair of data-points x1 and x2 on these
two gutters.
Now, Eq. (6.19) can be rewritten as
w
w= [x1 − x2 ] · (6.25)
w
Data pair in linear polynomial bases
weights or optimization parameters
242 Machine Learning with Python: Theory and Applications
This means that when w is maximized, the inner production of [x1 − x2 ] and
w
w is maximized (these two vectors are parallel), where [x1 −x2 ] is a vector of
pairs of data-points in the linear polynomial bases x1 , x2 , . . . , to approximate
w
a line in the feature space, and w is the vector of the normalized weights
or tuning/optimization parameters. The use of linear polynomial bases here
is because we assume the data-points are linearly separable by a hyperplane.
Remember our original goal is to find maximum street width. Based on
Eq. (6.20), this is equivalent to minimizing the norm of w, which in turn
is the same as minimizing 12 w2 . The benefit of such simple conversions
will soon be evidenced. We now have our objective function:
1 1
L= w2 = w w (6.26)
2 2
The above function needs to be minimized. We see a nice property of the
above formulation: the objective function is quadratic, and its Hessian matrix
is a unity matrix that is clearly SPD. Therefore, it has one and only one
minimal. The local minimal is the global one. This is the fundamental reason
why local minim