Comp Data Science
Comp Data Science
Introduction to
Using ScalaTion
...
John A. Miller
Department of Computer Science
University of Georgia
...
1
2
Brief Table of Contents
I Foundations 47
2 Linear Algebra 49
2.1 Linear System of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2 Matrix Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4 Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.5 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.6 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.7 Internal Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.8 Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3 Probability 69
3.1 Probability Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4 Empirical Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.6 Algebra of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.7 Median, Mode and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.8 Joint, Marginal and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.9 Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.10 Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.11 Estimating Parameters from Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.12 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3
3.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.14 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.15 Notational Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.16 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
II Modeling 149
6 Prediction 151
6.1 Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.2 Quality of Fit for Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.4 Simpler Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.5 Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.6 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.7 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.8 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.9 Quadratic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.10 Cubic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.11 Symbolic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.12 Transformed Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.13 Regression with Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.14 Weighted Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.15 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.16 Trigonometric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
7 Classification 257
7.1 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.2 Quality of Fit for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
7.3 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
4
7.4 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
7.5 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.6 Tree Augmented Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
7.7 Bayesian Network Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
7.8 Markov Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.9 Decision Tree ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.10 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
7.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
5
10.112D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
10.12Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
10.13Extreme Learning Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
14 Clustering 545
14.1 KNN Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
14.2 Clusterer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
14.3 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
14.4 K-Means Clustering - Hartigan-Wong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
14.5 K-Means++ Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
14.6 Clustering Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
14.7 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
14.8 Markov Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
6
III Simulation 567
Appendices 721
7
A Optimization in Data Science 723
A.1 Partial Derivatives and Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
A.2 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
A.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
A.4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
A.5 Stochastic Gradient Descent with Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
A.6 SGD with ADAptive Moment Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
A.7 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
A.8 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
A.9 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
A.10 Method of Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
A.11 Karush-Kuhn-Tucker Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
A.12 Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
A.13 Augmented Lagrangian Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
A.14 Alternating Direction Method of Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
A.15 Nelder-Mead Simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
8
Contents
I Foundations 47
2 Linear Algebra 49
2.1 Linear System of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2 Matrix Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.1 Vector Addition and Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.2 Element-wise Multiplication and Division . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.3 Vector Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.4 Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.5 Vector Operations in ScalaTion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4 Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.1 Gradient Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.2 Jacobian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.4.3 Hessian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.5 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.5.1 Matrix Operation in ScalaTion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.6 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.6.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.7 Internal Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9
2.8 Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.8.1 Three Dimensional Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.8.2 Four Dimensional Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3 Probability 69
3.1 Probability Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.1.1 Joint Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.1.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.1 Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.2 Continuous Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.1 Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.2 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.3 Probability Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Empirical Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.1 Continuous Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.2 Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.6 Algebra of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.1 Expectation is a Linear Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.2 Variance is not a Linear Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.3 Convolution of Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.4 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.7 Median, Mode and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.7.1 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.7.2 Quantile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.7.3 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.8 Joint, Marginal and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.8.1 Discrete Case: Joint and Marginal Mass . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.8.2 Continuous Case: Joint and Marginal Density . . . . . . . . . . . . . . . . . . . . . . . 86
3.8.3 Discrete Case: Conditional Mass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.8.4 Continuous Case: Conditional Density . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.8.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.8.6 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.8.7 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.9 Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.10 Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.11 Estimating Parameters from Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.11.1 Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10
3.11.2 Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.11.3 Estimation for Discrete Outcomes/Responses . . . . . . . . . . . . . . . . . . . . . . . 97
3.12 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.12.1 Positive Log Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.12.2 Joint Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.12.3 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.12.4 Relative Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.12.5 Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.12.6 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.12.7 Probability Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.14 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.15 Notational Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.16 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
11
5 Data Preprocessing 141
5.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.1 Remove Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.2 Convert String Columns to Numeric Columns . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.3 Identify Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.4 Preliminary Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.2 Methods for Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2.1 Based on Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2.2 Based on InterQuartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.2.3 Based on Quantiles/Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.3 Imputation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.3.1 Imputation Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.4 Align Multiple Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.5 Creating Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
II Modeling 149
6 Prediction 151
6.1 Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.1.1 Predictor Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.2 Quality of Fit for Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.2.1 Fit Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.3 Optimization - Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.4 Example Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.3.5 NullModel Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.4 Simpler Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.3 Optimization - Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.4 Example Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.4.5 SimplerRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.5 Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.5.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.5.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.5.3 Optimization - Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.5.4 Example Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.5.5 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
12
6.5.6 SimpleRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.6 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6.3 Optimization - Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.6.4 Matrix Inversion Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.6.5 LU Factorization Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.6.6 Cholesky Factorization Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.6.7 QR Factorization Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.6.8 Use of Factorization in Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.6.9 Model Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.6.10 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.6.11 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.6.12 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.6.13 Regression Problem: Texas Temperatures . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.6.14 Regression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.6.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.6.16 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
6.7 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.7.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.7.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.7.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.7.4 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.7.5 The λ Hyper-parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
6.7.6 Comparing RidgeRegression with Regression . . . . . . . . . . . . . . . . . . . . . . 204
6.7.7 RidgeRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.7.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.8 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.8.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.8.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.8.3 Optimization Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.8.4 The λ Hyper-parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.8.5 Regularized and Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.8.6 LassoRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.8.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.9 Quadratic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.9.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.9.2 Comparison of quadratic and Regression . . . . . . . . . . . . . . . . . . . . . . . . 214
6.9.3 SymbolicRegression.quadratic Method . . . . . . . . . . . . . . . . . . . . . . . . . 216
6.9.4 Quadratic Regression with Cross Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.9.5 Response Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
13
6.9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.10 Cubic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.10.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.10.2 Comparison of cubic, quadratic and Regression . . . . . . . . . . . . . . . . . . . . 222
6.10.3 SymbolicRegression.cubic Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.10.4 Cubic Regression with Cross Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6.10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.11 Symbolic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.11.1 Sample Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.11.2 As a Data Science Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.11.3 SymbolicRegression Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.11.4 Implementation of the apply Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
6.11.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.12 Transformed Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.12.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.12.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
6.12.3 Square Root Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
6.12.4 Log Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.12.5 Reciprocal Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
6.12.6 Box-Cox Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.12.7 Quality of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.12.8 TranRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.12.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.13 Regression with Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.13.1 Handling Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.13.2 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
6.13.3 RegressionCat Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
6.13.4 RegressionCat Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.14 Weighted Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.14.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.14.2 Root Absolute Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.14.3 RegressionWLS Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
6.14.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
6.15 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.15.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.15.2 PolyRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.15.3 PolyORegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
6.16 Trigonometric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
6.16.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
6.16.2 TrigRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
14
6.16.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
7 Classification 257
7.1 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.1.1 Classifier Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.2 Quality of Fit for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
7.2.1 FitC Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
7.3 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
7.3.1 NullModel Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
7.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
7.4 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
7.4.1 Factoring the Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
7.4.2 Estimating Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
7.4.3 Laplace Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
7.4.4 Table Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
7.4.5 The train Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
7.4.6 The test Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.4.7 The predictI Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.4.8 The lpredictI Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.4.9 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
7.4.10 NaiveBayes Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
7.4.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7.5 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.5.1 BayesClassifier Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.6 Tree Augmented Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
7.6.1 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
7.6.2 Conditional Probability Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
7.6.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
7.6.4 The train Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.6.5 The predictI Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.6.6 TANBayes Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
7.7 Bayesian Network Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
7.7.1 Network Augmented Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
7.8 Markov Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.8.1 Markov Blanket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.8.2 Factoring the Joint Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
7.8.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
7.9 Decision Tree ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.9.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.9.2 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.9.3 Early Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
7.9.4 DecisionTree Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
7.9.5 DecisionTree ID3 Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
15
7.9.6 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
7.9.7 DecisionTree ID3wp Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.10 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
7.10.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
7.10.2 Forward Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
7.10.3 Backward Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
7.10.4 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
7.10.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
7.10.6 Reestimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
7.10.7 HiddenMarkov Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
7.10.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
7.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
16
8.7.2 DecisionTree C45 Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
8.7.3 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
8.7.4 DecisionTree C45wp Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
8.7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
8.8 Bagging Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.8.1 Creating Subsample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.8.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.8.3 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
8.8.4 BaggingTrees Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
8.9 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.9.1 Extracting Sub-features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.9.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.9.3 RandomForest Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.10 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
8.10.1 Separating Hyperplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
8.10.2 Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
8.10.3 Running the Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
8.10.4 SupportVectorMachine Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
8.10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
8.11 Neural Network Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
8.11.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
8.11.2 Training Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
8.11.3 Prediction Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
8.11.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
8.11.5 NeuralNet Class 3L Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
8.11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
17
9.5.4 RegressionTreeMT class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
9.6 Random Forest Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
9.6.1 RegressionTreeRF Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
9.7 Gradient Boosting Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9.7.1 RegressionTreeGB Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
9.9 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
18
10.6.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
10.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
10.6.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
10.6.4 Matrix Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
10.6.5 NeuralNet 2L Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
10.6.6 NeuralNet 2L Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
10.6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
10.7 Three-Layer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
10.7.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
10.7.2 Ridge Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
10.7.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
10.7.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
10.7.5 Matrix Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
10.7.6 train Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
10.7.7 Stochastic Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 405
10.7.8 Example Error Calculation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
10.7.9 Response Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
10.7.10 NeuralNet 3L Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
10.7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
10.8 Multi-Hidden Layer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
10.8.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
10.8.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
10.8.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
10.8.4 Number of Nodes in Hidden Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
10.8.5 Avoidance of Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
10.8.6 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
10.8.7 NeuralNet XL Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
10.8.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
10.9 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
10.101D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
10.10.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
10.10.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
10.10.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
10.10.4 Matrix Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
10.10.5 Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
10.10.6 Example Error Calculation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
10.10.7 Two Convolutional Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
10.10.8 CNN 1D Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
10.10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
10.112D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
10.11.1 Filtering Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
10.11.2 Pooling Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
10.11.3 Flattening Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
19
10.11.4 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
10.11.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
10.11.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
10.11.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
10.12Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
10.12.1 Definition of Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
10.12.2 Type of Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
10.12.3 NeuralNet XLT Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
10.12.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
10.13Extreme Learning Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
10.13.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
10.13.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
10.13.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
10.13.4 ELM 3L1 Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
10.13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
20
11.4.5 AR Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
11.4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
11.5 Moving-Average (MA) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
11.5.1 MA(q) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
11.5.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
11.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
11.6 Auto-Regressive, Moving Average (ARMA) Models . . . . . . . . . . . . . . . . . . . . . . . . 473
11.6.1 Selection Based on ACF and PACF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
11.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
11.6.3 ARMA Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
11.6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
11.7 Rolling-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
11.7.1 1-Fold Rolling-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
11.7.2 Rolling Validation and the Forecasting Matrix . . . . . . . . . . . . . . . . . . . . . . 478
11.7.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
11.8 ARIMA (Integrated) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
11.8.1 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
11.8.2 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
11.8.3 Backshift Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
11.8.4 Stationarity Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
11.8.5 ARIMA Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
11.8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
11.9 SARIMA (Seasonal) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
11.9.1 Determination of the Seasonal Period . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
11.9.2 Seasonal Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
11.9.3 Seasonal AR and MA Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
11.9.4 Case Study: COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
11.9.5 SARIMA Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
11.9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
11.10Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
21
12.2.1 Model Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
12.2.2 SARIMAX Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
12.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
12.3 Vector Auto-Regressive (VAR) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
12.3.1 VAR(p, 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
12.3.2 VAR(p, n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
12.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
12.3.4 VAR Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
12.3.5 AR∗ (p, n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
12.3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
12.4 Nonlinear Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
12.4.1 Nonlinear Autoregressive (NAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
12.4.2 Autoregressive Neural Network (ARNN) . . . . . . . . . . . . . . . . . . . . . . . . . . 509
12.4.3 Nonlinear Autoregressive, Moving-Average (NARMA) . . . . . . . . . . . . . . . . . . 509
12.5 Recurrent Neural Networks (RNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
12.5.1 RNN(1, 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
12.5.2 RNN(p, nh ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
12.5.3 RNN(p, nh , nv ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
12.5.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
12.5.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
12.5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
12.6 Gated Recurrent Unit (GRU) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
12.6.1 A GRU Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
12.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
12.6.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
12.6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
12.7 Minimal Gated Unit (MGU) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
12.8 Long Short Term Memory (LSTM) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
12.8.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
12.9 Encoder-Decoder Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
12.9.1 Simple Encoder-Decoder Consisting of Two GRU Cells . . . . . . . . . . . . . . . . . . 530
12.9.2 Teacher Forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
12.9.3 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
12.9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
12.10Transformer Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
12.10.1 Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
12.10.2 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
12.10.3 Encoder-Decoder Architecture for Transformers . . . . . . . . . . . . . . . . . . . . . . 536
12.10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
12.10.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
22
13 Dimensionality Reduction 539
13.1 Reducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
13.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
13.2.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
13.2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
13.3 Autoencoder (AE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
13.3.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
13.3.2 Denoising Autoencoder (DEA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
14 Clustering 545
14.1 KNN Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
14.1.1 KNN Regression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
14.1.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
14.2 Clusterer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
14.3 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
14.3.1 Initial Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
14.3.2 Reassignment of Points to Closest Clusters . . . . . . . . . . . . . . . . . . . . . . . . 552
14.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
14.3.4 KMeansClusterer Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
14.3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
14.4 K-Means Clustering - Hartigan-Wong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
14.4.1 Adjusted Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
14.4.2 KMeansClusteringHW Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
14.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
14.5 K-Means++ Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
14.5.1 Picking Initial Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
14.5.2 KMeansClustererPP Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
14.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
14.6 Clustering Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
14.6.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
14.6.2 ClusteringPredictor Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
14.6.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
14.7 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
14.7.1 HierClusterer Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
14.7.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
14.8 Markov Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
14.8.1 MarkovClusterer Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
14.8.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
23
15.2 Types of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
15.2.1 Example: Modeling an M/M/1 Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
15.3 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
15.3.1 Example RNG: Random0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
15.3.2 Testing Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
15.3.3 Example RNG: Random3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
15.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
15.4 Random Variate Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
15.4.1 Inverse Transform Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
15.4.2 Convolution Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
15.4.3 Acceptance-Rejection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
15.4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
15.5 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
15.5.1 Generating a Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
15.5.2 Generating a Non-Homogeneous Poisson Process . . . . . . . . . . . . . . . . . . . . . 585
15.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
15.6 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
15.6.1 Simulation of Card Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
15.6.2 Integral of a Complex Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
15.6.3 Grain Dropping Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
15.6.4 Simulation of the Monty Hall Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
15.6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
15.7 Hand Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
15.7.1 Little’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
15.7.2 Event Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
15.7.3 Spreadsheet Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
15.7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
15.8 Tableau-Oriented Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
15.8.1 Iterating through Tableau Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
15.8.2 Reproducing the Hand Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
15.8.3 Customized Logic/Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
15.8.4 Tableau.scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
15.8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
24
16.2.4 MarkovChain Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
16.2.5 Continuous-Time Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
16.2.6 Limiting/Steady-State Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
16.2.7 MarkovChainCT Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
16.2.8 Queueing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
16.2.9 MMc Queue Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
16.2.10 MMcK Queue Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
16.2.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
16.3 Dynamic Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
16.3.1 Example: Traffic Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
16.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
16.4 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
16.4.1 Example: Golf Ball Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
16.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
16.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
16.5 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
16.5.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
16.5.2 Example: SEIHRD Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
16.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
16.6 ODE Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
25
18 Process-Oriented Models 667
18.1 Base Traits and Classes for Process-Oriented Models . . . . . . . . . . . . . . . . . . . . . . . 668
18.1.1 Identifiable Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.1.2 Locatable Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.1.3 Modelable Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.1.4 Temporal Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.2 Concurrent Processing of Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
18.2.1 Java’s Thread Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
18.2.2 ScalaTion’s Coroutine Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
18.3 Process Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
18.3.1 Model Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
18.3.2 Component Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
18.3.3 Example: BankModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
18.3.4 Executing the Bank Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
18.3.5 Network Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
18.3.6 Comparison to Event Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
18.3.7 SimActor Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
18.3.8 Source Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
18.3.9 Sink Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
18.3.10 Transport Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
18.3.11 Resource Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
18.3.12 WaitQueue Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
18.3.13 WaitQueue LCFS Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
18.3.14 Junction Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
18.3.15 Gate Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684
18.3.16 Route Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
18.3.17 Model Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
18.3.18 Vehicle Traffic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
18.3.19 Model MBM Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
18.3.20 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
18.4 Agent-Based Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
18.4.1 SimAgent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694
18.4.2 Vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695
18.4.3 Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695
18.4.4 Bank Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696
18.4.5 Vehicle Traffic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697
18.4.6 Hybrid Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698
18.4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698
18.5 Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
18.5.1 2D Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
18.5.2 3D Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
18.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
26
19 Simulation Output Analysis 705
19.1 Point and Interval Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705
19.2 One-Shot Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
19.3 Simulation Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708
19.3.1 Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709
19.3.2 Model Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
19.4 Method of Independent Replications (MIR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711
19.4.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712
19.4.2 Example: MIR Version of BankModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
19.5 Method of Batch Means (MBM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715
19.5.1 Effect of Increasing the Number of Batches . . . . . . . . . . . . . . . . . . . . . . . . 715
19.5.2 Effect on Batch Correlation of Increasing the Batch Size . . . . . . . . . . . . . . . . . 716
19.5.3 MBM versus MIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716
19.5.4 Relative Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
19.5.5 Example: MBM Version of BankModel . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
19.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
Appendices 721
27
A.7 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
A.8 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
A.8.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749
A.9 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
A.9.1 Newton-Raphson Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
A.9.2 Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
A.9.3 BFGS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
A.9.4 Limited Memory-BFGS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
A.9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753
A.9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
A.10 Method of Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
A.10.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
A.11 Karush-Kuhn-Tucker Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
A.11.1 Active and Inactive Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
A.12 Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
A.13 Augmented Lagrangian Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
A.13.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
A.13.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
A.14 Alternating Direction Method of Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
A.14.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
A.14.2 LassoAddm Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
A.14.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
A.15 Nelder-Mead Simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
A.15.1 NelderMeadSimplex Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766
A.15.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766
28
B.4.1 Embedding Relationships in Vertex-Types . . . . . . . . . . . . . . . . . . . . . . . . . 792
B.4.2 Resource Description Framework (RDF) Graphs . . . . . . . . . . . . . . . . . . . . . 793
B.4.3 From Relational to Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794
B.5 Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
B.5.1 Type Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
B.5.2 Constraints and Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796
B.5.3 KGTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797
B.6 Exercises - Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799
B.7 Graph Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
B.7.1 Path Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
B.7.2 Centrality and Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
B.7.3 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
B.8 Graph Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802
B.8.1 Graph Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802
B.8.2 Subgraph Isomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803
B.8.3 Graph Homomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803
B.8.4 Application to Query Processing in Graph Databases . . . . . . . . . . . . . . . . . . 803
B.9 Graph Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
B.9.1 Matrix Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
B.9.2 Graph Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805
B.10 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806
B.10.1 AGGREGATE and COMBINE Operations . . . . . . . . . . . . . . . . . . . . . . . . 807
B.11 Exercises - Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808
29
30
Preface
Applied Mathematics accelerated starting with the differential equations of Euler’s analytical mechanics
published in early 1700s [45, 117]. Over time increasingly accurate mathematical models of natural phenomena
were developed. The models are scrutinized by how well they match empirical data and related models.
Theories were developed that featured a collection of consistent, related models. In his theory of Universal
Gravity [132], Newton argues the sufficiency of this approach, while others seek to understand the underlying
substructures and causal mechanisms [117].
Data Science can trace its linage back to Applied Mathematics. One way to represent a mathematical
model is as a function f : Rn → R.
y = f (x, b) +
This illustrates that a response variable y is functionally related to other predictive variables x (vector in
bold font). Uncertainty in the relationship is modeled as a random variable (blue font) that follows some
probability distribution.
Making useful predictions or even inferences that one product lasts longer than another product are
clouded by this uncertainty. DeMoivre developed a limiting distribution for the Binomial Distribution.
Laplace derived a central limit theorem that showed that the sample means from several distributions follow
this same distribution. Gauss [180] studied this uncertainty and deduced a distribution for measurement
errors from basic principles. This distribution is now known as the Gaussian or Normal distribution. Infer-
ences such as which of two products has the longer expected lifetimes can now be made to a certain level of
confidence. Gauss also developed the method of least squares estimation.
Momentum in using probability distributions to analyze data, fit parameters and make inferences under
uncertainty lead to mathematical statistics emerging from applied mathematics in the late 1800s. In par-
ticular, Galton and Pearson collected and transformed statistical techniques into a mathematical discipline
(e.g., Pearson correlation coefficient, method of moments estimation, p-value, Chi-square test, statistical
hypothesis testing, principal component analysis). In the early 1900s, Gosset and Fisher expanded mathe-
matical statistics (e.g., analysis of variance, design of experiments, maximum likelihood estimation, Student’s
t-distribution, F-distribution).
With the increasing capabilities of computers, the amount of data available for training models grew
rapidly. This lead Computer Scientists into the fray with machine learning coined in 1959 and data mining
beginning in the late 1980s. Machine Learning developed slowly over the decades until the invention of the
back-propagation algorithm for neural networks in the mid 1980s lead to important advances. Data Mining
billed itself as finding patterns in data. Databases are often utilized and data preprocessing is featured in
the sense that mining through large amounts of data should be done with care.
31
With greater computing capabilities and larger amounts of data, statistics and machine learning are
leaning toward each other: The emphasis is to develop of accurate, interpretable and explainable models
for prediction, classification and forecasting. Data may also be clustered and simulation models that mimic
phenomena or systems may be created. Training a model is typically done using an optimization algorithm
(e.g., gradient descent) to minimize the errors in the model’s predictions. These constitute the elements of
data science.
This book is an introduction to data science that includes mathematical and computational foundations.
It is divided into three parts: (I) Foundations, (II) Modeling, and (III) Simulation. A review of Optimization
from the point of view of data science is included in the Appendix. The level of the book is College Junior
through beginning Graduate Student. The ideal mathematical background includes Differential, Integral and
Vector Calculus, Applied Linear Algebra, Probability and Mathematical Statistics. The following advanced
topics may be found useful for Data Science: Differential Equations, Nonlinear Optimization, Measure
Theory, Functional Analysis and Differential Geometry. Data Science also involves Computer Programming,
Database Management, Data Structures and Algorithms. Advanced topics include Parallel Processing,
Distributed Systems and Big Data frameworks (e.g., Hadoop and Spark). This book has been used in the
Data Science I and Data Science II courses at the University of Georgia.
32
Chapter 1
33
rithms for used for learning. Mathematical derivations are provided for the loss functions that are used
to train the models. Short Scala code snippets are provided to illustrate how the algorithms work. The
Scala object-oriented, functional language allows the creation of coincide code that looks very much like the
mathematical expressions. Modeling based on ordinary differential equations and simulation models are also
provided.
The prerequisite material for data science includes Vector Calculus, Applied Linear Algebra and Calculus-
based Probability and Statistics. Datasets can be stored as vectors and matrices, learning/parameter esti-
mation often involves taking gradients, and probability and statistics are needed to handle uncertainty.
34
1.2 ScalaTion
ScalaTion supports multi-paradigm modeling that can be used for simulation, optimization and analytics.
In ScalaTion, the modeling package provides tools for performing data analytics. Datasets are becom-
ing so large that statistical analysis or machine learning software should utilize parallel and/or distributed
processing. Databases are also scaling up to handle greater amounts of data, while at the same time increas-
ing their analytics capabilities beyond the traditional On-Line Analytic Processing (OLAP). ScalaTion
provides many analytics techniques found in tools like MATLAB, R and Weka. The analytics component
contains six types of tools: predictors, classifiers, forecasters, clusterers, recommenders and reduc-
ers. A trait is defined for each type.
To use ScalaTion, go to the Website http://www.cs.uga.edu/~jam/scalation.html and click on the
most recent version of ScalaTion and follow the first three steps: download, unzip, build.
Current projects are targeting Big Data Analytics in four ways: (i) use of sparse matrices, (ii) parallel
implementations using Scala’s support for parallelism (e.g., .par methods, parallel collections and actors),
(iii) distributed implementations using Akka, and (iv) high performance data stores including columnar
databases (e.g., Vertica), document databases (e.g., MongoDB), graph databases (e.g., Neo4j) and distributed
file systems (e.g., HDFS).
2. modeling - regression models with sub-packages for classification, clustering, neural networks, and time
series
35
2. simulation - multiple simulation engines
• if
1 if x < y then if x < y :
2 x += 1 x += 1
3 else if x > y then elsif x > y :
4 y += 1 y += 1
5 else else :
6 x += 1 x += 1
7 y += 1 y += 1
8 end if
The else and end are optional, as are the line breaks. Note, the x += 1 shortcut simply means x =
x + 1 for both languages.
• match
1 z = c match match c :
2 case ’+ ’ = > x + y case ’+ ’:
3 z = x + y
4 case ’ - ’ = > x - y case ’ - ’:
5 z = x - y
6 case ’* ’ = > x * y case ’* ’:
7 z = x * y
8 case ’/ ’ = > x / y case ’/ ’:
9 z = x / y
10 case _ = > println ( " not supported " ) case -:
11 print ( " not supported " )
In Scala 3, the case may be indented like Python. Also an end may be added.
• while
1 while x <= y do while x <= y :
2 x + = 0.5 x + = 0.6
3 end while
• for
1 for i <- 0 until 10 do for i in range (0 , 10) :
2 a ( i ) = 0.5 * i ~ ˆ 2 a [ i ] = 0.5 * i **2
3 end for
The end is optional, as are the line breaks. Note: for i <- 0 to 10 do will include 10, while until
will stop at 9. Both Scala and Python support other variaties of for. The for-yield collects all the
computed values into a.
36
1 val a = for i <- 0 until 10 yield 0.5 * i ~ ˆ 2
• cfor
1 var i = 0
2 cfor ( i < 10 , i + = 1) {
3 a ( i ) = 0.5 * i ~ ˆ 2
4 } // cfor
This for follows more of a C-style, provides improved efficiency and allows returns inside the loop. It
is defined as follows:
1 inline def cfor ( pred : = > Boolean , step : = > Unit ) ( body : = > Unit ) : Unit =
2 while pred do { body ; step }
3 end cfor
• try
1 try try :
2 file = new File ( " myfile . csv " ) x = 1 / 0
3 catch except Z e r o D i v i s i o n E r r o r :
4 case ex : FileNotFound = > println ( " not found " ) print ( " division by zero " )
5 end try
The end is optional and a finally clause is available. Both support a finally clause and Python
provides a shortcut with statement that comes in handy for opening files and automatically closing
them at the statement’s end of scope.
• assign with if
1 val y = if x < 1 then sqrt ( x ) else x ~ ˆ 2 y = sqrt ( x ) if x < 1 else x **2
All Scala control structures return values and so can be used in assignment statements. Note, prefix
sqrt with math for Python.
Note, the end tags are optional since Scala 3 uses significant indentation like Python.
Optionally, an end hypotenuse may be added and is often useful for functions which include several lines
of code. The Python code below is very similar, with the exception of the exponentiation operator ~^ for
ScalaTion and ** in Python. Outside of Scalation import scalation.~^. Both Double in Scala and
float in Python indicate 64-bit floating point numbers.
37
1 import math
2
The dot product operator on vectors is used extensively in data science. It multiplies all the elements
in the two vectors and then sums the products. An implementation in ScalaTion is given followed by a
similar implementation in Python that includes type annotations for improved readability and type checking.
1 import scalation . mathstat . VectorD
2
1 import numpy as np
2
Note, see the Chapter on Linear Algebra for more efficient implementations of dot product. Also, both
numpy.ndarray and VectorD directly provide dot product.
1 val z = x dot y
1 z = x . dot ( y )
In cases where the arguments are 2D arrays, np.dot is the same as matrix multiplication (x @ y) and for
scalars it is simple multiplication (x * y). ScalaTion supports several forms of multiplication for both
vectors and matrices (see the Linear Algebra Chapter).
Executable top-level functions can also be defined in similar ways in both Scala 3 and Python.
1 @ main def hello () : Unit =
2 val message = " Hello Data Science "
3 println ( s " message = $message " )
1.2.4 Classes
Defining a class is a good way to combine a data structure with its natural operations. The class will
consists of fields/attributes for maintaining data and methods for retrieving and updating the data.
An example of a class in Scala 3 is the Complex class that supports complex number (e.g., 2.1 + 3.2i)
and operations on complex numbers such as the + method. Of course, the actual implementation provides
many methods (see scalation.mathstat.Complex).
1 @ param re the real part ( e . g . , 2.1)
2 @ param im the imaginary part ( e . g . , 3.2)
3
38
5 extends Fractional [ Complex ] with Ordered [ Complex ]:
6
Notice that second argument im provides a default value of 0.0, so the class can be instantiated using either
one or two arguments/parameters.
Also observe that first variable cannot be reassigned as it is declared val, while the second variable c2 can be
as it is declared var. Finally, notice that the Complex class extends both Fractional and Ordered. These
are traits that the class Complex inherits. Some of the functionality (e.g., method implementations) can be
provided by the trait itself. The class must implement those that are not implemented or override the ones
with implementations to customize their behavior, if need be. Classes can extend several traits (multiple
inheritance), but may only extend one class (single inheritance).
Although Python already has a class called Complex, one could image coding one as follows:
1 class Complex :
2 def __init__ ( self , re : float , im : float = 0.0) :
3 self . re = re
4 self . im = im
5
Notice there are few differences: The constructor for Scala is any code at the top level of the class and
arguments to the constructor are given in the class definition, while Python has an explicit constructor
called init . Scala has an implicit reference to the instance object called this, while Python has an
explicit reference to the instance object called self. Furthermore, naming the method add makes it so
+ can be used to add two complex numbers. Another difference (not shown here) is that fields/attributes
as well as methods in Scala can be made internal using the private access modifier. In Python, a code
convention of having the first character of an identifier be underscore ( ) indicates that it should not be used
externally.
The basic data-types in Scala are integer types: Byte (8 bits), Short (16), Int (32) and Long (64), floating
point types: Float (32) and Double (64), character types: Char (single quotes) and String (double quotes),
and Boolean.
Corresponding Python data types are integer types: int (unlimited), floating point types: float32 (32)
and float (64), complex (128), character types: str (single or double quotes), and bool.
There are many operators that can be applied to these data-types, see https://docs.scala-lang.
org/tour/operators.html for the precedence of the operators. ScalaTion adds a few itself such as ˜^
for exponentiation. Also, ScalaTion provides complex numbers via the Complex class in the mathstat
package.
39
1.2.6 Collection Types
The most commonly used collection types in Scala are Array, ArrayBuffer, Range, List, Map, Set, and
Tuple. The Python rough equivalents (in lower case) are on the right (Map becomes dict).
1 val a = Array . ofDim [ Double ] (10) a = np . zeros (10)
2 val b = ArrayBuffer (2 , 3 , 3)
3 val r = 0 until 10 r = range (10)
4 val l = List (2 , 3 , 3) l = [2 , 3 , 3]
5 val m = Map ( " no " -> 0 , " yes " -> 1) m = { " no " : 0 , " yes " : 1}}
6 val s = Set (1 , 2 , 3 , 5 , 7) s = {1 , 2 , 3 , 5 , 7}
7 val t = ( firstName , lastName ) t = ( firstName , lastName )
For more collection types consult their documentation: https://scala-lang.org/api/3.x/ for Scala and
https://docs.python.org/3/library/collections.html for Python. Scala typically has mutable and
immutable versions of most collection types.
A matrix is a 2D array, that in this case is a 9-by-2 matrix holding two variables/features x0 and x1 in
columns of the matrix.
1 // col0 col1
2 val x = MatrixD ((9 , 2) , 1 , 8 , // row 0
3 2, 7, // row 1
4 3, 6, // row 2
5 4, 5, // row 3
6 5, 5, // row 4
7 6, 4 // row 5
8 7, 4, // row 6
9 8, 3, // row 7
10 9 , 2) // row 8
As practice, try to find a vector b of length/dimension 2, so that x * b is close to y. The * operator does
matrix-vector multiplication. It takes the dot product of the ith row of matrix x and vector b to obtain the
ith element in the resulting vector.
In Python, numpy arrays can be used to do the same thing. The following 1D array can represent a
vector. Note the use of period “1.” to make the elements be floating point numbers. The “D” indicates such
for ScalaTion.
1 y = np . array ([1. , 2. , 4. , 7. , 9. , 8. , 6. , 5. , 3.])
Using double square brackets “[[”, numpy can be used to represent matrices. Each “[ ... ]” corresponds to a
row in the matrix.
1 # col0 col1
2 x = np . array ([[1. , 8.] , # row 0
3 [2. , 7.] , # row 1
4 [3. , 6.] , # row 2
40
5 [4. , 5.] , # row 3
6 [5. , 5.] , # row 4
7 [6. , 4.] , # row 5
8 [7. , 4.] , # row 6
9 [8. , 3.] , # row 7
10 [9. , 2.]]) # row 8
8 13 , 14 , 15 , // 0 0 ,1 ,2 1
9 16 , 17 , 18 , // 1 0 ,1 ,2 1
10 19 , 20 , 21 , // 2 0 ,1 ,2 1
11 22 , 23 , 24) // 3 0 ,1 ,2 1
In Python, the above tensor can be defined as a 3D numpy array. Each row and column position has two
sheet values, e.g., ”[1., 13.]”.
1 # column 0 column 1 column 2
2 z = np . array ([[[1. , 13.] , [2. , 14.] , [3. , 15.]] , # row 0
3 [[4. , 16.] , [5. , 17.] , [6. , 18.]] , # row 1
4 [[7. , 19.] , [8. , 20.] , [9. , 21.]] , # row 2
5 [[10. , 22.] , [11. , 23.] , [12. , 24.]]]) # row 3
Vectors, matrices and tensors will discussed in greater detail in the Linear Algebra Chapter.
41
1.3 A Data Science Project
The orientation of this textbook is that of developing modeling techniques and the understanding of how
to apply them. A secondary goal is to explain the mathematics behind the models in sufficient detail to
understand the algorithms implementing the modeling techniques. Concise code based on the mathematics
is included and explained in the textbook. Readers may drill down to see the actual ScalaTion code.
The textbook is intended to facilitate trying out the modeling techniques as they are learned and to
support a group-based term project that includes the following ten elements. The term project is to culminate
in a presentation that explains what was done concerning these ten elements.
1. Problem Statement. Imagine that your group is hired as consultants to solve some problem for a
company or government agency. The answers and recommendations that your group produces should
not depend solely on prior knowledge, but rather on sophisticated analytics performed on multiple
large-scale datasets. In particular, the study should be focused and the purpose of the study should
clearly stated. What not to do: The following datasets are relevant to the company, so we ran them
through an analytics package (e.g., R) and obtained the following results.
2. Collection and Description of Datasets. To reduce the chances of results being relevant only to a
single dataset, multiple datasets should be used for the study (at least two). Explanation must be given
to how each dataset relates to the other datasets as well as to the problem statement. When a dataset
in the form of a matrix, metadata should be collected for each column/variable. In some cases the
response column(s)/variable(s) will be obvious, in others it will depend on the purpose of the study.
Initially, the result of columns/variables may be considered as features that may be useful in predicting
responses. Ideally, the datasets should loaded into a well-designed database. ScalaTion provides two
high-performance database systems: a relational database system and a graph database system
in scalation.database.table and scalation.database.graph, respectively.
3. Data Preprocessing Techniques Applied. During the preprocessing phase (before the modeling
techniques are applied), the data should be cleaned up. This includes elimination of features with zero
variance or too many missing values, as well as the elimination of key columns (e.g., on the training
data, the employee-id could perfectly predict the salary of an employee, but is unlikely to be of any
value in making predictions on the test data). For the remaining columns, strings should be converted
to integers and imputation techniques should be used to replace missing values.
4. Visual Examination. At this point, Exploratory Data Analysis (EDA) should be applied. Com-
monly, one column of a dataset in the combined data matrix will be chosen as the response column,
call it the response vector y, and the rest of the columns that remain after preprocessing form m-by-n
data matrix X. In general models are of the form
y = f (x) + (1.1)
where f is function mapping feature vector x into a predicted value for response y. The last term may be
viewed as random error . In an ideal model, the last term will be error (e.g., white noise). Since most
models are approximations, technically the last term should be referred to as a residual (that which
is not explained by the model). During exploratory data analysis, the value of y, should be plotted
against each feature/column x:j of data matrix X. The relationships between the columns should
42
be examined by computing a correlation matrix. Two columns that are very highly correlated are
supplying redundant information, and typically, one should be removed. For a regression type problem,
where y is treated as continuous random variable, a simple linear regression model should be created
for each feature xj ,
y = b0 + b1 xj + (1.2)
where the parameters b = [b0 , b1 ] are to be estimated. The line generated by the model should be
plotted along with the {(xij , yi )} data points. Visually, look for patterns such white noise, linear
relationship, quadratic relationship, etc. Plotting the residuals {(xij , i )} will also be useful. One
should also create Histograms and Box-Plots for each variable as well as consider removing outliers.
5. Modeling Techniques Chosen. For every type of modeling problem, there is the notions of a
NullModel: For prediction it is guess the mean, i.e., given a feature vector z, predict the value E [y],
regardless of the value of z. The coefficient of determination R2 for such models will be zero. If a
more sophisticated model cannot beat the NullModel, it is not helpful in predicting or explaining the
phenomena. Projects should include four classes of models: (i) NullModel, (ii) simple, easy to explain
models (e.g., Multiple Linear Regression), (iii) complex, performant models (e.g., Quadratic Regression,
Extreme Learning Machines) (iv) complex, time-consuming models (e.g., Neural Networks). If classes
(ii-iv) do not improve upon class (i) models, new datasets should be collected. If this does not help, a
new problem should be sought. On the flip side, if class (ii) models are nearly perfect (R2 close to 1),
the problem being addressed may be too simple for a term project. At least one modeling technique
should be chosen from each class.
7. Feature Selection. Although feature selection can occur during multiple phases in a modeling study,
an overview should be given at this point in the presentation. Explain which features were eliminated
and why they were eliminated prior to building the models. During model building, what features
were eliminated, e.g., using forward selection, backward elimination, Lasso Regression, dimensionality
reduction, etc. Also address and quantify the relative importance of the remaining features. Explain
how features that categorical are handled.
8. Reporting of Results. First the experimental setup should be described in sufficient detail to
facilitate reproducibility of your results. One way to show overall results is to plot predicted responses
ŷ and actual responses y versus the instance index i = 0 until m. Reports are to include the Quality
of Fit (QoF) for the various models and datasets in the form of tables, figures and explanation of the
results. Besides the overall model, for many modeling techniques the importance/significance of model
parameters/variables may be assessed as well. Tables and figures must include descriptive captions
and color/shape schemes should be consistent across figures.
43
9. Interpretation of Results. With the results clearly presented, they need to be given insightful
interpretations. What are the ramifications of the results? Are the modeling techniques useful in
making predictions, classifications or forecasts?
10. Recommendations of Study. The organization that hired your group would like some take home
messages that may result in improvements to the organization (e.g., what to produce, what processes
to adapt, how to market, etc.). A brief discussion of how the study could be improved (possibly leading
to further consulting work) should be given.
44
1.4 Additional Textbooks
More detailed development of this material can be found in textbooks on statistical learning, such as
See Table 1.1 for a mapping between the chapters in the four textbooks.
The next two chapters serve as quick reviews of the two principal mathematical foundations for data
science: linear algebra and probability.
45
46
Part I
Foundations
47
Chapter 2
Linear Algebra
Data science and analytics make extensive use of linear algebra. For example, let yi be the income of the ith
individual and xij be the value of the j th predictor/feature (age, education, etc.) for the ith individual. The
responses (outcomes of interest) are collected into a vector y, the values for predictors/features are collected
in a matrix X and the parameters/coefficients b are fit to the data.
y0 = x00 b0 + x01 b1
y1 = x10 b0 + x11 b1
This linear system has two equations with two variables having unknown values, b0 and b1 . Such linear
systems can be used to solve problems like the following: Suppose a movie theatre charges 10 dollars per
child and 20 dollars per adult. The evening attendance is 100, while the revenue is 1600 dollars. How many
children (b0 ) and adults (b1 ) were in attendance?
y = Xb (2.1)
49
2.2 Matrix Inversion
If the matrix is of full rank with m = n, then the unknown vector b may be uniquely determined by
multiplying both sides of the equation by the inverse of X, X −1
b = X −1 y (2.2)
Multiplying matrix X and its inverse X −1 , X −1 X results in an n-by-n identity matrix In = [1i=j ], where
the indicator function 1i=j equals 1 when i = j and 0 otherwise.
A faster and more numerically stable way to solve for b is to perform Lower-Upper (LU) Factorization.
This is done by factoring matrix X into lower L and upper U triangular matrices.
X = LU (2.3)
Then LU b = y, so multiplying both sides by L−1 gives U b = L−1 y. Taking an augmented matrix
" #
1 3 1
2 1 7
and performing row operations to make it upper right triangular has the effect of multiplying by L−1 . In
this case, the first row multiplied by -2 is added to second row to give.
" #
1 3 1
0 −5 5
From this, backward substitution can be used to determine b1 = −1 and then that b0 = 4, i.e.,
4
b =
−1
In cases where m > n, the system may be overdetermined, and no solution will exist. Values for b are
then often determined to make y and Xb agree as closely as possible, e.g., minimize absolute or squared
differences.
Vector notation is used in this book, with vectors shown in boldface and matrices in uppercase. Note,
matrices in ScalaTion are in lowercase, since by convention, uppercase indicates a type, not a variable.
ScalaTion supports vectors and matrices in its mathstat package. A commonly used operation is the dot
(inner) product, x · y, or in ScalaTion, x dot y.
50
2.3 Vector
A vector may be viewed a point in multi-dimensional space, e.g., in three space, we may have
where x is a point on the unit sphere and y is a point in the plane determined by the first two coordinates.
n−1
X
x·y = xi yi = 1.1547 (2.4)
i=0
Note, the inner product applies more generally, e.g., hx, yi may be applied when x and y are infinite sequences
or functions. See the exercises for the definition of an inner product space.
2.3.4 Norm
The norm of a vector is its length. Assuming Euclidean distance, the norm is
v
un−1
uX
kxk = t x2i = 1 (2.5)
i=0
√
The norm of y is 2. If θ is the angle between the x and y vectors, then the dot product is the product of
their norms and the cosine of the angle.
x·y 1.1547
cos(θ) = = √ = 0.8165
kxkkyk 1· 2
so the angle θ = .616 radians. Vectors x and y are orthogonal if the angle θ = π/2 radians (90 degrees).
In general there are `p norms. The two that are used here are the `2 norm kxk = kxk2 (Euclidean
distance) and the `1 norm kxk1 (Manhattan distance).
51
n−1
X
kxk1 = |xi | (2.7)
i=0
Vector notation facilitates concise mathematical expressions. Many common statistical measures for
populations or samples can be given in vector notation. For an m dimensional vector (m-vector) the following
may be defined.
1·x
µ(x) = µx =
m
(x − µx ) · (x − µx )
σ 2 (x) = σx2 =
m
x·x
= − µ2x
m
(x − µx ) · (y − µy )
σ(x, y) = σx,y =
m
x·y
= − µx µy
m
σx,y
ρ(x, y) = ρx,y =
σx σy
which are the population mean, variance, covariance and correlation, respectively.
The size of the population is m, which corresponds to the number of elements in the vector. A vector of
all ones is denoted by 1. For an m-vector k1k2 = 1 · 1 = m. Note, the sample mean uses the same formula,
while the sample variance and covariance divide by m − 1, rather than m (sample indicates that only some
fraction of population is used in the calculation).
Vectors may be used for describing the motion of an object through space over time. Let u(t) be the
location of an object (e.g., golf ball) in three dimensional space R3 at time t,
To describe the motion, let v(t) be the velocity at time t, and a be the constant acceleration, then according
to Newton’s Second Law of Motion,
1
u(t) = u(0) + v(0) t + a t2
2
The time varying function u(t) over time will show the trajectory of the golf ball.
52
8 with Cloneable [ VectorD ]
9 with D e f a u l t S e r i a l i z a b l e :
VectorD includes methods for size, indices, set, copy, filter, select, concatenate, vector arithmetic, power,
square, reciprocal, abs, sum, mean variance, rank, cumulate, normalize, dot, norm, max, min, mag, argmax,
argmin, indexOf, indexWhere, count, contains, sort and swap.
53
2.4 Vector Calculus
Data science uses optimization to fit parameters in models, where for example a quality of fit measure (e.g.,
sum of squared errors) is minimized. Typically, gradients are involved. In some cases, the gradient of the
measure can be set to zero allowing the optimal parameters to be determined by matrix factorization. For
complex models, this may not work, so an optimization algorithm that moves in the direction opposite to
the gradient can be applied.
For example, the functional value at the point [3, 2], f ([3, 2]) = 1 + 1 = 2 and at the point [1, 1], f ([1, 1]) =
1 + 4 = 5. The following contour curves illustrate the how the elevation of the function increases with
distance from the point [2, 3].
y
5
4
3
2
1
−3 −2 −1 1 2 3 4 5 x
−1
−2
−3
The gradient of function f consists of a vector formed from the two partial derivatives.
∂f ∂f
grad f = ∇f = ,
∂x ∂y
The gradient evaluated at point/vector u ∈ R2 is
∂f ∂f
∇f (u) = (u), (u)
∂x ∂y
The gradient indicates the direction of steepest increase/ascent. For example, the gradient at the point [3, 2],
∇f ([3, 2]) = [2, −2] (in blue), while at [1, 1], ∇f ([1, 1]) = [−2, −4] (in purple).
54
A gradient’s norm indicates the magnitude of the rate of change (or steepness). When the elevation
changes are fixed (here they differ by one), the closeness of the contours curves also indicates steepness.
Notice that the gradient vector at point [x, y] is orthogonal to the contour curve intersecting that point.
By setting the gradient equal to zero, in this case
∂f
= 2(x − 2)
∂x
∂f
= 2(y − 3)
∂y
one may find the vector that minimizes function f , namely u = [2, 3] where f = 0. For more complex
functions, repeatedly moving in the opposite direction to the gradient, may lead to finding a minimal value.
In general, the gradient (or gradient vector) of function f : Rn → R is
∂f ∂f ∂f
∇f = = ,..., (2.8)
∂x ∂x0 ∂xn−1
or evaluated at point/vector x ∈ Rn is
∂f ∂f ∂f
∇f (x) = (x) = (x), . . . , (x) (2.9)
∂x ∂x0 ∂xn−1
In data science, it is often convenient to take the gradient of a dot product of two functions of x, in which
case the following product rule can be applied.
∇f0 (x)
∇f (x)
1
...
∇fm−1 (x)
This follows the numerator layout where the functions correspond to rows (the opposite is called the denom-
inator layout which is the transpose of the numerator layout).
Consider the following function f : R2 → R2 that maps vectors in R2 into other vectors in R2 .
55
∂f ∂f0
0
,
∂x0 ∂x1
∂f ∂f
1 1
,
∂x0 ∂x1
Taking the partial derivatives gives the following Jacobian matrix.
" #
2x0 − 4, 2x1 − 6)
4x0 − 12, 6x1 − 12
∂2f
Hf (x) = (2.12)
∂xi ∂xj 0≤i<n,0≤j<n
∂2f ∂2f
,
∂x20 ∂x0 ∂x1
2 2
∂ f ∂ f
,
∂x1 ∂x0 ∂x21
Taking the second partial derivatives gives the following Hessian matrix.
" #
4, 0
0, 6
Consider a differentiable function of n variables, f : Rn → R. The points at which its gradient vector ∇f
is zero are referred to as critical points. In particular, they may be local minima, local maxima or saddle
points/inconclusive, depending on whether the Hessian matrix H is positive definite, negative definite, or
|
otherwise. A symmetric matrix A is positive definite if x Ax > 0 for all x 6= 0 (alternatively, all of A’s
eigenvalues are positive). Note: a positive/negative semi-definite Hessian matrix may or may not indicate
an optimal (minimal/maximal) point.
56
2.5 Matrix
A matrix may be viewed as a collection of vectors, one for each row in the matrix. Matrices may be used to
represent linear transformations
f : Rn → Rm (2.13)
n m
that map vectors in R to vectors in R . For example, in ScalaTion an m-by-n matrix A with m = 3
rows and n = 2 columns may be created as follows:
1 val a = MatrixD ((3 , 2) , 1 , 2 ,
2 3, 4,
3 5 , 6)
to produce matrix A.
1 2
3 4
5 6
Matrix A will transform u vectors in R2 into v vectors in R3 .
Au = v (2.14)
For example,
" # 5
1
A = 11
2
17
ScalaTion supports retrieval of row vectors, column vectors and matrix elements. In particular, the
following access operations are supported.
A = a = matrix
ai = a(i) = row vector i
a:j = a(?, j) = column vector j
aij = a(i, j) = the element at row i and column j
Ai:k,j:l = a(i to k, j to l) = row and column matrix slice
Note, i to k does not include k. Common operations on matrices are supported as well.
57
Matrix Addition and Subtraction
Matrix Multiplication
n−1
X
cij = aik bkj = ai · b:j (2.16)
k=0
Mathematically, this is written as C = AB. The ij element in matrix C is the vector dot product of the ith
row of A with the j th column of B.
Matrix Transpose
|
The transpose of matrix A, written A (val t = a.transpose or val t = a.T ), simply exchanges the roles
of rows and columns.
1 def transpose : MatrixD =
2 val a = Array . ofDim [ Double ] ( dim2 , dim )
3 for j <- indices do
4 val v_j = v ( j )
5 var i = 0
6 cfor ( i < dim2 , i + = 1) { a ( i ) ( j ) = v_j ( i ) }
7 end for
8 new MatrixD ( dim2 , dim , a )
9 end transpose
Matrix Determinant
The determinant of square (m = n) matrix A, written |A| (val d = a.det), indicates whether a matrix is
singular or not (and hence invertible), based on whether the determinant is zero or not.
Trace of a Matrix
The trace of matrix A ∈ Rn×n is simply the sum of its diagonal elements.
n−1
X
tr(A) = aii (2.17)
i=0
In ScalaTion, the trace is computed using the trace method (e.g., a.trace).
58
Matrix Dot Product
ScalaTion provides several types of dot products on both vectors and matrices, two of which are shown
below. The first method computes the usual dot product between two vectors. Note, the parameter y is
generalized to take any vector-like data type.
1 def dot ( y : IndexedSeq [ Double ]) : Double =
2 var sum = 0.0
3 for i <- v . indices do sum + = v ( i ) * y ( i )
4 sum
5 end dot
When relevant a n-vector (e.g., x ∈ Rn ) may be viewed as an n-by-1 matrix (column vector), in which case
|
x would be viewed as an 1-by-n matrix (row vector). Consequently, dot product (and outer product) can
be defined in terms of matrix multiplication and transpose operations.
|
x·y =x y dot (inner) product (2.18)
|
x ⊗ y = xy outer product (2.19)
The second method takes the dot product two matrices. The second method extends the notion of
|
matrices and is an efficient way to compute A B = A · B = a.transpose * b = a dot b.
1 def dot ( y : MatrixD ) : MatrixD =
2 if dim2 ! = y . dim then
3 flaw ( " dot " , s " matrix dot matrix - incompatible cross dimensions :
4 dim2 = $dim2 , y . dim = ${ y . dim } " )
5
23 end for
24 end for
25 end for
26 new MatrixD ( dim , y . dim , a )
27 end dot
59
2.6 Matrix Factorization
Many problems in data science involve matrix factorization to for example solve linear systems of equations
or perform Ordinary Least Squares (OLS) estimation of parameters. ScalaTion supports several factorization
techniques, including the techniques shown in Table 2.2
These algorithms are faster or more numerically stable than algorithms for matrix inversion. See the Pre-
diction chapter to see how matrix factorization is used in Ordinary Least Squares estimation.
Multiplying A and x yields 20 , while multiplying A and z yields 03 . Thus, letting λ0 = 2 and λ1 = 3, we
see that Ax = λ0 x and Az = λ1 z. In general, a matrix An×n of rank r will have r non-zero eigenvalues λi
with corresponding eigenvector x(i) such that
In other words, there will be r unit eigenvectors, for which multiplying by the matrix simply rescales the
eigenvector x(i) by its eigenvalue λi . The same will happen for any non-zero vector in alignment with one of
the r unit eigenvectors.
Given an eigenvalue λi , an eigenvector may be found by noticing that
Any vector in the nullspace of the matrix A − λi I is an eigenvector corresponding to λi . Note, if the above
equation is transposed, it is called a left eigenvalue problem (see the section on Markov Chains).
In low dimensions, the eigenvalues may be found as roots of the characteristic polynomial/equation
derived from taking the determinant of A − λi I. Software like ScalaTion, however, use iterative algorithms
that convert a matrix into Hessenburg and tridiagonal forms.
60
2.7 Internal Representation
The current internal representation used for storing the elements in a dense matrix is Array [Array [Double]]
in row major order (row-by-row). Depending on usage, operations may be more efficient using column ma-
jor order (column-by-column). Also, using a one dimensional array Array [Double] mapping (i, j) to the
k th location may be more efficient. Furthermore, having operations access through sub-matrices (blocks)
may improve performance because of caching efficiency or improved performance for parallel and distributed
versions.
The mathstat package provides several classes implementing multiple types of vectors and matrices as
shown in Table 2.3 including VectorD and MatrixD.
The suffix ‘D’ indicates the base element type is Double. There are also implementations for Complex ‘C’,
Int ‘I’, Long ‘L’, Rational ‘Q’, Real ‘R’, String ‘S’, and TimeNum ‘T’.
Note, ScalaTion 2.0 currently only supports dense vectors and matrices. See older versions for the
other types of vectors and matrices.
ScalaTion supports many operations involving matrices and vectors, including the following show in
Table 2.5.
61
2.8 Tensor
Loosely speaking, a tensor is a generalization of scalar, vector and matrix. The order of the tensor indicates
the number dimensions. In this text, tensors are treated as hyper-matrices and issues such as basis inde-
pendence, contravariant and covariant vectors/tensors, and the rules for index notation involving super and
subscripts are ignored [111]. To examine the relationship between order 2 tensors and matrices more deeply,
see the last exercise.
For data science, input into a model may be a vector (e.g., simple regression, univariate time series), a
matrix (e.g., multiple linear regression, neural networks), a tensor with three dimensions (e.g., monochro-
matic/greyscale images), and a tensor with four dimensions (e.g., color images).
5 class TensorD ( val dim : Int , val dim2 : Int , val dim3 : Int ,
6 private [ mathstat ] var v : Array [ Array [ Array [ Double ]]] = null )
7 extends Error with Serializable
A tensor T is stored in a triple array [tijk ]. Below is an example of a 2-by-2-by-2 tensor, T = [T::0 |T::1 ]
" #
t000 t010 | t001 t011
t100 t110 | t101 t111
62
2.8.2 Four Dimensional Tensors
In ScalaTion, tensors with four dimensions are supported by the Tensor4D class. The default names for
the dimensions [111] were chosen to follow a common convention (row, column, sheet, channel).
1 @ param dim size of the 1 st level / dimension ( row ) of the tensor ( height )
2 @ param dim2 size of the 2 nd level / dimension ( column ) of the tensor ( width )
3 @ param dim3 size of the 3 rd level / dimension ( sheet ) of the tensor ( depth )
4 @ param dim3 size of the 4 rd level / dimension ( channel ) of the tensor ( spectra )
5
6 class Tensor4D ( val dim : Int , val dim2 : Int , val dim3 : Int , , dim4 : Int ,
7 private [ mathstat ] var v : Array [ Array [ Array [ Array [ Double ]]]] = null )
8 extends Error with Serializable
63
2.9 Exercises
1. Draw two 2-dimensional non-zero vectors, x and y, whose dot product x · y is zero.
x
2. A vector can be transformed into a unit vector in the same direction by dividing by its norm, .
kxk
Let, y = 2x and show that the dot of the corresponding unit vectors equals one. This means that their
Cosine Similarity equals one.
x·y
cosxy = cos(θ) = where θ is the angle between the vectors
kxkkyk
3. Correlation ρxy vs. Cosine Similarity cosxy . What does it mean when the correlation (cosine similarity)
is 1, 0, -1, respectively. In general, does ρxy = cosxy ? What about in special cases?
4. Given the matrix X and the vector y, solve for the vector b in the equation y = Xb using matrix
inversion and LU factorization.
1 import scalation . mathstat .{ MatrixD , VectorD , Fac_LU }
2 val x = MatrixD ((2 , 2) , 1 , 3 ,
3 2 , 1)
4 val y = VectorD (1 , 7)
5 println ( " using inverse : b = X ˆ -1 y = " + x . inverse * y )
6 println ( " using LU fact : Lb = Uy = " + { val lu = new Fac_LU ( x ) ; lu . factor () . solve
(y) } )
Modify the code to show the inverse matrix X −1 and the factorization into the L and U matrices.
| |
5. If Q is an orthogonal matrix, then Q Q becomes what type of matrix? What about QQ ? Illustrate
with an example 3-by-3 matrix. What is the inverse of Q?
6. Show that the Hessian matrix of a scalar-valued function f : Rn → R is the transpose of the Jacobian
of the gradient, i.e.,
|
Hf (x) = [J ∇f (x)]
7. Critical points for a function f : Rn → R occur when ∇f (x) = 0. How can the Hessian Matrix can be
used to decide whether a particular critical point is a local minimum or maximum?
8. Define three functions, f1 (x, y), f2 (x, y) and f3 (x, y), that have critical points (zero gradient) at the
point [2, 3] such that this point is (a) a minimal point, (b) a maximal point, (c) a saddle point,
respectively. Compute the Hessian matrix at this point for each function and use it to explain the type
of critical point. Plot the three surfaces in 3D.
Hint: see https://www.math.usm.edu/lambers/mat280/spr10/lecture8.pdf
9. Determine the eigenvalues for the matrix A given in the section on eigenvalues and eigenvectors, by
setting the determinant of A − λI equal to zero.
64
" #
2−λ 0
0 3−λ
(2 − λ)(3 − λ) − 0 = 0
10. A vector space V over field K (e.g., R or C) is a set of objects, e.g., vectors x, y, and z, and two
operations, addition and scalar multiplication,
x, y ∈ V =⇒ x + y ∈ V (2.22)
x ∈ V and a ∈ K =⇒ ax ∈ V (2.23)
(x + y) + z = x + (y + z)
x+y = y+x
∃ 0 ∈ V s.t. x + 0 = x
∃ − x ∈ V s.t. x + (−x) = 0
(ab)x = a(bx)
a(x + y) = ax + ay
(a + b)x = ax + bx
∃1 ∈ K s.t. 1x = x
11. A normed vector space V over field K is a vector space with a function defined that gives the length
(norm) of a vector,
x ∈ V =⇒ kxk ∈ R+
65
d(x, y) = kx − yk
n−1
! p1
X
kxkp = |xi |p
i=0
Norms and distances are very useful in data science, for example, loss functions used to judge/optimize
models are often defined in terms of norms or distances.
Show that the last axiom called the triangle inequality hold for `2 -norms.
Hint: kxk22 is the sum of the elements in x squared.
12. An inner product space H over field K is a vector space with one more operation, inner product,
x, y ∈ H =⇒ hx, yi ∈ K
∗
hx, yi = hy, xi
hax + by, zi = a hx, zi + b hy, zi
hx, xi > 0 unless x = 0
Note, the complex conjugate negates the imaginary part of a complex number, e.g., (c + di)∗ = c − di
Show that an n-dimensional Euclidean vector space using the definition of dot product given in this
chapter is an inner product space over R.
13. Explain the meaning of the following statement, “a tensor of order 2 for a given coordinate system can
be represented by a matrix.”
Hint: see “Tensors: A Brief Introduction” [32]
66
2.10 Further Reading
1. Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares [21]
67
68
Chapter 3
Probability
Probability is used to measure the likelihood of certain events occurring, such as flipping a coin and getting a
head, rolling a pair of dice and getting a sum of 7, or getting a full house in five card draw. Given a random
experiment, the sample space Ω is the set of all possible outcomes.
Technically speaking, an event is a measurable subset of Ω (see [41] for a measure-theoretic definition).
Letting F be the set of all possible events, one may define a probability space as follows:
Definition: A probability space is defined as a triple (Ω, F, P ).
Given an event A ∈ F, the probability of its occurrence is restricted to the unit interval, P (A) ∈ [0, 1].
Thus, P may be viewed as a function that maps events to the unit interval.
P : F → [0, 1] (3.2)
If events A and B are independent, simply take the product of the individual probabilities,
69
3.1.2 Conditional Probability
The conditional probability of the occurrence of event A, given it is known that event B has occurred/will
occur is
P (AB)
P (A|B) = (3.5)
P (B)
If events A and B are independent, the conditional probability reduces to
Bayes Theorem
P (B|A)P (A)
P (A|B) = (3.7)
P (B)
When determining conditional probability A|B is difficult, one may try going the other direction and first
determine B|A.
Example
The size of the outcome space is 4 and since the event space F contains all subsets of Ω, its size is 24 = 16.
Define the following two events:
What is the probability that event A occurred, given that you know that event B occurred? If fair coins are
used, the probability of a head (or tail) is 1/2 and the probabilities reduce to the ratios of set sizes.
70
3.2 Random Variable
Rather than just looking at individual events, e.g., E1 or E2 , one is often more interested in the probability
that random variables take on certain values.
Definition: A random variable y is a function that maps outcomes in the sample space Ω into a set/domain
of numeric values Dy .
y : Ω → Dy (3.8)
Some commonly used domains are real numbers R, integers Z, natural numbers N, or subsets thereof. An
example of a mapping from outcomes to numeric values is tail → 0, head → 1. In other cases such as the
roll of one dice, the map is the identity function.
One may think of a random variable y (blue font) as taking on values from a given domain Dy . With a
random variable, its value is uncertain, i.e., its value is only known probabilistically.
For A ⊆ Dy one can measure the probability of the random variable y taking on a value from the set
A. This is denoted by
P (y ∈ A) (3.9)
E = y −1 (A) (3.10)
P (E) (3.11)
71
3.3 Probability Distribution
A random variable y is characterized by how its probability is distributed over its domain Dy . This can be
captured by functions that map Dy to R+ .
Fy : Dy → [0, 1] (3.12)
It measures the amount probability or mass accumulated over the domain up to and including the point y.
The color highlighted symbol y is the random variable, while y simply represents a value.
Fy (y) = P (y ≤ y) (3.13)
To illustrate the concept, let x1 and x2 be the number on dice 1 and dice 2, respectively. Let y = x1 + x2 ,
then Fy (6) = P (y ≤ 6) = 5/12. The entire CDF for the discrete random variable y (roll of two dice), Fy (y)
is
{(2, 1/36), (3, 3/36), (4, 6/36), (5, 10/36), (6, 15/36), (7, 21/36), (8, 26/36), (9, 30/36), (10, 33/36), (11, 35/36), (12, 36/36)}
As another example, the CDF for a continuous random variable y that is defined to be uniformly distributed
on the interval [0, 2] is
y
on [0, 2]
Fy (y) =
2
When random variable y follows this CDF, we may say that y is distributed as Uniform (0, 2), symbolically,
y ∼ Uniform (0, 2).
py : Dy → [0, 1] (3.14)
It can be calculated as the first difference of the CDF, i.e., the amount of accumulated mass at point yi
minus the amount of accumulated mass at the previous point yi−1 .
For one dice x1 , the pmf is
{(1, 1/6), (2, 1/6), (3, 1/6), (4, 1/6), (5, 1/6), (6, 1/6)}
72
A second dice x2 will have the same pmf. Both random variables follow the Discrete Uniform Distribution,
Randi (1, 6).
1
px (x) = 1{1≤x≤6} (3.16)
6
{(2, 1/36), (3, 2/36), (4, 3/36), (5, 4/36), (6, 5/36), (7, 6/36), (8, 5/36), (9, 4/36), (10, 3/36), (11, 2/36), (12, 1/36)}
The random variable y follows the Discrete Triangular Distribution (that peaks in the middle) and not the
flat Discrete Uniform Distribution.
min(y − 1, 13 − y)
py (y) = 1{2≤y≤12} (3.17)
36
6 − |7 − y|
py (y) = for y ∈ {2, . . . , 12} (3.18)
36
Suppose y is defined on the continuous domain, e.g., Dy = [0, 2], and that mass/probability is uniformly
spread amongst all the points in the domain. In such situations, it is not productive to consider the mass at
one particular point. Rather one would like to consider the mass in a small interval and scale it by dividing
by the length of the interval. In the limit this is the derivative which gives the density. For a continuous
random variable, if the function Fy is differentiable, a probability density function (pdf) may be defined.
f y : D y → R+ (3.19)
dFy (y)
fy (y) = (3.20)
dy
For example, the pdf for a uniformly distributed random variable y on [0, 2] is
d y 1
fy (y) = = on [0, 2]
dy 2 2
The pdf for the Uniform Distribution is shown in the figure below.
73
pdf for Uniform Distribution
0.6
0.55
fy (y)
0.5
0.45
0 0.5 1 1.5 2
y
Random variates of this type may be generated using ScalaTion’s Uniform (0, 2) class within the
scalation.random package.
1 val rvg = Uniform (0 , 2)
2 val yi = rvg . gen
For another example, the pdf for an exponentially distributed random variable y on [0, ∞) with rate
parameter λ is
The pdf for the Exponential (λ = 1) Distribution is shown in the figure below.
0.8
0.6
fy (y)
0.4
0.2
0
0 1 2 3 4
y
Going the other direction, the CDF Fy (y) can be computed by summing the pmf py (y)
74
X
Fy (y) = py (xi ) (3.21)
xi ≤y
75
3.4 Empirical Distribution
An empirical distribution may be used to describe a dataset probabilistically. Consider a dataset (X, y)
where X ∈ Rm×n is the data matrix collected about the predictor variables and y ∈ Rm is the data vector
collected about the response variable. In other words, the dataset consists of m instances of an n-dimensional
predictor vector xi and a response value yi .
The joint empirical probability mass function (epmf) may be defined on the basis of a given dataset
(X, y).
m−1
ν(x, y) 1 X
pdata (x, y) = = 1{xi =x,yi =y} (3.23)
m m i=0
where ν(x, y) is the frequency count and 1{c} is the indicator function (if c then 1 else 0).
The corresponding Empirical Cumulative Distribution Function (ECDF) may be defined as follows:
m−1
1 X
Fdata (x, y) = 1{xi ≤x,yi ≤y} (3.24)
m i=0
76
3.5 Expectation
Using the definition of a CDF, one can determine the expected value (or mean) for random variable y using
a Riemann-Stieltjes integral.
Z
E [y] = y dFy (y) (3.25)
Dy
The mean specifies the center of mass, e.g., a two-meters rod with the mass evenly distributed throughout,
would have a center of mass at 1 meter. Although it will not affect the center of mass calculation, since the
total probability is 1, unit mass is assumed (one kilogram). The center of mass is the balance point in the
middle of the bar.
X
E [y] = y py (y) (3.27)
y∈Dy
The mean for rolling two dice is E [y] = 7. One way to interpret this is to imagine winning y dollars by
playing a game, e.g., two dollars for rolling a 2 and twelve dollars for rolling a 12, etc. The expected earnings
when playing the game once is seven dollars. Also, by the law of large numbers, the average earnings for
playing the game n times will converge to seven dollars as n gets large.
3.5.3 Variance
The variance of random variable y is given by
V [y] = E (y − E [y])2
(3.28)
The variance specifies how the mass spreads out from the center of mass. For example, the variance of y ∼
Uniform (0, 2) is
Z 2
1 1
V [y] = E (y − 1)2 = (y − 1)2
dy =
0 2 3
77
That is, the variance of the one kilogram, two-meter rod is 13 kilogram meter2 . Again, for probability to
be viewed as mass, unit mass (one kilogram) must be used, so the answer may also be given as 13 meter2
Similarly to interpreting the mean as the center of mass, the variance corresponds to the moment of inertia.
The standard deviation is simply the square root of variance.
p
SD [y] = V [y] (3.29)
For the two-meter rod, the standard deviation is √13 = 0.57735. The percentage of mass within one
standard deviation unit of the center of mass is then 58%. Many distributions, such as the Normal (Gaussian)
distribution concentrate mass closer to the center. For example, the Standard Normal Distribution has the
following pdf.
1 2
fy (y) = √ e−y /2 (3.30)
2π
The mean for this distribution is 0, while the variance is 1. The percentage of mass within one standard
deviation unit of the center of mass is 68%. The pdf for the Normal (µ = 0, σ 2 = 1) Distribution is shown
in the figure below.
0.4
0.3
fy (y)
0.2
0.1
0
−3 −2 −1 0 1 2 3
y
Note, the uncentered variance (or mean square) of the random variable y is simply E y 2 .
3.5.4 Covariance
The covariance of two random variable x and y is given by
78
C [z] = C [zi , zj ] 0≤i,j<k (3.32)
79
3.6 Algebra of Random Variables
When random variables x1 and x2 are added to create a new random variable y,
y = x1 + x2
how is y described in terms of mean, variance and probability distribution? Also, what happens when a
random variable is multiplied a constant?
y = ax
The expectation of a random variable multiplied by a constant, is the constant multiplied by the random
variable’s expectation.
When the random variable are independent, the covariance is zero, so the variance of sum is just the sum of
variances.
The variance of a random variable multiplied by a constant, is the constant squared multiplied by the random
variable’s variance.
80
Convolution: Discrete Case
Assuming the random variables are independent and discrete, the pmf of the sum py is the convolution of
two pmfs px1 and px2 .
X
py (y) = px1 (x) px2 (y − x) (3.39)
x∈Dx
For example, letting x1 , x2 ∼ Bernoulli(p), i.e., px1 (x) = px (1 − p)1−x on Dx = {0, 1}, gives
X
py (0) = px1 (x) px2 (0 − x) = p2
x∈Dx
X
py (1) = px1 (x) px2 (1 − x) = 2p(1 − p)
x∈Dx
X
py (2) = px1 (x) px2 (2 − x) = (1 − p)2
x∈Dx
which indicates that y ∼ Binomial(p, 2). The pmf for the Binomial(p, n) distribution is
n y
py (y) = p (1 − p)n−y (3.40)
y
4
z
1 2 3 4 5 6
x
As the joint pmf pxz (xi , zj ) = px (xi )pz (zj ) = 1/36 is constant over all points, the convolution sum for a
particular value of y corresponds to the downward diagonal sum where the dice sum to that value, e.g.,
py (3) = 2/36, py (7) = 6/36.
81
Convolution: Continuous Case
Now, assuming the random variables are independent and continuous, the pdf of the sum fy is the convolution
of two pdfs fx1 and fx2 .
Z
fy (y) = fx1 (x) fx2 (y − x) dx (3.42)
Dx
For example, letting x1 , x2 ∼ Uniform(0, 1), i.e., fx1 (x) = 1 on Dx = [0, 1], gives
Z
for y ∈ [0, 1] fy (y) = fx1 (x) fx2 (y − x)dx = y
[0,y]
Z
for y ∈ [1, 2] fy (y) = fx1 (x) fx2 (y − x)dx = 2 − y
[0,2−y]
m−1
X
y = xi
i=0
1 1
When xi ∼ Uniform(0, 1) with mean 2 and variance 12 , then for m large enough y will follow a Normal
distribution
y ∼ Normal(µ, σ 2 )
m m
where µ = 2 and σ 2 = 12 . The pdf for the Normal Distribution is
1 1 y−µ 2
fy (y) = √ e− 2 ( σ ) (3.43)
2πσ
For most distributions, summed random variables will be approximately distributed as Normal, as in-
dicated by the Central Limit Theorem (CLT); for proofs see [47, 11]. Suppose xi ∼ F with mean µx and
variance σx2 < ∞, then the sum of m independent and identically distributed (iid) random variables is
distributed as follows:
m−1
X
y = xi ∼ N (mµx , mσx2 ) as m → ∞ (3.44)
i=0
This is one simple form of the CLT. See the exercises of a visual illustration of the CLT.
Similarly, the sum of m independent and identically distributed random variables (with mean µx and
variance σx 2 ) divided by m will also be Normally distributed for sufficiently large m.
82
m−1
1 X
y = xi
m i=0
1
The expectation of y = m mµx = µx , while variance is σx2 /m, so
As, E [y] = µx , y can serve as an unbiased estimator of µx . This can be transformed to the Standard Normal
Distribution with the following transformation.
y − µx
z = √ ∼ Normal(0, 1)
σx / m
The Normal distribution is also referred to as the Gaussian distribution. See the exercises for related
distributions: Chi-square, Student’s t and F .
83
3.7 Median, Mode and Quantiles
As stated, the mean is the expected value, a probability weighted sum/integral of the values in the domain of
the random variable. Other ways of characterizing a distribution are based more directly on the probability.
3.7.1 Median
Moving along the distribution, the place at which half of the mass is below you and half is above you is the
median.
1 1
P (y ≤ median) ≥ and P (y ≥ median) ≥ (3.45)
2 2
Given equally likely values (1, 2, 3), the median is 2. Given equally likely values (1, 2, 3, 4), there are two
common interpretations for the median: The smallest value satisfying the above equation (i.e., 2) or the
average of the values satisfying the equation (i.e., 2.5) The median for two dice (with the numbers summed)
which follow the Triangular distribution is 7.
3.7.2 Quantile
The median is also referred to as the half quantile.
1
Q [y] = Fy−1 ( ) (3.46)
2
More generally, the p ∈ [0, 1] quantile is given by
y
p = Fy (y) = on [0, 2]
2
Taking the inverse yields the iCDF.
3.7.3 Mode
Similarly, one may be interested in the mode, which is the average of the points of maximal probability mass.
The mode for rolling two dice is y = 7. For continuous random variables, it is the average of points of
maximal probability density.
For the two-meter rod, the mean, median and mode are all equal to 1.
84
3.8 Joint, Marginal and Conditional Distributions
Knowledge of one random variable may be useful in narrowing down the possibilities for another random
variable. Therefore, it is important to understand how probability is distributed in multiple dimensions.
There are three main concepts: joint, marginal and conditional.
In general, the joint CDF for two random variables x and y is
pxy (xi , yj ) = Fxy (xi , yj ) − [Fxy (xi−1 , yj ) + Fxy (xi , yj−1 ) − Fxy (xi−1 , yj−1 )] (3.51)
See the exercises to check this formula for the matrix shown below.
Imagine nine weights placed in a 3-by-3 grid with the number indicating the relative mass.
X
px (xi ) = pxy (xi , yj ) sum out y (3.52)
yj ∈Dy
X
py (yj ) = pxy (xi , yj ) sum out x (3.53)
xi ∈Dx
Carrying out the summations or calling margProbX (pxy) for px (xi ) and margProbY (pxy) for py (yj ) gives,
It is now easy to see that px is based on row sums, while py is based on column sums.
85
3.8.2 Continuous Case: Joint and Marginal Density
In the continuous case, the joint pdf for two random variables x and y is
∂ 2 Fxy
fxy (x, y) = (x, y) (3.54)
∂x∂y
Consider the following joint pdf that specifies the distribution of one kilogram of mass (or probability)
uniformly over a 2-by-3 meter plate.
1
fxy (x, y) = on [0, 2] × [0, 3]
6
2
y
0 0.5 1 1.5 2
x
Z x Z y
1 xy
Fxy (x, y) = dydx =
0 0 6 6
There are two marginal pdfs that are single integrals: Think of the mass of the vertical red line being
collected into the thick red bar at the bottom. Collecting all such lines creates the red bar at the bottom
and its mass is distributed as follows:
Z 3
1 3 1
fx (x) = dy = = on [0, 2] integrate out y
0 6 6 2
Now think of the mass of the horizontal green line being collected into the thick green bar on the left.
Collecting all such lines creates the green bar on the left and its mass is distributed as follows:
Z 2
1 2 1
fy (y) = dx = = on [0, 3] integrate out x
0 6 6 3
86
3.8.3 Discrete Case: Conditional Mass
Conditional probability can be examined locally. Given two discrete random variables x and y, the conditional
mass function of x given y is defined as follows:
pxy (xi , yj )
px|y (xi , yj ) = P (x = xi |y = yj ) = (3.55)
py (yj )
where pxy (xi , yj ) is the joint mass function. Again, the marginal mass functions are
X
px (xi ) = pxy (xi , yj )
yj ∈Dy
X
py (yj ) = pxy (xi , yj )
xi ∈Dx
Consider the following example: Roll two dice. Let x be the value on the first dice and y be the sum of
the two dice. Compute the conditional pmf for x given that it is known that y = 2.
pxy (xi , 2)
px|y (xi , 2) = P (x = xi |y = 2) = (3.56)
py (2)
Try this problem for each possible value for y.
fxy (x, y)
fx|y (x, y) = (3.57)
fy (y)
where fxy (x, y) is the joint density function. The marginal density functions are
Z
fx (x) = fxy (x, y)dy (3.58)
y∈Dy
Z
fy (y) = fxy (x, y)dx (3.59)
x∈Dx
The marginal density function in the x-dimension is the probability mass projected onto the x-axis from all
other dimensions, e.g., for a bivariate distribution with mass distributed in the first xy quadrant, all the
mass will fall onto the x-axis.
Consider the example below where the random variable x indicates how far down the center-line of a
straight golf hole the golf ball was driven in units of 100 yards. The random variable y indicates how far left
(positive) or right (negative) the golf ball ends up from the center of the fairway. Let us call these random
variable distance and dispersion. The golfer teed the ball up at location [0, 0]. For simplicity, assume the
probability is uniformly distributed within the triangle.
87
1
0.5
y
−0.5
−1
1
fxy (x, y) = on x ∈ [0, 3], y ∈ [−x/3, x/3]
3
The distribution (density) of the driving distance down the center-line is given the marginal density for the
random variable x
Z x/3 x/3
1 y 2x
fx (x) = dy = =
−x/3 3 3 −x/3 9
Therefore, the conditional density of dispersion y given distance x is given by
3.8.5 Independence
The two random variables x and y are said to independent denoted x ⊥ y when the joint CDF (equivalently
pmf/pdf) can be factored into the product of its marginal CDFs (equivalently pmf/pdf).
For example, determine which of the following two joint density functions defined on [0, 1]2 signify indepen-
dence.
For the first joint density, the two marginal densities are the following:
1 1
4xy 2
Z
fx (x) = 4xy dy = = 2x
0 2 0
88
1 1
4x2 y
Z
fy (y) = 4xy dx = = 2y
0 2 0
The product of the marginal densities fx (x) fy (y) = 4xy is the joint density.
Compute the conditional density under the assumption that the random variables, x and y, are indepen-
dent.
fxy (x, y)
fx|y (x, y) = (3.62)
fy (y)
As the joint density can be factored, fxy (x, y) = fx (x) fy (y), we obtain,
fx (x) fy (y)
fx|y (x, y) = = fx (x) (3.63)
fy (y)
showing that the value of random variable y has no effect on x. See the exercises for a proof that independence
implies zero covariance (and therefore zero correlation).
X
E [x|y = y] = x px|y (x, y) (3.65)
x∈Dx
Consider the previous example on the dispersion y of a golf ball conditioned on the driving distance y.
Compute the conditional mean and the conditional variance for y given x.
Z x/3
µy|x = E [y|x = x] = y fy|x (x, y)dy
−x/3
Z x/3
2
= E (y − µy|x )2 |x = x = (y − µy|x )2 fy|x (x, y)dy
σy|x
−x/3
89
3.8.7 Conditional Independence
A wide class of modeling techniques are under the umbrella of probabilistic graphical models (e.g., Bayesian
Networks and Markov Networks). They work by factoring a joint probability based on conditional indepen-
dencies. Random variables x and y are conditionally independent given z, denoted
x ⊥ y|z
means that
90
3.9 Odds
Another way of looking a probability is odds. This is the ratio of probabilities of an event A occurring over
the event not occurring S − A.
P (y ∈ A) P (y ∈ A)
odds(y ∈ A) = = (3.67)
P (y ∈ S − A) 1 − P (y ∈ A)
For example, the odds of rolling a pair dice and getting natural is 8 to 28.
8 2
odds(y ∈ {7, 11}) = = = .2857
28 7
Of the 36 individual outcomes, eight will be a natural and 28 will not. Odds can be easily calculated from
probability.
91
3.10 Example Problems
Understanding of some of techniques to be discussed requires some background in conditional probability.
1. Consider the probability of rolling a natural (i.e., 7 or 11) with two dice where the random variable y
is the sum of the dice.
If you knew you rolled a natural, what is the conditional probability that you rolled a 5 or 7?
P (y ∈ A, x ∈ B)
P (y ∈ A | x ∈ B) =
P (x ∈ B)
where
P (y ∈ A, x ∈ B) = P (x ∈ B | y ∈ A) P (y ∈ A)
P (x ∈ B | y ∈ A) P (y ∈ A)
P (y ∈ A | x ∈ B) =
P (x ∈ B)
This is Bayes Theorem written using random variables, which provides an alternative way to compute
conditional probabilities, i.e., P (y ∈ {5, 7} | y ∈ {7, 11}) is
P (x = 1) = 1/3
Obviously, the probability is 1/3, since the probability of picking any of the three coins is the same.
This is the prior probability.
Not satisfied with this level of uncertainty, you conduct experiments. In particular, you flip the selected
coin three times and get all heads. Let y indicate the number of heads rolled. Using Bayes Theorem,
we have,
92
P (y = 3 | x = 1) P (x = 1) 1 · (1/3)
P (x = 1 | y = 3) = = = 4/5
P (y = 3) 5/12
93
3.11 Estimating Parameters from Samples
Given a model for predicting a response value for y from a feature/predictor vector x,
y = f (x; b) +
one needs to pick a functional form for f and collect a sample of data to estimate the parameters b. The
sample will consist of m instances (yi , xi ) that form the response/output vector y and the data/input matrix
X.
y = f (X; b) +
There are multiple types of estimation procedures. The central ideas are to minimize error or maximize
the likelihood that the model would generate data like the sample. A common way to minimize error is to
minimize the Mean Squared Error (MSE). The error vector is the difference between the actual response
vector y and the predicted response vector ŷ.
= y − ŷ = y − f (x; b)
The mean squared error on the length (Euclidean norm) of the error vector kk is given by
2
E kk2 = V [kk] + E [kk]
(3.68)
where V [kk] is error variance and E [kk] is the error mean. If the model is unbiased the error mean will
be zero, in which case the goal is to minimize the error variance.
y = µ+
where ∼ N (0, σ 2 ). Create a sample of size m = 100 data points, using a Normal random variate generator.
The population values for the mean µ and standard deviation σ are typically unknown and need to be
estimated from the sample, hence the names sample mean µ̂ and sample standard deviation σ̂. Show the
generated sample, by plotting the data points and displaying a histogram.
1 @ main def sampleStats () : Unit =
2
12 end sampleStats
94
Imports: scalation.mathstat. , scalation.random. .
m−1
1·y 1 X
µ̂ = = yi
m m i=0
To create a confidence interval, we need to determine the variability or variance in the estimate µ̂.
"
m−1
# m−1
1 X 1 X σ2
V [µ̂] = V yi = 2
V [yi ] =
m i=0 m i=0 m
The difference between the estimate from the sample and the population mean is Normally distributed and
centered at zero (show that µ̂ is an unbiased estimator for µ, i.e., E [µ̂] = µ).
σ2
µ̂ − µ ∼ N (0,)
m
We would like to transform the difference so that the resulting expression follows a Standard Normal
distribution. This can be done by dividing by √σm .
µ̂ − µ
√ ∼ N (0, 1)
σ/ m
Consequently, the probability that the expression is greater than z is given by the CDF of the Standard
Normal distribution, FN (z).
µ̂ − µ
P √ >z = 1 − FN (z)
σ/ m
One might consider that if z = 2, two standard deviation units, then the estimate is not close enough. The
same problem can exist on the negative side, so we should require
µ̂ − µ
√ ≤2
σ/ m
In other words,
2σ
|µ̂ − µ| ≤ √
m
This condition implies that µ would likely be inside the following confidence interval.
2σ 2σ
µ̂ − √ , µ̂ + √
m m
In this case it is easy to compute values for the lower and upper bounds of the confidence interval. The
interval half width is simply 2·8
10 = 1.6, which is to be subtracted and added to the sample mean.
Use ScalaTion to determine the probability that µ is within such confidence intervals?
1 println ( s " 1 - F (2) = ${1 - normalCDF (2) } " )
95
The probability is one minus twice this value. If 1.96 is used instead of 2, what is the probability, expressed
as a percentage.
Typically, the population standard deviation is unlikely to be known. It would need to estimated by
using the sample standard deviation, where the sample variance is
m−1
1 X
σ̂ 2 = (yi − µ̂)2 (3.69)
m − 1 i=0
Note, this textbook uses θ̂ to indicate an estimator for parameter θ, regardless of whether it is a Maxi-
mum Likelihood (MLE) estimator. This substitution introduces more variability into the estimation of the
confidence interval and results in the Standard Normal distribution (z-distribution)
z∗σ z∗σ
µ̂ − √ , µ̂ + √ (3.70)
m m
being replace by the Student’s t distribution
t∗ σ̂ t∗ σ̂
µ̂ − √ , µ̂ + √ (3.71)
m m
where z ∗ and t∗ represent distances from zero, e,g., 1.96 or 2.09, that are large enough so that the analyst
is comfortable with the probability that they may be wrong.
The numerators for the interval half widths (ihw) are calculated by the following top-level functions in
Statistics.scala. The z sigma function is used for the z-distribution.
1 def z_sigma ( sig : Double , p : Double = .95) : Double =
2 val pp = 1.0 - (1.0 - p ) / 2.0 // e . g . , .95 --> .975 ( two tails )
3 val z = random . Quantile . normalInv ( pp )
4 z * sig
5 end z_sigma
Does the probability you determined in the last example problem make any sense. Seemingly, if you took
several samples, only a certain percentage of them would have the population mean within their confidence
interval.
1 @ main def c o n f i d e n c e I n t e r v a l T e s t () : Unit =
2
96
11 val ( mu_ , sig_ ) = ( y . mean , y . stdev ) // sample mean and std dev
12
26 end c o n f i d e n c e I n t e r v a l T e s t
97
3.12 Entropy
The entropy of a discrete random variable y with probability mass function (pmf) py (y) is the negative of
the expected value of the log of the probability.
X
H(y) = H(py ) = − E [log2 py ] = − py (y) log2 py (y) (3.72)
y∈Dy
For finite domains of size k = |Dy |, entropy H(y) ranges from 0 to log2 (k). Low entropy (close to 0) means
that there is low uncertainty/risk in predicting an outcome of an experiment involving the random variable
y, while high entropy (close to log2 k) means that there is high uncertainty/risk in predicting an outcome of
such an experiment. For binary classification (k = 2), the upper bound on entropy is 1.
The entropy may be normalized by setting the base of the logarithm to the size of the domain k, in which
case, the entropy will be in the interval [0, 1].
X
Hk (y) = Hk (py ) = − E [logk py ] = − py (y) logk py (y)
y∈Dy
A random variable y ∼ Bernoulli(p) may be used to model the flip of a single coin that has a probability of
success/head (1) of p. Its pmf is given by the following formula.
p(y) = py (1 − p)1−y
The figure below plots the entropy H([p, 1 − p]) as probability of a head p ranges from 0 to 1.
98
Entropy for Bernoulli pmf
0.8
0.6
fy (y)
0.4
0.2
A random variable y = z1 + z2 where z1 , z2 are distributed as Bernoulli(p) may be used to model the
sum of flipping two coins.
See the exercises for how to extend entropy to continuous random variables.
The concept of plog can also be used in place of probability and offers several advantages: (1) multiplying
many small probabilities may lead to round off error or underflow; (2) independence leads to addition of
plog values rather than multiplication of probabilities; and (3) its relationship to log-likelihood in Maximum
Likelihood Estimation.
X
plog(x) = plog(xj ) for independent random variables (3.75)
j
The greater the plog the less likely the occurrence, e.g., the plog of rolling snake eyes (1, 1) with two dice is
about 5.17, while probability of rolling 7 is 2.58. Note, probability 1 and .5 give plog of 0 and 1, respectively.
99
3.12.2 Joint Entropy
Entropy may be defined for multiple random variables as well. Given two discrete random variables, x, y,
with a joint pmf px,y (x, y) the joint entropy is defined as follows:
X X
H(x, y) = H(px,y ) = − E [log2 px,y ] = − px,y (x, y) log2 px,y (x, y) (3.76)
x∈Dx y∈Dy
X X
H(x|y) = H(px|y ) = − E log2 px|y = − px,y (x, y) log2 px|y (x, y) (3.77)
x∈Dx y∈Dy
Suppose an experiment involves two random variables x and y. Initially, the overall entropy is given by the
joint entropy H(x, y). Now, partial evidence allows the value of y to be determined, so the overall entropy
should decrease by y’s entropy.
When there is no dependency between x and y (i.e., they are independent), H(x, y) = H(x) + H(y)), so
At the other extreme, when there is full dependency (i.e., they value of x can be determined from the value
of y).
H(x|y) = 0 (3.80)
Given a discrete random variables, y, with two candidate probability mass functions (pmf)s py (y) and qy (y)
the relative entropy is defined as follows:
py X py (y)
H(py ||qy ) = E log2 = py (y) log2 (3.81)
qy qy (y)
y∈Dy
One way to look at relative entropy is that it measures the uncertainty that is introduced by replacing
the true/empirical distribution py with an approximate/model distribution qy . If the distributions are
identical, then the relative entropy is 0, i.e., H(py ||py ) = 0. The larger the value of H(py ||qy ) the greater
the dissimilarity between the distributions py and qy .
As an example, assume the true distribution for a coin is [.6, .4], but it is thought that the coin is fair
[.5, .5]. The relative entropy is computed as follows:
100
H(py ||qy ) = .6 log2 .6/.5 + .4 log2 .4/.5 = 0.029
Given a continuous random variables, y, with two candidate probability density functions (pdf)s fy (y) and
gy (y) the relative entropy is defined as follows:
Z
fy fy (y)
H(fy ||gy ) = E log2 = fy (y) log2 dy (3.82)
gy Dy gy (y)
In this subsection, we examine the relationship between KL Divergence and Maximum Likelihood. Consider
the dissimlarity of an empirical distribution pdata (x, y) and a model generated distribution pmod (x, y, b).
pdata (x, y)
H(pdata (x, y)||pmod (x, y, b)) = E log (3.83)
pmod (x, y, b)
m−1
X pdata (xi , yi )
= pdata (xi , yi ) log (3.84)
i=0
pmod (xi , yi , b)
Note, that pdata (xi , yi ) is unaffected by the choice of parameters b, so it represents a constant C.
m−1
X
H(pdata (x, y)||pmod (x, y, b)) = C − pdata (xi , yi ) log pmod (xi , yi , b) (3.85)
i=0
1
The probability for the ith data instance is m, thus
m−1
1 X
H(pdata (x, y)||pmod (x, y, b)) = C − log pmod (xi , yi , b) (3.86)
m i=0
The second term is the negative log-likelihood (the Chapter on Generalized Linear Models for details).
It is the sum of the entropy of the empirical distribution and the model distribution’s relative entropy to the
empirical distribution. It can be calculated using the following formula (see exercises for details):
X
H(py × qy ) = − py (y) log2 qy (y) (3.88)
y∈Dy
Since cross entropy is more efficient to calculate than relative entropy, it is a good candidate as a loss function
for machine learning algorithms. The smaller the cross entropy, the more the model (e.g., Neural Network)
agrees with the empirical distribution (dataset). The formula looks like the one for ordinary entropy with
qy substituted in as the argument for the log function. Hence the name cross entropy.
101
3.12.6 Mutual Information
Recall that if x and y are independent, then for all x ∈ Dx and y ∈ Dy ,
The relative entropy (KL divergence) of the joint distribution to the product of the marginal distributions
is referred to as mutual information.
As with covariance (or correlation) mutual information will be zero when x and y are independent. While
independence implies zero covariance, independence is equivalent to zero mutual information. Mutual infor-
mation is symmetric and non-negative. See the exercises for additional comparisons between covariance/-
correlation and mutual information.
While mutual information measures the dependence between two random variables, relative entropy (KL
divergence) measures the dissimilarity of two distribution.
Mutual Information corresponds to Information Gain, i.e., the drop in entropy of one random variable
due to knowledge of the value of the other random variable.
102
3.12.7 Probability Object
Class Methods:
1 object Probability :
2
3 def isProbability ( px : VectorD ) : Boolean = px . min >= 0.0 && abs ( px . sum - 1.0) < EPSILON
4 def isProbability ( pxy : MatrixD ) : Boolean = pxy . mmin >= 0.0 && abs ( pxy . sum - 1.0) <
EPSILON
5 def freq ( x : VectorI , vc : Int , y : VectorI , k : Int ) : MatrixD =
6 def freq ( x : VectorI , y : VectorI , k : Int , vl : Int ) : ( Double , VectorI ) =
7 def freq ( x : VectorD , y : VectorI , k : Int , vl : Int , cont : Boolean ,
8 def count ( x : VectorD , vl : Int , cont : Boolean , thres : Double ) : Int =
9 def toProbability ( nu : VectorI ) : VectorD = nu . toDouble / nu . sum . toDouble
10 def toProbability ( nu : VectorI , n : Int ) : VectorD = nu . toDouble / n . toDouble
11 def toProbability ( nu : MatrixD ) : MatrixD = nu / nu . sum
12 def toProbability ( nu : MatrixD , n : Int ) : MatrixD = nu / n . toDouble
13 def probY ( y : VectorI , k : Int ) : VectorD = y . freq ( k ) . _2
14 def jointProbXY ( px : VectorD , py : VectorD ) : MatrixD = outer ( px , py )
15 def margProbX ( pxy : MatrixD ) : VectorD =
16 def margProbY ( pxy : MatrixD ) : VectorD =
17 def condProbY_X ( pxy : MatrixD , px_ : VectorD = null ) : MatrixD =
18 def condProbX_Y ( pxy : MatrixD , py_ : VectorD = null ) : MatrixD =
19 inline def plog ( p : Double ) : Double = - log2 ( p )
20 def plog ( px : VectorD ) : VectorD = px . map ( plog ( _ ) )
21 def entropy ( px : VectorD ) : Double =
22 def entropy ( nu : VectorI ) : Double =
23 def entropy ( px : VectorD , b : Int ) : Double =
24 def nentropy ( px : VectorD ) : Double =
25 def rentropy ( px : VectorD , qx : VectorD ) : Double =
26 def centropy ( px : VectorD , qx : VectorD ) : Double =
27 def entropy ( pxy : MatrixD ) : Double =
28 def entropy ( pxy : MatrixD , px_y : MatrixD ) : Double =
29 def muInfo ( pxy : MatrixD , px : VectorD , py : VectorD ) : Double =
30 def muInfo ( pxy : MatrixD ) : Double = muInfo ( pxy , margProbX ( pxy ) , margProbY ( pxy ) )
For example, the following freq method is used by Naı̈ve Bayes Classifiers. It computes the Joint Frequency
Table (JFT) for all value combinations of vectors x and y by counting the number of cases where xi = v
and yi = c.
1 @ param x the variable / feature vector
2 @ param vc the number of distinct values in vector x ( value count )
3 @ param y the response / classif ication vector
4 @ param k the maximum value of y + 1 ( number of classes )
5
103
3.13 Exercises
Several random number and random variate generators can be found in ScalaTion’s random package. Some
of the following exercises will utilize these generators.
1. Let the random variable h be the number heads when two coins are flipped. Determine the following
conditional probability: P (h = 2|h ≥ 1).
P (B|A)P (A)
P (A|B) =
P (B)
3. Compute the mean and variance for the Bernoulli Distribution with success probability p.
4. Use the Randi random variate generator to run experiments to check the pmf and CDF for rolling two
dice.
1 import scalation . mathstat . _
2 import scalation . random . Randi
3
2
V [y] = E (y − E [y])2 = E y 2 − E [y]
7. Show that the covariance of two independent, continuous random variables, x and y, is zero.
Z Z
C [x, y] = E [(x − µx )(y − µy )] = (x − µx )(y − µy )fxy (x, y)dxdy
Dy Dx
104
8. Derive the formula for the expectation of the sum of random variables.
9. Derive the formula for the variance of the sum of random variables.
10. Use the Uniform random variate generator and the Histogram class to run experiments illustrating
the Central Limit Theorem (CLT).
1 import scalation . mathstat . _
2 import scalation . random . Uniform
3
6 val rg = Uniform ()
7 val x = VectorD ( for i <- 0 until 100000 yield rg . gen + rg . gen + rg . gen + rg . gen )
8 new Histogram ( x )
9
10 end cLTTest
z 2 ∼ χ21
z
p ∼ tk
v/k
u/k1
∼ Fk1 ,k2
v/k2
14. Run the confidenceIntervalTest main function (see the Confidence Interval section) for values of
m = 20 to 40, 60, 80 and 100. Report the confidence interval and the number cases when the true
values was inside the confidence interval for (a) the z-distribution and (b) the t-distribution. Explain.
105
16. Show that formula for computing the joint probability mass function (pmf) for the 3-by-3 grid of
weights is correct. Hint: Add/subtract rectangular regions of the grid and make sure nothing is double
counted.
17. Show for k = 2 where pp = [p, 1 − p], that H(pp) = p log2 (p) + (1 − p) log2 (1 − p). Plot the entropy
H(pp) versus p.
1 val p = VectorD . range (1 , 100) / 100.0
2 val h = p . map ( p = > -p * log2 ( p ) - (1 - p ) * log2 (1 - p )
3 new Plot (p , h )
18. Plot the entropy H and normalized entropy Hk for the first 16 Binomial(p, n) distributions, i.e., for
the number of coins n = 1, . . . , 16. Try with p = .6 and p = .5.
19. Entropy can be defined for continuous random variables. Take the definition for discrete random
variables and replace the sum with an integral and the pmf with a pdf. Compute the entropy for
y ∼ Uniform(0, 1).
20. Using the summation formulas for entropy, relative entropy and cross entropy, show that cross entropy
is the sum of entropy and relative entropy.
21. Show that mutual information equals the sum of marginal entropies minus the joint entropy, i.e.,
22. Compare correlation and mutual information in terms of how well they measure dependence between
random variables x and y. Try various functional relationships: negative exponential, reciprocal,
constant, logarithmic, square root, linear, right-arm quadratic, symmetric quadratic, cubic, exponential
and trigonometric.
y = f (x) +
Other types of relationships are also possible. Try various constrained mathematical relations: circle,
ellipse and diamond.
f (x, y) + = c
23. Consider an experiment involving the roll of two dice. Let x indicate the value of dice 1 and x2 indicate
of the value of dice 2. In order to examine dependency between random variables, define y = x + x2 .
The joint pmf px,y can be recorded in a 6-by-11 matrix that can be computed from the following
feasible occurrence matrix (0 → cannot occur, 1 → can occur), since all the non-zero probabilities are
the same (equal likelihood).
1 // X - dice 1: 1 , 2, 3, 4, 5, 6
2 // X2 - dice 2: 1 , 2, 3, 4, 5, 6
3 // Y = X + X2 : 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12
4 val nuxy = MatrixD ((6 , 11) , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 ,
106
5 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
6 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0,
7 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,
8 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
9 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)
Use methods in the Probability object to compute the joint, marginal and conditional probability
distributions, as well as the joint, marginal, conditional and relative entropy, and mutual information.
Explore the independence between random variables x and y.
24. Convolution. The convolution operator may be applied to vectors as well as functions (including
mass and density functions). Consider two vectors c ∈ Rm and x ∈ Rn . Without loss of generality let
m ≤ n, then their convolution is defined as follows:
m−1
X
y = c ? x = yk = cj xk−j (3.92)
j=0
k=0,m+n−2
Note, there are also ’same’ and ’valid’ versions of convolution operators.
25. Consider a distribution with density on the interval [0, 2]. Let the probability density function (pdf)
for this distribution be the following:
y
fy (y) = on [0, 2]
2
(i) Draw/plot the pdf fy (y) vs. y for the interval [0, 2].
(ii) Determine the Cumulative Distribution Function (CDF), Fy (y).
(iii) Draw/plot the CDF Fy (y) vs y for the interval [0, 2].
(iv) Determine the expected value of the Random Variable (RV) y, i.e., E [y].
26. Take the limit of the difference quotient of monomial xn to show that
d n
x = nxn−1
dx
Recall the definition of derivative as the limit of the difference quotient.
d f (x + h) − f (x)
f (x) = lim
dx h→0 h
Recall the notations due to Leibniz, Lagrange, and Euler.
d
f (x) = f 0 (x) = Dx f (x)
dx
107
27. Take the integral and then the derivative of the monomial xn to show that
Z
d
xn dx = xn
dx
108
3.14 Further Reading
1. Probability and Mathematical Statistics [163].
109
3.15 Notational Conventions
With respect to random variables, vectors and matrices, the following notational conventions shown in Table
3.1 will be used in this book.
Built on the Functional Programming features in Scala, ScalaTion support several function types:
1 type FunctionS2S = Double = > Double // function of a scalar
2 type FunctionS2V = Double = > VectorD // vector - valued function of a scalar
3
These function types are defined in the scalation and scalation.mathstat packages. A scalar-valued
function type ends in ’S’, a vector-valued function type ends in ’V’, and a matrix-valued function type ends
in ’M’.
Mathematically, the scalar-valued functions are denoted by a symbol, e.g., f .
S2S function f : R → R
V2S function f : Rn → R
S2V function f : R → Rn
V2V function f : Rm → Rn
M2V function f : Rm×p → Rn
110
V2M function f : Rp → Rm×n
M2M function f : Rp×q → Rm×n
111
3.16 Model
Models are about making predictions such as given certain properties of a car, predict the car’s mileage, given
recent performance of a stock index fund, forecast its future value, or given a person’s credit report, classify
them as either likely to repay or not likely to repay a loan. The thing that is being predicted, forecasted
or classified is referred to the response/output variable, call it y. In many cases, the “given something” is
either captured by other input/feature variables collected into a vector, call it x,
y = f (x; b) + (3.94)
or by previous values of y. Some functional form f is chosen to map input vector x into a predicted value
for response y. The last term indicates the difference between actual and predicted values, i.e., the residuals
. The function f is parameterized and often these parameters can be collected into a matrix b.
If values for the parameter vector b are set randomly, the model is unlikely to produce accurate pre-
dictions. The model needs to be trained by collecting a dataset, i.e., several (m) instances of (xi , yi ), and
optimizing the parameter vector b to minimize some loss function, such as mean squared error (mse),
1
mse = ky − ŷk2 (3.95)
m
where y is the vector from all the response instances and ŷ = f (X; b) is the vector of predicted response
values and X is the matrix formed from all the input/feature vector instances.
Estimation Procedures
Although there are many types of parameter estimation procedures, this text only utilizes the three most
commonly used procedures [14].
The method of moments develops equations the relate the moments of a distribution to the parameters of the
model, in order to create estimates for the parameters. Least Squares Estimation takes the sum of squared
errors and sets the parameter values to minimize this sum. It has three main varieties: Ordinary Least
Squares (OLS), Weighted Least Squares (WLS), and Generalized Least Squares (GLS). Finally, Maximum
Likelihood Estimation sets the parameter values so that the observed data is likely to occur. The easiest
way to think about this is to imagine that one wants to create a generative model (a model that generates
data). One would want to set the parameters of the model so it generates data that looks like the given
dataset.
Setting of parameters is done by solving a system of equations for the simpler models, or by using an
optimization algorithm for more complex models.
112
Quality of Fit (QoF)
After a model is trained, its Quality of Fit (QoF) should be evaluated. One way to perform the evaluation
is to train the model on the full dataset and test as well on the full dataset. For complex models with many
parameters, over-fitting will likely occur. Then its excellent evaluation is unlikely to be reproduced when the
model is applied in the real-world. To avoid overly optimistic evaluations due to over-fitting, it is common
to divide a dataset (X, y) into a training dataset and testing dataset where training is conducted on the
training dataset (Xr , yr ) and evaluation is done on the test dataset (Xe , ye ). The conventions used in this
book for the full, training and test datasets are shown in Table 3.3
Note, when training and testing on the full dataset, the training and test dataset are actually the same, i.e.,
they are the full dataset. If a model has many parameters, the Quality of Fit (QoF) found from training
and testing on the full dataset should be suspect. See the section on cross-validation for more details.
In ScalaTion, the Model trait severs as base trait for all the modeling techniques in the modeling
package and its sub-packages classifying, clustering, fda, forecasting, and recommeneder.
Model Trait
Trait Methods:
1 trait Model :
2
The getFname method returns the predictor variable/feature names in the model. The train method
will use a training or full dataset to train the model, i.e., optimize its parameter vector b to minimize a
given loss function. After training, the quality of the model may be assessed using the test method. The
evaluation may be performed on a test or full dataset. Finally, information about the model may be extracted
113
by the following three methods: (1) hparameter showing the hyper-parameters, (2) parameter showing the
parameters, and (3) report showing the hyper-parameters, the parameter, and the Quality of Fit (QoF) of
the model. Note, hyper-parameters are used by some modeling techniques to influence either the result or
how the result is obtained.
Classes that implement (directly or indirectly) the Model trait should default x and x e to the full
data/input matrix x, and y and y e to the full response/output vector y that are passed into the class
constructor.
Implementations of the train method take a training data/input matrix x and a training respon-
se/output vector y and optimize the parameter vector b to, for example, minimize error or maximize
likelihood. Implementations of the test method take a test data/input matrix x e and the corresponding
test response/output vector y e to compute errors and evaluate the Quality of Fit (QoF). Note that with
cross-validation (to be explained later), there will be multiple training and test datasets created from one
full dataset. Implementations of the hparameter method simply return the hyper-parameter vector hparam,
while implementations of the parameter method simply return the optimized parameter vector b. (The
fname and technique parameters for Regression are the feature names and the solution/optimization
technique used to estimate the parameter vector, respectively.)
Associated with the Model trait is the FitM trait that provides QoF measures common to all types of
models. For prediction, Fit extends FitM with several additional QoF measures and they are explained on
the Prediction Chapter. Similarity, FitC extends FitM for classification models.
FitM Trait
Trait Methods:
1 trait FitM :
2
The diagnose method takes the actual response/output vector y and the predictions from the model yp
and calculates the basic QoF measures.
1 @ param y the actual response / output vector to use ( test / full )
2 @ param yp the predicted response / output vector ( test / full )
3 @ param w the weights on the instances ( defaults to null )
4
114
10 val mu = y . mean // mean of y ( may be zero )
11 val e = y - yp // residual / error vector
12 sse = e . normSq // sum of squares for error
13 if w = = null then
14 sst = y . cnormSq // sum of squares total
15 ssr = sst - sse // sum of squares model
16 else
17 ssr = ( w * ( yp - ( w * yp / w . sum ) . sum ) ~ ˆ 2) . sum // regression sum of squares
18 sst = ssr + sse
19 end if
20
Note, ˜^ is the exponentiation operator provided in ScalaTion, where the first character is ˜ to give the
operator higher precedence than multiplication (*).
One of the measures is based on absolute errors, Mean Absolute Error (MAE), and is computed as the
`1 norm of the error vector divided by the number of elements in the response vector (m). The rest are
based on squared values. Various squared `2 norms may be taken to compute these quantities, i.e., sst =
y.cnormSq is the centered norm squared of y, while sse = e.normSq is the norm squared of e. Then ssr,
the sum of squares model/regression, is the difference. The idea being that one started with the variation in
the response, some of which can accounted for by the model, with the remaining part considered errors. As
models are less than perfect, what remains are better referred to as residuals, part of which a better model
could account for. The fraction of the variation accounted for by the model to the total variation is called
the coefficient of determination R2 = ssr/sst ≤ 1. A measure that parallel MAE is the Root Mean Squared
Error (RMSE). It is typically higher as a large squared term has more of an effect. Both are interpretable
as they are in the units of the response variable, e.g., imagine one hits a golf ball at 150 mph with an MAE
of 7 mph and an RMSE of 10 mph. Further explanations are given in the Prediction Chapter.
115
116
Chapter 4
Data Management
4.1 Introduction
Data Science relies on having large amounts of quality data. Collecting data and handling data quality issues
are of utmost importance. Without support from a system or framework, this can be very time-consuming
and error-prone. This chapter provides a quick overview of the support provided by ScalaTion for data
management.
In the era of big data, a variety of database management technologies have been proposed, including
those under the umbrella of Not-only-SQL (NoSQL). These technologies include the following:
• Key-value stores (e.g., Memcached). When the purpose the data store is very rapid lookup and not
advanced query capabilities, a key-value store may be ideal. They are often implemented as distributed
hash tables.
• Document-oriented databases (e.g., MongoDB). These databases are intended for storage and retrieval
of unstructured (e.g., text) and semi-structured (e.g., XML or JSON) data.
• Columnar databases (e.g., Vertica). Such databases are intended for structured data like traditional
relational databases, but to better facilitate data compression and analytic operations. Data is stored
in columns rather rows as in traditional relational databases.
• Graph databases (e.g., Neo4j). These make the implicit relationships (via foreign-key, primary-key
pairs) in relational databases explicit. A tuple in a relational database is mapped to a node in a graph
database, while an implicit relationship is mapped to edge in a graph database. The database then
consists of a collection directed graphs, each consisting of nodes and edges connecting the nodes. These
database are particularly suited to social networks.
The purpose of these database technologies is to provide enhanced performance over traditional, row-oriented
relational database and each of the above are best suited to particular types of data.
Data management capabilities provided by ScalaTion include Relational Databases, Columnar Databases
and Graph Databases. All include extensions making them suitable as a Time Series DataBase (TSDB).
Graph databases are discussed in the Appendix.
Preprocessing of data should be done before applying analytics techniques to ensure they are working on
quality data. ScalaTion provides a variety of preprocessing techniques, as discussed in the next chapter.
117
4.1.1 Analytics Databases
In data science, it is convenient to collect data from multiple sources and store the data in a database.
Analytics databases are organized to support efficient data analytics.
A database supporting data science should make it easy and efficient to view and select data to be feed
into models. The structures supported by the database should make it easy to extract data to create vectors,
matrices and tensors that are used by data science tools and packages.
Multiple systems, including ScalaTion’s TSDB, are built on top of columnar, main memory databases
in order to provide high performance. ScalaTion’s TSDB is a Time Series DataBase that has built-in
capabilities for handling time series data. It is able to store non-time series data as well. It provides multiple
Application Programming Interfaces (APIs) for convenient access to the data [?].
6 trait Tabular [ T <: Tabular [ T ]] ( val name : String , val schema : Schema ,
7 val domain : Domain , val key : Schema )
8 extends Serializable :
For convenience, the following two Scala type definitions are utilized.
1 type Schema = Array [ String ]
2 type Domain = Array [ Char ]
Tabular structures are logically linked together via foreign keys. A foreign key is an attribute that
references a primary key in some table (typically another table). In ScalaTion, the foreign key specification
is added via the following method call after the Tabular structure is created.
1 def addForeignKey ( fkey : String , refTab : T ) : Unit
ScalaTion supports the following domains/data-types: ’D’ouble, ’I’nt, ’L’ong, ’S’tring, and ’T’imeNum.
1 ’D ’ - ‘ Double ‘ - ‘ VectorD ‘ - 64 bit double precision floating point number
2 ’I ’ - ‘Int ‘ - ‘ VectorI ‘ - 32 bit integer
3 ’L ’ - ‘ Long ‘ - ‘ VectorL ‘ - 64 bit long integer
4 ’S ’ - ‘ String ‘ - ‘ VectorS ‘ - variable length numeric string
5 ’T ’ - ‘ TimeNum ‘ - ‘ VectorT ‘ - time numbers for date - time
These data types are generalized into a ValueType as a Scala union type.
1 type ValueType = ( Double | Int | Long | String | TimeNum )
118
4.2 Relational Data Model
A relational database table may be built up as follows: A cell in the table holds an atomic value of type
ValueType. A tuple (or row) in the table is simply an array of ValueType. A relational Table consists of a
bag (or multi-set) of tuples. Each column in the Table is restricted to a particular domain. Note, uniqueness
of primary keys is enforced by creating a primary index.
1 type Tuple = Array [ ValueType ]
Using the operator += as an alias for the add method the following code may be used to populate the Bank
database.
1 customer + = ( " Peter " , " Oak St " , " Bogart " )
2 + = ( " Paul " , " Elm St " , " Watkinsville " )
3 + = ( " Mary " , " Maple St " , " Athens " )
4 customer . show ()
5
119
17 deposit . show ()
18
120
Fundamental Relational Algebra Operators
The following six relational algebra operators form the fundamental operators for ScalaTion’s table pack-
age and are shown in Table 4.1. They are fundamental in sense that rest of operators, although convenient,
do not increase the power of the query language.
customer ρ (“client”)
2. Project Operator. The project operator will return the specified columns in table customer.
3. Select Operator. The select operator will return the rows that match the predicate in table customer.
4. Union Operator. The union operator will return the union of rows from deposit and loan. Duplicate
tuples may be eliminated by creating an index. For this operator the textbook syntax and ScalaTion
syntax are identical.
deposit ∪ loan
1 deposit ∪ loan
5. Minus Operator. The minus operator will return the rows from account (result of the union) that
are not in loan. For this operator the textbook syntax and ScalaTion syntax are identical.
account − loan
1 account - loan
6. Cartesian Product Operator. The product operator will return all combinations of rows in customer
with rows in deposit. For this operator the textbook syntax and ScalaTion syntax are identical.
customer × deposit
1 customer × deposit
121
Additional Relational Algebra Operators
The next eight operators, although not fundamental, are important operators in SacalaTion’s table
package and are shown in Table 4.1.
1. Join Operator. In order to combine information from two tables, join operators are preferred over
products, as they are much more efficient and only combine related rows. ScalaTion’s table package
supports natural-join, equi-join, theta-join, left outer join, and right outer join, as shown below. For
each tuple in the left table, the equi-join pairs it with all tuples in the right table that match it on
the given attributes (in this case customer.bname = deposit.bname). The natural-join is an equi-
join on the common attributes in the two tables, followed by projecting away any duplicate columns.
The theta-join generalizes an equi-join by allowing any comparison operator to be used (in this case
deposit1 .balance < deposit2 .balance). The symbol for semi-join is adopted for outer joins as it is a
Unicode symbol. The left join keeps all tuples from the left (null padding if need be), while the right
join keeps all tuples from the right table.
1 customer ./ deposit
2 customer ./ ( " cname = = cname " , deposit )
3 deposit ./ ( " balance < balance " , deposit )
4 customer n deposit
5 customer o deposit
Additional forms of joins are also available in the Table class. Join is not fundamental as its result
can be made by combining product and select.
2. Divide Operator. For the query below, the divide operator will return the cnames where the cus-
tomers has a deposit account at all branches (of course it would make sense to first select on the
branches).
The divide operator requires the other attributes (in this case cname) in the left table to be paired up
with all the attribute values (in this case bname) in the right table.
3. Intersect Operator. The intersect operator will return the rows in account that are also in loan.
For this operator the textbook syntax and ScalaTion syntax are identical.
account ∩ loan
122
1 account ∩ loan
4. GroupBy Operator. The groupBy operator forms groups among the relation based on the equality
of attributes. The following example groups the tuples in the deposit table based on the value of the
bname attribute.
γbname (deposit)
5. Aggregate Operator. The aggregate operator returns values for the grouped-by attribute (e.g.,
bname) and applies aggregate operators on the specified columns (e.g., avg (balance)). Typically it is
called after the groupBy operator.
1 deposit F ( " bname " , ( count , " accno " ) , ( avg , " balance " ) )
6. OrderBy Operator. The orderBy operator effectively puts the rows into ascending order based on
the given attributes.
↑bname (deposit)
7. OrderByDesc Operator. The orderByDesc operator effectively puts the rows into descending order
based on the given attributes.
↓bname (deposit)
8. Select-Project Operator. The selproject is a combination operator added for convenience and
efficiency, especially for columnar relation databases (see the next section). As whole columns are
stored together, this operator only requires one column to be accessed.
1 customer σπ ( " ccity " , _ = = ’ Athens ’)
123
4.2.4 Example Queries
1. List the names of customers who live in the city of Athens.
1 val liveAthens = customer .σ ( " ccity = = ’ Athens ’" ) .π ( " cname " )
2 liveAthens . show ()
2. List the names of customers who live in Athens or bank (have deposits in branches located) in Athens.
1 val bankAthens = ( deposit ./ branch ) .σ ( " bcity = = ’ Athens ’" ) .π ( " cname " )
2 bankAthens . show ()
3. List the names of customers who live and bank in the same city.
1 val sameCity = ( customer ./ deposit ./ branch ) .σ ( " ccity = = bcity " ) .π ( " cname " )
2 sameCity . create_index ()
3 sameCity . show ()
4. List the names and account numbers of customers with the largest balance.
1 val largest = deposit .π ( " cname , accno " ) - ( deposit ./ ( " balance < balance " ,
deposit ) ) .π ( " cname , accno " )
2 largest . show ()
5. List the names of customers who are silver club members (have loans where they have deposits).
1 val silver = ( loan .π ( " cname , bname " ) ∩ deposit .π ( " cname , bname " ) ) .π ( " cname " )
2 silver . create_index ()
3 silver . show ()
6. List the names of customers who are gold club members (have loans only where they have deposits).
1 val gold = loan .π ( " cname " ) - ( loan .π ( " cname , bname " ) - deposit .π ( " cname , bname " )
) .π ( " cname " )
2 gold . create_index ()
3 gold . show ()
8. List the names of customers who have deposits at all branches located in Athens.
1 val allAthens = deposit .π ( " cname , bname " ) / inAthens
2 allAthens . create_index ()
3 allAthens . show ()
124
4.2.5 Persistence
Modern databases do much of the processing in main-memory due to its large size and high speed. Although
using MRAM, main-memories may be persistent, typically they are volatile, meaning if the power is lost,
so is the data. It is therefore essential to provide efficient mechanisms for making and maintaining the
persistence of data.
Traditional database management systems achieve this by having a persistence data store in non-volatile
storage (e.g., Hard-Disk Drives (HDD) or Solid-State Devices (SSD)) and a large database cache in main-
memory. Complex page management algorithms are used to ensure persistence and transactional correctness
(see the next subsection).
A simple way to provide persistence is to design the database management system to operate in main-
memory and then provide load and save methods that utilize built-in serialization to save to or load from
persistent storage. This is what ScalaTion does.
The load method will read a table with a given name into main-memory using serialization.
1 @ param name the name of the table to load
2
The save method will write the entire contents of this table into a file using serialization.
1 def save () : Unit =
2 val oos = new O b j e c t O u t p u t S t r e a m ( new F i l e O u t p u t S t r e a m ( STORE_DIR + name + SER ) )
3 oos . writeObject ( this )
4 oos . close ()
5 end save
For small databases, this approach is fine, but as database become large, greater efficiency must be
sought. One cannot save a whole table ever time there is a change. See the exercises for alternatives.
4.2.6 Transactions
The idea of a transaction is to bundle a sequence of operations into a meaningful action that one wants to
succeed, such as transferring money from one bank account to another.
Making the action a transaction has the main benefit of making it atomic, the action either completes
successfully (called a commit) or is completely undone having no effect on the database state (called a
rollback). The third option, a partially completed action in this case would lead to a bank customer losing
their money.
Making a transaction atomic can be achieved by maintaining a log. Operations can be written to the log
and then only saved once the transaction commits. If a transaction cannot commit, it must be rolled back.
There must also be a recover procedure to handles the situation when volatile storage is lost. For this to
function, committed log records must be flushed to persistent storage.
A second important advantage of making an action a transaction is to protect it from other transactions,
so it can think of itself as if it is running in isolation. Rather than worrying about how other transactions
125
may corrupt the action, this worry is turned over to database management system to handle it. One
form of potential interference involves two transactions running concurrently and accessing the same back
accounts. It one transaction accesses all the accounts first, there will be no corruption. Such an execution
of two transactions is called a serial execution (one transaction executes at a time). Unfortunately, modern
high-performance database management systems could not operate at the slow speed this would dictate.
Transaction must be run concurrently, not serially. The correction condition caller serializability allows
transaction to run with their concurrency controlled by a protocol that ensures their effects on the database
are equivalent to one of their slow-running, serially-executing cousin schedules. In other words, the fast
running serializable schedule for a set of transactions must be equivalent to some serial execution of the
same set of transactions. See the exercise for more details on equivalence (e.g., conflict and view equivalence)
and various concurrency control protocols that can be used to ensure correctness with minimal impact on
performance.
6 class Table ( override val name : String , override val schema : Schema ,
7 override val domain : Domain , override val key : Schema )
8 extends Tabular [ Table ] ( name , schema , domain , key )
9 with Serializable :
Internally, the Table class maintains a collection of tuples. Using a Bag allows for duplicates, if wanted.
Creating an index on the primary will efficiently eliminate any duplicates. Foreign key relationships are
specified in linkTypes. It also provides a groupMap used by the groupBy operator.
The Table class supports three types of indices:
1. Primary Index. A uniques index on the primary key (may be composite).
1 private [ table ] val index = IndexMap [ KeyType , Tuple ] ()
2. Secondary Unique Indices. A unique index on a single attribute (other than the primary key). For
example, a student id may be used as a primary for a Student table, while email may also be required
to be unique. Since there can be multiple such indices a Map is used to name each index.
1 private [ table ] val sindex = Map [ String , IndexMap [ ValueType , Tuple ]] ()
3. Non-Unique Indices. When fast-lookup is required based on an attribute/column that is not required
to be unique (e.g., name) such an index may be used. Again, since there can be multiple such indices
a Map is used to name each index.
1 private [ table ] val mindex = Map [ String , MIndexMap [ ValueType , Tuple ]] ()
The following methods may be used to create the various types of indices: primary unique index, secondary
unique index, or non-unique index, respectively.
126
1 def create_index ( rebuild : Boolean = false ) : Unit =
2 def create_sindex ( atr : String ) : Unit =
3 def create_mindex ( atr : String ) : Unit =
The following factory method in the companion object provides a more convenient way to create a table.
The strim method splits a string into an array of strings based on a separation character and then trims
away any white-space.
1 def apply ( name : String , schema : String , domain_ : String , key : String ) : Table =
2 new Table ( name , strim ( schema ) , strim ( domain_ ) . map ( _ . head ) , strim ( key ) )
3 end apply
The following two classes extend the Table class in the direction of the Graph Data Model, see Appendix
C.
6 case class LTable ( name_ : String , schema_ : Schema , domain_ : Domain , key_ : Schema )
7 extends Table ( name_ , schema_ , domain_ , key_ )
8 with Serializable :
The LTable class (for Linked-Table) simply adds an explicit link from the foreign key to the primary key
that it references. For each tuple in a linked-table, add a link to the referenced table, so that the foreign key
is linked to the primary key. Caveat: LTable does not handle composite foreign keys. Although in general
primary keys may be composite, a foreign key is conceptualized as a column value and its associated link.
1 @ param fkey the foreign key column
2 @ param refTab the referenced table being linked to
3
The LTable class makes many-to-one relationships/associations explicit and improves the efficiency of
the most common form of join operation which is based on equating a foreign key (fkey) to a primary key
(pkey). Without an index, these are performed using a Nest-Loop Join algorithm. The existence of an index
on the primary key allows a much more efficient Indexed Join algorithm to be utilized. The direct linkage
provides for additional speed up of such join operations (see the exercises for a comparison). Note that the
linkage is only in one direction, so joining from the primary key table to the foreign key table would require
a non-unique index on the foreign key column, or resorting to a slow nested loop join.
Note, the link and foreign key value are in some sense redundant. Removing the foreign key column is
possible, but may force the need for an additional join for some queries, so the database designer may wish
to keep the foreign key column. ScalaTion leaves this issue up to the database designer.
The next class moves further in the direction of the Graph Data Model.
4.2.9 VTable Class
127
3 @ param domain_ the domains / data - types for attributes ( ’D ’ , ’I ’ , ’L ’ , ’S ’ , ’X ’ , ’T ’)
4 @ param key_ the attributes forming the primary key
5
6 case class VTable ( name_ : String , schema_ : Schema , domain_ : Domain , key_ : Schema )
7 extends Table ( name_ , schema_ , domain_ , key_ )
8 with Serializable :
The VTable class (for Vertex-Table) supports many-to-many relationships with efficient navigation in
both directions. Supporting this is much more completed than what is needed for LTable, but provides for
index-free adjacency, similar to what is provided by Graph Database systems.
The VTable model is graph-like in that it elevates tuples into vertices as first-class citizens of the data
model. However, edges are embedded inside of vertices and are there to establish adjacency. Edges do not
have labels, attributes or properties. Although this simplifies the data model and makes it more relation-like,
it is not set up to naturally support finding for example shortest paths.
The Vertex class extends the notion of Tuple into values stored in the tuple part, along with foreign
keys links captured as outgoing edges.
1 @ param tuple the tuple part of a vertex
2
7 end Vertex
For data models where edges become first-class citizens, see the Appendix on Graph Data Models.
128
4.3 Columnar Relational Data Model
Of the NoSQL database management systems, columnar databases are closest to traditional relational
databases. Rather than tuples/rows taking center stage, columns/vectors take center stage.
A columnar database is made up of the following components:
• Element - a value from a given Domain or Datatype (e.g., Int, Long, Double, Rational, Real, Complex,
String, TimeNum)
• Column/Vector - a collection of values from the same Datatype (e.g., forming VectorI, VectorL,
VectorD, VectorQ, VectorR, VectorC, VectorS, VectorT)
Table 4.2 shows the first 10 rows (out of 392) for the well-known Auto MPG dataset (see https://
archive.ics.uci.edu/ml/datasets/Auto+MPG).
Table 4.2: Example Columnar Relation: First 10 Rows of Auto MPG Dataset
mpg cylinders displacement horsepower weight acceleration model year origin car name
Double Int Double Double Double Double Int Int String
18.0 8 307.0 130.0 3504.0 12.0 70 1 ”chevrolet chevelle”
15.0 8 350.0 165.0 3693.0 11.5 70 1 ”buick skylark 320”
18.0 8 318.0 150.0 3436.0 11.0 70 1 ”plymouth satellite”
16.0 8 304.0 150.0 3433.0 12.0 70 1 ”amc rebel sst”
17.0 8 302.0 140.0 3449.0 10.5 70 1 ”ford torino”
15.0 8 429.0 198.0 4341.0 10.0 70 1 ”ford galaxie 500”
14.0 8 454.0 220.0 4354.0 9.0 70 1 ”chevrolet impala”
14.0 8 440.0 215.0 4312.0 8.5 70 1 ”plymouth fury iii”
14.0 8 455.0 225.0 4425.0 10.0 70 1 ”pontiac catalina”
15.0 8 390.0 190.0 3850.0 8.5 70 1 ”amc ambassador dpl”
Since each column is stored as a vector, they can be readily compressed. Due to the high repetition in the
cylinders column it can be effectively compressed using Run Length Encoding (RLE) compression. In
addition, a column can be efficiently extracted since it already stored as a vector in the database. These
vectors can be used in aggregate operators or passed into analytic models.
Data files in various formats (e.g., comma separated values (csv)) can be loaded into the database.
1 val auto_mpg = Relation ( " auto_mpg " , " auto_mpg . csv " )
It is easy to create a Multiple Linear Regression model for this dataset. Simply pick the response column,
in this case mpg and the predictor columns, in this case all other columns besides car name. The connection
between car name and mpg is coincidental. The response column/variable goes into a vector.
1 val y = auto_mpg . toVectorD (0)
129
1 val x = auto_mpg . toMatrixD (1 to 7) )
Then the matrix x and vector y can be passed into a Regression model constructor.
1 val rg = new Regression (x , y )
See the next chapter for how to train a model, evaluate the quality of fit and make predictions.
The first API is a Columnar Relational Algebra that includes the standard operators of relational algebra
plus those common to column-oriented databases. It consists of the Table trait and two implementing classes:
Relation and MM Relation. Persistence for Relation is provided by the save method, while MM Relation
utilizes memory-mapped files.
The name of the first relation is “sensor” and it stores information about traffic sensors.
• The fourth argument is the column number for the primary key (key),
• The fifth argument, “ISDDI”, indicates the domains (domain) for the attributes (Integer, String,
Double, Double, Integer).
• The sixth and optional argument can be used to define foreign keys (fKeys).
• The seventh and optional argument indicates whether to enter that relation is the system Catalog.
130
The second relation road stores the Id, name, beginning and ending latitude-longitude coordinates.
The third relation mroad is for multi-lane roads.
The fourth relation traffic stores the data collected from traffic sensors. The primary key in this case
is composite, Seq (0, 1), as both the time and the sensorId are required for unique identification.
The fifth relation wsensor stores information about weather sensors.
Finally, the sixth relation weather stores data collected from the weather sensors.
Select Operator
The select operator will return the rows that match the predicate, in this case rdN ame == “I285”.
Project Operator
The project operator will return the specified columns, in this case rdN ame, lat1, long1.
131
Union Operator
The union operator will return the rows from r and s with no duplicates. For this operator the textbook
syntax and column db syntax are identical.
r∪s
Minus Operator
The minus operator will return the rows from r that are not in s. For this operator the textbook syntax and
column db syntax are identical.
r−s
The product operator will return all combinations of rows in r with rows in s. For this operator the textbook
syntax and column db syntax are identical.
r×s
Rename Operator
r.ρ(“r2”)
The above six operators form the fundamental operators for SacalaTion’s column db package and are
shown as the first group in Table 4.3.
Table 4.3: Columnar Relational Algebra (r = road, s = sensor, t = traffic, q = mroad, w = weather)
132
The next seven operators, although not fundamental, are important operators in SacalaTion’s column db
package and are shown as the second group in Table 4.3.
Join Operators
In order to combine information from two relations, join operators are preferred over products, as they are
much more efficiently and only combine related rows. ScalaTion’s column db package supports natural-
join, equi-join, general theta join, left outer join, and right outer join, as shown below.
r ./ s natural − join
r ./ (“roadId”, “roadId”, s) equi − join
r ./ [Int](s, (“roadId”, “roadId”, == )) theta join
t n (“time”, “time”, w) left outer join
t o (“time”, “time”, w) right outer join
Intersect Operator
The intersect operator will return the rows in r that are also in s. For this operator the textbook syntax
and column db syntax are identical.
r∩s
GroupBy Operator
The groupBy operator forms groups among the relation based on the equality of attributes. The following
example groups traffic data based in the value of the “sensorId” attribute.
t.γ(“sensorId”)
The extended projection operator eproject applies aggregate operators on aggregation columns (first argu-
ments) and regular project on the other columns (second arguments). Typically it is called after the groupBy
operator.
OrderBy Operator
The orderBy operator effectively puts the rows into ascending (descending) order based on the given at-
tributes.
t.ω(“sensorId”)
133
Compress Operator
The compress operator will compress the given columns of the relation.
t.ζ(“count”)
Uncompress Operator
The uncompress operator will uncompress the given columns of the relation.
t.Z(“count”)
2. Retrieve the automobile mileage data for cars with 8 cylinders, returning the car name and mpg.
3. Retrieve traffic data within a 100 kilometer-grid from the center of Austin, Texas. The latitude-
longitude coordinates for Austin, Texas are (30.266667, -97.733333).
Class Methods:
1 @ param name the name of the relation
2 @ param colName the names of columns
3 @ param col the Scala Vector of columns making up the columnar relation
4 @ param key the column number for the primary key ( < 0 = > no primary key )
5 @ param domain an optional string indicating domains for columns ( e . g . , ’ SD ’ = ’ String ’
, ’ Double ’)
6 @ param fKeys an optional sequence of foreign keys
7 - Seq ( column name , ref table name , ref column position )
8 @ param enter whether to enter the newly created relation into the ‘ Catalog ‘
9
134
10 class Relation ( val name : String , val colName : Seq [ String ] , var col : Vector [ Vec ] =
null ,
11 val key : Int = 0 , val domain : String = null ,
12 var fKeys : Seq [( String , String , Int ) ] = null , enter : Boolean = true )
13 extends Table with Error with Serializable
135
4.4 SQL-Like Language
The SQL-Like API in ScalaTion provides many of the language constructs of SQL in a functional style.
1. Retrieve the vehicle traffic counts over time from all sensors on the road with Id = 101.
1 ( traffic join sensor ) . where [ Int ] ( " roadId " , _ = = 101)
2 . select ( " sensorId " , " time " , " count " )
2. Retrieve the vehicle traffic counts averaged over time from all sensors on the road with Id = 101.
1 ( traffic join sensor ) . where [ Int ] ( " roadId " , _ = = 101)
2 . groupBy ( " sensorId " )
3 . eselect (( avg , " acount " , " count " ) ) ( " sensorId " )
136
4.4.3 RelationSQL Class
Class Methods:
1 @ param name the name of the relation
2 @ param colName the names of columns
3 @ param col the Scala Vector of columns making up the columnar relation
4 @ param key the column number for the primary key ( < 0 = > no primary key )
5 @ param domain an optional string indicating domains for columns ( e . g . , ’ SD ’ = ’ String ’
, ’ Double ’)
6 @ param fKeys an optional sequence of foreign keys - Seq ( column name , ref table name ,
ref column position )
7
8 class RelationSQL ( name : String , colName : Seq [ String ] , col : Vector [ Vec ] ,
9 key : Int = 0 , domain : String = null , fKeys : Seq [( String , String , Int
) ] = null )
10 extends Tabular with Serializable
11
137
12 def toVectorL ( colName : String ) : VectorL = r . toVectorL ( colName )
13 def toVectorS ( colPos : Int = 0) : VectorS = r . toVectorS ( colPos )
14 def toVectorS ( colName : String ) : VectorS = r . toVectorS ( colName )
15 def toVectorT ( colPos : Int = 0) : VectorT = r . toVectorT ( colPos )
16 def toVectorT ( colName : String ) : VectorT = r . toVectorT ( colName )
17 def show ( limit : Int = Int . MaxValue ) r . show ( limit )
18 def save () r . save ()
19 def generateIndex ( reset : Boolean = false ) r . generateIndex ( reset )
138
4.5 Exercises
1. Use Scala 3 to complete the implementation of the following ScalaTion data models: Table, LTable,
and VTable in the scalation.table package. A group will work on one the data models. See Appendix
C for two more data models: GTable and PGraph.
• Test all types of non-unique indices (MIndexMap). Use the import scheme shown in the beginning
of Table.scala.
• Add use of indexing to speed up as many operations as possible.
• Speed up joins by using Unique Indices and Non-Unique Indices.
• Use index-free adjacency when possible for further speed-up.
• Make the save operation efficient, by only serializing tuples/vertices that have changed since the
last load. One way to approach this would be to maintain a map in persistent storage,
1 Map [ KeyType , [ TimeNum , Tuple ]]
where the key for a tuple/vertex may be used to check the timestamp of a tuple/vertex. Unless
the timestamp of the volatile tuple/vertex is larger, there is no need to save it. Further speed
improvement may be obtained by switching from Java’s text-based serialization to Kryo’s binary
serialization.
4. Create the sensor schema using the RelationSQL class in the columnar db package.
139
7. Retrieve traffic data within a 100 kilometer-grid from the center of Austin, Texas. The latitude-
longitude coordinates for Austin, Texas are (30.266667, -97.733333).
Formulate a relation algebra expression to list the names of the professors of courses taken by Peter.
140
Chapter 5
Data Preprocessing
141
5.2 Methods for Outlier Detection
Data points that are considered outliers may happen because of errors or highly unusual occurrences. For
example, suppose a dataset records the times for members of a football team to run a 100-yard dash and
one of the recorded values is 3.2 seconds. This is an outlier. Some analytics techniques are less sensitive to
outliers, e.g., `1 Regression, while others, e.g., `2 Regression, are more sensitive. Detection of outliers suffers
from the obvious problems of being too strict (in which case good data may be thrown away) or too lenient
(in which case outliers are passed to an analytics technique). One may choose to handle outliers separately,
or turn them into missing values, so that both outliers and missing values may be handled together.
If measured values for a random variable xj are approximately Normally distributed and are several standard
deviation units away form the center (µxj ), they are rare events. Depending on the situation, this may be
important information to examine, but may often indicate incorrect measurement. Table 5.1 shows how
unlikely it is to obtain data points in distant tails of a Normal distribution. The standard way to detect
outliers using the standard deviation method is to examine points beyond three standard deviation (σxj )
units for being outliers. This is also called the z-score method as xj needs to be transformed to zj that
follows the Standard Normal distribution.
xj − µxj
zj = (5.1)
σxj
142
pdf for Standard Normal Distribution
0.4
0.3
fzj (z)
0.2
0.1
−4 −2 0 2 4
z
xj ∈
/ [ .25 Q [xj ] − δ · IQR, .75 Q [xj ] + δ · IQR ] (5.2)
For the Normal distribution case, when the scale factor δ = 1.5, it corresponds to 2.69792 standard deviation
units and at 2.0 it corresponds to 3.3724 standard deviation units (see the exercises). The advantage of
this method over the previous one, is that it can work when the data points are not approximately Normal.
This includes the cases where the distribution is not symmetric (a problematic situation for the previous
method). A weakness of the IQR method occurs when data are concentrated near the median, resulting in
an IQR that is in some sense too small to be useful.
Use of Box-Plots provides visual support for looking for outliers. The IQR is shown as a box with whiskers
extending in both directions, extending δ ·IQR units beyond the box, with indications of locations of extreme
data points beyond the whiskers.
• Standard Deviation Method: data points too many standard deviation units (typically 2.5 to 3.5,
defaults to 2.7) away from the mean, DistanceOutlier;
143
• InterQuartile Range Method: data points a scale factor/expansion multiplier (typically 1.5 to 2.0,
defaults to 1.5) times the IQR beyond the middle two quartiles, QuartileXOutlier; and
• Quantiles/Percentile Method: data points in the extreme percentages (typically 0.7 to 10 percent,
defaults to 0.7), i.e., having the smallest or largest values, QuantileOutlier.
Note: These defaults put these three outlier detection methods in alignment when data points are approx-
imately Normally distributed.
The following function will turn outliers in missing values, by reassigning the outliers to noDouble,
ScalaTion’s indicator of a missing value of type Double.
An alternative to eliminating outliers during data preprocessing, is to eliminate them during modeling
by looking for extreme residuals. In addition to looking at the magnitude of a residual i , some argue only
to remove data points that also have high influence on the model’s parameters/coefficients, using techniques
such as DFFITS, Cook’s Distance, or DFBETAS [34].
144
5.3 Imputation Techniques
The two main ways to handle missing values are (1) throw them away, or (2) use imputation to replace them
with reasonable guesses. When there is a gap in time series data, imputation may be used for short gaps,
but is unlikely to be useful for long gaps. This is especially true when imputation techniques are simple. The
alternative could be to use an advanced modeling technique like SARIMA for imputation, but then results
of a modeling study using SARIMA are likely to be biased. Imputation implementations are based on the
Imputation trait in the scalation.modeling package.
Trait Methods:
1 trait Imputation
2
2. object ImputeForward extends Imputation: Use the previous value and slope to estimate the next
missing value.
3. object ImputeBackward extends Imputation: Use the subsequent value and slope to estimate the
previous missing value.
4. object ImputeMean extends Imputation: Use the filtered mean to estimate the next missing value.
5. object ImputeMovingAvg extends Imputation: Use the moving-average of the last ’dist’ values to
estimate the next missing value.
6. object ImputeNormal extends Imputation: Use the median of three Normally distributed, based
on filtered mean and variance, random values to estimate the next missing value.
7. object ImputeNormalWin extends Imputation: Same as ImputeNormal except mean and variance
are recomputed over a sliding window.
145
5.4 Align Multiple Time Series
When the data include multiple time series, there are likely to be time alignment problems. The frequency
and/or phase may not be in agreement. For example, traffic count data may be recorded every 15 minutes
and phased on the hour, while weather precipitation data may be collected every 30 minutes and phased to
10 minutes past the hour.
ScalaTion supports the following alignments techniques: (1) approximate left outer join and (2) dy-
namic time warping. The first operator will perform a left outer join between two relations based on their
time (TimeNum) columns. Rather than the usual matching based on equality, approximately equal times are
considered sufficient for alignment. For example, to align traffic data with the weather data, the following
approximate left outer join may be used.
146
5.5 Creating Vectors and Matrices
Once the data have been preprocessed, columns may be projected out to create a matrix that may be passed
to analytics/modeling techniques.
This matrix may then be passed into multiple modeling techniques: (1) a Multiple Linear Regression, (2) a
Auto-Regressive, Integrated, Moving-Average (ARIMA) model.
By default in ScalaTion the rightmost columns are the response/output variables. As many of the
modeling techniques have a single response variable, it will be assumed to in the last column. There are also
constructors and factory apply functions that take explicit vector and matrix parameters, e.g., a matrix of
predictor variables and a response vector.
147
5.6 Exercises
1. Assume random variable xj is distributed N (µ, σ).
(a) Show that when the scale factor δ = 1.5, the InterQuartile Range method corresponds to the
Standard Deviation method at 2.69792 standard deviation units.
(b) Show that when the scale factor δ = 2.0, the InterQuartile Range method corresponds to the
Standard Deviation method at 3.3724 standard deviation units.
(c) What should the scale factor δ need to be to correspond to 3 standard deviation units?
2. Randomly generate 10,000 data points from the Standard Normal distribution. Count how many of
these data points are considered as outliers for
(a) the Standard Deviation method set at 3.3724 standard deviation units, and
(b) the InterQuartile Range method with δ = 2.0.
(c) the Quantile/Percentile method set at what? percent.
3. Load the auto mpg.csv dataset into an auto mpg relation. Perform the preprocessing steps above to
create a cleaned-up relation auto mpg2 and produce a data matrix called auto mat from this relation.
Print out the correlation matrix for auto mat. Which columns have the highest correlation? To predict
the miles per gallon mpg which columns are likely to be the best predictors.
4. Find a dataset at the UCI Machine Learning Repository and carry out the same steps
https://archive.ics.uci.edu/ml/index.php.
148
Part II
Modeling
149
Chapter 6
Prediction
As the name predictive analytics indicates, the purpose of techniques that fall in this category is to develop
models to predict outcomes. For example, the distance a golf ball travels y when hit by a driver depends
on several factors or inputs x such as club head speed, barometric pressure, and smash factor (how square
the impact is). The models can be developed using a combination of data (e.g., from experiments) and
knowledge (e.g., Newton’s Second Law). The modeling techniques discussed in this technical report tend
to emphasize the use of data more than knowledge, while those in the simulation modeling technical report
emphasize knowledge.
Abstractly, a predictive model can generally be formulated using a prediction function f as follows:
y = f (x, t; b) + (6.1)
where
• y is an response/output scalar,
• x is an predictor/input vector,
Both the response y and residuals/errors are treated as random variables, while the predictor/feature
variables x may be treated as either random or deterministic depending on context. Depending on the goals
of the study as well as whether the data are the product of controlled/designed experiments, the random or
deterministic view may be more suitable.
The parameters b can be adjusted so that the predictive model matches the available data. Note,
in the definition of a function, the arguments appear before the “;”, while the parameters appear after.
The residuals/errors are typically additive as shown above, but may also be multiplicative. Of course, the
formulation could be generalized by turning the output/response into a vector y and the parameters into a
matrix B.
When a model is time-independent or time can be treated as just another dimension within the x vectors,
prediction functions can be represented as follows:
151
y = f (x; b) + (6.2)
Another way to look at such models, is that we are trying to estimate the conditional expectation of y given
x.
y = E [y|x] +
= y − f (x; b)
Given a dataset (m instances of data), each instance contributes to an overall residual/error vector .
One of the simpler ways to estimate the parameters b is to minimize the size of the residual/error vector,
e.g., its Euclidean norm. The square of this norm is the sum of squared errors (sse)
This corresponds to minimizing the raw mean square error (mse = sse/m). See the section on Generalized
Linear Models for further development along these lines.
In ScalaTion, data are passed to the train function to train the model/fit the parameters b. In the
case of prediction, the predict function is used to predict values for the scalar response y.
A key question to address is the possible functional forms that f may take, such as the importance of
time, the linearity of the function, the domains for y and x, etc. We consider several cases in the subsections
below.
152
6.1 Predictor
In ScalaTion, the Predictor trait provides a common framework for several predictor classes such as
SimpleRegression or Regression. All of the modeling techniques discussed in this chapter extend the
Predictor trait. They also extend the Fit trait to enable Quality of Fit (QoF) evaluation. (Unlike classes,
traits support multiple inheritance).
Many modeling techniques utilize several predictor/input variables to predict a value for a response/out-
put variable, e.g., given values for [x0 , x1 , x2 ] predict a value for y. The datasets fed into such modeling
techniques will collect multiple instances of the predictor variables into a matrix x and multiple instances of
the response variable into a vector y. The Predictor trait takes datasets of this form.
Trait Methods:
1 @ param x the input / data m - by - n matrix
2 ( augment with a first column of ones to include intercept in model )
3 @ param y the response / output m - vector
4 @ param fname the feature / variable names ( if null , use x_j )
5 @ param hparam the hyper - parameters for the model
6
153
37 def s t e p R e g r e s s i o n A l l ( idx_q : Int = QoF . rSqBar . ordinal , cross : Boolean = true ) :
38 ( LinkedHashSet [ Int ] , MatrixD ) =
39
The Predictor trait extends the Model trait (see the end of the Probability chapter) and has the following
methods:
1. The getX method returns the actual data/input matrix used by the model. Some complex models
expand the columns in an initial data matrix to add for example quadratic or cross terms.
2. The getY method returns the actual response/output vector used by the model. Some complex models
transform the initial response vector.
3. The getFname method returns the names of predictor variable/features, both given and extended.
5. The train method takes the dataset passed into the model (either the full dataset or a training-data)
and optimizes the model parameters b.
6. The train2 method takes the dataset passed into the model (either the full dataset or a training
dataset) and optimizes the model parameters b. It also optimizes the hyper-parameters.
7. The test method evaluates the Quality of Fit (QoF) either on the full dataset or a designated test-data
using the diagnose method.
8. The trainNtest method trains on the training-set and evaluates on the test-set.
9. The predict method take a data vector (e.g., a new data instance) and predicts its response. Another
predict method takes a matrix as input (with each row being an instance) and makes predictions for
each row.
10. The hparameter method returns the hyper-parameters for the model. Many simple models have none,
but more sophisticated modeling techniques such as RidgeRegression and LassoRegression have
them (e.g., a shrinkage hyper-parameter).
11. The parameter method returns the estimated parameters for the model.
12. The residual method returns the difference between the actual and predicted response vectors. The
residual indicates what the model has left to explain/account for (e.g., an ideal model will only leave
the noise in the data unaccounted for).
13. The buildModel method build a sub-model that is restricted to given columns of the data matrix.
This method of called by the following feature selection methods.
154
14. The selectFeatures methods makes it easy to switch between forward, backward and stepwise feature
selection.
15. The forwardSel method is used for forward selection of variables/features for inclusion into the model.
At each step the variable that increases the predictive power of the model the most is selected. This
method is called repeatedly in forwardSelAll to find “best” combination of features. Not guaranteed
to find the optimal combination.
16. The importance method is used to indicate the relative importance of the features/variables.
17. The bakwardElim method is used for backward elimination of variables/features from the model. At
each step the variable that contributes the least to the predictor power of the model is eliminated. This
method is called repeatedly in bakwardElimAll to find “best” combination of features. Not guaranteed
to find the optimal combination.
18. The stepRegressionAll method decides to add or remove a variable/feature based on whichever leads
to the greater improvement. It continues until there is no further improvement. A swap operation may
yield a better combination of features.
19. The vif method returns the Variance Inflation Factors (VIFs) for each of the columns in the data/input
matrix. High VIF scores may indicate multi-collinearity.
21. The validate method divides a dataset into a training-set and a test-set, trains on one and tests on
the other to determine out-of-sample Quality of Fit (QoF).
22. The crossValidate method implements k-fold cross-validation, where a dataset is divided into a
training-set and a test-set. The training-set is used by the train method, while the test-set is used by
the test method. The crossValidate method is similar to validate, but more extensive in that it
repeats this process k times and makes sure all the data ends up in one of the k test-sets.
155
6.2 Quality of Fit for Prediction
The related Fit trait provides a common framework for computing Quality of Fit (QoF) measures. The
dataset for many models comes in the form of an m-by-n data matrix X and an m response vector y. After
the parameters b (an n vector) have been fit/estimated, the error vector may be calculated. The basic
QoF measures involve taking either `1 (Manhattan) or `2 (Euclidean) norms of the error vector as indicated
in Table 6.1.
Typically, if a model has m instances/rows in the dataset and n parameters to fit, the error vector will live
in an m − n dimensional space (ignoring issues related to the rank the data matrix). Note, if n = m, there
may be a unique solution for the parameter vector b, in which case = 0, i.e., the error vector lives in a
0-dimensional space. The Degrees of Freedom (for error) is the dimensionality of the space that the error
vector lives in, namely, df = m − n.
Trait Methods:
1 @ param dfm the degrees of freedom for model / regression
2 @ param df the degrees of freedom for error
3
For modeling, a user chooses one the of classes (directly or indirectly) extending the trait Predictor
(e.g., Regression) to instantiate an object. Next the train method would be typically called, followed
by the test method, which computes the residual/error vector and calls the diagnose method. Then the
156
fitMap method would be called to return quality of fit statistics computed by the diagnose method. The
quality of fit measures computed by the diagnose method in the Fit class are shown below.
1 @ param y the actual response / output vector to use ( test / full )
2 @ param yp the predicted response / output vector ( test / full )
3 @ param w the weights on the instances ( defaults to null )
4
One may look at the sum of squared errors (sse) as an indicator of model quality.
sse = · (6.4)
In particular, sse can be compared to the sum of squares total (sst), which measures the total variability of
the response y,
1 X 2
sst = ky − µy k2 = y · y − m µ2y = y · y − yi (6.5)
m
while the sum of squares regression (ssr = sst − sse) measures the variability captured by the model, so the
coefficient of determination measures the fraction of the variability captured by the model.
ssr sse
R2 = = 1− ≤ 1 (6.6)
sst sst
Values for R2 would be non-negative, unless the proposed model is so bad (worse than the Null Model that
simply predicts the mean) that the proposed model actually adds variability.
157
6.3 Null Model
The NullModel class implements the simplest type of predictive modeling technique. If all else fails it may
be reasonable to simply guess that y will take on its expected value or mean.
y = E [y] + (6.7)
This could happen if the predictors x are not relevant, not collected in a useful range or the relationship is
too complex for the modeling techniques you have applied.
y = b0 + (6.8)
This intercept-only model is just a constant term plus the error/residual term.
6.3.2 Training
The training dataset in this case only consists of a response vector y. The error vector in this case is
= y − ŷ = y − b0 1 (6.9)
For Least Squares Estimation (LSE), the loss function L(b) can be set to half the sum of squared errors.
1 1 1
L(b) = sse = kk2 = · (6.10)
2 2 2
Substituting for gives
1
L(b) = y − b0 1 · y − b0 1 (6.11)
2
(f · g)0 = f 0 · g + f · g0
(f · f )0 = 2 f 0 · f
1
Dividing by 2 gives,
1
(f · f )0 = f 0 · f (6.12)
2
dL
Taking the derivative w.r.t. b0 , , using the derivative product rule and setting it equal to zero yields the
db0
following equation.
158
dL
= −1 · (y − b0 1) = 0
db0
Therefore, the optimal value for the parameter b0 is
1·y 1·y
b0 = = = µy (6.13)
1·1 m
This shows that the optimal value for the parameter is the mean of the response vector.
In ScalaTion this requires just one line of code inside the train method.
1 def train ( x_null : MatrixD = null , y_ : VectorD = y ) : Unit =
2 b = VectorD ( y_ . mean ) // parameter vector [ b0 ]
3 end train
After values for the model parameters are determined, it it important to assess the Quality of Fit (QoF).
The test method will compute the residual/error vector and then call the diagnose method.
1 def test ( x_null : MatrixD = null , y_ : VectorD = y ) : ( VectorD , VectorD ) =
2 val yp = VectorD . fill ( y_ . dim ) ( b (0) ) // y predicted for ( test / full )
3 ( yp , diagnose ( y_ , yp ) ) // return predictions and QoF
4 end test
The coefficient of determination R2 for the null regression model is always 0, i.e., none of variance in the
random variable y is explained by the model. A more sophisticated model should only be used if it is better
than the null model, that is when its R2 is strictly greater than zero. Also, a model can have a negative R2
if its predictions are worse than guessing the mean.
Finally, the predict method is simply.
1 def predict ( z : VectorD ) : Double = b (0)
y = 2.75 + (6.14)
x y ŷ 2
11
1 1 4 − 74 49
16
11 1 1
2 3 4 4 16
11 1 1
3 3 4 4 16
11 5 25
4 4 4 4 16
19
10 11 11 0 4 = 4.75
The sum of squared errors (sse) is given in the lower, right corner of the table. The sum of squares total for
this dataset is 4.75, so
159
sse 4.75
R2 = 1 − = 1− = 0
sst 4.75
The plot below illustrates how the Null Model attempts to fit the four given data points.
y 3
0 1 2 3 4
x
Class Methods:
1 @ param y the response / output vector
2
6.3.6 Exercises
1. Determine the value of the second derivative of the loss function
d2 L
= ?
db0 2
at the critical point b0 = µy . What kind of critical point is this?
160
2. Let the response vector y be
1 val y = VectorD (1 , 3 , 3 , 4)
Draw an xy plot of the data points. Give the value for the parameter vector b. Show the error distance
for each point in the plot. Compare the sum of squared errors sse with the sum of squares total sst.
What is the value for the coefficient of determination R2 ?
3. Using ScalaTion, analyze the NullModel for the following response vector y.
1 val y = VectorD (2.0 , 3.0 , 5.0 , 4.0 , 6.0) // response vector y
2 println ( s " y = $y " )
3
4. Execute the NullModel on the Auto MPG dataset. See scalation.modeling.Example AutoMPG. What
is the quality of the fit (e.g., R2 or rSq)? Is this value expected? Is is possible for a model to perform
worse than this?
161
6.4 Simpler Regression
The SimplerRegression class supports simpler linear regression. In this case, the predictor vector x consists
of a single variable x0 , i.e., x = [x0 ] and there is only a single parameter that is the coefficient for x0 in the
model.
y = b · x + = b0 x 0 + (6.15)
where represents the residuals/errors (the part not explained by the model).
6.4.2 Training
A dataset may be collected for providing an estimate for parameter b0 . Given m data points, stored in an
m-dimensional vector x0 and m response values, stored in an m-dimensional vector y, we may obtain the
following vector equation.
y = b0 x 0 + (6.16)
One way to find a value for parameter b0 is to minimize the norm of residual/error vector .
minb0 ky − b0 x0 k (6.18)
This is equivalent to minimizing half the dot product ( 21 kk2 = 1
2 ·= 1
2 sse). Thus the loss function is
1
L(b) = y − b0 x0 · y − b0 x0 (6.19)
2
dL
= −x0 · (y − b0 x0 ) = 0 (6.20)
db0
Therefore, the optimal value for the parameter b0 is
x0 · y
b0 = (6.21)
x0 · x0
162
6.4.4 Example Calculation
Consider the following data points {(1, 1), (2, 3), (3, 3), (3, 4)} and solve for the parameter (slope) b0 .
[1, 2, 3, 4] · [1, 3, 3, 4] 32 16
b0 = = =
[1, 2, 3, 4] · [1, 2, 3, 4] 30 15
16
Using this optimal value for the parameter b0 = , we may obtain predicted values for each of the x-values.
15
The table below shows the values of x, y, ŷ, , and 2 . for the Simpler Regression Model,
16 16
y = · [x] + = x+
15 15
x y ŷ 2
16 1 1
1 1 15 − 15 225
32 13 169
2 3 15 15 225
48 3 9
3 3 15 − 15 225
64 4 16
4 4 15 − 15 225
160 5 13
10 11 15 15 15 = 0.867
The sum of squared errors (sse) is given in the lower, right corner of the table. The sum of squares total for
this dataset is 4.75, so
sse 0.867
R2 = 1 − = 1− = 0.813
sst 4.75
The plot below illustrates how the Simpler Regression Model attempts to fit the four given data points.
163
Simpler Regression Model Line vs. Data Points
y
2
0 1 2 3 4
x
Note, that this model has no intercept. This makes the solution for the parameter very easy, but may
make the model less accurate. This is remedied in the next section. Since no intercept really means the
intercept is zero, the regression line will go through the origin. This is referred to as Regression Through
the Origin (RTO) and should only be applied when the data scientist has reason to believe it makes sense.
Class Methods:
1 @ param x the data / input matrix ( only use the first column )
2 @ param y the response / output vector
3 @ param fname_ the feature / variable names ( only use the first name )
4
6.4.6 Exercises
1. For x0 = [1, 2, 3, 4] and y = [1, 3, 3, 4], try various values for the parameter b0 . Plot the sum of squared
errors (sse) vs. b0 . Note, the code must be completed before it is complied and run.
164
1 import scalation . mathstat . _
2
5 val x0 = VectorD (1 , 2 , 3 , 4)
6 val y = VectorD (1 , 3 , 3 , 4)
7 val b0 = VectorD . range (0 , 50) / 25.0
8 val sse = new VectorD ( b0 . dim )
9 for i <- b0 . indices do
10 val e = ?
11 sse ( i ) = e dot e
12 end for
13 new Plot ( b0 , sse , lines = true )
14
15 end s i m p l e r R e g r e s s i o n _ e x e r _ 1
2. From the X matrix and y vector, plot the set of data points {(xi1 , yi ) | 0 ≤ i < m} and draw the line
that is nearest to these points. What is the slope of this line. Pass the X matrix and y vector as
arguments to the SimplerRegression class to obtain the b = [b0 ] vector.
1 // 4 data points : x0
2 val x = MatrixD ((4 , 1) , 1 , // x 4 - by -1 matrix
3 2,
4 3,
5 4)
6 val y = VectorD (1 , 3 , 3 , 4) // y vector
7
An alternative to using the above constructor new SimplerRegression is to use a factory method
SimplerRegression. Substitute in the following lines of code to do this.
1 val x = VectorD (1 , 2 , 3 , 4)
2 val rg = S i m p l e r R e g r e s s i o n (x , y , null )
3 new Plot (x , y , yp , lines = true )
3. From the X matrix and y vector, plot the set of data points {(xi1 , yi ) | 0 ≤ i < m} and draw the line
that is nearest to these points and intersects the origin [0, 0]. What is the slope of this line? Pass the
X matrix and y vector as arguments to the SimplerRegression class to obtain the b = [b0 ] vector.
1 // 5 data points : x0
2 val x = MatrixD ((5 , 1) , 0 , // x 5 - by -1 matrix
3 1,
165
4 2,
5 3,
6 4)
7 val y = VectorD (2 , 3 , 5 , 4 , 6) // y vector
8
4. Execute the SimplerRegression on the Auto MPG dataset. See scalation.modeling.Example AutoMPG.
What is the quality of the fit (e.g., R2 or rSq)? Is this value expected? What does it say about this
model? Try using different columns for the predictor variable.
d2 L
5. Compute the second derivative of the loss function w.r.t. b0 , . Under what conditions will it be
db0 2
positive?
166
6.5 Simple Regression
The SimpleRegression class supports simple linear regression. It combines the benefits of the last two mod-
eling techniques: the intercept model NullModel and the slope model SimplerRegression. It is guaranteed
to be at least as good as the better of these two modeling techniques. In this case, the predictor vector
x ∈ R2 consists of the constant one and a single variable x1 , i.e., [1, x1 ], so there are now two parameters
b = [b0 , b1 ] ∈ R2 in the model.
where represents the residuals (the part not explained by the model).
6.5.2 Training
The model is trained on a dataset consisting of m data points/vectors, stored row-wise in an m-by-2 matrix
X ∈ Rm×2 and m response values, stored in an m dimensional vector y ∈ Rm .
y = Xb + (6.23)
The parameter vector b may be determined by solving the following optimization problem:
Substituting = y − ŷ = y − Xb yields
minb ky − Xbk
Using the fact that the matrix X consists of two column vectors 1 and x1 , it can be rewritten,
b0
min[b0 ,b1 ] ky − [1 x1 ] k
b1
Since x0 is just 1, for simplicity we drop the subscript on x1 . Thus the loss function 21 sse is
1
L(b) = y − (b0 1 + b1 x) · y − (b0 1 + b1 x) (6.27)
2
167
6.5.3 Optimization - Gradient
A function of several variables can be optimized using Vector Calculus by setting its gradient (see the Linear
Algebra Chapter) equal to zero and solving the resulting system of equations. When the system of equations
are linear, matrix factorization may be used, otherwise techniques from Nonlinear Optimization may be
needed.
Taking the gradient of the loss function L gives
∂L ∂L
∇L = , (6.28)
∂b0 ∂b1
The goal is to find the value of the parameter vector b that yields a zero gradient (flat response surface).
Setting the gradient equal to zero (0 = [0, 0]) yields two equations.
∂L ∂L
∇L(b) = (b), (b) = 0 (6.29)
∂b0 ∂b1
The gradient (the two partial derivatives) may be determined using the derivative product rule for dot
products.
1
(f · f )0 = f 0 · f
2
−1 · (y − (b0 1 + b1 x)) = 0
1 · y − 1 · (b0 1 + b1 x) = 0
b0 1 · 1 = 1 · y − b1 1 · x
1 · y − b1 1 · x
b0 = (6.30)
m
−x · (y − (b0 1 + b1 x)) = 0
x · y − x · (b0 1 + b1 x) = 0
b0 1 · x + b1 x · x = x · y
m b0 1 · x + m b1 x · x = m x · y (6.31)
Substituting for m b0 = 1 · y − b1 1 · x yields
[1 · y − b1 1 · x]1 · x + m b1 x · x = m x · y
b1 [m x · x − (1 · x)2 ] = m x · y − (1 · x)(1 · y)
168
Solving for b1 gives
m x · y − (1 · x)(1 · y)
b1 = (6.32)
m x · x − (1 · x)2
The b0 parameter gives the intercept, while the b1 parameter gives the slope of the line that best fits the data
points.
Consider again the problem from the last section where the data points are {(1, 1), (2, 3), (3, 3), (3, 4)} and
solve for the two parameters, (intercept) b0 and (slope) b1 .
Table ?? below shows the values of x, y, ŷ, , and 2 for the Simple Regression Model,
x y ŷ 2
1 1 1.4 -0.4 0.16
2 3 2.3 0.7 0.49
3 3 3.2 -0.2 0.04
4 4 4.1 -0.1 0.01
10 11 11 0 0.7
For which models (NullModel, SimplerRegression and SimpleRegression), did the redidual/error vector
sum to zero?
The sum of squared errors (sse) is given in the lower, right corner of the table. The sum of squares total
for this dataset is 4.75, so the Coefficient of Determination,
sse 0.7
R2 = 1 − = 1− = 0.853
sst 4.75
The plot below illustrates how the Simple Regression Model (SimpleRegression) attempts to fit the
four given data points.
169
Simple Regression Model Line vs. Data Points
y
2
0 1 2 3 4
x
More concise and intuitive formulas for the parameters b0 and b1 may be derived.
• Using the definition for mean from Chapter 3 for µx and µy , it can be shown that the expression for
b0 shortens to
b0 = µy − b1 µx (6.33)
Draw a line through the following two points [0, b0 ] (the intercept) and [µx , µy ] (the center of mass).
How does this line compare to the regression line.
• Now, using the definitions for covariance σx,y and variance σx2 from Chapter 3, it can be shown that
the expression for b1 shortens to
σx,y
b1 = (6.34)
σx2
If the slope of the regression line is simply the ratio of the covariance to the variance, what would the
slope be if y = x. It may also be written as follows:
Sxy
b1 = (6.35)
Sxx
− µx )2 .
P P
where Sxy = i (xi − µx )(yi − µy ) and Sxx = i (xi
Table 6.5 extends the previous table to facilitate computing the parameters vector b using the concise
formulas.
170
Table 6.5: Simple Regression Model: Expanded Table with Centering µx = 2.5, µy = 2.75
x x − µx y y − µx ŷ 2
1 -1.5 1 -1.75 1.4 -0.4 0.16
2 -0.5 3 0.25 2.3 0.7 0.49
3 0.5 3 0.25 3.2 -0.2 0.04
4 1.5 4 1.25 4.1 -0.1 0.01
10 0 11 0 11 0 0.7
X
Sxx = (xi − µx )2 = 1.52 + 0.52 + 0.52 + 1.52 = 5
i
X
Syy = (yi − µy )2 = 1.752 + 0.252 + 0.252 + 1.252 = 4.75
i
X
Sxy = (xi − µx )(yi − µy ) = (−1.5 · −1.75) + (−0.5 · 0.25) + (0.5 · 0.25) + (1.5 · 1.25) = 4.5
i
Therefore,
Sxy 4.5
b1 = = = 0.9
Sxx 5
Next the relationships between the predictor variable xj (the columns in input/data matrix X) should
be compared. If two of the predictor variables are highly correlated, their individual effects on the response
variable y may be indistinguishable. The correlations between the predictor variable, may be seen by
examining the correlation matrix. Including the response variable in a combined data matrix xy allows one
to see how each predictor variable is correlated with the response.
1 banner ( " Correlation Matrix for Columns of xy " )
2 println ( s " x_fname = ${ stringOf ( x_fname ) } " )
171
3 println ( s " y_name = MPG " )
4 println ( s " xy . corr = ${ xy . corr } " )
Although Simple Regression may be too simple for many problems/datasets, it should be used in Ex-
ploratory Data Analysis (EDA). A simple regression model should be created for each predictor variable xj .
The data points and the best fitting line should be plotted with y on the vertical axis and xj on the hori-
zontal axis. The data scientist should look for patterns/tendencies of y versus xj , such as linear, quadratic,
logarithmic, or exponential patterns. When there is no relationship, the points will appear to be randomly
and uniformly positioned in the plane.
1 for j <- x . indices2 do
2 banner ( s " Plot response y vs . predictor variable ${ x_fname ( j ) } " )
3 val xj = x (? , j )
4 val mod = S i m p l e R e g r e s s i o n ( xj , y , Array ( " one " , x_fname ( j ) ) )
5 mod . trainNtest () () // train and test model
6 val yp = mod . predict ( mod . getX )
7 new Plot ( xj , y , yp , s " EDA : y and yp ( red ) vs . ${ x_fname ( j ) } " , lines = true )
8 end for
The Figure below shows four possible patterns: Linear (blue), Quadratic (purple), Inverse (green), Inverse-
Square (black). Each curve depicts a function 1 + xp , for p = −2, −1, 1, 2.
Finding a Pattern: Linear (blue), Quadratic (purple), Inverse (green), Inverse-Square (black)
10
y
0
0.5 1 1.5 2 2.5 3 3.5
x
To look for quadratic patterns, the following code regresses on the square of each predictor variable (i.e.,
x2j ).
1 for j <- x . indices2 do
2 banner ( s " Plot response y vs . predictor variable ${ x_fname ( j ) } " )
3 val xj = x (? , j )
4 val mod = S i m p l e R e g r e s s i o n . quadratic ( xj , y , Array ( " one " , x_fname ( j ) + " ˆ 2 " ) )
5 mod . trainNtest () () // train and test model
6 val yp = mod . predict ( mod . getX )
7 new Plot ( xj , y , yp , s " EDA : y and yp ( red ) vs . ${ x_fname ( j ) } " , lines = false )
8 end for
172
To determine the effect of having linear and quadratic terms (both xj and x2j ) the Regression class that
supports Multiple Linear Regression or the SymbolicRegression object may be used. Generally, one could
include both terms if there is sufficient improvement over just using one term. If one term is chosen, use
the linear term unless the quadratic term is sufficiently better (see the section on Symbolic Regression for a
more detailed discussion).
Plotting
The Plot and PlotM classes in the mathstat package can be used for plotting data and results. Both use
ZoomablePanel in the scala2d package to support zooming and dragging. The mouse wheel controls the
amount of zooming (scroll value where up is negative and down is positive), while mouse dragging repositions
the objects in the panel (drawing canvas).
1 @ param x the x vector of data values ( horizontal ) , use null to use y s index
2 @ param y the y vector of data values ( primary vertical , black )
3 @ param z the z vector of data values ( secondary vertical , red ) to compare with y
4 @ param _title the title of the plot
5 @ param lines flag for generating a line plot
6
7 class Plot ( x : VectorD , y : VectorD , z : VectorD = null , _title : String = " Plot y vs . x " ,
8 lines : Boolean = false )
9 extends VizFrame ( _title , null ) :
Class Methods:
1 @ param x the data / input matrix augmented with a first column of ones
2 ( only use the first two columns [1 , x1 ])
3 @ param y the response / output vector
4 @ param fname_ the feature / variable names ( only use the first two names )
5
173
15 b_ : VectorD = b , vifs : VectorD = vif () ) : String =
16 def confInterval ( x_ : MatrixD = getX ) : VectorD =
6.5.7 Exercises
1. From the X matrix and y vector, plot the set of data points {(xi1 , yi ) | 0 ≤ i < m} and draw the line
that is nearest to these points (i.e., that minimize kk). Using the formulas developed in this section,
what are the intercept and slope [b0 , b1 ] of this line.
Also, pass the X matrix and y vector as arguments to the SimpleRegression class to obtain the b
vector.
1 // 4 data points : one x1
2 val x = MatrixD ((4 , 2) , 1 , 1 , // x 4 - by -2 matrix
3 1, 2,
4 1, 3,
5 1 , 4)
6 val y = VectorD (1 , 3 , 3 , 4) // y vector
7
2. For more complex models, setting the gradient to zero and solving a system of simultaneous equation
may not work, in which case more general optimization techniques may be applied. Two simple
optimization techniques are grid search and gradient descent.
For grid search, in a spreadsheet set up a 5-by-5 grid around the optimal point for b, found in the
previous problem. Compute values for the loss function L = 12 sse for each point in the grid. Plot h
versus b0 across the optimal point. Do the same for b1 . Make a 3D plot of the surface h as a function
b0 and b1 .
For gradient descent, pick a starting point b0 , compute the gradient ∇L and move −η∇L from b0
where η is the learning rate (e.g., 0.1). Repeat for a few iterations. What is happening to the value of
the loss function L = 21 see.
[−1 · , −x · ]
3. From the X matrix and y vector, plot the set of data points {(xi1 , yi ) | 0 ≤ i < m} and draw the line
that is nearest to these points. What are the intercept and slope of this line. Pass the X matrix and
y vector as arguments to the SimpleRegression class to obtain the b vector.
174
1 // 5 data points : one x1
2 val x = MatrixD ((5 , 2) , 1 , 0 , // x 5 - by -2 matrix
3 1, 1,
4 1, 2,
5 1, 3,
6 1 , 4)
7 val y = VectorD (2 , 3 , 5 , 4 , 6) // y vector
8
5. Let errors i have E [i ] = 0 and V [i ] = σ 2 , and be independent of each other. Show that the variances
for the parameters b0 and b1 are as follows:
σ2
V [b1 ] =
Sxx
Sxy −2
Hint: V [b1 ] = V = Sxx V [Sxy ].
Sxx
µ2
1
V [b0 ] = + x σ2
m Sxx
6. Further assume that i ∼ N(0, σ 2 ). Show that the confidence intervals for the parameters b0 and b1
are as follows:
s
b1 ± t∗ √
Sxx
sse
Hint: Let the error variance estimator be s2 = = mse.
m−2
s
2
b0 ± t∗ s 1 µ
+ x
m Sxx
175
sse
estimate the error variance s2 = = mse. Take its square root to obtain the residual standard
m−2
error s. Use these to compute 95% confidence intervals for the parameters: b0 and b1 .
8. Consider the above simple dataset, but where the y values are reversed so the slope is negative and
the fit line is away from the origin,
1 val x = VectorD (1 , 2 , 3 , 4 , 5)
2 val y = VectorD (4 , 5 , 3 , 3 , 1)
3 val ox = MatrixD . one ( x . dim ) : ˆ + x
Compare the SimplerRegression model with the SimpleRegression model. Examine the QoF mea-
sures for each model and make an argument for which model to pick. Also compute R02 (R2 relative
to 0)
ky − ŷk2
R02 = 1 − (6.37)
kyk2
ky − ŷk2
R2 = 1 − (6.38)
ky − µy k2
For Regression Through the Origin (RTO) some software packages use R02 in place of R2 . See
[43] for a deeper discussion of the issues involved, including when it is appropriate to not include an
intercept b0 in the model. ScalaTion provides functions for both in the FitM trait: def rSq (the
default) and def rSq0 .
176
6.6 Regression
The Regression class supports multiple linear regression where multiple input/predictor variables are used
to predict a value for the response/output variable. When the response variable has non-zero correlations
with multiple predictor variables, this technique tends to be effective, efficient and leads to explainable
models. It should be applied typically in combination with more complex modeling techniques. In this case,
the predictor vector x is multi-dimensional [1, x1 , ...xk ] ∈ Rn , so the parameter vector b = [b0 , b1 , . . . , bk ] has
the same dimension as x, while response y is a scalar.
x0
b0
x1 y
b1
β
b2
x2
The intercept can be provided by fixing x0 to one, making b0 the intercept. Alternatively, x0 can be used
as a regular input variable by introducing another parameter β for the intercept. In Neural Networks, β is
referred to as bias and bj is referred to as the edge weight connecting input vertex/node j to the output
node as shown in Figure 6.1. Note, if a activation function fa is added to the model, the Multiple Linear
Regression model becomes a Perceptron model.
y = b · x + = b0 + b1 x1 + ... + bk xk + (6.39)
where represents the residuals (the part not explained by the model).
6.6.2 Training
Using several data samples as a training set (X, y), the Regression class in ScalaTion can be used to
estimate the parameter vector b. Each sample pairs an x input vector with a y response value. The x vectors
are placed into a data/input matrix X ∈ Rm×n row-by-row with a column of ones as the first column in X.
The individual response values taken together form the response vector y ∈ Rm .
The training diagram shown in Figure 6.2 illustrates how the ith instance/row flows through the diagram
computing the predicted response ŷ = b · x and the error = y − ŷ.
177
x0
b0
b·x
x1 ŷ
b1
(X, y) xi0
b2
= y − ŷ
xi1 x2
xi2
y
yi
The matrix-vector product Xb provides an estimate for the response vector ŷ.
y = Xb + (6.40)
The goal is to minimize the distance between y and its estimate ŷ. i.e., minimize the norm of residual/error
vector.
Substituting = y − ŷ = y − Xb yields
This is equivalent to minimizing half the dot product of the error vector with itself ( 12 kk2 = 12 · = 12 sse)
Thus, the loss function is
1
L(b) = y − Xb · y − Xb (6.43)
2
1
(f · f )0 = f 0 · f (6.45)
2
yields the j th partial derivative.
178
∂L |
= −x:j · (y − Xb) = − x:j (y − Xb) (6.46)
∂bj
Notice that the parameter bj is only multiplied by column x:j in the matrix-vector product Xb. The dot
product is equivalent a transpose operation followed by matrix multiplication. The gradient is formed by
collecting all these partial derivatives together.
|
∇L = − X (y − Xb) (6.47)
Now, setting the gradient equal to the zero vector 0 ∈ Rn yields
|
−X (y − Xb) = 0
| |
−X y + (X X)b = 0
A more detailed derivation of this equation is given in section 3.4 of “Matrix Calculus: Derivation and Simple
Application” [82]. Moving the term involving b to the left side, results in the Normal Equations.
| |
(X X)b = X y (6.48)
Note: equivalent to minimizing the distance between y and Xb is minimizing the sum of the squared
residuals/errors (Least Squares method).
ScalaTion provides five techniques for solving for the parameter vector b based on the Normal Equa-
tions: Matrix Inversion, LU Factorization, Cholesky Factorization, QR Factorization and SVD Factorization.
| |
(X X)b = X y
|
a simple technique is Matrix Inversion, which involves computing the inverse of X X and using it to multiply
both sides of the Normal Equations.
| |
b = (X X)−1 X y (6.49)
| |
where (X X)−1 is an n-by-n matrix, X is an n-by-m matrix and y is an m-vector. When X is full rank,
the expression above involving the X matrix may be referred to as the pseudo-inverse X + .
| |
X + = (X X)−1 X
When X is not full rank, Singular Value Decomposition may be applied to compute X + . Using the pseudo-
inverse, the parameter vector b may be solved for as follows:
b = X +y (6.50)
The pseudo-inverse can be computed by first multiplying X by its transpose. Gaussian Elimination can be
used to compute the inverse of this, which can be then multiplied by the transpose of X. In ScalaTion,
the computation for the pseudo-inverse (x pinv) looks similar to the math.
179
1 val x_pinv = ( x .T * x ) . inverse * x .T
Most of the factorization classes/objects implement matrix inversion, including Fac Inv, Fac LU, Fac Cholesky,
and Fac QR. The default Fac LU combines reasonable speed and robustness.
1 def inverse : MatrixD = Fac_LU . inverse ( this ) ()
For efficiency, the code in Regression does not calculate x pinv, rather is directly solves for the parameters
b.
1 val b = fac . solve ( x .T * y )
Starting the solution to the Normal Equations that takes the inverse for determining the optimal parameter
vector b,
| |
b = (X X)−1 X y (6.51)
One can substitute the rhs into the prediction equation for ŷ = Xb
| |
ŷ = X(X X)−1 X y = Hy (6.52)
| |
where H = X(X X)−1 X is the hat matrix (puts a hat on y). The hat matrix may be viewed as a projection
matrix.
|
X X = LU
where L is a lower left triangular n-by-n matrix and U is an upper right triangular n-by-n matrix. Then the
normal equations may be rewritten
|
LU b = X y
Letting w = U b allows the problem to solved in two steps. The first is solved by forward substitution to
determine the vector w.
|
Lw = X y
Ub = w
180
Example Calculation
Consider the example where the input/data matrix X and output/response vector y are as follows:
1 1 1
1 2 3
X = , y =
1 3 3
1 4 4
| |
Putting these values into the Normal Equations (X X)b = X y yields
" #
4 10 11
10 30 32
Multiply the first row by -2.5 and add it to the second row,
" #
4 10 11
0 5 4.5
|
This results in the following optimal parameter vector b = [.5, .9]. Note, the product of L and U gives X X.
| |
X X = LL
where L is a lower triangular n-by-n matrix. Then the normal equations may be rewritten
| |
LL b = X y
|
Letting w = L b, we may solve for w using forward substitution
|
Lw = X y
and then solve for b using backward substitution.
|
L b = w
| |
As an example, the product of L and its transpose L gives X X.
181
Therefore, w can be determined by forward substitution and b by backward substitution.
X = QR
where Q is an orthogonal m-by-n matrix and R matrix is a right upper triangular n-by-n matrix. Starting
again with the Normal Equations,
| |
(X X)b = X y
| |
(QR) QRb = (QR) y
| | | |
R Q QRb = R Q y
|
and using the fact that Q Q = I, we obtain the following:
| | |
R Rb = R Q y
|
Multiply both sides by (R )−1 yields
|
Rb = Q y
Since R is an upper triangular matrix, the parameter vector b can be determined by backward substitution.
Alternatively, the pseudo-inverse may be computed as follows:
|
X + = R−1 Q
182
1 private def solver ( x_ : MatrixD ) : Factorization =
2 algorithm match
3 case " Fac_Cholesky " = > new Fac_Cholesky ( x_ .T * x_ ) // Cholesky Factorization
4 case " Fac_LU " = > new Fac_LU ( x_ .T * x_ ) // LU Factorization
5 case " Fac_Inverse " = > new Fac_Inverse ( x_ .T * x_ ) // Inverse Factorization
6 case " Fac_SVD " = > new Fac_SVD ( x_ ) // Singular Value Decomp .
7 case _ = > new Fac_QR ( x_ ) // QR Factorization
8 end match
9 end solver
The train method below computes parameter/coefficient vector b by calling the solve method provided
by the factorization classes.
1 def train ( x_ : MatrixD = x , y_ : VectorD = y ) : Unit =
2 val fac = solver ( x_ )
3 fac . factor () // factor the matrix
4
10 if b (0) . isNaN then flaw ( " train " , s " parameter b = $b " )
11 debug ( " train " , s "$fac estimates parameter b = $b " )
12 end train
After training, the test method does two things: First, the residual/error vector is computed. Second,
several quality of fit measures are computed by calling the diagnose method.
1 def test ( x_ : MatrixD = x , y_ : VectorD = y ) : ( VectorD , VectorD ) =
2 val yp = predict ( x_ ) // make predictions
3 e = y_ - yp // RECORD the residuals / errors
4 ( yp , diagnose ( y_ , yp ) ) // return predictions and QoF
5 end test
To see how the train and test methods work in a Regression model see the Collinearity Test and Texas
Temperatures examples in subsequent subsections.
183
Degrees of Freedom
the prediction vector ŷ is a projection of the response vector y ∈ Rm onto Rk , the space (hyperplane)
spanned by the vectors x1 , . . . xk . Since = y − ŷ, one might think that the residual/error ∈ Rm−k . As
P
i i = 0 when an intercept parameter b0 is included in the model (n = k + 1), this constraint reduces the
dimensionality of the space by one, so ∈ Rm−n .
Therefore, the Degrees of Freedom (DoF) captured by the regression model is dfr and left for error is df
are indicated in the table below.
As an example, the equation ŷ = 2x1 + x2 + .5 defines a dfr = 2 dimensional hyperplane (or ordinary plane)
as shown in Figure 6.3.
10
0
y
−10 5
−4 0
−2 0 2 4 x2
x1 −5
184
Adjusted Coefficient of Determination R̄2
dfr + df
rdf =
df
SimplerRegression is at one extreme of model complexity, where df = m−1 and dfr = 1, so rdf = m/(m−1)
is close to one. For a more complicated model, say with n = m/2, rdf will be close to 2. This ratio can be
used to adjust the Coefficient of Determination R2 to reduce it with increasing number of parameters. This
is called the Adjusted Coefficient of Determination R̄2
R̄2 = 1 − rdf (1 − R2 )
Suppose m = 121, n = 21 and R2 = 0.9, as an exercise, show that rdf = 1.2 and R̄2 = 0.88.
Dividing sse and ssr by their respective Degrees of Freedom gives the mean square error and regression,
respectively
mse = sse / df
msr = ssr / dfr
The mean square error mse follows a Chi-square distribution with df Degrees of Freedom, while the mean
square regression msr follows a Chi-square distribution with dfr Degrees of Freedom. Consequently, the
ratio
msr
∼ Fdfr ,df (6.53)
mse
that is, it follows an F -distribution with (dfr , df ) Degrees of Freedom. If this number exceeds the critical
value, one can claim that the parameter vector b is not zero, implying the model is useful. More general
quality of fit measures useful for comparing models are the Akaike Information Criterion (AIC) and Bayesian
Information Criterion (BIC).
In ScalaTion the several Quality of Fit (QoF) measures are computed by the diagnose method in the
Fit class, as described in section 1 of this chapter.
1 def diagnose ( y : VectorD , yp : VectorD , w : VectorD = null )
It looks at different ways to measure the difference between the actual y and predicted yp values for the
response. The differences are optionally weighted by the vector w. Weighting is not applied when w is null.
185
Now the difficult issue is how to guard against over-fitting. With enough flexibility and parameters to
fit, modeling techniques can push quality measures like R2 to perfection (R2 = 1) by fitting the signal and
the noise in the data. Doing so tends to make a model worse in practice than a simple model that just
captures the signal. That is where quality measures like R̄2 (or AIC) come into play, but computations of
R̄2 require determination of Degrees of Freedom (df ), which may be difficult for some modeling techniques.
Furthermore, the amount of penalty introduced by such quality measures is somewhat arbitrary.
Would not it be better to measure quality in way in which models fitting noise are downgraded because
they perform more poorly on data they have not seen? Is it really a test, if the model has already seen
the data? The answers to these questions are obvious, but the solution of the underlying problem is a bit
tricky. The first thought would be to divide a dataset in half, but then only half of the data are available
for training. Also, picking a different half may result in substantially different quality measures.
This leads to two guiding principles: First, the majority of the data should be used for training. Second,
multiple testing should be done. In general, conducting real-world tests of a model can be difficult. There
are, however, strategies that attempt to approximate such testing. Two simple and commonly used strategies
are the following: Leave-One-Out and Cross-Validation. In both cases, a dataset is divided into a training
set and a test set.
Leave-One-Out
When fitting the parameters b the more data available in the training set, in all likelihood, the better the
fit. The Leave-One-Out strategy takes this to the extreme, by splitting the dataset into a training set of
size m − 1 and test set of size 1 (e.g., row t in data matrix X). From this, a test error can be computed
yt − b · xt . This can be repeated by iteratively letting t range from the first to the last row of data matrix
X. For certain predictive analytics techniques such as Multiple Linear Regression, there are efficient ways
to compute the test sse based on the leverage each point in the training set has [85].
k-Fold Cross-Validation
A more generally applicable strategy is called cross-validation, where a dataset is divided into k test sets.
For each test set, the corresponding training set is all the instances not chosen for that test set. A simple
way to do this is to let the first test dataset be first m/k rows of matrix X, the second be the second m/k
rows, etc.
1 val tsize = m / k // test set size
2 for l <- 0 until k do
3 x_e = x ( l * tsize until (( l +1) * tsize ) // l - th test set
4 x_ = x . not ( l * tsize until (( l +1) * tsize ) ) // l - th training set
5 end for
The model is trained k times using each of the training sets. The corresponding test set is then used to
estimate the test sse (or other quality measure such as mse). These are more meaningful out-of-sample
results. From each of these samples, a mean, standard deviation and confidence interval may be computed
for the test sse.
Due to patterns that may exist in the dataset, it is more robust to randomly select each of the test sets.
The row indices may be permuted for random selection that ensures that all data instances show up exactly
in one test set.
186
Typically, training QoF (in-sample) measures such as R2 will be better than testing QoF (out-of-sample)
2
measures such as Rcv . Adjusted measures such as R¯2 are intending to more closely follow Rcv
2
than R2 .
ScalaTion support cross-validation via is crossValidate method.
1 @ param k the number of cross - validation iterations / folds ( defaults to 5 x ) .
2 @ param rando flag indicating whether to use randomized or simple cross - validation
3
It also supports a simpler strategy that only tests once, via its validate method defined in the Predictor
trait. It utilizes the Test-n-Train Split TnT Split from the mathstat package.
1 @ param rando flag indicating whether to use randomized or simple validation
2 @ param ratio the ratio of the TESTING set to the full dataset ( e . g . , 70 -30 , 80 -20)
3 @ param idx the prescribed TESTING set indices
4
10 train ( x_ , y_ )
11 val qof = test ( x_e , y_e ) . _2
12 if qof ( QoF . sst . ordinal ) <= 0.0 then
13 flaw ( " validate " , " chosen testing set has no variability " )
14 end if
15 println ( FitM . fitMap ( qof , QoF . values . map ( _ . toString ) ) )
16 qof
17 end validate
6.6.11 Collinearity
Consider the matrix-vector equation used for estimating the parameters b via the minimization of kk.
y = Xb +
The parameter/coefficient vector b = [b0 , b1 , . . . , bk ] may be viewed as weights on the column vectors in the
data/predictor matrix X.
y = b0 1 + b1 x:1 + . . . + bk x:k +
A question arises when two of these column vectors are nearly the same (or more generally nearly parallel
or anti-parallel). They will affect and may obfuscate each others’ parameter values.
First, we will examine ways of detecting such problems and then give some remedies. A simple check is to
compute the correlation matrix for the column vectors in matrix X. High (positive or negative) correlation
indicates collinearity.
Example Problem
Consider the following data/input matrix X and response vector y. This is the same example used for
SimpleRegression with new variable x2 added (i.e., y = b0 + b1 x1 + b2 x2 + ). The collinearityTest
main function allows one to see the effects of increasing the collinearity of features/variables x1 and x2 .
187
1 package < your - package >
2
8 // one x1 x2
9 val x = MatrixD ((4 , 3) , 1 , 1 , 1 , // input / data matrix
10 1, 2, 2,
11 1, 3, 3,
12 1 , 4 , 0) // change 0 by adding .5 until it ’s 4
13
16 val v = x (? , 0 until 2)
17 banner ( s " Test without column x2 " )
18 println ( s " v = $v " )
19 var mod = new Regression (v , y )
20 mod . trainNtest () ()
21 println ( mod . summary () )
22
23 for i <- 0 to 8 do
24 banner ( s " Test Increasing Collinearity : x_32 = ${ x (3 , 2) } " )
25 println ( s " x = $x " )
26 println ( s " x . corr = ${ x . corr } " )
27 mod = new Regression (x , y )
28 mod . trainNtest () ()
29 println ( mod . summary () )
30 x (3 , 2) + = 0.5
31 end for
32
33 end co ll i n e a r i t y T e s t
Try changing the value of element x32 from 0 to 4 by .5 and observe what happens to the correlation
matrix. What effect do these changes have on the parameter vector b = [b0 , b1 , b2 ] and how do the first two
parameters compare to the regression where the last column of X is removed giving the parameter vector
b = [b0 , b1 ].
The corr method is provided by the scalation.mathstat.MatrixD class. For this method, if either
column vector has zero variance, when the column vectors are the same, it returns 1.0, otherwise -0.0
(indicating undefined).
Note, perfect collinearity produces a singular matrix, in which case many factorization algorithms will
give NaN (Not-a-Number) for much of their output. In this case, Fac SVD (Singular Value Decomposition)
should be used. This can be done by changing the following hyper-parameter provided by the Regression
object, before instantiating the Regression class.
188
Multi-Collinearity
Even if no particular entry in the correlation matrix is high, a column in the matrix may still be nearly
a linear combination of other columns. This is the problem of multi-collinearity. This can be checked by
computing the Variance Inflation Factor (VIF) function (or vif in ScalaTion). For a particular parameter
bj for the variable/predictor xj , the function is evaluated as follows:
1
vif(bj ) = (6.54)
1 − R2 (xj )
where R2 (xj ) is R2 for the regression of variable xj onto the rest of the predictors. It measures how well the
variable xj (or its column vector x:j ) can be predicted by all xl for l 6= j. Values above 20 (R2 (xj ) = 0.95)
are considered by some to be problematic. In particular, the value for parameter bj may be suspect, since
its variance is inflated by vif(bj ).
mse
σ̂ 2 (bj ) = · vif(bj ) (6.55)
k σ̂ 2 (xj )
See the exercises for details. Both corr and vif may be tested in ScalaTion using RegressionTest4.
One remedy to reduce collinearity/multi-collinearity is to eliminate the variable with the highest corr/vif
value. Another is to use regularized regression such as RidgeRegression or LassoRegression.
Forward Selection
The forewordSel method, coded in the Predictor trait, performs forward selection by adding the most
predictive variable to the existing model, returning the variable to be added and a reference to the new
model with the added variable/feature.
1 @ param cols the columns of matrix x currently included in the existing model
2 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
3
4 def forwardSel ( cols : LinkedHashSet [ Int ] , idx_q : Int = QoF . rSqBar . ordinal ) : BestStep =
The BestStep is used to record the best improvement step found so far.
1 @ param col the column / variable to ADD / REMOVE for this step
2 @ param qof the Quality of Fit ( QoF ) for this step
3 @ param mod the model including selected features / variables for this step
189
4
5 case class BestStep ( col : Int = -1 , qof : VectorD = null , mod : Predictor = null )
Selecting the most predictive variable to add boils down to comparing on the basis of a Quality of Fit
(QoF) measure. The default is the Adjusted Coefficient of Determination R̄2 . The optional argument idx q
indicates which QoF measure to use (defaults to QoF.rSqBar.ordinal). To start with a minimal model, set
cols = Set (0) for an intercept-only model. The method will consider every variable/column x.indices2
not already in cols and pick the best one for inclusion.
1 for j <- x . indices2 if ! ( cols contains j ) do
To find the best model, the forwardSel method should be called repeatedly while the quality of fit measure
is sufficiently improving. This process is automated in the forwardSelAll method.
1 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
2 @ param cross whether to include the cross - validation QoF measure
3
4 def forwardSelAll ( idx_q : Int = QoF . rSqBar . ordinal , cross : Boolean = true ) :
5 ( LinkedHashSet [ Int ] , MatrixD ) =
The forwardSelAll method takes the QoF measure to use as the selection criterion and whether to apply
cross-validation as inputs and returns the best collection of features/columns to include in the model as well
as the QoF measures for all steps.
To see how R2 , R̄2 , sMAPE, and Rcv
2
change with the number of features/parameters added to the model
by forwardSelAll method, run the following test code from the scalation modeling module.
sMAPE, symmetric Mean Absolute Percentage Error, is explained in detail in the Time Series/Temporal
Models Chapter.
Backward Elimination
The backwardElim method, coded in the Predictor trait, performs backward elimination by removing the
least predictive variable from the existing model, returning the variable to eliminate, the new parameter
vector and a reference to the new model with the removed variable/feature.
1 @ param cols the columns of matrix x currently included in the existing model
2 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
3 @ param first first variable to consider for elimination
4 ( default (1) assume intercept x_0 will be in any model )
5
6 def backwardElim ( cols : LinkedHashSet [ Int ] , idx_q : Int = QoF . rSqBar . ordinal ,
7 first : Int = 1) : BestStep =
To start with a maximal model, set cols = Set (0, 1, ..., k) for a full model. As with forwardSel,
the idx q optional argument allows one to choose from among the QoF measures. The last parameter first
provides immunity from elimination for any variable/parameter that is less than first (e.g., to ensure that
models include an intercept b0 , set first to one). The method will consider every variable/column from
first until x.dim2 in cols and pick the worst one for elimination.
1 for j <- first until x . dim2 if cols contains j do
190
To find the best model, the backwardElim method should be called repeatedly until the quality of fit measure
sufficiently decreases. This process is automated in the backwardElimAll method.
1 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
2 @ param first first variable to consider for elimination
3 @ param cross whether to include the cross - validation QoF measure
4
The backwardElimAll method takes the QoF measure to use as the selection criterion, the index of the first
variable to consider for elimination, and whether to apply cross-validation as inputs and returns the best
collection of features/columns to include in the model as well as the QoF measures for all steps.
Some studies have indicated that backward elimination can outperform forward selection, but it is difficult
to say in general.
More advanced feature selection techniques include using genetic algorithms to find near optimal subsets
of variables as well as techniques that select variables as part of the parameter estimation process, e.g.,
LassoRegression.
Stepwise Regression
An improvement over Forward Selection and Backward Elimination is possible with Stepwise Regression.
It starts with either no variables or the intercept in the model and adds one variable that improves the
selection criterion the most. It then adds the second best variable for step two. After the second step, it
determines whether it is better to add or remove a variable. It continues in this fashion until no improvement
in the selection criterion is found at which point it terminates. Note, for Forward Selection and Backward
Elimination it may instructive to continue all the way to the end (all variables for forward/no variables for
backward).
Stepwise regression may lead to coincidental relationships being included in the model, particularly if a t-
test is the basis of inclusion or a penalty-free QoF measure such as R2 is used. Typically, this approach is used
when there a penalty for having extra variables/parameters, e.g., R2 adjusted R̄2 , R2 cross-validation Rcv 2
or
Akaike Information Criterion (AIC). See the section on Maximum Likelihood Estimation for a definition of
AIC. Alternatives to Stepwise Regression include Lasso Regression (`1 regularization) and to a lesser extent
Ridge Regression (`2 regularization).
ScalaTion provides the stepRegressionAll method for Stepwise Regression. At each step it calls
forwardSel and backwardElim and chooses the one yielding better improvement.
1 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
2 @ param cross whether to include the cross - validation QoF measure
3
An option for further improvement is to add a swapping operation, which finds the best variable to remove
and replace with a variable not in the model. Unfortunately, this may lead to a quadratic number of steps in
the worst-case (as opposed to linear for forward, backward and stepwise without swapping). See the exercises
for more details.
191
Categorical Variables/Features
For Regression, the variables/features have so far been treated as continuous or ordinal. However, some
variables may be categorical in nature, where there is no ordering of the values for a categorical variable.
Although one can encode “English”, “French”, “Spanish” as 0, 1, and 2, it may lead to problems such
as concluding the average of “English” and “Spanish” is ‘French”.
In such cases, it may be useful to replace a categorical variable with multiple dummy variables. Typically,
a categorical variable (column in the data matrix) taking on k distinct values is replaced with with k − 1
dummy variables (columns in the data matrix). For details on how to do this effectively, see the section on
RegressionCat.
The trainNtest method defined in the Predictor trait does several things: trains the model on x and y ,
tests the model on xx and yy, produces a report about training and testing, and optionally plots y-actual
and y-predicted.
192
1 @ param x_ the training / full data / input matrix ( defaults to full x )
2 @ param y_ the training / full response / output vector ( defaults to full y )
3 @ param xx the testing / full data / input matrix ( defaults to full x )
4 @ param yy the testing / full response / output vector ( defaults to full y )
5
The report method returns the following basic information: (1) the name of the modeling technique nm, (2)
the values of the hyper-parameters hp (used for controlling the model/optimizer), (3) the feature/predictor
variable names fn, (4) the values of the parameters b, and (5) several Quality of Fit measures qof.
REPORT
----------------------------------------------------------------------------
modelName mn = Regression
----------------------------------------------------------------------------
hparameter hp = HyperParameter(factorization -> (Fac_QR,Fac_QR))
----------------------------------------------------------------------------
features fn = Array(x0, x1, x2, x3)
----------------------------------------------------------------------------
parameter b = VectorD(151.298,-1.99323,-0.000955478,-0.384710)
----------------------------------------------------------------------------
fitMap qof = LinkedHashMap(
rSq -> 0.991921, rSqBar -> 0.989902, sst -> 941.937500, sse -> 7.609494,
mse0 -> 0.475593, rmse -> 0.689633, mae -> 0.531353, dfm -> 3.000000,
df -> 12.000000, fStat -> 491.138015, aic -> -8.757481, bic -> -5.667126,
mape -> 1.095990, smape -> 1.094779, mase -> 0.066419)
The plot below shows the results from running the ScalaTion Regression Model in terms of actual (y) vs.
predicted (yp) response vectors.
193
Regression Model: y(*) vs yp(+)
60
y, yp
50
40
0 5 10 15
index
More details about the parameters/coefficients including standard errors, t-values, p-values, and Variance
Inflation Factors (VIFs) are shown by the summary method.
1 println ( mod . summary () )
For the Texas Temperatures dataset it provides the following information: The Estimate is the value assigned
to the parameter for the given Var. The Std. Error, t-value, p-value and VIF are also given.
Given the following assumptions: (1) ∼ D(0, σI) for some distribution D and (2) for each column j, and
xj are independent, the covariance matrix of the parameter vector b is
|
C [b] = σ 2 (X X)−1 (6.56)
194
·
σ̂ 2 = = mse (6.57)
df
the standard deviation (or standard error) of the j th parameter/coefficient may be given as the square root
of the j th diagonal element of the covariance matrix.
q
σ̂bj = σ̂ (X | X)−1 jj (6.58)
The corresponding t-value is simply the parameter value divided by its standard error, which indicates how
many standard deviation units it is away from zero. The farther way from zero the more significant (or more
important to the model) the parameter is.
bj
t(bj ) = (6.59)
σ̂bj
When the error distribution is Normal, then t(bj ) follows the Student’s t Distribution. For example, the pdf
for the Student’s t Distribution with df = ν = 2 Degrees of Freedom is shown in the figure below (the t
Distribution approaches the Normal Distribution as ν increases).
0.4
0.3
fy (y)
0.2
0.1
0
−3 −2 −1 0 1 2 3
y
The corresponding p-value P (|y| > t) measures how significant the t-value is, e.g.,
Fy (−1.683018) = 0.0590926
P (|y| > −1.683018) = 2Fy (−1.683018)) = 0.118185 for ν = df = 12
Typically, the t-value is only considered significant if is in the tails of the Student’s t distribution. The
farther out in the tails, the less likely for the parameter to be non-zero (and hence be part of the model)
simply by chance. The p-value measures the risk (chance of being wrong) in including parameter bj and
therefore variable xj in the model.
195
The predict Method
Finally, a given new data vector z, the predict method may be used to predict its response value.
1 val z = VectorD (1.0 , 30.0 , 1000.0 , 100.0)
2 println ( s " predict (z) ={ mod . predict ( z ) } " )
Feature Selection
Feature selection (or Variable Selection) may be carried out by using either forwrardSel or backwardElim.
These methods add or remove one variable at a time. To iteratively add or remove, the following methods
may be called.
1 mod . forwardSelAll ( cross = false )
2 mod . ba ckw ar dE l im Al l ( cross = false )
3 mod . s t e p R e g r e s s i o n A l l ( cross = false )
The default criterion for choosing which variable to add/remove is Adjusted R2 . It may be changed via the
idx q parameter to the methods (see the Fit trait for the possible values for this parameter). Note: The
cross-validation is turned off (cross = false) due to the small size of the dataset.
The source code for the Texas Temperatures example is a test case in Regression.scala.
Class Methods:
1 @ param x the data / input m - by - n matrix
2 ( augment with a first column of ones to include intercept in model )
3 @ param y the response / output m - vector
4 @ param fname_ the feature / variable names ( defaults to null )
5 @ param hparam the hyper - parameters ( defaults to Regression . hp )
6
6.6.15 Exercises
| |
1. For Exercise 1 from the last section, compute A = X X and z = X y. Now solve the following linear
systems of equations for b.
196
Ab = z
2. Gradient descent can be used for Multiple Linear Regression as well. For gradient descent, pick a
starting point b0 , compute the gradient of the loss function ∇L and move −η∇L from b0 where η is
the learning rate. Write a Scala program that repeats this for several iterations for the above data.
What is happening to the value of the loss function L.
|
∇L = − X (y − Xb)
|
−X
Starting with data matrix x, response vector y and parameter vector b, in ScalaTion, the calculations
become
1 val yp = x * b // y predicted
2 val e = y - yp // error
3 val g = x .T * e // - gradient
4 b += g * eta // update parameter b
5 val h = 0.5 * ( e dot e ) // half the sum of squared errors
Unless the dataset is normalized, finding an appropriate learning rate eta may be difficult. See the
MatrixTransform object for details. Do this for the Blood Pressure Example BPressure dataset. Try
using another dataset.
3. Consider the relationships between the predictor variables and the response variable in the AutoMPG
dataset. This is a well know dataset that is available at multiple websites including the UCI Machine
Learning Repository http://archive.ics.uci.edu/ml/datasets/Auto+MPG. The response variable
is the miles per gallon (mpg: continuous) while the predictor variables are cylinders: multi-valued
discrete, displacement: continuous, horsepower: continuous, weight: continuous, acceleration:
continuous, model year: multi-valued discrete, origin: multi-valued discrete, and car name: string
(unique for each instance). Since the car name is unique and obviously not causal, this variable is
eliminated, leaving seven predictor variables. First compute the correlations between mpg (vector y)
and the seven predictor variables (each column vector x:j in matrix X).
1 val correlation = y corr x_j
and then plot mpg versus each of the predictor variables. The source code for this example is at
http://www.cs.uga.edu/~jam/scalation_2.0/src/main/scala/scalation/modeling/Example_AutoMPG.
scala .
Alternatively, a .csv file containing the AutoMPG dataset may be read into a relation called auto tab
from which data matrix x and response vector y may be produced. If the dataset has missing values,
they may be replaced using a spreadsheet or using the techniques discusses in the Data Preprocessing
Chapter.
197
1 val auto_tab = Relation ( BASE_DIR + " auto - mpg . csv " , " auto_mpg " , null , -1)
2 val (x , y ) = auto_tab . toMatrixDD (1 to 6 , 0)
3 println ( s " x = x”)println(s”y =y "
4. Apply Regression analysis on the AutoMPG dataset. Compare with results of applying the NullModel,
SimplerRegression and SimpleRegression. Try using SimplerRegression and SimpleRegression
with different predictor variables for these models. How does their R2 values compare to the correlation
analysis done in the previous exercise?
5. Examine the collinearity and multi-collinearity of the column vectors in the AutoMPG dataset.
6. For the AutoMPG dataset, repeatedly call the backwardElim method to remove the predictor variable
that contributes the least to the model. Show how the various quality of fit (QoF) measures change as
variables are eliminated. Do the same for the forwardSel method. Using R̄2 , select the best models
from the forward and backward approaches. Are they the same?
7. Compare model assessment and model validation. Compute sse, mse and R2 for the full and best
AutoMPG models trained on the entire data set. Compare this with the results of Leave-One-Out,
5-fold Cross-Validation and 10-fold Cross-Validation.
mse
σ̂ 2 (bj ) = · vif(bj )
k σ̂ 2 (xj )
Derive this formula. The standard error is the square root of this value. Use the estimate for bj and
its standard error to compute a t-value and p-value for the estimate. Run the AutoMPG model and
explain these values produced by the summary method.
9. Singular Value Decomposition Technique. In cases where the rank of the data/input matrix X
is not full or its multi-collinearity is high, a useful technique to solve for the parameters of the model is
Singular Value Decomposition (SVD). Based on the derivation given in http://www.ime.unicamp.br/
~marianar/MI602/material%20extra/svd-regression-analysis.pdf, we start with the equation
estimating y as the product of the data matrix X and the parameter vector b.
y = Xb
|
X = U ΣV
where in the full-rank case, U is an m-by-n orthogonal matrix, Σ is an n-by-n diagonal matrix of singular
|
values, and V is an n-by-n orthogonal matrix The r = rank(A) equals the number of nonzero singular
|
values in Σ, so in general, U is m-by-r, Σ is r-by-r, and V is r-by-n. The singular values are the
|
square roots of the nonzero eigenvalues of X X. Substituting for X yields
|
y = U ΣV b
198
|
Defining d = ΣV b, we may write
y = Ud
This can be viewed as a estimating equation where X is replaced with U and b is replaced with d.
Consequently, a least squares solution for the alternate parameter vector d is given by
| |
d = (U U )−1 U y
|
Since U U = I, this reduces to
|
d = U y
b = V Σ−1 d
where Σ−1 is a diagonal matrix where elements on the main diagonal are the reciprocals of the singular
values.
10. Improve Stepwise Regression. Write ScalaTion code to improve the stepRegressionAll method
by implementing the swapping operation. Then redo exercise 6 using all three: Forward Selection,
Backward Elimination, and Stepwise Regression with all four criteria: R2 , R̄2 , Rcv
2
, and AIC. Plot the
curve for each criterion, determine the best number of variables and what these variables are. Compare
the four criteria.
As part of a larger project compare this form of feature selection with that provided by Ridge Regression
and Lasso Regression. See the next two sections.
Now add features including quadratic terms, cubic terms, and dummy variables to the model using
SymbolicRegression.quadratic, SymbolicRegression.cubic, and RegressionCat. See the subse-
quent sections.
In addition to the AutoMPG dataset, use the Concrete dataset and three more datasets from UCI
Machine Learning Repository. The UCI datasets should have more instances (m) and variables (n)
than the first two datasets. The testing should also be done in R or Python.
11. Regression as Projection. Consider the following six vectors/points in 3D space where the response
variable y is modeled as a linear function of predictor variables x1 and x2 .
1 // x1 x2 y
2 val xy = MatrixD ((6 , 3) , 1 , 1, 2.8 ,
3 1, 2, 4.2 ,
4 1, 3, 4.8 ,
5 2, 1, 5.3 ,
6 2, 2, 5.5
7 2, 3, 6.5)
199
5
y
0 2
0
0.5 1 x2
1.5 2 0
x1
y = b0 x1 + b1 x2 +
Determine the plane (response surface) that these six points are projected onto.
ŷ = b0 x1 + b1 x2
For this problem, the number of instances m = 6 and the number of parameters/predictor variables
n = 2. Determine the number of Degrees of Freedom for the model dfm and the number of Degrees of
Freedom for the residuals/errors df .
|
12. Given a data matrix X ∈ Rm×2 and response vector y ∈ Rm where X = [1, x], compute X X and
|
X y. Use these to set up an augmented matrix and then apply LU Factorization to make it upper
triangular. Solve for the parameters b0 and b1 symbolically. Simply to reproduce formulas for b0 and
b1 for Simple Regression.
| |
13. Recall that ŷ = Hy where the hat matrix is X(X X)−1 X . The leverage of point i is defined to be
hii .
| |
hii = xi (X X)−1 xi (6.61)
The main diagonal of the hat matrix gives the leverage for each of the points. Points with high leverage
are those above a threshold such as
2 tr(H)
hii ≥ (6.62)
m
Note, that the trace tr(H) = rank(H) = rank(X) will equal n when X has full rank. List the high
leverage points for the Example AutoMPG dataset.
200
14. Points that are influential in determining values for model coefficients/parameters combine high lever-
age with large residuals. Measures of influence include Cook’s Distance, DFFITS, and DFBETAS
[34] and see http://home.iitk.ac.in/~shalab/regression/Chapter6-Regression-Diagnostic%
20for%20Leverage%20and%20Influence.pdf. These measures can also be useful in detecting po-
tential outliers. Compute these measures for the Example AutoMPG dataset.
15. The best two predictor variables for AutoMPG are weight and modelyear and with the weight given
in units of 1000 pounds, the prediction equation for the Regression model (with intercept) is
40
20
y
0 80
0 75
2
4 x2
x1 6 70
Make a plot of the hyperplane for the second best combination of features. Compare the QoF of these
two models and explain how the feature combinations affect the response variable (mpg).
16. State and explain the conditions required for the Ordinary Least Squares (OLS) estimate of parameter
vector b for multiple linear regression to be B.L.U.E. See the Gauss-Markov Theorem. B.L.U.E. stands
for Best Linear Unbiased Estimator.
201
6.7 Ridge Regression
The RidgeRegression class supports multiple linear ridge regression. As with Regression, the predictor
variables x are multi-dimensional [x1 , . . . , xk ], as are the parameters b = [b1 , . . . , bk ]. Ridge regression adds
a penalty based on the `2 norm of the parameters b to reduce the chance of them taking on large values
that may lead to less robust models.
The penalty holds down the values of the parameters and this may result in several advantages: (1)
better out-of-sample (e.g., cross-validation) quality of fit, (2) reduced impact of multi-collinearity, (3) turn
singular matrices, non-singular, and to a limited extent (4) eliminate features/predictor variables from the
model.
The penalty is not to be included on the intercept parameter b0 , as this would shift predictions in a way
that would adversely affect the quality of the model. See the exercise on scale invariance.
y = b · x + = b1 x1 + · · · + bk xk + (6.63)
where represents the residuals (the part not explained by the model).
6.7.2 Training
Centering the dataset (X, y) has the following effects: First, when the X matrix is centered, the intercept
b0 = µy . Second, when y is centered, µy becomes zero, implying b0 = 0. To rescale back to the original
response values, µy can be added back during prediction. Therefore, both the data/input matrix X and the
response/output vector y should be centered (zero mean).
The regularization of the model adds an `2 -penalty on the parameters b. The objective function to
minimize is now the loss function L(b) = 12 sse plus the `2 -penalty.
1 1 1
fobj = L(b) + λ kbk2 = · + λ b · b (6.66)
2 2 2
where λ is the shrinkage parameter. A large value for λ will drive the parameters b toward zero, while a
small value can help stabilize the model (e.g., for nearly singular matrices or high multi-collinearity).
1 1
fobj = (y − Xb) · (y − Xb) + λ b · b (6.67)
2 2
202
6.7.3 Optimization
Fortunately, the quadratic nature of the penalty function allows it to be combined easily with the quadratic
error terms, so that matrix factorization can still be used for finding optimal values for parameters.
Taking the gradient of the objective function fobj with respect to b and then setting it equal to zero
yields
|
− X (y − Xb) + λb = 0 (6.68)
Recall the first term of the gradient was derived in the Regression section. See the exercises below for
deriving the last term of the gradient. Multiplying out gives,
| |
−X y + (X X)b + λb = 0
| |
(X X)b + λb = X y
| |
(X X + λI)b = X y (6.69)
Matrix factorization may now be used to solve for the parameters b in the modified Normal Equations. For
example, use of matrix inversion yields,
| |
b = (X X + λI)−1 X y (6.70)
|
For Cholesky factorization, one may compute X X and simply add λ to each of the diagonal elements (i.e,
along the ridge). QR and SVD factorizations require similar, but slightly more complicated, modifications.
Note, use of SVD can improve the efficiency of searching for an optimal value for λ [71, 196].
6.7.4 Centering
Before creating a RidgeRegression model, the X data matrix and the y response vector should be centered.
This is accomplished by subtracting the means (vector of column means for X and a mean value for y).
1 val mu_x = x . mean // column - wise mean of x
2 val mu_y = y . mean // mean of y
3 val x_c = x - mu_x // centered x ( column - wise )
4 val y_c = y - mu_y // centered y
The centered matrix x c and center vector y c are then passed into the RidgeRegression constructor.
1 val mod = new R id ge R eg re ss i on ( x_c , y_c )
2 mod . trainNtest ()
Now, when making predictions, the new data vector z needs to be centered by subtracting mu x. Then the
predict method is called, after which the mean of y is added.
1 val z_c = z - mu_x // center z first
2 yp = mod . predict ( z_c ) + mu_y // predict z_c and add y ’s mean
3 println ( s " predict (z) =yp " )
203
6.7.5 The λ Hyper-parameter
The value for λ can be user specified (typically a small value) or chosen by a method like findLambda. It
finds a roughly optimal value for the shrinkage parameter λ based on the cross-validated sum of squared
errors sse cv. The search starts with the low default value for λ and then doubles it with each iteration,
returning the minimizing λ and its corresponding cross-validated sse. A more precise search could be used
to provide a better value for λ.
1 def findLambda : ( Double , Double ) =
2 var l = lambda // start with a small default value
3 var l_best = l
4 var sse = Double . MaxValue
5 for i <- 0 to 20 do
6 R id ge Re g re ss io n . hp ( " lambda " ) = l
7 val mod = new R id ge R eg re ss i on (x , y )
8 val stats = mod . crossValidate ()
9 val sse2 = stats ( QoF . sse . ordinal ) . mean
10 banner ( s " R idgeReg ession with lambda = $mod . lambda_ } has sse = $sse2 " )
11 if sse2 < sse then { sse = sse2 ; l_best = l }
12 l *= 2
13 end for
14 ( l_best , sse ) // best lambda and its sse_cv
15 end findLambda
Third, predict a value for new input vector z using each model.
204
1 banner ( " Make Predictions " )
2 val z = VectorD (20.0 , 80.0) // new instance to predict
3 val _1z = VectorD .++ (1.0 , z ) // prepend 1 to z
4 val z_c = z - mu_x // center z
5 println ( s " rg . predict (z) ={ rg . predict ( _1z ) } " ) // predict using _1z
6 println ( s " mod . predict (z) ={ mod . predict ( z_c ) + mu_y } " ) // predict using z_c and add
y ’s mean
SUMMARY
Parameters/Coefficients:
Var Estimate Std. Error t value Pr(>|t|) VIF
----------------------------------------------------------------------------------
x0 -281.426985 835.349154 -0.336897 0.768262 NA
x1 -7.611030 8.722908 -0.872534 0.474922 3.653976
x2 19.010291 8.423716 2.256758 0.152633 3.653976
SUMMARY
Parameters/Coefficients:
Var Estimate Std. Error t value Pr(>|t|) VIF
----------------------------------------------------------------------------------
x0 -7.611271 8.722908 -0.872561 0.474910 NA
x1 19.009947 8.423716 2.256717 0.152638 3.653976
Notice there is very little difference between the two models. Try increasing the value of the shrinkage
hyper-parameter λ beyond its default value of 0.01. This example can be run as follows:
$ sbt
sbt> runMain scalation.modeling.ridgeRegressionTest
Automatic Centering
ScalaTion provides factory methods, apply and center, in the RigdgeRgression companion object that
center the data for the user.
205
1 // val mod = R id g eR eg re s si on ( xy , fname ) // apply takes a combined matrix xy
2 val mod = R id ge Re g re ss io n . center (x , y , fname ) // center takes a matrix x and
vector y
3 mod . trainNTest () ()
4 val yp = mod . predict ( z - x . mean ) + y . mean
The user must still center any vectors passed into the predict method and add back the response mean at
the end, e.g., pass z - x.mean and add back y.mean.
Note, care should be taken regarding x.mean and y.mean when preforming validation or crossValidation.
The means for the full, training and testing sets may differ.
Class Methods:
1 @ param x the centered data / input m - by - n matrix NOT augmented with a column of 1 s
2 @ param y the centered response / output m - vector
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param hparam the shrinkage hyper - parameter , lambda (0 = > OLS ) in the penalty term
5 ’ lambda * b dot b ’
6
6.7.8 Exercises
1. Based on the example given in this section, try increasing the value of the hyper-parameter λ and
examine its effect on the parameter vector b, the quality of fit and predictions made.
1 import R id g eR eg re s si on . hp
2
Alternatively,
1 hp ( " lambda " ) = 1.0
206
See the HyperParameter class in the scalation.modeling package for details.
2. For the AutoMPG dataset, use the findLambda method find a value for λ that roughly minimizes
out-of-sample sse cv based on using the crossValidate method. Plot sse cv vs. λ.
3. Why is it important to center (zero mean) both the data matrix X and the response vector y? What
is scale invariance and how does it relate to centering the data?
4. The Degrees of Freedom (DoF) used in ScalaTion’s RidgeRegression class is approximate. As the
shrinkage parameter λ increases the effective DoF (eDoF) should be used instead. A general definition
| |
of effective DoF is the trace tr of the hat matrix H = X(X X + λI)−1 X
eff
dfm = tr(H)
Read [86] and explain the difference between DoF and effective DoF (eDoF) for Ridge Regression.
5. A matrix that is close to singularity is said to be ill-conditioned. The condition number κ of a matrix
|
A (e.g., A = X X) is defined as follows:
κ = kAk kA−1 k ≥ 1
When κ becomes large the matrix is considered to be ill-conditioned. In such cases, it is recommended
to use QR or SVD Factorization for Least-Squares Regression [39]. Compute the condition number for
|
of X X for various datasets.
6. For the last term of the gradient of the objective function, show that
∂ λ λX 2
b·b = b = λbj
∂bj 2 2 i i
λ
Put these together to show that ∇ b · b = λb
2
7. For over-parameterized (or under-determined) regression where the n > m, (number of parameters >
number of instances) it is common to seek a min-norm solution.
| |
b = X (XX )−1 y
8. Compare different algorithms for finding a suitable value for the shrinkage parameter λ.
Hint: see Lecture Notes on Ridge Regression - https://arxiv.org/pdf/1509.09169.pdf - [196].
207
6.8 Lasso Regression
The LassoRegression class supports multiple linear regression using the Least absolute shrinkage and
selection operator (Lasso) that constrains the values of the b parameters and effectively sets those with low
impact to zero (thereby deselecting such variables/features). Rather than using an `2 -penalty (Euclidean
norm) like RidgeRegression, it uses and an `1 -penalty (Manhattan norm). In RidgeRegression when bj
approaches zero, b2j becomes very small and has little effect on the penalty. For LassoRegression, the
effect based on |bj | will be larger, so it is more likely to set parameters to zero. See section 6.2.2 in [85]
for a more detailed explanation on how LassoRegression can eliminate a variable/feature by setting its
parameter/coefficient to zero.
y = b · x + = b0 + b1 x1 + ... + bk xk + (6.71)
where represents the residuals (the part not explained by the model). See the exercise that considers
whether to include the intercept b0 in the shrinkage.
6.8.2 Training
The regularization of the model adds an `1 -penalty on the parameters b. The objective function to minimize
is now the loss function L(b) = 21 sse plus the penalty.
1 1
fobj = sse + λ kbk1 = kk22 + λ kbk1 (6.72)
2 2
where λ is the shrinkage parameter. Substituting = y − Xb yields,
1
fobj = ky − Xbk22 + λ kbk1 (6.73)
2
Replacing the norms with dot products gives,
1
fobj = (y − Xb) · (y − Xb) + λ 1 · |b| (6.74)
2
Although similar to the `2 -penalty used in Ridge Regression, it may often be more effective. Still, the
` -penalty for Lasso has a disadvantage that the absolute values in the `1 norm make the objective function
1
non-differentiable.
k
X
λ 1 · |b| = λ |bj | (6.75)
j=0
Therefore, the straightforward strategy of setting the gradient equal to zero to develop appropriate modified
Normal Equations that allow the parameters to be determined by matrix factorization will no longer works.
Instead, the objective function needs to be minimized using a search based optimization algorithm.
208
6.8.3 Optimization Strategies
There are multiple optimization algorithms that can be applied for parameter estimation in Lasso Regression.
Coordinate Descent
Coordinate Descent attempts to optimize one variable/feature at a time (repeated one dimensional optimiza-
tion). For normalized data the following algorithm has been shown to work: https://xavierbourretsicotte.
github.io/lasso_implementation.html.
ScalaTion uses the Alternating Direction Method of Multipliers (ADMM) [22] algorithm to optimize the b
parameter vector. The algorithm for using ADMM for Lasso Regression is outlined in section 6.4 of Boyd [22]
(https://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf). Optimization problems in ADMM
form separate the objective function into two parts f and g.
For Lasso Regression, the f function will capture the loss function ( 12 sse), while the g function will capture
the `1 regularization, i.e.,
1
f (b) = ky − Xbk22 , g(z) = λ kzk1 (6.77)
2
Introducing z allows the functions to be separated, while the constraint keeps z and b close. Therefore, the
iterative step in the ADMM optimization algorithm becomes
| |
b = (X X + ρI)−1 (X y + ρ(z − u))
z = Sλ/ρ (b + u)
u = u+b−z
where u is the vector of Lagrange multipliers and Sλ is the soft thresholding function.
209
6 L as so Re g re ss io n . hp ( " lambda " ) = l
7 val mod = new L as so R eg re ss i on (x , y )
8 val stats = mod . crossValidate ()
9 val sse2 = stats ( QoF . sse . ordinal ) . mean
10 banner ( s " L assoReg ession with lambda = $mod . lambda_ } has sse = $sse2 " )
11 if sse2 < sse then
12 sse = sse2 ; l_best = l
13 end if
14 Fit . s h o w Q o f S t a t T a b l e ( stats )
15 l *= 2
16 end for
17 ( l_best , sse )
18 end findLambda
As the default value for the shrinkage/penalty parameter λ is very small, the optimal solution will be
close to the Ordinary Least Squares (OLS) solution shown in green at b = [b1 , b2 ] = [3, 1] in Figure 6.5.
Increasing the penalty parameter will pull the optimal b towards the origin. At any given point in the plane,
the objective function is the sum of the loss function L(b) and the penalty function p(b). The contours in
blue show points of equal height for the penalty function, while those in black show the same for the loss
function. Suppose for some λ the point [2, 0] is this penalized optimum. This would mean that moving
toward the origin would be non-productive, as the increase in the loss would exceed the drop in the penalty.
On the other hand, moving toward [3, 1] would be non-productive as the increase in the penalty would exceed
the drop in the loss. Notice in this case, that the penalty has pulled the b1 parameter to zero (an example of
feature selection). Ridge regression will be less likely to pull a parameter to zero, as its contours are circles
rather than diamonds. Lasso regression’s contours have sharp points on the axis which thereby increase the
chance of intersecting a loss contour on an axis.
b2
−2 −1 1 2 3 4 b1
−1
−2
210
6.8.5 Regularized and Robust Regression
Regularized and Robust Regression are useful in many cases including high-dimensional data, correlated
data, non-normal data and data with outliers [70]. These techniques work by adding a `1 and/or `2 -penalty
terms to shrink the parameters and/or changing from an `2 to `1 loss function. Modeling techniques include
Ridge, Lasso, Elastic Nets, Least Absolute Deviation (LAD) and Adaptive LAD [70].
Class Methods:
1 @ param x the data / input m - by - n matrix
2 @ param y the response / output m - vector
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param hparam the shrinkage hyper - parameter , lambda (0 = > OLS ) in the penalty term
5 ’ lambda * b dot b ’
6
6.8.7 Exercises
1. Compare the results of LassoRegression with those of Regression and RidgeRegression. Examine
the parameter vectors, quality of fit and predictions made.
1 // 5 data points : one x_0 x_1
2 val x = MatrixD ((5 , 3) , 1.0 , 36.0 , 66.0 , // 5 - by -3 matrix
3 1.0 , 37.0 , 68.0 ,
4 1.0 , 47.0 , 64.0 ,
5 1.0 , 32.0 , 53.0 ,
6 1.0 , 1.0 , 101.0)
7 val y = VectorD (745.0 , 895.0 , 442.0 , 440.0 , 1598.0)
8 val z = VectorD (1.0 , 20.0 , 80.0)
9
10 // Create a L as s oR eg re s si on model
11
211
2. Based on the last exercise, try increasing the value of the hyper-parameter λ and examine its effect on
the parameter vector b, the quality of fit and predictions made.
1 import L as s oR eg re s si on . hp
2
3. Using the above dataset and the AutoMPG dataset, determine the effects of (a) centering the data
(µ = 0), (b) standardizing the data (µ = 0, σ = 1).
1 import M a t r i xT r a n s f o r m s . _
2
4. Explain how the Coordinate Descent Optimization Algorithm works for Lasso Regression. See
https://xavierbourretsicotte.github.io/lasso_implementation.html.
5. Explain how the ADMM Optimization Algorithm works for Lasso Regression. See
https://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf.
6. Compare LassoRegression the with Regression that uses forward selection or backward elimination
for feature selection. What are the advantages and disadvantages of each for feature selection.
7. Compare LassoRegression the with Regression on the AutoMPG dataset. Specifically, compare the
quality of fit measures as well as how well feature selection works.
8. Show that the contour curves for the Simple Regression loss function L(b0 , b1 ) are ellipses. The general
equation of an ellipse centered at (h, k) is
9. Elastic Nets combine both `2 and `1 penalties to try to combine the best features of both RidgeRegression
and LassoRegression. Elastic Nets naturally includes two shrinkage parameters, λ1 and λ2 . Is the
additional complexity worth the benefits?
10. Regularization using Lasso has the nice property of being able to force parameters/coefficients to zero,
but this may require a large shrinkage hyper-parameter λ that shrinks non-zero coefficients more than
desired. Newer regularization techniques reduce the shrinkage effect compared to Lasso, by having a
penalty profile that matches Lasso for small coefficients, but is below Lasso for large coefficient values.
Make of plot of the penalty profiles for Lasso, Smoothly Clipped Absolute Deviations (SCAD) and
Mimimax Concave Penalty (MCP).
212
6.8.8 Further Reading
1. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
[22]
213
6.9 Quadratic Regression
The quadratic method in the SymbolicRegression object adds quadratic terms into the model. It can
often be the case that the response variable y will have a nonlinear relationship with one more of the predictor
variable xj . The simplest such nonlinear relationship is a quadratic relationship. Looking at a plot of y vs.
xj , it may be evident that a bending curve will fit the data much better than a straight line. For example,
a particle under constant acceleration will have a position that changes quadratically with time.
When there is only one predictor variable x, the response y is modeled as a quadratic function of x
(forming a parabola).
y = b0 + b1 x + b2 x2 + (6.79)
The quadratic method achieves this simply by expanding the data matrix. From the dataset (initial
data matrix), all columns will have another column added that contains the values of the original column
squared. It is important that the initial data matrix has no intercept. The expansion will optionally
add an intercept column (column of all ones). Since 12 = 1, the ones columns and its square will be perfectly
collinear and make the matrix singular, if the user includes a ones column.
where x0 = [1, x1 , x2 , x21 , x22 ], b = [b0 , b1 , b2 , b3 , b4 ], and represents the residuals (the part not explained by
the model).
The number of terms (nt) in the model increases linearly with the dimensionality of the space (n)
according to the following formula:
Each column in the initial data matrix is expanded into two in the expanded data matrix and an intercept
column is optionally added.
214
10 val (x , y ) = ( xy . not (? , 1) , xy (? , 1) ) // x is first column , y is last column
11 val ox = VectorD . one ( xy . dim ) + ˆ : x // prepend a column of all ones
12
Now compare their summary results. The summary results for the Regression model are shown below:
SUMMARY
Parameters/Coefficients:
Var Estimate Std. Error t value Pr(>|t|) VIF
----------------------------------------------------------------------------------
x0 -13.285714 5.154583 -2.577457 0.041913 NA
x1 8.285714 1.020760 8.117205 0.000188 1.000000
The summary results for the SymbolicRegression.quadratic model are given here:
SUMMARY
Parameters/Coefficients:
Var Estimate Std. Error t value Pr(>|t|) VIF
----------------------------------------------------------------------------------
x0 4.035714 3.873763 1.041807 0.345231 NA
x1 -2.107143 1.975007 -1.066904 0.334798 21.250000
x2 1.154762 0.214220 5.390553 0.002965 21.250000
The summary results for the SymbolicRegression.quadratic model highlight a couple of important
issues:
Try eliminating x1 to see if these two improve without much of a drop in Adjusted R-squared R̄2 . Note,
eliminating x1 makes the model non-hierarchical (see the exercises). Figure 6.6 shows the predictions (yp)
of the Regression and quadratic models.
215
y
60
50
40
30
20
10
1 2 3 4 5 6 7 8 x
−10
Figure 6.6: Actual y (red) vs. Regression yp (green) vs. quadratic yp (blue)
The quadratic method in the SymbolicRegression object creates a Regression object that uses mul-
tiple regression to fit a quadratic surface to the data.
Method:
1 @ param x the initial data / input m - by - n matrix ( before quadratic term expansion )
2 must not include an intercept column of all ones
3 @ param y the response / output m - vector
4 @ param fname the feature / variable names ( defaults to null )
5 @ param intercept whether to include the intercept term ( column of ones ) _1
6 ( defaults to true )
7 @ param cross whether to include 2 - way cross / interaction terms x_i x_j
8 ( defaults to false )
9 @ param hparam the hyper - parameters ( defaults to Regression . hp )
10
The apply method is defined in the SymbolicRegression object. The Set (1, 2) specifies that first
(Linear) and second (Quadratic) order terms will be included in the model. The intercept flag indicates
whether a column of ones will be added to the input/data matrix.
216
The next few modeling techniques described in subsequent sections support the development of low-order
multi-dimensional polynomial regression models. Higher order polynomial regression models are typically
restricted to one-dimensional problems (see the PolyRegression class).
Model Equation
In two dimensions (2D) where x = [x1 , x2 ], the quadratic cross model/regression equation is the following:
The number of terms (nt) in the model increases quadratically with the dimensionality of the space (n)
according to the formula for triangular numbers shifted by (n → n + 1).
n+2 (n + 2)(n + 1)
nt = = e.g., nt = 6 for n = 2 (6.83)
2 2
This result may derived by summing the number of constant terms (1), linear terms (n), quadratic terms
(n), and cross terms n2 .
Such models generalize quadratic by introducing cross terms, e.g., x1 x2 . Adding cross terms makes the
number of terms increase quadratically rather than linearly with the dimensionality. Consequently, multi-
collinearity problems (check VIF scores) may be intensified and the need for feature selection, therefore,
increases.
y = f (x1 , x2 ) + (6.84)
For example, a model with two predictor variables and one response variable may be displayed in three
dimensions. Such a response surface can also be shown in two dimensions using contour plots where a
contour/curve shows points of equal height. Figure 6.7 shows three types of contours that represent the
types of terms in quadratic regression (1) linear terms, (2) quadratic terms, and (3) cross terms. In the
217
figure, the first green line is for x1 + x2 = 4, the first blue curve is for x21 + x22 = 16, and the first red curve
is for x1 x2 = 4.
x2
1 2 3 4 5 6 x1
A constant term simply moves the whole response surface up or down. The coefficients for each of terms can
rotate and stretch these curves.
The response surface for Quadratic Regression on AutoMPG based on the best combination of features,
weight and modelyear, is shown in 6.8.
50
y
80
0 75
2
4 x2
x1 6 70
218
6.9.6 Exercises
1. Enter the x, y dataset from the example given in this section and use it to create a quadratic model.
Show the expanded input/data matrix and the response vector using the following two print statements.
1 val qrg = new S y m b o l i c R e g r e s s i o n . quadratic (x , y )
2 println ( s " expanded x = ${ qrg . getX } " )
3 println ( s " y = ${ qrg . getY } " )
2. Perform Quadratic Regression on the Example BPressure dataset using the first two columns of its
data matrix x.
1 import E x a m p l e _ B P r e s s u r e .{ x01 = > x , y }
3. Perform both forward selection and backward elimination to find out which of the terms have the most
impact on predicting the response. Which feature selection approach (forward selection or backward
elimination) finds a model with the highest R¯2 ?
4. Generate a dataset with data matrix x and response vector y using the following loop where noise =
new Normal (0, 10 * m * m).
1 for i <- x . indices do
2 x (i , 0) = i
3 y ( i ) = i * i + i + noise . gen
4 end for
Compare the results of Regression vs. quadratic. Compare the Quality of Fit and the parameter
values. What correspondence do the parameters have with the coefficients used to generate the data?
Plot y vs. x, yp and y vs. t for both Regression and quadratic. Also plot the residuals e vs. x for
both. Note, t is the index vector VectorD.range (0, m).
5. Generate a dataset with data matrix x and response vector y using the following loop where noise =
new Normal (0, 10 * s * s) and grid = 1 to s.
1 var k = 0
2 for i <- grid ; j <- grid do
3 x ( k ) = VectorD (i , j )
4 y ( k ) = x (k , 0) ~ ˆ 2 + 2 * x (k , 1) + noise . gen
5 k += 1
6 end for
Compare the results of Regression vs. quadratic. Try modifying the equation for the response and
see how the Quality of Fit changes.
6. The quadratic model as well as its more complex cousin cubic may have issues with having high
multi-collinearity or high VIF values. Although high VIF values may not be a problem for predic-
tion accuracy, they can make interpretation and inferencing difficult. For the problem given in this
section, rather than adding x2 to the existing Regression model, find a second order polynomial that
could be added without causing high VIF values. VIF values are the lowest when column vectors are
orthogonal. See the section on Polynomial Regression for more details.
7. Extrapolation far from the training data can be risky for many types of models. Show how having
higher order polynomial terms in the model can increase this risk.
219
8. A polynomial regression model is said to be hierarchical [143, 167, 127] if it contains all terms up to
xk , e.g., a model with x, x2 , x3 is hierarchical, while a model with x, x3 is not. Show that hierarchical
models are invariant under linear transformations.
Hint: Consider the following two models where x is the distance on I-70 West in miles from the center
of Denver (junction with I-25) and y is the elevation in miles above sea level.
ŷ = b0 + b1 x + b2 x2
ŷ = b0 + b2 x2
The first model is hierarchical, while the second is not. A second study is conducted, but now the
distance z is from the junction of I-70 and I-76. A linear transformation can be used to resolve the
problem.
x = z+7
Putting z into the second model (assuming the first study indicated a linear term is not needed) gives,
9. Perform quadratic and quadratic (with cross terms) regression on the Example BPressure dataset
using the first two columns of its data matrix x.
1 import E x a m p l e _ B P r e s s u r e .{ x01 = > x , y }
10. Perform both forward selection and backward elimination to find out which of the terms have the most
impact on predicting the response. Which feature selection approach (forward selection or backward
elimination) finds a model with the highest R¯2 ?
11. Generate a dataset with data matrix x and response vector y using the following loop where noise =
new Normal (0, 10 * s * s) and grid = 1 to s.
1 var k = 0
2 for i <- grid ; j <- grid do
3 x ( k ) = VectorD (i , j )
4 y ( k ) = x (k , 0) ~ ˆ 2 + 2 * x (k , 1) + x (k , 0) * x (k , 1) + noise . gen
5 k += 1
6 end for
Compare the results of Regression, quadratic with cross = false, and quadratic with cross =
true.
12. Prove that the number of terms for a quadratic function f (x) in n dimensions is n+2
2 , by decomposing
the function into its quadratic (both squared and cross), linear and constant terms,
| |
f (x) = x Ax + b x + c
220
where A in an n-by-n matrix, b is an n-dimensional column vector and c is a scalar. Hint: A is
symmetric, but the main diagonal is not repeated, and we are looking for unique terms (e.g., x1 x2 and
x2 x1 are treated as the same). Note, when n = 1, A and b become scalars, yielding the usual quadratic
function ax2 + bx + c.
221
6.10 Cubic Regression
The cubic method in the SymbolicRegression object adds cubic terms in addition to the quadratic terms
added by the quadratic method. Linear terms in a model allow for slopes and quadratic terms allow for
curvature. If the curvature changes substantially or there is an inflection point (curvature changes sign), then
cubic terms may be useful. For example, before the inflection point the curve/surface may be concave upward,
while after the point it may be concave downward, e.g., a car stops accelerating and starts decelerating.
When there is only one predictor variable x, the response y is modeled as a cubic function of x.
y = b0 + b1 x + b2 x 2 + b3 x 3 + (6.85)
The number of terms (nt) in the model still increases quadratically with the dimensionality of the space
(n) according to the formula for triangular numbers shifted by (n → n + 1) plus n for the cubic terms.
n+2 (n + 2)(n + 1)
nt = +n = +n e.g., nt = 8 for n = 2 (6.87)
2 2
When n = 10, the number of terms and corresponding parameters nt = 76, whereas for Regression,
quadratic and quadratic with cross terms and order 2, it would 11, 21 and 66, respectively. Issues related
to negative Degrees of Freedom, over-fitting and multi-collinearity will need careful attention.
222
13 val rg = new Regression ( ox , y ) // create a regression model
14 rg . trainNtest () () // train and test the model
15 println ( rg . summary () ) // show summary
16
Figure 6.9 shows the predictions (yp) of the Regression, quadratic and cubic models.
y
60
50
40
30
20
10
1 2 3 4 5 6 7 8 x
−10
Figure 6.9: Actual y (red) vs. Regression (green) vs. quadratic (blue) vs. cubic (black)
Notice the quadratic curve follows the linear curve (line), while the cubic curve more closely follows the data.
Class Methods:
1 @ param x the initial data / input m - by - n matrix ( before quadratic term expansion )
2 must not include an intercept column of all ones
3 @ param y the response / output m - vector
4 @ param fname the feature / variable names ( defaults to null )
5 @ param intercept whether to include the intercept term ( column of ones ) _1
6 ( defaults to true )
7 @ param cross whether to include 2 - way cross / interaction terms x_i x_j
8 ( defaults to false )
9 @ param cross3 whether to include 3 - way cross / interaction terms x_i x_j x_k
10 ( defaults to false )
11 @ param hparam the hyper - parameters ( defaults to Regression . hp )
223
12
The Set (1, 2, 3) specifies that first (Linear), second (Quadratic), and third (Cubic) order terms will
be included in the model. The intercept flag indicates whether a column of ones will be added to the
input/data matrix.
Model Equation
In two dimensions (2D) where x = [x1 , x2 ], the cubic model/regression equation with cross terms is the
following:
x0 = [1, x1 , x2 , x21 , x22 , x31 , x32 , x1 x2 , x21 x2 , x1 x22 ] expanded input vector
b = [b0 , b1 , b2 , b3 , b4 , b5 , b6 , b7 , b8 , b9 ] parameter/coefficient vector
= y − b · x0 error/residual
Naturally, the number of terms in the model increases cubically with the dimensionality of the space (n)
according to the formula for tetrahedral numbers shifted by (n → n + 1).
n+3 (n + 3)(n + 2)(n + 1)
nt = = e.g., nt = 10 for n = 2 (6.90)
3 6
When n = 10, the number of terms and corresponding parameters nt = 286, whereas for Regression,
quadratic, quadratic with cross and cubic with both crosses and order 2, it would 11, 21, 66 and 76,
respectively. Issues related to negative Degrees of Freedom, over-fitting and multi-collinearity will need even
more careful attention.
224
If polynomials of higher degree are needed, ScalaTion provides a couple of means to deal with it. First,
when the data matrix consists of single column and x is one dimensional, the PolyRegression class may
be used. If one or two variables need higher degree terms, the caller may add these columns themselves as
additional columns in the data matrix input into the Regression class. The SymbolicRegression object
described in the next section allows the user to try many function forms.
Quadratic and Cubic Regression may fail producing Not-a-Number (NaN) results when a dataset contains
one or more categorical variables. For example, a variable like citizen “no”, “yes” is likely to be encoded 0,
1. If such a column is squared or cubed, the new column will be identical to the original column, so that
they will be perfectly collinear. One solution is not to expand such columns. If one must, then a different
encoding may be used, e.g., 1, 2. See the section on RegressionCat for more details.
6.10.5 Exercises
1. Generate and compare the model summaries produced by the three models (Regression, quadratic
and cubic) applied to the dataset given in this section.
2. An inflection point occurs when the second derivative changes sign. Find the inflection point in the
following cubic equation:
Plot the cubic function to illustrate. Explain why there are no inflection points for quadratic models.
3. Many laws in science involve quadratic and cubic terms as well as the inverses of these terms (e.g.,
inverse square laws). Find such a law and an open dataset to test the law.
4. Perform Cubic and Cubic with cross terms Regression on the Example BPressure dataset using the
first two columns of its data matrix x.
1 import E x a m p l e _ B P r e s s u r e .{ x01 = > x , y }
5. Perform both forward selection and backward elimination to find out which of the terms have the most
impact on predicting the response. Which feature selection approach (forward selection or backward
elimination) finds a model with the highest R¯2 ?
6. Generate a dataset with data matrix x and response vector y using the following loop where noise =
new Normal (0, 10 * s * s) and grid = 1 to s.
1 var k = 0
2 for i <- grid ; j <- grid do
3 x ( k ) = VectorD (i , j )
4 y ( k ) = x (k , 0) ~ ˆ 2 + 2 * x (k , 1) + x (k , 0) * x (k , 1) + noise . gen
5 k += 1
6 end for
225
Compare the results of Regression, quadratic with cross = false, quadratic with cross = true,
cubic with cross = false, cubic with cross = true, cubic with cross = true, cross3 = true,
Try modifying the equation for the response and see how the Quality of Fit changes.
226
6.11 Symbolic Regression
The last two sections covered Quadratic and Cubic Regression, but there are many possible functional forms.
For example, in physics force often decreases with distance following a inverse square law. The Newton’s Law
of Universal Gravitation states that masses m1 and m2 with center of mass positions at p1 and p2 (with
distance r = kp2 − p1 k will attract each other with force f ,
m1 m2
f = G (6.91)
r2
where the gravitational constant G = 6.67408 · 10−11 m3 kg−1 s−2 .
y = b0 x0 x1 x−2
2 + (6.92)
Given a four column dataset [x0 , x1 , x2 , y] a Symbolic Regression could be run to estimate a more general
model that includes all possible terms with powers x−2 −1 1 2
j , xj , xj , xj . It could also include cross (two-way
interaction) terms between all these terms. In this case, it is necessary to add cross3 (three-way interaction)
terms. An intercept would imply force with no masses involved, so it should be left out of the model.
It is easier to collect data where the Earth is used for mass 1 and mass 2 is for people at various distances
from the center of the Earth (m1 → x0 , r → x1 , f → y).
y = b0 x0 x−2
1 + (6.93)
In this case the parameter b0 will correspond to GM , where G is the Gravitational Constant and M is the
Mass of the Earth. The following code provides simulated data and uses symbolic regression to determine
the Gravitational Constant.
1 val noise = Normal (0 , 10) // random noise
2 val rad = Uniform (6370 , 7000) // distance from center of Earth in km
3 val mas = Uniform (50 , 150) // mass of person
4
227
7
The statement val mod = SymbolicRegression (...) invokes the factory method called apply in the
SymbolicRegression object. The SymbolicRegression object provides methods for quadratic, cubic, and
more general symbolic regression.
Object Methods:
1 object S y m b o l i c R e g r e s s i o n :
2
11 end S y m b o l i c R e g r e s s i o n
The apply method is flexible enough to include many functional forms as terms in a model. Feature
selection can be used to eliminate many of the terms to produce a meaningful and interpretable model.
Note, unless measurements are precise and experiments are controlled, other terms besides the one given by
Newton’s of Universal Gravitation are likely to be selected.
1 @ param x the initial data / input m - by - n matrix ( before expansion )
2 must not include an intercept column of all ones
3 @ param y the response / output m - vector
4 @ param fname the feature / variable names ( defaults to null )
228
5 @ param powers the set of powers to raise matrix x to ( defaults to null )
6 @ param intercept whether to include the intercept term ( column of ones ) _1
7 ( defaults to true )
8 @ param cross whether to include 2 - way cross / interaction terms x_i x_j
9 ( defaults to true )
10 @ param cross3 whether to include 3 - way cross / interaction terms x_i x_j x_k
11 ( defaults to false )
12 @ param hparam the hyper - parameters ( defaults to Regression . hp )
13 @ param terms custom terms to add into the model , e . g . ,
14 Array ((0 , 1.0) , (1 , -2.0) ) adds x0 x1 ˆ ( -2)
15
where type Xj2p = (Int, Double) indicates raising column Xj to the p-th power.
1. The powers set takes each column in matrix X and raises it to the pth power for every p ∈ powers.
The expression X p produces a matrix with all columns raised to the pth power. For example, Set (1,
2, 0.5) will add the original columns, quadratic columns, and square root columns.
2. The intercept flag indicates whether an intercept (column of ones) is to be added to the model.
Again, such a column must not be included in the original matrix.
3. The cross flag indicates whether two-way cross/interaction terms of the form xi xj (for i 6= j) are to
be added to the model.
4. The cross3 flag indicates whether three-way cross/interaction terms of the form xi xj xk (for i, j, k not
all the same) are to be added to the model.
5. The terms (repeated) array allows custom terms to add into the model. For example,
1 Array ((0 , 1.0) , (1 , -2) )
229
Much of functionality to do this is supplied by the MatrixD class in the mathstat package. The operator
++^ concatenates two matrices column-wise, while operator x~^p returns a new matrix where each of the
columns in the original matrix is raised to the pth power. The crossAll method returns a new matrix
consisting of columns that multiply each column by every other column. The crossAll3 method returns a
new matrix consisting of columns that multiply each column by all combinations of two other columns.
buildMatrix Method
The bulk of the work is done by the buildMatrix method that creates the input data matrix, column by
column.
1 def buildMatrix ( x : MatrixD , fname : Array [ String ] ,
2 powers : Set [ Double ] , intercept : Boolean ,
3 cross : Boolean , cross3 : Boolean ,
4 terms : Array [ Xj2p ]*) : ( MatrixD , Array [ String ]) =
5 val _1 = VectorD . one ( x . dim ) // one vector
6 var xx = new MatrixD ( x . dim , 0) // start empty
7 var fname_ = Array [ String ] ()
8
34 if cross then
35 xx = xx ++ ˆ x . crossAll // add 2 - way cross x_i x_j
36 fname_ ++ = crossNames ( fname )
37 end if
38
39 if cross3 then
40 xx = xx ++ ˆ x . crossAll3 // add 3 - way cross x_i x_j x_k
41 fname_ ++ = crossNames3 ( fname )
42 end if
43
230
44 if intercept then
45 xx = _1 + ˆ : xx // add intercept term ( _1 )
46 fname_ = Array ( " one " ) ++ fname_
47 end if
48
6.11.5 Regularization
Due to fact that symbolic regression may introduce many terms into the model and have high multi-
collinearity, regularization becomes even more important.
Symbolic Ridge Regression can be beneficial in dealing with multi-collinearity. The SymRidgeRegression
object supports the same methods that SymbolicRegression does, except buildMatrix that it reuses.
1 object S y m R i d g e R e g r e s s i o n :
2
231
Symbolic Lasso Regression
Other forms of regularization can be useful as well. Symbolic Lasso Regression can be beneficial in dealing
with multi-collinearity and more importantly by setting some parameters/coefficients bj to zero, thereby
eliminating the j th term. This is particularly important for symbolic regression as the number of possible
terms can become very large.
1 object S y m L a s s o R e g r e s s i o n :
2
6.11.6 Exercises
1. Exploratory Data Analysis Revisited. For each predictor variable xj in the Example AutoMPG
dataset, determine the best power to raise that column to. Plot y and yp versus xj for SimpleRegression.
Compare this to the plot of y and yp versus xj for SymbolicRegression using the best power.
2. Combine all the best powers together to form a model matrix with the same number of columns as
the original AutoMPG matrix and compare SymbolicRegression with Regression on the original
matrix.
3. Use forward, backward and stepwise regression to look for a better (than the last exercise) combination
of features for the AutoMPG dataset.
232
4. Redo the last exercise using SymRidgeRegression. Note any differences.
6. When there are for example quadratic terms added to the expanded matrix, explain why it will not
work to simply center (by subtracting the column means) the original data matrix X.
7. Compare the effectiveness of the following two search strategies that are used in Symbolic Regression:
(a) Genetic Algorithms and (b) FFX Algorithm.
8. Present a review of a paper that discusses how Symbolic Regression has been used to reproduce a
theory in a scientific discipline.
233
6.12 Transformed Regression
The TranRegression class supports transformed multiple linear regression and hence, the predictor vector
x is multi-dimensional [1, x1 , ...xk ]. In certain cases, the relationship between the response scalar y and the
predictor vector x is not linear. There are many possible functional relationships that could apply [144], but
five obvious choices are the following:
1. The response grows exponentially versus a linear combination of the predictor variable.
2. The response grows quadratically versus a linear combination of the predictor variable.
3. The response grows as the square root of a linear combination of the predictor variable.
4. The response grows logarithmically versus a linear combination of the predictor variable.
5. The response grows inversely (as the reciprocal) versus a linear combination of the predictor variable.
The capability can be easily implemented by introducing a transform (transformation function) into Regression.
The transformation function and its inverse are passed into the TranRegression class which extends the
Regression class.